Git new feature when cloning

Reply: Li-Wen Hsu : "Re: Git new feature when cloning"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Mathieu Arnold <mat_at_freebsd.org>
Date: Fri, 29 Jul 2022 11:41:53 UTC
Hi,

A while back, Git grew a way to filter the objects it asks the server
when cloning. It can speed up the download because it will download less
data. It also stores less information locally, so this is a bonus.

The only drawback is that whem you ask for information it does not have
locally, it will have to download the missing data, which it'll store
locally, so you don't download something twice.  (it's done under the
hood and you don't see it happening, the only thing you'll see is the
command being a bit longer to return.)

It all happens in the --filter argument to git clone, see
git-rev-list(1) for the whole explanation, and range things you can do.
It can filter a few things, but in order of information downloaded, the
most common values I can see for our usage are:

--filter=blob:none
  This will download all the commits and all the trees (which are the
  file list of a directory), and only the blobs needed to checkout the
  branch you asked for.

--filter=tree:0
  This will download all the commits, and only the trees and blobs
  needed to checkout the branch you asked for.

Both of those can be used with --sparse, which enables sparse checkout,
which basically only checks out the files in the root directory, and you
need to use git sparse-checkout to add/remove files to the checkout.
That can be useful if you don't have a lot of disk space, and need
multiple checkouts to work on. Note that you can't really use --sparse
on the ports tree if you want to build things out of it, because you
would need to add all the dependencies, and the framework, to build a
port. For a kernel developper though, you can probably live with only
having the kernel sources and not the whole world.

And for numbers because we all love numbers :

| filter           | SRC   | PORTS | DOC  |
|------------------|-------|-------|------|
| blob:none        |  605M |  576M | 119M |
| blob:none sparse |  314M |  498M |  37M |
| tree:0           |  407M |  238M |  97M |
| tree:0 sparse    |  115M |  115M |  15M |
| filtering        | 1461M | 1010M | 321M |

This is the size of .git/objects, for a checkout done this morning. So
it is basically the amount of data downloaded from the server.

Note that contrary to using --depth=X, which limits the number of
commits you get from the server, and which renders the repository ok for
testing, but not great for development because fo some limitations, the
repository you get when running --filter is fully usable, the only
drawback is that if you need bits of history you filtered out, they will
be downloaded on the fly so internet access may be required.

PS: as filtering is done on the server, a knob needed to be enabled on
    our servers, gitlab and github already supported the feature.
    gitrepo.f.o and gitrepo-dev.f.o have it enabled, I am unsure about
    the mirror status, but they should be ok too.

-- 
Mathieu Arnold