Allow using reference repositories to share objects about west HOT 8 CLOSED

hzarnani commented on July 22, 2024

Allow using reference repositories to share objects

from west.

Comments (8)

marc-hb commented on July 22, 2024

Today, the only way to efficiently handle cloning a large repository using west update that I am aware of is to limit the fetch depth as in -o=--depth=.

There are other (and not mutually exclusive) optimizations

#638
actions/checkout#1152
performance How long things take

must be deep enough to include the specific SHA listed in the west.yml file, or else the update step fails.

I don't think this is how west typically works. Take a look at

#344

An alternative to limiting the fetch depth is to share objects with a local reference repository

Did you look at west update -h?

from west.

hzarnani commented on July 22, 2024

Today, the only way to efficiently handle cloning a large repository using west update that I am aware of is to limit the fetch depth as in -o=--depth=.

There are other (and not mutually exclusive) optimizations

Consider adding config option for treeless clones (--fetch-opt=--filter=...) #638

Support for Treeless clones actions/checkout#1152

performance
How long things take

must be deep enough to include the specific SHA listed in the west.yml file, or else the update step fails.

I don't think this is how west typically works. Take a look at

[RFC] manifest: allow projects to say where their SHAs are #344

An alternative to limiting the fetch depth is to share objects with a local reference repository

Did you look at west update -h?

The caching mechanism added by @mbolivar-nordic in c50d342 does not actually take advantage of Git's object sharing. It still clones all of the objects and the entire history of the repository, only does so locally rather than over the network, which of course is an improvement, but not what's being asked here. The crucial step is to set up a .git/objects/info/alternates file. Using the --shared option when running git-clone does that. Other tools that use Git would have to create this file themselves, which is fairly easy to do. That would result in a much smaller size for the .git directory in the project workspace as well, thus saving a lot of disk space in addition to a faster creation of the work tree.

On a related note, using git-init and git-fetch is much preferred over using git-clone. In other words,

git init
git remote add <remote_name> <remote_url>
[set up .git/objects/info/alternates to point to objects in <local_cache>]
git fetch <remote_name>

is much better than

git clone <local_cache>
git set-url <remote_name> <remote_url>
git fetch ...

which seems to be how west works when given a cached repository.

from west.

marc-hb commented on July 22, 2024

That would result in a much smaller size for the .git directory in the project workspace as well,

Much smaller disk space... if you don't count the initial repos.

only does so locally rather than over the network, which of course is an improvement, but not what's being asked here

There is no doubt --shared would be a big optimization. But as with any optimization work the most important question is: "How much?". More precisely: how much compared to existing optimizations? Greatly increasing the complexity of the code base for saving a few percents would never be worth it.

So far you haven't provided any number, not even any order of magnitude. You don't sound like you've explored all available options either: your first sentence at the top is "--depth is the only efficient way I'm aware of", which is incorrect.

Interactive users clone very rarely from scratch. In our CI, west update takes 1-2 minutes from scratch (using the existing optimizations I listed) which is acceptable for us. Need some time to run tests anyway.

So what is your use case? Development normally happens to fix tangible and measurable issues, not just "cool ideas".

Before implementing one of the existing optimizations, @mbolivar-ampere spent a lot of time performing some measurements. You can find those at one of the links I shared above if you're interested.

On a related note, using git-init and git-fetch is much preferred over using git-clone.

west used to do this but it was changed in e283d99

from west.

hzarnani commented on July 22, 2024

That would result in a much smaller size for the .git directory in the project workspace as well,

Much smaller disk space... if you don't count the initial repos.

Think of many concurrent workspaces, not just a single one.

only does so locally rather than over the network, which of course is an improvement, but not what's being asked here

There is no doubt --shared would be a big optimization. But as with any optimization work the most important question is: "How much?".

Multiple workspaces sharing the same Git objects is very clearly a huge advantage, both in terms of storage and speed of checkout. Imagine N users, or many concurrent CI jobs, using the same Git mirrors on some NFS share locally. N workspaces sharing the same repository histories is very clearly an advantage over N workspaces and N replications of the same history. And it's faster.

More precisely: how much compared to existing optimizations?

I'm not going to do that comparison. But feel free to do some Google searching on the advantages of sharing Git objects with a reference repository.

Greatly increasing the complexity of the code base for saving a few percents would never be worth it.

What exactly is the complexity? And no, it's not a few percents.

So far you haven't provided any number, not even any order of magnitude.

It should be self-evident. A single replicated history, which is a constant, versus N replicated histories.

You don't sound like you've explored all available options either: your first sentence at the top is "--depth is the only efficient way I'm aware of", which is incorrect.

I indeed have. As I said, the caching implementation in west is mediocre at best and doesn't address the issue of object sharing.

Interactive users clone very rarely from scratch.

Now I am going to challenge statements like this -- please provide some numbers. How many users? How often?

In our CI, west update takes 1-2 minutes from scratch (using the existing optimizations I listed) which is acceptable for us. Need some time to run tests anyway.

The assumption in that statement is that the Git repositories involved are small. But what if large repositories are involved and they may not be using LFS?

So what is your use case?

I mentioned that earlier -- many concurrent workspaces using large repositories.

Development normally happens to fix tangible and measurable issues, not just "cool ideas".

I appreciate that.

Before implementing one of the existing optimizations, @mbolivar-ampere spent a lot of time performing some measurements. You can find those at one of the links I shared above if you're interested.

I'll take a look.

On a related note, using git-init and git-fetch is much preferred over using git-clone.

west used to do this but it was changed in e283d99

from west.

marc-hb commented on July 22, 2024

Interactive users clone very rarely from scratch.

Now I am going to challenge statements like this -- please provide some numbers. How many users? How often?

You're the one asking for a "clearly", "self-evident" new feature - without providing any number, reproducible use case, example, measurements of existing optimizations, prototype code or any offer to contribute or help[1]. You seem to have a performance problem to solve[2]. I don't.

Now answering your question anyway:

Doctor, it hurts when I keep cloning from scratch interactively.
Don't.

You don't sound like you've explored all available options

I indeed have.

Then share some reproducible example and actual data, not "self-evidence".

[1] "I'll leave it to the feature designer/developer..." - who is that? "I'm not going to do that comparison. Feel free to Google..."
[2] assuming it's not a https://en.wikipedia.org/wiki/XY_problem

from west.

marc-hb commented on July 22, 2024

What exactly is the complexity?

This was just an example. Every feature and code addition increases complexity - and bugs, and maintenance costs. If you take a quick look at the git log, you'll notice this project is not really staffed with an army of full-time developers. Not even one full time in fact, very far from it.

I have no idea what would the complexity be in this particular case but your description of the new feature is not exactly short while still leaving a lot of opens. If you think this would be a small effort then I can't wait for your pull requests (with some sample data to back them up). Don't forget the test code.

from west.

hzarnani commented on July 22, 2024

It is desirable for a tool built on top of Git to allow using the facilities that it offers for dealing with various complexities, particularly cloning large repositories. west is deficient on that front because it does not support using Git's object sharing mechanism, which is a well-known and primary feature of this tool. While I'd like to motivate the need for my feature ask with numbers, suffice it to say that some development and test environments rely on object sharing. I ask that the reader refer to the wealth of literature available on this topic to learn more.

About the problem statement being long, sure, it could have been more concise. But thoroughness was the goal.

I realize and appreciate how with limited time and resources, feature requests have to be addressed judiciously.

I can take a stab at extending west and adding the desired behavior. I'll make a pull request if I decide that what I have is presentable. And I certainly hope that then the conversation goes a little better.

from west.

hzarnani commented on July 22, 2024

Turns out, this feature request is closely related to (really a duplicate of) #625.

from west.

Allow using reference repositories to share objects about west HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent