Comments (11)
IIRC, the garbage refs are collected because of a mixture of reducing code duplication and another Result<Iter<Result>>
. The collection code itself should return an iterator which may be used directly.
However, this does not change a thing about the GC being slow. I'll try to have a look soonish.
from git-dit.
Note: while making the GC code capable of doing things in parallel might be of value in the future, I'd rather figure out why it's slow in the first place, because it shouldn't be.
from git-dit.
Maybe we are not to blame, after all: libgit2/libgit2#4428
Note: this is just a speculation, I'll make proper measurements some time in the future.
from git-dit.
@matthiasbeyer do you have a repo somewhere for reproduction/measurement?
from git-dit.
I have a repository but I would only share it privately to you.
from git-dit.
I obtained a repo for testing from @matthiasbeyer. I can somewhat reproduce the (low) performance observed by him:
$ time git-dit gc --dry-run
real 0m29.135s
user 0m10.965s
sys 0m17.705s
According to perf
, nearly 80% of the time is spent in git2::repo::Repository::references_glob
or git_reference_iterator_glob_new
(the latter is the libgit2
function invoked by the git2
abstraction).
This is clearly a problem for us, both because we inherently have lots of refs and there are not so many reasonable alternatives to retrieving the refs via globbing. I also suspect that libgit2
doesn't have much room for improvement itself or that it is reasonable for the devs to improve the performance at all cost. Git is not designed for many references.
I'm not yet sure how to solve this problem. Ideas are welcome!
from git-dit.
Maybe we could do some measurement how fast git gc
is, first. If it's in the same space, I'd rather mark this as non-issue.
Note that this case is extreme: We should run the GC more often than after 8k messages or whatever it was. Calling the GC on a small repository (tens or hundrets of refs to collect) this should be way faster.
All in the context "I have not tested how fast it is if it has nothing to collect", of course.
What I just remembered I should note as well: I have slow discs. I have 5400rpm discs in my machine... parallelization on our end could help with FS access, I guess.
from git-dit.
It may help to restrict collecting to some issues as the space of refnames to consider is much smaller. That's one of the reason why I added support for this to git dit gc
and (iirc) proposed crafting a hook using this functionality.
However, if the number of refs is huge, you will always run into this problem. Sure, we may be able to parallelize globbing: collect references for multiple issues in parallel, but the underlying problem remains: globbing is slow.
Maybe we can take advantage of the fact that we don't really need globbing but only directory-listing. This would, however, lead to a completely separate implementation for ref-discovery if not provided by libgit2 itself.
from git-dit.
Turns out libgit2
iterators over all references and matches each one against the glob provided. Hence, we end up with a runtime proportional to #issues x #all_references
. I'll try to introduce an optimization to libgit2
specific to the FS-backend.
Regardless whether the optimization will be accepted or not, it would be reasonable to modify the garbage-collection so it will also work with current versions of libgit2
and other backends: instead of querying the references for each issue, we would iterate over all references and filter them based on the selected issues and GC-parameters. We should also put a big sticker on it saying why we do it this way.
from git-dit.
As I already explained, the GC is slow with lots of issues because it scales a little worse than O(number_of_issues^2)
, because globbing leads to processing of each reference in the repository. For FS-based repo, there is an optimization (libgit2/libgit2#4629) which will, probably, soon be merged and be available via git2-rs at some point in the future. We'll "just" have to wait for that and then update the git2-dependency. However, we can also do a bit more than waiting:
The latest refactor of the GC in itself did not make collecting references faster. We could spare us some globbing, but only at the expense of effectively building our own in-memory refdb.
However, it did enable printing/processing the references collected for each issue directly, without an intermediate collection step. In other words, we can now tweak git dit gc
so that it doesn't sit there silently for eternity and suddenly spouts all the refs it accumulated but spits out what's availible in smaller chunks. I'll do that soonish, together with a few other changes.
Another thing we can to is building a pipeline where CollectableRefs::for_issue()
is run in another thread. This way we can collect references (and maybe find the next issue) while for_issue()
is busy globbing for refs and figuring out whether the local head can be collected. Implementing such a functionality is relatively easy using stuff from std::sync::mpsc. It appears that Revwalk
implements Send
, so we could easily implement Send
for RefsReferringTo
.
There is also a neat little crate just for pipelining (ppipe). However, we cannot run multiple worker threads if we use that. And we may want to do so, eventually.
from git-dit.
Bah. I misread, Revwalk
does not implement Send
. This, of course, complicates things a bit. I bet there are synchronization wrappers somewhere in the stdlib, but I did not see them, yet. We can just wrap the RefsReferringTo
in a Mutex
.
from git-dit.
Related Issues (20)
- Get the next release out HOT 2
- Workflow examples HOT 2
- Please add your project to https://dist-bugs.branchable.com/ HOT 6
- git-dit list/show panics HOT 14
- No crates actually released HOT 3
- We have no way to show all issue trees HOT 2
- [UI] "git dit show -g" should have colors
- Neither --verbose nor --trace print any output HOT 1
- `git dit list` shows committer date, should show author date
- Flag to _not_ use pager HOT 2
- Helppage: new --metadata missing prefix description HOT 1
- new --metadata keys HOT 1
- libgitdit is too hard to use HOT 6
- Trailers not beeing recognized (when having underscore in key?) HOT 4
- Workflow: Sending issues with git-format-patch / applying with git-am
- README clarifications HOT 3
- Does not build manpages with pandoc 2.2.1 HOT 3
- malformed URL error for git+ssh://user@host:/path/foo.git HOT 7
- Collaboration with git-bug? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from git-dit.