Hi, as I already told you in private, I am playing with importing al

Maybe we are not to blame, after all: <a class="issue-link js-issue-link" data-error-t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I obtained a repo for testing from <a class="user-mention notranslate" data-hovercard-

git-dit gc is slow about git-dit HOT 11 CLOSED

matthiasbeyer commented on June 12, 2024

git-dit gc is slow

from git-dit.

Comments (11)

neithernut commented on June 12, 2024

IIRC, the garbage refs are collected because of a mixture of reducing code duplication and another Result<Iter<Result>>. The collection code itself should return an iterator which may be used directly.

However, this does not change a thing about the GC being slow. I'll try to have a look soonish.

from git-dit.

neithernut commented on June 12, 2024

Note: while making the GC code capable of doing things in parallel might be of value in the future, I'd rather figure out why it's slow in the first place, because it shouldn't be.

from git-dit.

neithernut commented on June 12, 2024

Maybe we are not to blame, after all: libgit2/libgit2#4428
Note: this is just a speculation, I'll make proper measurements some time in the future.

from git-dit.

neithernut commented on June 12, 2024

@matthiasbeyer do you have a repo somewhere for reproduction/measurement?

from git-dit.

matthiasbeyer commented on June 12, 2024

I have a repository but I would only share it privately to you.

from git-dit.

neithernut commented on June 12, 2024

I obtained a repo for testing from @matthiasbeyer. I can somewhat reproduce the (low) performance observed by him:

$ time git-dit gc --dry-run
real	0m29.135s
user	0m10.965s
sys	0m17.705s

According to perf, nearly 80% of the time is spent in git2::repo::Repository::references_glob or git_reference_iterator_glob_new (the latter is the libgit2 function invoked by the git2 abstraction).

This is clearly a problem for us, both because we inherently have lots of refs and there are not so many reasonable alternatives to retrieving the refs via globbing. I also suspect that libgit2 doesn't have much room for improvement itself or that it is reasonable for the devs to improve the performance at all cost. Git is not designed for many references.

I'm not yet sure how to solve this problem. Ideas are welcome!

from git-dit.

matthiasbeyer commented on June 12, 2024

Maybe we could do some measurement how fast git gc is, first. If it's in the same space, I'd rather mark this as non-issue.

Note that this case is extreme: We should run the GC more often than after 8k messages or whatever it was. Calling the GC on a small repository (tens or hundrets of refs to collect) this should be way faster.

All in the context "I have not tested how fast it is if it has nothing to collect", of course.

What I just remembered I should note as well: I have slow discs. I have 5400rpm discs in my machine... parallelization on our end could help with FS access, I guess.

from git-dit.

neithernut commented on June 12, 2024

It may help to restrict collecting to some issues as the space of refnames to consider is much smaller. That's one of the reason why I added support for this to git dit gc and (iirc) proposed crafting a hook using this functionality.

However, if the number of refs is huge, you will always run into this problem. Sure, we may be able to parallelize globbing: collect references for multiple issues in parallel, but the underlying problem remains: globbing is slow.

Maybe we can take advantage of the fact that we don't really need globbing but only directory-listing. This would, however, lead to a completely separate implementation for ref-discovery if not provided by libgit2 itself.

from git-dit.

neithernut commented on June 12, 2024

Turns out libgit2 iterators over all references and matches each one against the glob provided. Hence, we end up with a runtime proportional to #issues x #all_references. I'll try to introduce an optimization to libgit2 specific to the FS-backend.

Regardless whether the optimization will be accepted or not, it would be reasonable to modify the garbage-collection so it will also work with current versions of libgit2 and other backends: instead of querying the references for each issue, we would iterate over all references and filter them based on the selected issues and GC-parameters. We should also put a big sticker on it saying why we do it this way.

from git-dit.

neithernut commented on June 12, 2024

As I already explained, the GC is slow with lots of issues because it scales a little worse than O(number_of_issues^2), because globbing leads to processing of each reference in the repository. For FS-based repo, there is an optimization (libgit2/libgit2#4629) which will, probably, soon be merged and be available via git2-rs at some point in the future. We'll "just" have to wait for that and then update the git2-dependency. However, we can also do a bit more than waiting:

The latest refactor of the GC in itself did not make collecting references faster. We could spare us some globbing, but only at the expense of effectively building our own in-memory refdb.

However, it did enable printing/processing the references collected for each issue directly, without an intermediate collection step. In other words, we can now tweak git dit gc so that it doesn't sit there silently for eternity and suddenly spouts all the refs it accumulated but spits out what's availible in smaller chunks. I'll do that soonish, together with a few other changes.

Another thing we can to is building a pipeline where CollectableRefs::for_issue() is run in another thread. This way we can collect references (and maybe find the next issue) while for_issue() is busy globbing for refs and figuring out whether the local head can be collected. Implementing such a functionality is relatively easy using stuff from std::sync::mpsc. It appears that Revwalk implements Send, so we could easily implement Send for RefsReferringTo.

There is also a neat little crate just for pipelining (ppipe). However, we cannot run multiple worker threads if we use that. And we may want to do so, eventually.

from git-dit.

neithernut commented on June 12, 2024

Bah. I misread, Revwalk does not implement Send. This, of course, complicates things a bit. ~~I bet there are synchronization wrappers somewhere in the stdlib, but I did not see them, yet.~~ We can just wrap the RefsReferringTo in a Mutex.

from git-dit.

git-dit gc is slow about git-dit HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent