Comments (28)
@virgiliu: sure. I just hope @Byron doesn't mind that we're using his bug tracker to discuss like this. :)
So, in one of my projects I'm using the following solution. I haven't pushed it yet (mostly because I'm not entirely satisfied with it) so I can't just give you a link to the code. You can assume that the code below is MIT-licensed.
So first, I have the following decorator:
class subprocessed(object):
"""Decorator to run a function in a subprocess
We can't use GitPython directly because it leaks memory and file
descriptors:
https://github.com/gitpython-developers/GitPython/issues/60
Until it is fixed, we have to use multiple processes :(
"""
def __init__(self, func):
self.func = func
def __get__(self, instance, *args, **kwargs):
self.instance = instance
return self
def __call__(self, *args):
from multiprocessing import Process, Queue
q = Queue()
def wrapper(queue, func, *args):
queue.put(func(*args))
p = Process(target=wrapper,
args=(q, self.func, self.instance) + args)
p.start()
p.join()
return q.get()
Then, you just need to decorate a method with it, and it gets run as a subprocess!
In my case, I have a class Foobar
which does stuff on various Git repositories:
class Foobar(object):
def __init__(self):
self.git_rooturl = "git://git.example.com"
self.workdir = "/path/to/where/the/repo/will/be/cloned"
def do_stuff_on_repo(self, reponame):
curdir = os.getcwd()
# Clone the module
repo = git.Repo.clone_from("%s/%s" % (self.git_rooturl, reponame),
os.path.join(self.workdir, reponame))
os.chdir(repo.working_tree_dir)
# Do your stuff here
result = ...
os.chdir(curdir)
shutil.rmtree(repo.working_tree_dir)
del repo
return result
Unfortunately the do_stuff_on_repo
method goes through the wrong code path and I call it repeatedly (even on different repositories), so it leaks memory and fds.
As a result, I'm just defining it in this way:
@subprocessed
def do_stuff_on_repo(self, reponame):
...
All I changed is that I decorated it, and now each call to the function will be run in its own process, so the leaking memory and fds get cleared when the process exits.
The only tricky part which is not handled above is what to do in case of an error: if do_some_stuff_on_repo
raises an exception for example, then subprocess just exits, without affecting the main process where your application is running.
How to handle that properly is left as an exercise to the reader. (because that's the part I'm not very happy with :P)
from gitpython.
I think it will work if you use the latest HEAD
of this repository, which would be v1.0.2
if I hadn't lost access to my pypi account.
In v1.0.1
, you should be able to instantiate the Repo
type with git.Repo(path, odbt=GitCmdObjectDB)
to alleviate the issue.
If this works for you, I'd be glad if you could let me know here as well.
from gitpython.
Thanks for reporting this.
Probably this is related to the smmap library, which opens memory mappings of handled files. Apparently, there is a chance of not closing them after being done with the access.
I suffered from similar problems in windows tests (where open handles prevented the test-cases to clean themselves up properly), and this definitely should be looked into.
Until then, you might use the GitCmdDb Backend for your repository, which uses a different implementation.
from gitpython.
I can confirm the issue on fedora 17, there is *.pack and *.idx, open as file and memory mapped.
from gitpython.
I think, the lsof checks will have to go into the test-suite too, in order to protect against regression.
It will be a lesson for me, as I simply didn't verify I indeed close the handles (or that there are no dependency-cycles preventing objects to die).
Especially with something as vital (and limited) as system resources, I shouldn't leave anything to chance.
Currently I have no ETA on this, but I am planning to redirect development time to git-python once again.
from gitpython.
So, learning my way through the code with pdb
, I found where the file is actually opened.
In smmap/util.py
, in the constructor of the class MapRegion
, there's the following:
if isinstance(path_or_fd, int):
fd = path_or_fd
else:
fd = os.open(path_or_fd, os.O_RDONLY|getattr(os, 'O_BINARY', 0)|flags)
#END handle fd
In the case of this bug report, path_or_fd
is a path, so the fd
gets opened. At this point, it is file descriptor 3.
Then later on, in the same function, I can see:
finally:
if isinstance(path_or_fd, basestring):
os.close(fd)
#END only close it if we opened it
And here, the file descriptor 3 is closed. So everything is good, right?
No, because in between, there is this:
self._mf = mmap(fd, actual_size, **kwargs)
This maps the content of the file pointed by file descriptor 3 in memory... and it also opens the file again, as file descriptor 4!
It is this one which never gets closed.
Now let's consider the following:
import mmap
import os
raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
os.getcwd()))
f = open("hello", "r+b")
raw_input("Check open files again (opened 'hello' as fd %s)" % f.fileno())
mf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
raw_input("Check open files again (mmaped fd %s)" % f.fileno())
f.close()
raw_input("Check open files again (closed 'hello')")
mf.close()
raw_input("Check open files again (closed the mmap)")
This reproduces exactly the same behaviour: a file is opened as fd 3, when mmap-ing its content we also open the same file as fd 4, then we close the fd 3, and finally closing the map will also close fd 4.
So what is missing from smmap
is to close the memory map when done with it.
from gitpython.
Thank you for making the effort of pointing the course of the issue out so clearly.
As this affects all git-python installation, I should definitely be on it with high priority !
In any way, the memory map needs to remain open until the last user of the respective memory region fades away, and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess.
AFAIK, when the mmap instance goes out of scope, it will close automatically, maybe I am relying on this. Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here.
from gitpython.
As this affects all git-python installation, I should definitely be on it with high priority !
I'd certainly appreciate it. :)
Especially since it's not even possible to manually keep track of those files (by monkey-patching Python's open()
and file()
) and close them manually: the mmap
module is written in C and doesn't use either open()
or file()
.
In any way, the memory map needs to remain open until the last user of the respective memory region fades away,
I tried spending some time yesterday figuring where this should be, but that's just too much for me.
and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess.
Well, I tried destroying the repo object, but that didn't help. If a reference is kept on it, I really wonder where. :-/
Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here.
So I can confirm the issue on RHEL 6 (Python 2.6.6) and Fedora 16 (Python 2.7.2)
And mscherer above confirmed it on Fedora 17, which has Python 2.7.3.
So I'm not sure that's related to a Python version, or else we are extremely unlucky. :D
from gitpython.
When I implemented the reference counting, I decided to try using Python's internal one, with sys.getrefcount(). However, I realized that I had to put some magic numbers into the code to account for references to my object that I didn't really have on my radar. In that moment, I might have opened the gates for plenty of bugs.
Or I have a cyclic relation somewhere, even though I was very conscious about strong references and about who keeps pointers to whom.
Anyway, its buggy, and a real problem as GitPython uses this engine heavily under the hood, and by default.
from gitpython.
Any news on this issue?
from gitpython.
No, sorry. Also I cannot currently assign any time to the project.
from gitpython.
Any news? I'm getting the same error on Python 2.6.6 on Debian.
from gitpython.
Wow, exactly one year ago I said I had no time, and sadly this still didn't change. However, next year I could possibly get to it, there are many things I want to improve and do with git-python.
from gitpython.
That sounds great.
I've managed to make a workaround for the function that caused the "Too many open files" error using git commands.
Here's a simplified version of the code that caused the error:
import git
def last_commit(repo, rev, path):
commit = repo.iter_commits(rev, path, max_count=1)
return commit.next()
repo = git.Repo('/path/to/repo')
for i in range(1000):
print last_commit(repo, 'master', '/path/to/some/dir/inside/repo')
After about 500 iterations it crashes with
OSError: [Errno 24] Too many open files
Is there any way of getting the lastest commit from a directory or a file on a certain branch or tag besides using iter_commits ? I'd rather use objects instead of parsing raw git output and making my own dictionary with the values.
from gitpython.
@virgiliu My personal workaround is to run the GitPython stuff (at least the ones which lead to this bug) in different processes. :-/
from gitpython.
@bochecha could you show me an example of that?
from gitpython.
@bochecha Great stuff. π
Thanks a lot! I'll give it a try if/when I run into this problem again (for now I've circumvented it by using raw git commands) :)
from gitpython.
Bumping this. I just spent a ton of time tracking this down. It's still there :-)
from gitpython.
+1
from gitpython.
This is definitely making gitpython unusable for projects like grokmirror that must be able to operate on thousands of repositories. As this bug is now 2+ years old, it doesn't look like it'll be fixed at this point. :(
from gitpython.
Ahhh, I am having the same issue. Would be nice to see a fix. :)
from gitpython.
Is anyone working on this bug?
from gitpython.
The issue presented by @virgiliu seems to be fixed now (with the latest version of smmap 0.9.0), but by artificially constraining the file limit to 64, I can still see that pipes are leaked.
For some reason, repository instances are never freed, and I will have to figure out why that would happen.
from gitpython.
As __del__()
is not reliably called in python 3, the tests leak heavily as no repository it ever used gets cleaned up, leaving plenty of the processes shown above. Each of them comes with three pipes, quickly using up the processes resources.
For example, it's possible to run all tests with ulimit -n 64
in python 2, but needs ulimit -n 140
in python 3.
from gitpython.
It turned out I forgot to add a tear-down super class call, which was responsible for most of the test-leakage.
On the plus side, travis now checks for this as well.
An improved leak-test was added, which clearly shows that repositories dispose of their file handles correctly.
from gitpython.
Fantastic, thanks!
from gitpython.
I am still seeing this problem in version 1.0.1. Here is a little test script:
import git
repo = git.Repo('./')
for k in repo.iter_commits(max_count=25):
print k
At the same time, look at the number of open files with lsof | grep ".git"
. I see that the number of open files increases by about 25. The only way to decrease the number of open files is to close the python session.
(I should point out that I am running on Mac OSX, with a mostly MacPorts python installation.)
from gitpython.
I have some reports of the same issue in earwig/git-repo-updater#9, I think. Will ask people if GitPython HEAD fixes it.
from gitpython.
Related Issues (20)
- switch to Ruff instead of `flake8` HOT 6
- origin.pull(rebase=True) sometimes lost local commit , gitpyton version 3.1.31 HOT 3
- Wrong type annotation on `Index.commit` for `parent_commits` HOT 3
- Using create submodule doesn't create the correct submodule structure HOT 4
- Submodule.__init__ `parent_commit` conversion/validation is implied but not done HOT 8
- Module docstring says TagReference but means TagObject
- Diffable.diff is misleadingly annotated regarding special `other` values HOT 3
- Top-level refresh function not listed in __all__ HOT 1
- Bad git executable HOT 2
- Unable to specify lines of context while doing diff (-U flag) HOT 1
- γremote().exist()γ exception when lack remote section in config file
- Unhandled IndexError when calling .read() on a malformed config file HOT 6
- Detailed xfail output sometimes distracts from unexpected failures
- Some xfail markings fail to validate their exception types
- How can i set create_patch=true, and meanwhile get change_type in Diff
- git.Repo.clone_from() not working correctly in WSL HOT 3
- Fuzz Tests Are Crashing at Start-up on ClusterFuzz HOT 2
- GitConfigParser misparses quotes in options HOT 1
- Can't catch GitCommandError HOT 6
- Command's DEBUG reveals username/password from URL HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gitpython.