Code Monkey home page Code Monkey logo

Comments (28)

bochecha avatar bochecha commented on July 16, 2024 2

@virgiliu: sure. I just hope @Byron doesn't mind that we're using his bug tracker to discuss like this. :)

So, in one of my projects I'm using the following solution. I haven't pushed it yet (mostly because I'm not entirely satisfied with it) so I can't just give you a link to the code. You can assume that the code below is MIT-licensed.

So first, I have the following decorator:

class subprocessed(object):
    """Decorator to run a function in a subprocess

    We can't use GitPython directly because it leaks memory and file
    descriptors:
        https://github.com/gitpython-developers/GitPython/issues/60

    Until it is fixed, we have to use multiple processes :(
    """
    def __init__(self, func):
        self.func = func

    def __get__(self, instance, *args, **kwargs):
        self.instance = instance
        return self

    def __call__(self, *args):
        from multiprocessing import Process, Queue
        q = Queue()

        def wrapper(queue, func, *args):
            queue.put(func(*args))

        p = Process(target=wrapper,
                    args=(q, self.func, self.instance) + args)
        p.start()
        p.join()

        return q.get()

Then, you just need to decorate a method with it, and it gets run as a subprocess!

In my case, I have a class Foobar which does stuff on various Git repositories:

class Foobar(object):
    def __init__(self):
        self.git_rooturl = "git://git.example.com"
        self.workdir = "/path/to/where/the/repo/will/be/cloned"

    def do_stuff_on_repo(self, reponame):
        curdir = os.getcwd()

        # Clone the module
        repo = git.Repo.clone_from("%s/%s" % (self.git_rooturl, reponame),
                                   os.path.join(self.workdir, reponame))
        os.chdir(repo.working_tree_dir)

        # Do your stuff here
        result = ...

        os.chdir(curdir)
        shutil.rmtree(repo.working_tree_dir)
        del repo

        return result

Unfortunately the do_stuff_on_repo method goes through the wrong code path and I call it repeatedly (even on different repositories), so it leaks memory and fds.

As a result, I'm just defining it in this way:

    @subprocessed
    def do_stuff_on_repo(self, reponame):
        ...

All I changed is that I decorated it, and now each call to the function will be run in its own process, so the leaking memory and fds get cleared when the process exits.

The only tricky part which is not handled above is what to do in case of an error: if do_some_stuff_on_repo raises an exception for example, then subprocess just exits, without affecting the main process where your application is running.

How to handle that properly is left as an exercise to the reader. (because that's the part I'm not very happy with :P)

from gitpython.

Byron avatar Byron commented on July 16, 2024 1

I think it will work if you use the latest HEAD of this repository, which would be v1.0.2 if I hadn't lost access to my pypi account.

In v1.0.1, you should be able to instantiate the Repo type with git.Repo(path, odbt=GitCmdObjectDB) to alleviate the issue.

If this works for you, I'd be glad if you could let me know here as well.

from gitpython.

Byron avatar Byron commented on July 16, 2024

Thanks for reporting this.
Probably this is related to the smmap library, which opens memory mappings of handled files. Apparently, there is a chance of not closing them after being done with the access.

I suffered from similar problems in windows tests (where open handles prevented the test-cases to clean themselves up properly), and this definitely should be looked into.

Until then, you might use the GitCmdDb Backend for your repository, which uses a different implementation.

from gitpython.

mscherer avatar mscherer commented on July 16, 2024

I can confirm the issue on fedora 17, there is *.pack and *.idx, open as file and memory mapped.

from gitpython.

Byron avatar Byron commented on July 16, 2024

I think, the lsof checks will have to go into the test-suite too, in order to protect against regression.
It will be a lesson for me, as I simply didn't verify I indeed close the handles (or that there are no dependency-cycles preventing objects to die).

Especially with something as vital (and limited) as system resources, I shouldn't leave anything to chance.

Currently I have no ETA on this, but I am planning to redirect development time to git-python once again.

from gitpython.

bochecha avatar bochecha commented on July 16, 2024

So, learning my way through the code with pdb, I found where the file is actually opened.

In smmap/util.py, in the constructor of the class MapRegion, there's the following:

                if isinstance(path_or_fd, int):
                        fd = path_or_fd
                else:
                        fd = os.open(path_or_fd, os.O_RDONLY|getattr(os, 'O_BINARY', 0)|flags)
                #END handle fd

In the case of this bug report, path_or_fd is a path, so the fd gets opened. At this point, it is file descriptor 3.

Then later on, in the same function, I can see:

                finally:
                        if isinstance(path_or_fd, basestring):
                                os.close(fd)
                        #END only close it if we opened it

And here, the file descriptor 3 is closed. So everything is good, right?

No, because in between, there is this:

self._mf = mmap(fd, actual_size, **kwargs)

This maps the content of the file pointed by file descriptor 3 in memory... and it also opens the file again, as file descriptor 4!

It is this one which never gets closed.

Now let's consider the following:

import mmap
import os


raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
                                                            os.getcwd()))

f = open("hello", "r+b")
raw_input("Check open files again (opened 'hello' as fd %s)" % f.fileno())

mf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
raw_input("Check open files again (mmaped fd %s)" % f.fileno())

f.close()
raw_input("Check open files again (closed 'hello')")

mf.close()
raw_input("Check open files again (closed the mmap)")

This reproduces exactly the same behaviour: a file is opened as fd 3, when mmap-ing its content we also open the same file as fd 4, then we close the fd 3, and finally closing the map will also close fd 4.

So what is missing from smmap is to close the memory map when done with it.

from gitpython.

Byron avatar Byron commented on July 16, 2024

Thank you for making the effort of pointing the course of the issue out so clearly.
As this affects all git-python installation, I should definitely be on it with high priority !

In any way, the memory map needs to remain open until the last user of the respective memory region fades away, and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess.

AFAIK, when the mmap instance goes out of scope, it will close automatically, maybe I am relying on this. Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here.

from gitpython.

bochecha avatar bochecha commented on July 16, 2024

As this affects all git-python installation, I should definitely be on it with high priority !

I'd certainly appreciate it. :)

Especially since it's not even possible to manually keep track of those files (by monkey-patching Python's open() and file()) and close them manually: the mmap module is written in C and doesn't use either open() or file().

In any way, the memory map needs to remain open until the last user of the respective memory region fades away,

I tried spending some time yesterday figuring where this should be, but that's just too much for me.

and apparently this doesn't happen, maybe because someone keeps it open by keeping a reference to it - this at least would be my guess.

Well, I tried destroying the repo object, but that didn't help. If a reference is kept on it, I really wonder where. :-/

Also, smmap has a known issue with reference counting in some python versions apparently, which may also lead to the issue we see here.

So I can confirm the issue on RHEL 6 (Python 2.6.6) and Fedora 16 (Python 2.7.2)

And mscherer above confirmed it on Fedora 17, which has Python 2.7.3.

So I'm not sure that's related to a Python version, or else we are extremely unlucky. :D

from gitpython.

Byron avatar Byron commented on July 16, 2024

When I implemented the reference counting, I decided to try using Python's internal one, with sys.getrefcount(). However, I realized that I had to put some magic numbers into the code to account for references to my object that I didn't really have on my radar. In that moment, I might have opened the gates for plenty of bugs.

Or I have a cyclic relation somewhere, even though I was very conscious about strong references and about who keeps pointers to whom.

Anyway, its buggy, and a real problem as GitPython uses this engine heavily under the hood, and by default.

from gitpython.

bochecha avatar bochecha commented on July 16, 2024

Any news on this issue?

from gitpython.

Byron avatar Byron commented on July 16, 2024

No, sorry. Also I cannot currently assign any time to the project.

from gitpython.

virgiliu avatar virgiliu commented on July 16, 2024

Any news? I'm getting the same error on Python 2.6.6 on Debian.

from gitpython.

Byron avatar Byron commented on July 16, 2024

Wow, exactly one year ago I said I had no time, and sadly this still didn't change. However, next year I could possibly get to it, there are many things I want to improve and do with git-python.

from gitpython.

virgiliu avatar virgiliu commented on July 16, 2024

That sounds great.
I've managed to make a workaround for the function that caused the "Too many open files" error using git commands.

Here's a simplified version of the code that caused the error:

import git

def last_commit(repo, rev, path):
  commit = repo.iter_commits(rev, path, max_count=1)

  return commit.next()


repo = git.Repo('/path/to/repo')

for i in range(1000):
  print last_commit(repo, 'master', '/path/to/some/dir/inside/repo')

After about 500 iterations it crashes with

OSError: [Errno 24] Too many open files

Is there any way of getting the lastest commit from a directory or a file on a certain branch or tag besides using iter_commits ? I'd rather use objects instead of parsing raw git output and making my own dictionary with the values.

from gitpython.

bochecha avatar bochecha commented on July 16, 2024

@virgiliu My personal workaround is to run the GitPython stuff (at least the ones which lead to this bug) in different processes. :-/

from gitpython.

virgiliu avatar virgiliu commented on July 16, 2024

@bochecha could you show me an example of that?

from gitpython.

virgiliu avatar virgiliu commented on July 16, 2024

@bochecha Great stuff. πŸ‘
Thanks a lot! I'll give it a try if/when I run into this problem again (for now I've circumvented it by using raw git commands) :)

from gitpython.

dirkgr avatar dirkgr commented on July 16, 2024

Bumping this. I just spent a ton of time tracking this down. It's still there :-)

from gitpython.

abesto avatar abesto commented on July 16, 2024

+1

from gitpython.

mricon avatar mricon commented on July 16, 2024

This is definitely making gitpython unusable for projects like grokmirror that must be able to operate on thousands of repositories. As this bug is now 2+ years old, it doesn't look like it'll be fixed at this point. :(

from gitpython.

AnthonyBrunasso avatar AnthonyBrunasso commented on July 16, 2024

Ahhh, I am having the same issue. Would be nice to see a fix. :)

from gitpython.

smn avatar smn commented on July 16, 2024

Is anyone working on this bug?

from gitpython.

Byron avatar Byron commented on July 16, 2024

The issue presented by @virgiliu seems to be fixed now (with the latest version of smmap 0.9.0), but by artificially constraining the file limit to 64, I can still see that pipes are leaked.
For some reason, repository instances are never freed, and I will have to figure out why that would happen.
screen shot 2015-01-07 at 17 57 37

from gitpython.

Byron avatar Byron commented on July 16, 2024

As __del__() is not reliably called in python 3, the tests leak heavily as no repository it ever used gets cleaned up, leaving plenty of the processes shown above. Each of them comes with three pipes, quickly using up the processes resources.

For example, it's possible to run all tests with ulimit -n 64 in python 2, but needs ulimit -n 140 in python 3.

from gitpython.

Byron avatar Byron commented on July 16, 2024

It turned out I forgot to add a tear-down super class call, which was responsible for most of the test-leakage.
On the plus side, travis now checks for this as well.
An improved leak-test was added, which clearly shows that repositories dispose of their file handles correctly.

from gitpython.

mricon avatar mricon commented on July 16, 2024

Fantastic, thanks!

from gitpython.

smithsp avatar smithsp commented on July 16, 2024

I am still seeing this problem in version 1.0.1. Here is a little test script:

import git
repo = git.Repo('./')
for k in repo.iter_commits(max_count=25):
     print k

At the same time, look at the number of open files with lsof | grep ".git". I see that the number of open files increases by about 25. The only way to decrease the number of open files is to close the python session.
(I should point out that I am running on Mac OSX, with a mostly MacPorts python installation.)

from gitpython.

earwig avatar earwig commented on July 16, 2024

I have some reports of the same issue in earwig/git-repo-updater#9, I think. Will ask people if GitPython HEAD fixes it.

from gitpython.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.