Code Monkey home page Code Monkey logo

py-lmdb's People

Contributors

achalddave avatar asellappen avatar astraluma avatar avdn avatar chbrown avatar de-code avatar dw avatar ecederstrand avatar groz avatar gustavla avatar heyman avatar ispequalnp avatar itayb avatar jamespic avatar jcea avatar jnwatson avatar lamby avatar mbattifarano avatar nirs avatar odidev avatar ongteckwu avatar tejasvi avatar vitalyrepin avatar wbolster avatar willsthompson avatar zhukovalexander avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

py-lmdb's Issues

incorrect docs

03:46 hyc dw: hmm, lmdb 0.77 doc still talks about dbis being in shared memory, between processes

API discussion

The API is nasty in places, particularly around iteration and transaction management. This ticket is to track various ideas/compromises to produce a better API.

Prevent multiple-writer deadlock

Track the current non-nested write transaction in EnvObject. If trans_new() is called on the environment without parent=, raise an exception instead of deadlocking.

Cursor objects cannot cache key/value fields

Cursors must always call MDB_GET_CURRENT to get current key/value, since caching the value during an insert will cause the cached pointers to become invalid. This is a memory corruption issue in the current design

Consider 'transactionless' API

Consider adding an API that opens a temporary transaction for the operation. Since most of the cost of starting a transaction is Python overhead, at least in the CPython version an API that combines begin/read/put/delete/commit into a single function would have a significant performance benefit, and additionally allow avoiding allocation of a Transaction object.

Maybe:

Environment.get(k)
Environment.gets([keys])
Environment.put(k, v)
Environment.puts({k: v} or ((k, v), ...))
Environment.delete(k)
Environment.deletes([keys])

etc

Wait until at least dupsort=True works, and any API changes are completed to accommodate it.

Upsides: encourages many short transactions
Downsides: encourages many short transactions

Vastly simplify C version

The C module is a mess. It has specialized versions for more or less every operation, even though LMDB internally implements everything in terms of cursors. Prior to introducing transaction/cursor/iterator freelists, all operations shou be rewritten in terms of creating temporary CursorObjects, because it will greatly simplify the freelist change.

Strange behaviour

Using git master and lmdb 0.9.10 on FreeBSD with the following simple writer:

import lmdb
import time

env = lmdb.open('pb',map_size=1024**4,metasync=False,sync=False)

recnum=1000000
data='A'*250
start=time.time()
for i in xrange(recnum):
        with env.begin(write=True) as txn:
                txn.put(str(i), data)
end=time.time()
print end-start, recnum/(end-start)
env.close()

the process starts with 5896K memory usage and ends with 130M.
Re-running the same program again yields 408M memory usage (145M just after opening the environment). Running it third times it also gives a 408M maximum.
Why is this?

Also, I have a "reader", which is supposed to read all the records and delete them if processed:

import lmdb

env = lmdb.open('pb',map_size=1024**4,metasync=False,sync=False)

while True:
        sleep=False
        with env.begin(write=True) as txn:
                cur=txn.cursor()
                res=cur.first()
                if res:
                        cur.delete()
                else:
                        sleep=True
        if sleep:
                time.sleep(0.1)

This runs for few seconds and the dies:
lmdb.PageFullError: mdb_cursor_del: MDB_PAGE_FULL: Internal error - page has no more space

Basically I would like to do a two process (reader/writer) FIFO queue here (I'm aware that the sorting is alphabetical, so the order is not perfect).

I'm new to lmdb, so please bear with me, if I misunderstood something.

Test suite

Test all public APIs..

  • Forward iteration
    • From start
    • From existent key (SET_KEY)
    • From existent key (SET_RANGE)
    • From non-existent key (SET_RANGE)
    • From last key
    • From non-existent key past end
  • Forward iteration with dups
    • From existent (key, value) dup record
    • From non-existent (key, value) dup record
    • From last (key, value) dup record
    • From first (key, value) dup record
    • From last (key, value) in database
    • From last (key, value) past end
  • Reverse iteration as with forward
  • Reverse dup iteration as with forward dup
  • Test mixed seek/mutation:
    • As with forward iteration
    • put() during iteration
    • put() error conditions during iteration
    • delete() during iteration
    • delete() error conditions during iteration
  • delete() and put() after seek() errors
  • with/without writemap
  • error conditions for readonly txn/env
    • Cursor.put, Txn.put, Txn.delete, Txn.drop, etc
  • Leaks / crashes
    • Try every operation after close_db(), Env.close(), Txn.abort()
    • Try cursors after above
    • Existing cursors, try to create new cursors
    • Iterators
    • Delete lock file, database file
    • Truncate lock file, database file
  • Evil data
    • Empty string key, empty string value
    • huge key, huge value
    • Boundary case key (511 bytes, 510 bytes)

Explore releasing GIL around reads

There are several reasons why this may ultimately be pointless, particularly in Python, since already multiprocess is the obviously better approach to better throughput with MDB

There are equally as many reasons why it might be a good idea.

Granularity is another question. Endlessly dropping/reacquiring lock e.g. during iteration might result in more overhead than required to actually complete the operation. Requires testing

better exceptions

Currently it's impossible to differentiate between error codes. This isn't ordinarily useful since LMDB is quite disciplined in only raising truly fatal errors, however in some circumstances it may still be quite useful, e.g. dealing with MDB_TXN_FULL.

Expose either a string attribute containing the "MDB_TXN_FULL" constant name of the error that occurred, or expose the error constant value itself (along with all error constants at the module scope), or define subclasses for each type of exception (almost certainly pointless)

Lack of transaction parameter in Database's get and put methods

I notice that Database.get and Database.put do not take a Transaction parameter. It sure would be nice if there were an optional field to allow a separate Transaction.

In order to use multiple transactions, do you intend the client of py-lmdb to have to close and re-open the Database object? That seems to me the only way to provide a new transaction handle to the underlying mdb_get and mdb_put methods. Or perhaps I don't understand something about mdb (for example I'm not sure why a transaction is passed into mdb_open: so the database creation itself can be rolled back?)

BTW, thanks for your work so far! I'm pleasantly impressed that I've see mdb have comparable performance to regular python dictionaries for my usage.

Interaction between cffi fallback and IPython

Installed: latest py-lmdb version 0.59 via "pip install".

The symptom is that when pydoc is loaded first, (for example by typing "import lmdb" inside ipython or ipdb), the import fails, because the cffi fallback doesn't work (see trace below). Is it supposed to?

If I remove from init.py the line below "Hack" it works fine (because the cffi stuff isn't loaded).

try:
# Hack: disable speedups while testing or reading docstrings.
    if any(k in sys.modules for k in ('sphinx', 'pydoc')) or \
            os.getenv('LMDB_FORCE_CFFI') is not None:
        raise ImportError

I get the same error via LMDB_FORCE_CFFI=1 python -c 'import lmdb' from the command line.

In [1]: import lmdb
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-1-1d2f8f67cf57> in <module>()
----> 1 import lmdb

/usr/local/lib/python2.7/dist-packages/lmdb/__init__.py in <module>()
     35     from lmdb.cpython import *
     36 except ImportError:
---> 37     from lmdb.cffi import *
     38     from lmdb.cffi import __doc__
     39

/usr/local/lib/python2.7/dist-packages/lmdb/cffi.py in <module>()
    227     sources=['lib/mdb.c', 'lib/midl.c'],
    228     extra_compile_args=['-Wno-shorten-64-to-32'],
--> 229     include_dirs=['lib']
    230 )
    231

/usr/local/lib/python2.7/dist-packages/cffi/api.pyc in verify(self, source, tmpdir, **kwargs)
    309         tmpdir = tmpdir or _caller_dir_pycache()
    310         self.verifier = Verifier(self, source, tmpdir, **kwargs)
--> 311         lib = self.verifier.load_library()
    312         self._libraries.append(lib)
    313         return lib

/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in load_library(self)
     66             self._locate_module()
     67             if not self._has_module:
---> 68                 self.compile_module()
     69         return self._load_library()
     70

/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in compile_module(self)
     53             raise ffiplatform.VerificationError("module already compiled")
     54         if not self._has_source:
---> 55             self._write_source()
     56         self._compile_module()
     57

/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in _write_source(self, file)
    126         if must_close:
    127             _ensure_dir(self.sourcefilename)
--> 128             file = open(self.sourcefilename, 'w')
    129         self._vengine._f = file
    130         try:

IOError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/lmdb/__pycache__/_cffi__xb735746ax21954bad.c'

document python-dev package requirement (was: error installing with pip in ubuntu)

Hello. After trying to install it in windows7 in diferent ways; seems that does not compiles fine with the pip installation in Ubuntuserver 14.01(virtualbox) as a root(sudo) and in python 2.7(virtualenv). Without virtualenv the same:

....................................................

running install
running build
running build_py
creating build
creating build/lib.linux-i686-2.7
creating build/lib.linux-i686-2.7/lmdb
copying lmdb/init.py -> build/lib.linux-i686-2.7/lmdb
copying lmdb/cffi.py -> build/lib.linux-i686-2.7/lmdb
running build_ext
building 'cpython' extension
creating build/temp.linux-i686-2.7
creating build/temp.linux-i686-2.7/lmdb
creating build/temp.linux-i686-2.7/lib
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ilib -I/usr/include/python2.7 -c lmdb/cpython.c -o build/temp.linux-i686-2.7/lmdb/cpython.o -Wno-shorten-64-to-32

lmdb/cpython.c:29:20: error fatal: Python.h: No existe el archivo o el directorio /////Python.h does not exist///////

compilación terminada. /////terminated compilation////

error: command 'gcc' failed with exit status 1


Command /usr/bin/python -c "import setuptools;file='/home/u/py27/bin/build/lmdb/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-z_exK6-record/install-record.txt failed with error code 1
Storing complete log in /home/u/.pip/pip.log

...................................................................

"To install the Python module, ensure a C compiler and pip or easy_install are available and type:

pip install lmdb
/ or
easy_install lmdb"

I'm new in python, only know that there's a problem. I have gcc; so maybe an issue. thanks for this module.

Fix resource management (invalidation/close)

I tried to implement a trick from wxPython where in order to flag a dead object, an instance's __class__ is switched out for a 'class Invalid'. This way objects depending on some resource don't need to test for validity for every option (allowing removal of two ifs on every iteration).

This hack along witih __del__ doesn't work properly yet, and needs fixed before the binding is used in long-lived code.

Opening existing database in read-only mode fails

Latest lmdb 0.59, installed via "pip install".

This did work in 0.4, (via lmdb.connect())

In [2]: import lmdb
In [3]: env = lmdb.Environment( "/tmp/test.lmdb", readonly=True )
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-3-3c4e28c0391f> in <module>()
----> 1 env = lmdb.Environment( "/tmp/test.lmdb", readonly=True )

Error: mdb_txn_begin: Permission denied

Fix CPython module invalidation

Current scheme sucks.. we can get rid of most of the valid() calls by installing a (void()(), PyObject_) callback in dependencies, that they fire to trigger invalidation. Then each object sets its valid = 0

Clearly document memory usage measurement

Per issue #50, it is not entirely obvious using the standard Linux tools that LMDB's memory can be evicted under memory pressure. Add a small paragraph to the docs to assist users in correctly measuring memory usage.

Better method for opening main database handle (was: Write/open locking has changed)

py-lmdb version 0.59, installed from pip install. Operating on an existing database.

In process 1:

In [1]: import lmdb
In [2]: env = lmdb.Environment( "/tmp/test.lmdb" )
In [3]: txn = env.begin(write=True)

Then, in process 2:

In [1]: import lmdb
In [2]: env = lmdb.Environment( "/tmp/test.lmdb" )

Line 2 in process 2 hangs until process 1 executes a txn.commit(). This prevents a process from connecting to a database (even if they only want to read from it) if there's a long-lived transaction standing. This certainly wasn't the behavior in version 0.4.

Experimental CPython extension

On CPython, returning a buffer to represent a memory area of even 8 bytes using cffi requires allocation of two objects: the cffi minibuffer (48 bytes) followed by the Python 2.x buffer object (64 bytes).

It should be possible to do much better, by having Transaction and Cursor preallocate two buffer instances at construction time when buffers=True, arrange for key/value/item to always return references to these buffers, and then use some evil trick to flip the pre-existing buffer's pointer/length to point at MDB_vals returned during lookup/iteration.

lmdb.open map_size parameter not working

Version: 0.59 installed via 'pip install'

I'm getting lmdb.Error: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached very quickly in my internal benchmark.

The following demonstrates the problem: the map_size argument is being ignored.

>>> import lmdb
>>> env = lmdb.open("/tmp/temp.lmdb", map_size=2**39 )
>>> env.info()
{'max_readers': 126L, 'map_size': 1048576L, 'last_pgno': 1L, 'num_readers': 0L, 'map_addr': 0L, 'last_txnid': 0L}

map_size=2**30 works fine. Perhaps a 32 vs. 64-bit problem.

I see map_size is an int (line 635). It really should be a size_t since that's what the underlying library asks for.

Iterators skip records when an independent cursor deletes the current record

Given 2 cursors:

  • Cursor 1 is positioned on key 'B'.
  • A 'started' iterator exists for Cursor 1.
  • Cursor 2 is positioned on key 'B'.
  • Cursor2.delete() is called.
  • Cursor2.delete() causes LMDB to reposition Cursor1 on next valid record ('C')
  • next(iterator) is called, causing iterator to advance Cursor1 to next position.
  • Except it's already in the next position.

So there needs to be a way to track repositioning. After various thoughts, it seems th best way to do this is:

  • Iterators become directly dependent on the Transaction object (PyObject_HEAD -> LmdbObject_HEAD)
  • Iterators grow a doubly linked list of sibling iterators belonging to the Transaction.
  • Transaction grows an iterator list head.
  • When iterator is created, it registers on the list.
  • When iterator is destroyed, it unregisters from the list.
  • When Transaction.delete() or Cursor.delete() is invoked, it performs:
for(IteratorObject *it = txn->iter_head; it != NULL; it = it->iter_next) {
    CursorObject *curs = it->curs;
    if(curs->positioned && iter->started && iter->op == MDB_NEXT &&
       curs->key.mv_size == to_delete.mv_size &&
       /* See ticket #43! */ !memcmp(curs->key.mv_data, to_delete.mv_data)) {
        iter->started = 0;
    }
}

By clearing iter->started, the iterator avoids MDB_NEXT/MDB_PREV prior to next next() call. MDB_NEXT test is due to the fact that a reverse iterator will behave correctly, calling MDB_PREV to skip the record that its cursor has been newly positioned on to.

Ticket #43 describes yet another design screwup, in that curs->key shouldn't even be cached at all. So that means any delete will require walking a list of all iterators, calling MDB_GET_CURRENT on their associated cursor, then string comparing the result to see if the iterator needs its state hacked.

This also means the cffi version will cross the Python/C boundary way more often, which will make performance suck even harder

Example for Environment.puts() return value seems incorrect

The Environment.puts() docs (online version) show this example:

a_existed, b_existed = env.puts(overwrite=False, items={
    'a': '1',
    'b': '2'
})

if a_existed:
    print 'Did not overwrite a, it already existed.'
if b_existed:
    print 'Did not overwrite b, it already existed.'

However, dictionary order is undefined in Python, so the items = {...} could be in any order.

Use bytes() in docs, not str()

The docs mention using str() to convert buffers to strings, but I think the correct suggestion (for both Python 2 and 3) would be bytes(), since we're dealing with byte strings here, not Unicode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.