jnwatson / py-lmdb Goto Github PK

View Code? Open in Web Editor NEW

624.0 624.0 102.0 3.19 MB

Universal Python binding for the LMDB 'Lightning' Database

Home Page: http://lmdb.readthedocs.io/

License: Other

Python 29.38% C 70.56% GDB 0.05%

py-lmdb's People

Contributors

Stargazers

Watchers

Forkers

alepharchives zabrane bra-fsn smartkiwi lanzhzh igauravsehrawat clongeau futhermocker heyman simudream mbattifarano pombreda oberstet woozzu veer66 heikoheiko lukeyeager an2016 jutrat colingogo ongteckwu vitalyrepin buptmawl adoyleatcaci stb-tester nsshaila achalddave adamchainz groz ispequalnp zhukovalexander rotorliu 2php zergey bitbyt3r jamespic willyd jakirkham mklemm2 poupas rwightman coinut orishoshan wypsmall catroot astraluma asodeur uni159169 nirs bhy jiangdon2007 chbrown de-code pombredanne kolanich-libs jcea algrebe nkdeep peppelinux sailfish009 normoes zgxnet willsthompson onlyone0001 asellappen shadchin thermi stereo-weld hirni-meshram2 hirni-meshram odidev lamden tgrape hirnimeshrampuresoftware itayb charudatta10 frankfanslc jungyitsai maneeshd07 karimyoussef91 ajunlonglive bradleybluebean andres11f avdn krpatter-intc arpitjain799 genostack bye-lemon

py-lmdb's Issues

incorrect docs

03:46 hyc dw: hmm, lmdb 0.77 doc still talks about dbis being in shared memory, between processes

Consider removing buffer object hacks

Try removing the buffer object hacks, so that each cursor iteration will produce 2 new buffers and a new tuple, and test the performance impact.

API discussion

The API is nasty in places, particularly around iteration and transaction management. This ticket is to track various ideas/compromises to produce a better API.

Prevent multiple-writer deadlock

Track the current non-nested write transaction in EnvObject. If trans_new() is called on the environment without parent=, raise an exception instead of deadlocking.

ensure open_db() is synchronized in both cffi/cpython

looks like the implementations diverge. they should check for write txn and fail if not

Cursor objects cannot cache key/value fields

Cursors must always call MDB_GET_CURRENT to get current key/value, since caching the value during an insert will cause the cached pointers to become invalid. This is a memory corruption issue in the current design

Rewrite docstrings

They're terribly inconsistent

Consider 'transactionless' API

Consider adding an API that opens a temporary transaction for the operation. Since most of the cost of starting a transaction is Python overhead, at least in the CPython version an API that combines begin/read/put/delete/commit into a single function would have a significant performance benefit, and additionally allow avoiding allocation of a Transaction object.

Maybe:

Environment.get(k)
Environment.gets([keys])
Environment.put(k, v)
Environment.puts({k: v} or ((k, v), ...))
Environment.delete(k)
Environment.deletes([keys])

etc

Wait until at least dupsort=True works, and any API changes are completed to accommodate it.

Upsides: encourages many short transactions
Downsides: encourages many short transactions

Vastly simplify C version

The C module is a mess. It has specialized versions for more or less every operation, even though LMDB internally implements everything in terms of cursors. Prior to introducing transaction/cursor/iterator freelists, all operations shou be rewritten in terms of creating temporary CursorObjects, because it will greatly simplify the freelist change.

implement max_spare_txns

Link to MDB Doxygen with doxylink

http://pythonhosted.org/sphinxcontrib-doxylink/

Requires generating a tag file, and keeping it consistent with Doxygen. Maybe too much effort

Implement Cursor.put

Missing functionality

Strange behaviour

Using git master and lmdb 0.9.10 on FreeBSD with the following simple writer:

import lmdb
import time

env = lmdb.open('pb',map_size=1024**4,metasync=False,sync=False)

recnum=1000000
data='A'*250
start=time.time()
for i in xrange(recnum):
        with env.begin(write=True) as txn:
                txn.put(str(i), data)
end=time.time()
print end-start, recnum/(end-start)
env.close()

the process starts with 5896K memory usage and ends with 130M.
Re-running the same program again yields 408M memory usage (145M just after opening the environment). Running it third times it also gives a 408M maximum.
Why is this?

Also, I have a "reader", which is supposed to read all the records and delete them if processed:

import lmdb

env = lmdb.open('pb',map_size=1024**4,metasync=False,sync=False)

while True:
        sleep=False
        with env.begin(write=True) as txn:
                cur=txn.cursor()
                res=cur.first()
                if res:
                        cur.delete()
                else:
                        sleep=True
        if sleep:
                time.sleep(0.1)

This runs for few seconds and the dies:
lmdb.PageFullError: mdb_cursor_del: MDB_PAGE_FULL: Internal error - page has no more space

Basically I would like to do a two process (reader/writer) FIFO queue here (I'm aware that the sorting is alphabetical, so the order is not perfect).

I'm new to lmdb, so please bear with me, if I misunderstood something.

CPython docstrings depend on cffi

It's impossible to use pydoc on the module without causing cffi build to start, which is horribly broken.

Aware of tspurway / pymdb-lightning ?

Are you aware of https://github.com/tspurway/pymdb-lightning ?

Test suite

Test all public APIs..

Forward iteration
- From start
- From existent key (SET_KEY)
- From existent key (SET_RANGE)
- From non-existent key (SET_RANGE)
- From last key
- From non-existent key past end
Forward iteration with dups
- From existent (key, value) dup record
- From non-existent (key, value) dup record
- From last (key, value) dup record
- From first (key, value) dup record
- From last (key, value) in database
- From last (key, value) past end
Reverse iteration as with forward
Reverse dup iteration as with forward dup
Test mixed seek/mutation:
- As with forward iteration
- put() during iteration
- put() error conditions during iteration
- delete() during iteration
- delete() error conditions during iteration
delete() and put() after seek() errors
with/without writemap
error conditions for readonly txn/env
- Cursor.put, Txn.put, Txn.delete, Txn.drop, etc
Leaks / crashes
- Try every operation after close_db(), Env.close(), Txn.abort()
- Try cursors after above
- Existing cursors, try to create new cursors
- Iterators
- Delete lock file, database file
- Truncate lock file, database file
Evil data
- Empty string key, empty string value
- huge key, huge value
- Boundary case key (511 bytes, 510 bytes)

Explore releasing GIL around reads

There are several reasons why this may ultimately be pointless, particularly in Python, since already multiprocess is the obviously better approach to better throughput with MDB

There are equally as many reasons why it might be a good idea.

Granularity is another question. Endlessly dropping/reacquiring lock e.g. during iteration might result in more overhead than required to actually complete the operation. Requires testing

Get duplicate-key databases working properly.

Mostly done already, needs tests

better exceptions

Currently it's impossible to differentiate between error codes. This isn't ordinarily useful since LMDB is quite disciplined in only raising truly fatal errors, however in some circumstances it may still be quite useful, e.g. dealing with MDB_TXN_FULL.

Expose either a string attribute containing the "MDB_TXN_FULL" constant name of the error that occurred, or expose the error constant value itself (along with all error constants at the module scope), or define subclasses for each type of exception (almost certainly pointless)

implement max_spare_iters

exposing mdb_stat (getting statistics for a subdatabase)

Hi,
with env.stat() I cannot see statistics for subdatabases. Perhaps exposing to python (I use cpython interface) mdb_stat funcion could solve this issue.

TIA, best regards
/gp

Lack of transaction parameter in Database's get and put methods

I notice that Database.get and Database.put do not take a Transaction parameter. It sure would be nice if there were an optional field to allow a separate Transaction.

In order to use multiple transactions, do you intend the client of py-lmdb to have to close and re-open the Database object? That seems to me the only way to provide a new transaction handle to the underlying mdb_get and mdb_put methods. Or perhaps I don't understand something about mdb (for example I'm not sure why a transaction is passed into mdb_open: so the database creation itself can be rolled back?)

BTW, thanks for your work so far! I'm pleasantly impressed that I've see mdb have comparable performance to regular python dictionaries for my usage.

Interaction between cffi fallback and IPython

Installed: latest py-lmdb version 0.59 via "pip install".

The symptom is that when pydoc is loaded first, (for example by typing "import lmdb" inside ipython or ipdb), the import fails, because the cffi fallback doesn't work (see trace below). Is it supposed to?

If I remove from init.py the line below "Hack" it works fine (because the cffi stuff isn't loaded).

try:
# Hack: disable speedups while testing or reading docstrings.
    if any(k in sys.modules for k in ('sphinx', 'pydoc')) or \
            os.getenv('LMDB_FORCE_CFFI') is not None:
        raise ImportError

I get the same error via LMDB_FORCE_CFFI=1 python -c 'import lmdb' from the command line.

In [1]: import lmdb
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-1-1d2f8f67cf57> in <module>()
----> 1 import lmdb

/usr/local/lib/python2.7/dist-packages/lmdb/__init__.py in <module>()
     35     from lmdb.cpython import *
     36 except ImportError:
---> 37     from lmdb.cffi import *
     38     from lmdb.cffi import __doc__
     39

/usr/local/lib/python2.7/dist-packages/lmdb/cffi.py in <module>()
    227     sources=['lib/mdb.c', 'lib/midl.c'],
    228     extra_compile_args=['-Wno-shorten-64-to-32'],
--> 229     include_dirs=['lib']
    230 )
    231

/usr/local/lib/python2.7/dist-packages/cffi/api.pyc in verify(self, source, tmpdir, **kwargs)
    309         tmpdir = tmpdir or _caller_dir_pycache()
    310         self.verifier = Verifier(self, source, tmpdir, **kwargs)
--> 311         lib = self.verifier.load_library()
    312         self._libraries.append(lib)
    313         return lib

/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in load_library(self)
     66             self._locate_module()
     67             if not self._has_module:
---> 68                 self.compile_module()
     69         return self._load_library()
     70

/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in compile_module(self)
     53             raise ffiplatform.VerificationError("module already compiled")
     54         if not self._has_source:
---> 55             self._write_source()
     56         self._compile_module()
     57

/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in _write_source(self, file)
    126         if must_close:
    127             _ensure_dir(self.sourcefilename)
--> 128             file = open(self.sourcefilename, 'w')
    129         self._vengine._f = file
    130         try:

IOError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/lmdb/__pycache__/_cffi__xb735746ax21954bad.c'

cpython/Transaction: TypeError: 'env' argument required

cpython trans_new checks arg.env before calling parse_args().
This makes lmdb.Transaction(env) fail with "TypeError: 'env' argument required"

Implement Cursor.get() and remaining missing cursor operations

document python-dev package requirement (was: error installing with pip in ubuntu)

Hello. After trying to install it in windows7 in diferent ways; seems that does not compiles fine with the pip installation in Ubuntuserver 14.01(virtualbox) as a root(sudo) and in python 2.7(virtualenv). Without virtualenv the same:

....................................................

running install
running build
running build_py
creating build
creating build/lib.linux-i686-2.7
creating build/lib.linux-i686-2.7/lmdb
copying lmdb/init.py -> build/lib.linux-i686-2.7/lmdb
copying lmdb/cffi.py -> build/lib.linux-i686-2.7/lmdb
running build_ext
building 'cpython' extension
creating build/temp.linux-i686-2.7
creating build/temp.linux-i686-2.7/lmdb
creating build/temp.linux-i686-2.7/lib
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ilib -I/usr/include/python2.7 -c lmdb/cpython.c -o build/temp.linux-i686-2.7/lmdb/cpython.o -Wno-shorten-64-to-32

lmdb/cpython.c:29:20: error fatal: Python.h: No existe el archivo o el directorio /////Python.h does not exist///////

compilación terminada. /////terminated compilation////

error: command 'gcc' failed with exit status 1

Command /usr/bin/python -c "import setuptools;file='/home/u/py27/bin/build/lmdb/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-z_exK6-record/install-record.txt failed with error code 1
Storing complete log in /home/u/.pip/pip.log

...................................................................

"To install the Python module, ensure a C compiler and pip or easy_install are available and type:

pip install lmdb
/ or
easy_install lmdb"

I'm new in python, only know that there's a problem. I have gcc; so maybe an issue. thanks for this module.

Fix resource management (invalidation/close)

I tried to implement a trick from wxPython where in order to flag a dead object, an instance's __class__ is switched out for a 'class Invalid'. This way objects depending on some resource don't need to test for validity for every option (allowing removal of two ifs on every iteration).

This hack along witih __del__ doesn't work properly yet, and needs fixed before the binding is used in long-lived code.

Thorough Signed/unsigned, 32/64 audit

Per issue #18, new arg parsing broke at least one thing, but also there are likely more of these issues hiding in the code.

working windows build + cheese shop binaries

perhaps need to make NOINLINE conditional, and ensure it builds under mingw or msvc. submit binaries to cheese shop (travis ci?)

implement max_spare_cursors

Cursor.put() with pwdump.cdbmake from cdblib doesn't work

Only 20 entries make it into the file. Either mdb_cursor_put is buggy or it's being used incorrectly in the binding

Opening existing database in read-only mode fails

Latest lmdb 0.59, installed via "pip install".

This did work in 0.4, (via lmdb.connect())

In [2]: import lmdb
In [3]: env = lmdb.Environment( "/tmp/test.lmdb", readonly=True )
---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-3-3c4e28c0391f> in <module>()
----> 1 env = lmdb.Environment( "/tmp/test.lmdb", readonly=True )

Error: mdb_txn_begin: Permission denied

Fix CPython module invalidation

Current scheme sucks.. we can get rid of most of the valid() calls by installing a (void()(), PyObject_) callback in dependencies, that they fire to trigger invalidation. Then each object sets its valid = 0

Inverted append logic in cursor_put

cpython cursor_put sets MDB_APPEND if append=False

fix documentation of sub-database handles, maybe re-add close

per IRC conversation

Clearly document memory usage measurement

Per issue #50, it is not entirely obvious using the standard Linux tools that LMDB's memory can be evicted under memory pressure. Add a small paragraph to the docs to assist users in correctly measuring memory usage.

Better method for opening main database handle (was: Write/open locking has changed)

py-lmdb version 0.59, installed from pip install. Operating on an existing database.

In process 1:

In [1]: import lmdb
In [2]: env = lmdb.Environment( "/tmp/test.lmdb" )
In [3]: txn = env.begin(write=True)

Then, in process 2:

In [1]: import lmdb
In [2]: env = lmdb.Environment( "/tmp/test.lmdb" )

Line 2 in process 2 hangs until process 1 executes a txn.commit(). This prevents a process from connecting to a database (even if they only want to read from it) if there's a long-lived transaction standing. This certainly wasn't the behavior in version 0.4.

Experimental CPython extension

On CPython, returning a buffer to represent a memory area of even 8 bytes using cffi requires allocation of two objects: the cffi minibuffer (48 bytes) followed by the Python 2.x buffer object (64 bytes).

It should be possible to do much better, by having Transaction and Cursor preallocate two buffer instances at construction time when buffers=True, arrange for key/value/item to always return references to these buffers, and then use some evil trick to flip the pre-existing buffer's pointer/length to point at MDB_vals returned during lookup/iteration.

lmdb.open map_size parameter not working

Version: 0.59 installed via 'pip install'

I'm getting lmdb.Error: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached very quickly in my internal benchmark.

The following demonstrates the problem: the map_size argument is being ignored.

>>> import lmdb
>>> env = lmdb.open("/tmp/temp.lmdb", map_size=2**39 )
>>> env.info()
{'max_readers': 126L, 'map_size': 1048576L, 'last_pgno': 1L, 'num_readers': 0L, 'map_addr': 0L, 'last_txnid': 0L}

map_size=2**30 works fine. Perhaps a 32 vs. 64-bit problem.

I see map_size is an int (line 635). It really should be a size_t since that's what the underlying library asks for.

Support writemap flag

Probably could use a fat warning label in the docs

RTD claims to document lmdb 0.66

The documentation at RTD documents HEAD, but has lmdb 0.66 documentation as its title.

Figure out why transactionless get() perf is horrendous

Support linking against system lmdb install

Reduce use of MDB_GET_CURRENT in Cursor implementation

Figure out if these calls are expensive or not, and if so, curtail their unnecessary use in various places

Double delete at last key in cursor triggers crash

Looks like MDB-level problem, but also possibly memory corruption.

Support dictionary like operations in transactions

Transactions should allow dictionary like operations like

txn[key] = value

del txn[key]

key in txn

for key, value in txn.items()

etc.

Iterators skip records when an independent cursor deletes the current record

Given 2 cursors:

Cursor 1 is positioned on key 'B'.
A 'started' iterator exists for Cursor 1.
Cursor 2 is positioned on key 'B'.
Cursor2.delete() is called.
Cursor2.delete() causes LMDB to reposition Cursor1 on next valid record ('C')
next(iterator) is called, causing iterator to advance Cursor1 to next position.
Except it's already in the next position.

So there needs to be a way to track repositioning. After various thoughts, it seems th best way to do this is:

Iterators become directly dependent on the Transaction object (PyObject_HEAD -> LmdbObject_HEAD)
Iterators grow a doubly linked list of sibling iterators belonging to the Transaction.
Transaction grows an iterator list head.
When iterator is created, it registers on the list.
When iterator is destroyed, it unregisters from the list.
When Transaction.delete() or Cursor.delete() is invoked, it performs:

for(IteratorObject *it = txn->iter_head; it != NULL; it = it->iter_next) {
    CursorObject *curs = it->curs;
    if(curs->positioned && iter->started && iter->op == MDB_NEXT &&
       curs->key.mv_size == to_delete.mv_size &&
       /* See ticket #43! */ !memcmp(curs->key.mv_data, to_delete.mv_data)) {
        iter->started = 0;
    }
}

By clearing iter->started, the iterator avoids MDB_NEXT/MDB_PREV prior to next next() call. MDB_NEXT test is due to the fact that a reverse iterator will behave correctly, calling MDB_PREV to skip the record that its cursor has been newly positioned on to.

Ticket #43 describes yet another design screwup, in that curs->key shouldn't even be cached at all. So that means any delete will require walking a list of all iterators, calling MDB_GET_CURRENT on their associated cursor, then string comparing the result to see if the iterator needs its state hacked.

This also means the cffi version will cross the Python/C boundary way more often, which will make performance suck even harder

Example for Environment.puts() return value seems incorrect

The Environment.puts() docs (online version) show this example:

a_existed, b_existed = env.puts(overwrite=False, items={
    'a': '1',
    'b': '2'
})

if a_existed:
    print 'Did not overwrite a, it already existed.'
if b_existed:
    print 'Did not overwrite b, it already existed.'

However, dictionary order is undefined in Python, so the items = {...} could be in any order.

Using map_size=10 results in segfaults.

This is an upstream problem. Setting too low a map_size causes LMDB to explode.

Use bytes() in docs, not str()

The docs mention using str() to convert buffers to strings, but I think the correct suggestion (for both Python 2 and 3) would be bytes(), since we're dealing with byte strings here, not Unicode.