jnwatson / py-lmdb Goto Github PK
View Code? Open in Web Editor NEWUniversal Python binding for the LMDB 'Lightning' Database
Home Page: http://lmdb.readthedocs.io/
License: Other
Universal Python binding for the LMDB 'Lightning' Database
Home Page: http://lmdb.readthedocs.io/
License: Other
03:46 hyc dw: hmm, lmdb 0.77 doc still talks about dbis being in shared memory, between processes
Try removing the buffer object hacks, so that each cursor iteration will produce 2 new buffers and a new tuple, and test the performance impact.
The API is nasty in places, particularly around iteration and transaction management. This ticket is to track various ideas/compromises to produce a better API.
Track the current non-nested write transaction in EnvObject. If trans_new() is called on the environment without parent=, raise an exception instead of deadlocking.
looks like the implementations diverge. they should check for write txn and fail if not
Cursors must always call MDB_GET_CURRENT to get current key/value, since caching the value during an insert will cause the cached pointers to become invalid. This is a memory corruption issue in the current design
They're terribly inconsistent
Consider adding an API that opens a temporary transaction for the operation. Since most of the cost of starting a transaction is Python overhead, at least in the CPython version an API that combines begin/read/put/delete/commit into a single function would have a significant performance benefit, and additionally allow avoiding allocation of a Transaction object.
Maybe:
Environment.get(k)
Environment.gets([keys])
Environment.put(k, v)
Environment.puts({k: v} or ((k, v), ...))
Environment.delete(k)
Environment.deletes([keys])
etc
Wait until at least dupsort=True works, and any API changes are completed to accommodate it.
Upsides: encourages many short transactions
Downsides: encourages many short transactions
The C module is a mess. It has specialized versions for more or less every operation, even though LMDB internally implements everything in terms of cursors. Prior to introducing transaction/cursor/iterator freelists, all operations shou be rewritten in terms of creating temporary CursorObjects, because it will greatly simplify the freelist change.
http://pythonhosted.org/sphinxcontrib-doxylink/
Requires generating a tag file, and keeping it consistent with Doxygen. Maybe too much effort
Missing functionality
Using git master and lmdb 0.9.10 on FreeBSD with the following simple writer:
import lmdb
import time
env = lmdb.open('pb',map_size=1024**4,metasync=False,sync=False)
recnum=1000000
data='A'*250
start=time.time()
for i in xrange(recnum):
with env.begin(write=True) as txn:
txn.put(str(i), data)
end=time.time()
print end-start, recnum/(end-start)
env.close()
the process starts with 5896K memory usage and ends with 130M.
Re-running the same program again yields 408M memory usage (145M just after opening the environment). Running it third times it also gives a 408M maximum.
Why is this?
Also, I have a "reader", which is supposed to read all the records and delete them if processed:
import lmdb
env = lmdb.open('pb',map_size=1024**4,metasync=False,sync=False)
while True:
sleep=False
with env.begin(write=True) as txn:
cur=txn.cursor()
res=cur.first()
if res:
cur.delete()
else:
sleep=True
if sleep:
time.sleep(0.1)
This runs for few seconds and the dies:
lmdb.PageFullError: mdb_cursor_del: MDB_PAGE_FULL: Internal error - page has no more space
Basically I would like to do a two process (reader/writer) FIFO queue here (I'm aware that the sorting is alphabetical, so the order is not perfect).
I'm new to lmdb, so please bear with me, if I misunderstood something.
It's impossible to use pydoc on the module without causing cffi build to start, which is horribly broken.
Are you aware of https://github.com/tspurway/pymdb-lightning ?
Test all public APIs..
There are several reasons why this may ultimately be pointless, particularly in Python, since already multiprocess is the obviously better approach to better throughput with MDB
There are equally as many reasons why it might be a good idea.
Granularity is another question. Endlessly dropping/reacquiring lock e.g. during iteration might result in more overhead than required to actually complete the operation. Requires testing
Mostly done already, needs tests
Currently it's impossible to differentiate between error codes. This isn't ordinarily useful since LMDB is quite disciplined in only raising truly fatal errors, however in some circumstances it may still be quite useful, e.g. dealing with MDB_TXN_FULL.
Expose either a string attribute containing the "MDB_TXN_FULL" constant name of the error that occurred, or expose the error constant value itself (along with all error constants at the module scope), or define subclasses for each type of exception (almost certainly pointless)
Hi,
with env.stat() I cannot see statistics for subdatabases. Perhaps exposing to python (I use cpython interface) mdb_stat funcion could solve this issue.
TIA, best regards
/gp
I notice that Database.get and Database.put do not take a Transaction parameter. It sure would be nice if there were an optional field to allow a separate Transaction.
In order to use multiple transactions, do you intend the client of py-lmdb to have to close and re-open the Database object? That seems to me the only way to provide a new transaction handle to the underlying mdb_get and mdb_put methods. Or perhaps I don't understand something about mdb (for example I'm not sure why a transaction is passed into mdb_open: so the database creation itself can be rolled back?)
BTW, thanks for your work so far! I'm pleasantly impressed that I've see mdb have comparable performance to regular python dictionaries for my usage.
Installed: latest py-lmdb version 0.59 via "pip install".
The symptom is that when pydoc is loaded first, (for example by typing "import lmdb" inside ipython or ipdb), the import fails, because the cffi fallback doesn't work (see trace below). Is it supposed to?
If I remove from init.py the line below "Hack" it works fine (because the cffi stuff isn't loaded).
try:
# Hack: disable speedups while testing or reading docstrings.
if any(k in sys.modules for k in ('sphinx', 'pydoc')) or \
os.getenv('LMDB_FORCE_CFFI') is not None:
raise ImportError
I get the same error via LMDB_FORCE_CFFI=1 python -c 'import lmdb'
from the command line.
In [1]: import lmdb
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-1-1d2f8f67cf57> in <module>()
----> 1 import lmdb
/usr/local/lib/python2.7/dist-packages/lmdb/__init__.py in <module>()
35 from lmdb.cpython import *
36 except ImportError:
---> 37 from lmdb.cffi import *
38 from lmdb.cffi import __doc__
39
/usr/local/lib/python2.7/dist-packages/lmdb/cffi.py in <module>()
227 sources=['lib/mdb.c', 'lib/midl.c'],
228 extra_compile_args=['-Wno-shorten-64-to-32'],
--> 229 include_dirs=['lib']
230 )
231
/usr/local/lib/python2.7/dist-packages/cffi/api.pyc in verify(self, source, tmpdir, **kwargs)
309 tmpdir = tmpdir or _caller_dir_pycache()
310 self.verifier = Verifier(self, source, tmpdir, **kwargs)
--> 311 lib = self.verifier.load_library()
312 self._libraries.append(lib)
313 return lib
/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in load_library(self)
66 self._locate_module()
67 if not self._has_module:
---> 68 self.compile_module()
69 return self._load_library()
70
/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in compile_module(self)
53 raise ffiplatform.VerificationError("module already compiled")
54 if not self._has_source:
---> 55 self._write_source()
56 self._compile_module()
57
/usr/local/lib/python2.7/dist-packages/cffi/verifier.pyc in _write_source(self, file)
126 if must_close:
127 _ensure_dir(self.sourcefilename)
--> 128 file = open(self.sourcefilename, 'w')
129 self._vengine._f = file
130 try:
IOError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/lmdb/__pycache__/_cffi__xb735746ax21954bad.c'
cpython trans_new checks arg.env before calling parse_args().
This makes lmdb.Transaction(env) fail with "TypeError: 'env' argument required"
See also #5
Hello. After trying to install it in windows7 in diferent ways; seems that does not compiles fine with the pip installation in Ubuntuserver 14.01(virtualbox) as a root(sudo) and in python 2.7(virtualenv). Without virtualenv the same:
....................................................
running install
running build
running build_py
creating build
creating build/lib.linux-i686-2.7
creating build/lib.linux-i686-2.7/lmdb
copying lmdb/init.py -> build/lib.linux-i686-2.7/lmdb
copying lmdb/cffi.py -> build/lib.linux-i686-2.7/lmdb
running build_ext
building 'cpython' extension
creating build/temp.linux-i686-2.7
creating build/temp.linux-i686-2.7/lmdb
creating build/temp.linux-i686-2.7/lib
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Ilib -I/usr/include/python2.7 -c lmdb/cpython.c -o build/temp.linux-i686-2.7/lmdb/cpython.o -Wno-shorten-64-to-32
lmdb/cpython.c:29:20: error fatal: Python.h: No existe el archivo o el directorio /////Python.h does not exist///////
compilación terminada. /////terminated compilation////
error: command 'gcc' failed with exit status 1
Command /usr/bin/python -c "import setuptools;file='/home/u/py27/bin/build/lmdb/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-z_exK6-record/install-record.txt failed with error code 1
Storing complete log in /home/u/.pip/pip.log
...................................................................
"To install the Python module, ensure a C compiler and pip or easy_install are available and type:
pip install lmdb
/ or
easy_install lmdb"
I'm new in python, only know that there's a problem. I have gcc; so maybe an issue. thanks for this module.
I tried to implement a trick from wxPython where in order to flag a dead object, an instance's __class__
is switched out for a 'class Invalid'. This way objects depending on some resource don't need to test for validity for every option (allowing removal of two ifs on every iteration).
This hack along witih __del__
doesn't work properly yet, and needs fixed before the binding is used in long-lived code.
Per issue #18, new arg parsing broke at least one thing, but also there are likely more of these issues hiding in the code.
perhaps need to make NOINLINE conditional, and ensure it builds under mingw or msvc. submit binaries to cheese shop (travis ci?)
Only 20 entries make it into the file. Either mdb_cursor_put is buggy or it's being used incorrectly in the binding
Latest lmdb 0.59, installed via "pip install".
This did work in 0.4, (via lmdb.connect()
)
In [2]: import lmdb
In [3]: env = lmdb.Environment( "/tmp/test.lmdb", readonly=True )
---------------------------------------------------------------------------
Error Traceback (most recent call last)
<ipython-input-3-3c4e28c0391f> in <module>()
----> 1 env = lmdb.Environment( "/tmp/test.lmdb", readonly=True )
Error: mdb_txn_begin: Permission denied
Current scheme sucks.. we can get rid of most of the valid() calls by installing a (void()(), PyObject_) callback in dependencies, that they fire to trigger invalidation. Then each object sets its valid = 0
cpython cursor_put sets MDB_APPEND if append=False
per IRC conversation
Per issue #50, it is not entirely obvious using the standard Linux tools that LMDB's memory can be evicted under memory pressure. Add a small paragraph to the docs to assist users in correctly measuring memory usage.
py-lmdb version 0.59, installed from pip install
. Operating on an existing database.
In process 1:
In [1]: import lmdb
In [2]: env = lmdb.Environment( "/tmp/test.lmdb" )
In [3]: txn = env.begin(write=True)
Then, in process 2:
In [1]: import lmdb
In [2]: env = lmdb.Environment( "/tmp/test.lmdb" )
Line 2 in process 2 hangs until process 1 executes a txn.commit()
. This prevents a process from connecting to a database (even if they only want to read from it) if there's a long-lived transaction standing. This certainly wasn't the behavior in version 0.4.
On CPython, returning a buffer to represent a memory area of even 8 bytes using cffi requires allocation of two objects: the cffi minibuffer (48 bytes) followed by the Python 2.x buffer object (64 bytes).
It should be possible to do much better, by having Transaction and Cursor preallocate two buffer instances at construction time when buffers=True, arrange for key/value/item to always return references to these buffers, and then use some evil trick to flip the pre-existing buffer's pointer/length to point at MDB_vals returned during lookup/iteration.
Version: 0.59 installed via 'pip install'
I'm getting lmdb.Error: mdb_put: MDB_MAP_FULL: Environment mapsize limit reached
very quickly in my internal benchmark.
The following demonstrates the problem: the map_size argument is being ignored.
>>> import lmdb
>>> env = lmdb.open("/tmp/temp.lmdb", map_size=2**39 )
>>> env.info()
{'max_readers': 126L, 'map_size': 1048576L, 'last_pgno': 1L, 'num_readers': 0L, 'map_addr': 0L, 'last_txnid': 0L}
map_size=2**30
works fine. Perhaps a 32 vs. 64-bit problem.
I see map_size is an int (line 635). It really should be a size_t
since that's what the underlying library asks for.
Probably could use a fat warning label in the docs
The documentation at RTD documents HEAD, but has lmdb 0.66 documentation as its title.
Figure out if these calls are expensive or not, and if so, curtail their unnecessary use in various places
Looks like MDB-level problem, but also possibly memory corruption.
Transactions should allow dictionary like operations like
txn[key] = value
del txn[key]
key in txn
for key, value in txn.items()
etc.
Given 2 cursors:
So there needs to be a way to track repositioning. After various thoughts, it seems th best way to do this is:
for(IteratorObject *it = txn->iter_head; it != NULL; it = it->iter_next) {
CursorObject *curs = it->curs;
if(curs->positioned && iter->started && iter->op == MDB_NEXT &&
curs->key.mv_size == to_delete.mv_size &&
/* See ticket #43! */ !memcmp(curs->key.mv_data, to_delete.mv_data)) {
iter->started = 0;
}
}
By clearing iter->started
, the iterator avoids MDB_NEXT/MDB_PREV prior to next next()
call. MDB_NEXT
test is due to the fact that a reverse iterator will behave correctly, calling MDB_PREV to skip the record that its cursor has been newly positioned on to.
Ticket #43 describes yet another design screwup, in that curs->key
shouldn't even be cached at all. So that means any delete will require walking a list of all iterators, calling MDB_GET_CURRENT on their associated cursor, then string comparing the result to see if the iterator needs its state hacked.
This also means the cffi version will cross the Python/C boundary way more often, which will make performance suck even harder
The Environment.puts()
docs (online version) show this example:
a_existed, b_existed = env.puts(overwrite=False, items={
'a': '1',
'b': '2'
})
if a_existed:
print 'Did not overwrite a, it already existed.'
if b_existed:
print 'Did not overwrite b, it already existed.'
However, dictionary order is undefined in Python, so the items = {...}
could be in any order.
This is an upstream problem. Setting too low a map_size causes LMDB to explode.
The docs mention using str()
to convert buffers to strings, but I think the correct suggestion (for both Python 2 and 3) would be bytes()
, since we're dealing with byte strings here, not Unicode.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.