jamesls / semidbm Goto Github PK

View Code? Open in Web Editor NEW

51.0 7.0 13.0 429 KB

Cross platform (fast) DBM interface in python

License: Other

Python 100.00%

semidbm's Introduction

Overview

https://secure.travis-ci.org/jamesls/semidbm.png?branch=master

https://coveralls.io/repos/jamesls/semidbm/badge.png?branch=master

Semidbm is a fast, pure python implementation of a dbm, which is a persistent key value store. It allows you to get and set keys through a dict interface:

import semidbm
db = semidbm.open('testdb', 'c')
db['foo'] = 'bar'
print db['foo']
db.close()

These values are persisted to disk, and you can later retrieve these key/value pairs:

# Then at a later time:
db = semidbm.open('testdb', 'r')
# prints "bar"
print db['foo']

It was written with these things in mind:

Pure python, supporting python 2.6, 2.7, 3.3, and 3.4.
Cross platform, works on Windows, Linux, Mac OS X.
Supports CPython, pypy, and jython (versions 2.7-b3 and higher).
Simple and Fast (See Benchmarking Semidbm).

Supported Python Versions

Semidbm supports python 2.6, 2.7, 3.3, and 3.4.

Official Docs

Read the semidbm docs for more information and how to use semidbm.

Features

Semidbm originally started off as an improvement over the dumbdbm library in the python standard library. Below are a list of some of the improvements over dumbdbm.

Single Data File

Instead of an index file and a data file, the index and data have been consolidated into a single file. This single data file is always appended to, data written to the file is never modified.

Data File Compaction

Semidbm uses an append only file format. This has the potential to grow to large sizes as space is never reclaimed. Semidbm addresses this by adding a compact() method that will rewrite the data file to a minimal size.

Performance

Semidbm is significantly faster than dumbdbm (keep in mind both are pure python libraries) in just about every way. The documentation shows the results of semidbm vs. other dbms, along with how to run the benchmarking script yourself.

Limitations

Not thread safe; can't be accessed by multiple processes.
The entire index must fit in memory. This essentially means that all of the keys must fit in memory.

Post feedback and issues on github issues, or check out the latest changes at the github repo.

semidbm's People

Contributors

Stargazers

Watchers

Forkers

atbrox kingxsp dcentralize pombreda speedplane cmarshall108 oxij pricingassistant ytknzw besfahbod benpankow aie2

semidbm's Issues

Using `semidbm` in a `shelve` object - a code snippet

Using Python 3.7's shelve with the default dbm I run into the same size limitation noted here http://jamesls.com/semidbm-a-pure-python-dbm.html (notably HASH: Out of overflow pages. Increase page size) using a Mac. Having installed gdbm it won't appear with my Conda Pythons.

semidbm came to the rescue using the following code snippet. The class and function are lifted directly from Python's shelve.py. I see no speed difference but I do see a an ability to scale to more objects that dbm lacked. gdbm should have provided a similar solution but on my Anaconda distribution I can't get it to work ( for reference import dbm.gnu generates ModuleNotFoundError: No module named '_gdbm').

Thank you for this package! I hope that the snippet below helps other who use shelve on a large dataset.

from shelve import Shelf
class DbfilenameShelfSemidbm(Shelf):
    """Shelf implementation using the "dbm" generic dbm interface.

    This is initialized with the filename for the dbm database.
    See the module's __doc__ string for an overview of the interface.
    """

    def __init__(self, filename, flag='c', protocol=None, writeback=False):
        import dbm
        Shelf.__init__(self, semidbm.open(filename, flag), protocol, writeback)


def open_semidbm(filename, flag='c', protocol=None, writeback=False):
    """Open a persistent dictionary for reading and writing.

    The filename parameter is the base filename for the underlying
    database.  As a side-effect, an extension may be added to the
    filename and more than one file may be created.  The optional flag
    parameter has the same interpretation as the flag parameter of
    dbm.open(). The optional protocol parameter specifies the
    version of the pickle protocol.

    See the module's __doc__ string for an overview of the interface.
    """

    return DbfilenameShelfSemidbm(filename, flag, protocol, writeback)

Timing on a smaller client task (prior to the HASH error above):

dbm inside a default shelve 1m20
semidbm inside the derived shelve 1m20 (the same as dbm)

In Read-only file system, error happens

In google cloud function,
If you call semidbm.db.open( filename, "r" ), it happens that error 「Function failed on loading user code. Error message: [Errno 30] Read-only file system: '/env/local/lib/python3.7/site-packages/pykakasi/kanwadict3.db/data'」
( It happens in pykakasi.)
The cause of that is 「compat.DATA_OPEN_FLAGS」.
Please modify.

Support python 3

I think I'd be fine with supporting just python3.3 for now.

File not closed when verify_header fails

I need to convert dbhash because it's not supported in python3, so in python2 I:

try to open as semidb
if fail, open as dbhash, copy keys to new-name, close, rename.

This lets the conversion happen automagically in my db code, rather than needing explicit conversion. BUT the rename fails because the file has not been closed, so in mmapload.py I made the change:

        try:    # Phil added try...except
            header = f.read(8)
            self._verify_header(header)
        except:
            f.close()
            raise

I did not try to submit a change because it may be better wrapped in the other try clause.

Plans for new release?

Hi @jamesls and team,

Looks like there has not been a release for near 5 years now. Are there any plans to address this, specially merging open PRs to fix bugs, fixing the CI failures, etc?

If not, mind giving access to new contributors (github, travis, pypi.org) to revive the project and unblock dependent projects?

(cc @oxij, @sylvinus, @speedplane, @AnythingTechPro, @miurahr)

Does not build successfully

The latest semidbm version from your git repo does not build successfully:

$ python setup.py build
running build
running build_py
file semidbm.py (for module semidbm) not found
file semidbm.py (for module semidbm) not found

Installation also reports this error during sudo python setup.py install. The resulting semidbm cannot be imported.

Is there a release of 0.4 available that does build successfully? Or another git repo where it builds? Or is there an easy fix to this build problem?

Support jython 2.7

Not officially released yet, but semidbm should support jython2.7. This gives jython users an easy to use dbm for jython (gdbm is not available on jython).

In terms of the work, we'll have to refactor the use of mmap, as that't not available on the jython platform.

in appears always returns False

It should error out or return the correct result.

(Pdb) self.store["blah"] = "1"
(Pdb) "blah" in self.store
False
(Pdb) self.store["blah"]
b'1'
(Pdb) self.store
<semidbm.db._SemiDBM object at 0x11f6ec050>
(Pdb) semidbm.__version__
'0.5.1'

Deleting a nonexistent key corrupts the database

>>> import semidbm
>>> db = semidbm.open('test', 'n')
>>> del db[b'nonexistent']
Traceback [...]
KeyError: b'nonexistent'
>>> db[b'foo'] = b'bar'
>>> db[b'foo']
b'exi'
>>> db.close()
>>> db = semidbm.open('test', 'r')
Traceback [...]
KeyError: b'nonexistent'

The fix is probably to move the del above the write in _SemiDBM.__delitem__ (and add a regression test).

3.6, 3.7, 3.8, 3.9 on PyPI

Hello,
Congrats for this nice project.

pip install semidbm

does not work on Python 3.7.

Is it planned to update it on PyPI?

Thanks!

just a question about an issue i'm seeing with db disappearing

I have a dbm I use that I access via read-only sdbm handle in a web server, I never write to this handle (which wouldn't be possible anyway), but every once in awhile my data disappears, i.e. all the contents disappear (size goes from 2.6G to 8kb). It's hard to debug this because I don't write to the handle anywhere and it happens once every few weeks so it's hard for me to reproduce. I do access the read_only handle from multiple threads, so I'm wondering if somehow that's causing a problem? Here is how I create the handle

sdbm_read_only = sdbm.open(os.path.expandvars(self.config.get("sdbm_location")),'r')

Use binary format instead of ASCII

Consider switching to a binary format for the db, it make it more efficient to load and possibly improve performance.

Test failure in windows python2.6

Confirmed this only happens in windows py26. Seems to be an issue with remapping pages once we exceed the number of mapped pages.

======================================================================
ERROR: test_remap_required (test_semidbm.TestRemapping)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\Administrator\Documents\GitHub\semidbm\test_semidbm.py", line 330, in test_remap_required
    db2 = self.open_db_file()
  File "C:\Users\Administrator\Documents\GitHub\semidbm\test_semidbm.py", line 26, in open_db_file
    return semidbm.open(self.dbdir, 'c', **kwargs)
  File "C:\Users\Administrator\Documents\GitHub\semidbm\semidbm\db.py", line 461, in open
    return _SemiDBM(filename, **kwargs)
  File "C:\Users\Administrator\Documents\GitHub\semidbm\semidbm\db.py", line 73, in __init__
    self._load_db()
  File "C:\Users\Administrator\Documents\GitHub\semidbm\semidbm\db.py", line 81, in _load_db
    self._index = self._load_index(self._data_filename)
  File "C:\Users\Administrator\Documents\GitHub\semidbm\semidbm\db.py", line 92, in _load_index
    return self._load_index_from_fileobj(filename)
  File "C:\Users\Administrator\Documents\GitHub\semidbm\semidbm\db.py", line 105, in _load_index_from_fileobj
    for key_name, offset, size in self._read_index(filename):
  File "C:\Users\Administrator\Documents\GitHub\semidbm\semidbm\db.py", line 163, in _read_index
    offset=num_resizes * remap_size)
WindowsError: [Error 8] Not enough storage is available to process this command

----------------------------------------------------------------------
Ran 93 tests in 0.952s

FAILED (errors=1)

Add version info to db file

This will make it easier to determine whether or not we can load the db file.

Doesn't support all dict methods?

notably .items()?