pytries / marisa-trie Goto Github PK
View Code? Open in Web Editor NEWStatic memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
Home Page: https://marisa-trie.readthedocs.io/en/latest/
License: MIT License
Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
Home Page: https://marisa-trie.readthedocs.io/en/latest/
License: MIT License
Bug report by lazarou.
"Maybe I'm asking something very silly, but here goes. I get this error when trying to install the package (using Win 7 64-bit and the latest version of mingw):
C:\MinGW\bin\gcc.exe -mdll -O -Wall -Ilib -IC:\Python32\include -IC:\Python32\PC -c lib/marisa/grimoire/io\mapper.cc -o build\temp.win32-3.2\Release\lib\marisa\grimoire\io\mapper.o
lib/marisa/grimoire/io\mapper.cc: In member function 'void marisa::grimoire::io::Mapper::open_(const char*)':
lib/marisa/grimoire/io\mapper.cc:110:19: error: aggregate 'marisa::grimoire::io::Mapper::open(const char*)::_stat64 st' has incomplete type and cannot be defined
lib/marisa/grimoire/io\mapper.cc:111:3: error: '::_stat64' has not been declared
error: command 'gcc' failed with exit status 1
```"
Edit: nvm, misunderstood purpose of Trie.load
.
The documentation does not explain what format is expected for files passed to load
. I tried the following:
In [1]: import marisa_trie
In [2]: trie = marisa_trie.Trie()
In [3]: trie.load('data')
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-3-d813dec57587> in <module>()
----> 1 trie.load('data')
src/marisa_trie.pyx in marisa_trie._Trie.load()
src/marisa_trie.pyx in marisa_trie._Trie.load()
RuntimeError: marisa-trie/lib/marisa/grimoire/trie/header.h:26: MARISA_FORMAT_ERROR: !test_header(buf)
Documentation does not specify what the file should be formatted like:
In [4]: trie.load?
Docstring:
_Trie.load(self, path)
Load a trie from a specified path.
Type: builtin_function_or_method
See below:
>>> import marisa_trie
>>> trie = marisa_trie.Trie(['zeroth', 'first', 'second', 'third', 'fourth', 'fifth'])
>>> # IDs aren't ordered the same as original input list:
... trie.get('zeroth')
2
>>> # IDs aren't ordered the same as iteration order of the trie, either:
... for word, ID in trie.items():
... print(word, ID)
...
fifth 4
first 5
fourth 3
second 0
third 1
zeroth 2
This is inconvenient given that one possible use case, actually encouraged in the README, is to
use the returned ID to store a value in a separate data structure (e.g. in a python list
Ideally I'd like to be able to loop over my elements in ID order to construct such a list. I guess I can create a list of the right length and then assign into it, but couldn't this be made easier (either by assigning IDs according to the order the words were passed to Trie()
in, or by having iteration over the trie iterate in ID order?
Please see whether this issue from here can be solved in this implementation ... we'd be very happy to switch to your implementation ...
https://code.google.com/p/marisa-trie/issues/detail?id=19
Thanks
Just wanted to make a suggestion to publish a recent version of this code on PyPI. The latest published PyPI version is 0.7.2, which is from April 2015.
I might be missing something out here, but I believe there is a consistency issue with the Trie
implementation and keys that have null characters.
Take a look at the following code snippet:
key = 'Random\x00Key'
python_dict = { key : 'random_value' }
key in python_dict # prints True
python_dict.get(key) # returns 'random_value'
std_trie = marisa_trie.Trie(python_dict)
key in std_trie # prints True
std_trie.keys() # prints ['Random\x00Key']
std_trie.get(key) # should return 'random_value'
std_trie.key_id(key) # should return the key id
What happens is that std_trie.get(key)
actually returns None
, and std_trie.key_id(key)
throws a KeyError
exception with the following trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/marisa_trie.pyx", line 409, in marisa_trie.Trie.key_id
File "src/marisa_trie.pyx", line 417, in marisa_trie.Trie.key_id
KeyError: 'Random\x00Key'
Apparently, the RecordTrie
implementation is immune to this consistency issue.
r_trie = marisa_trie.RecordTrie('<H', zip([key], [(1,)]))
key in r_trie # prints True
r_trie.keys() # prints ['Random\x00Key']
r_trie.get(key) # prints [(1,)]
By the way, if you confirm this issue as an actual bug, also check DAWG. I haven't used it extensively, but calling dawg.DAWG(python_dict)
should throw the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dawg.pyx", line 45, in dawg.DAWG.__init__ (src\dawg.cpp:2147)
File "dawg.pyx", line 70, in dawg.DAWG._build_from_iterable (src\dawg.cpp:2570)
dawg.Error: Can't insert key b'Random\x00Key' (with value 0)
I'm running Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32
and using marisa-trie 0.7.5
and DAWG 0.7.8
.
Is it possible to add keys to the Trie after its been created? I've seen the restore facility (eg: trie.restore_key(1)
) and looked into the source but haven't seen anything offering this. Something like
t = Trie([u'one'])
t._add_key(u'two')
pip3 install marisa-trie
is working fine for me, but when using it I get the following: running build_clib
building 'libmarisa-trie' library
creating build/temp.macosx-10.14.6-x86_64-3.8
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire/io
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire/trie
creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire/vector
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -iwithsysroot/System/Library/Frameworks/System.framework/PrivateHeaders -iwithsysroot/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Headers -arch arm64 -arch x86_64 -Imarisa-trie/marisa-trie/lib -Imarisa-trie/marisa-trie/include -c marisa-trie/marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/trie.o
marisa-trie/marisa-trie/lib/marisa/trie.cc:1:10: fatal error: 'marisa/stdio.h' file not found
#include "marisa/stdio.h"
^~~~~~~~~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1
----------------------------------------
I tried the steps in here but it didn't resolve the issue: #50
Any tips welcome, thanks!
Like so:
In [46]: marisa_trie.Trie([u'foo', u'bar']).restore_key(0)
Out[46]: u'bar\x02'
This doesn't happen if I first get the key_id
for that key:
In [48]: t = marisa_trie.Trie([u'foo', u'bar'])
In [49]: t.key_id(u'bar')
Out[49]: 0
In [50]: t.restore_key(0)
Out[50]: u'bar'
If it's part of the contract that key_id
is needed before restore_key
then it should probably be documented, ideally raise some kind of exception if the contract is violated rather than silently return an incorrect result.
I see that this data structure supports prefix lookups -- does it also support fuzzy lookups (i.e. all records within Levenshtein distance). If that's not supported in this package / this data structure, do you know of any other packages that would let me do in-memory fuzzy searching?
~ Ben
For example, I have a trie
with:
>>>list(tree.iterkeys("GC"))
['GCGx0x0']
and yet:
>>>list(tree.iterkeys("G"))
[]
G
is certainly a prefix of GCGx0x0
, so shouldn't it return the same results in both cases?
When I use marisa_trie, my execution always running:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
I tried to construct the following trie:
trie = marisa_trie.Trie([('New', 'York'), ('New', 'Castle')])
Which gave me AttributeError: 'tuple' object has no attribute 'encode'
. So I suppose the library accepts only strings, but sometimes you want other structures.
Thanks for building such a solid library wrapper! The marisa_trie C++ seems to expose a way of specifying order of nodes returned for a prefix, given the weight
parameter.
Is there a way to expose this parameter in your lib?
Or alternatively, if you could provide some guarantee that nodes will return for a prefix in the same order that they had in the list that built the marisa tree, that would be fine too.
Would you please make a new release on pypi with updated patch-level version to include fixes of deprecation warnings on not deprecated methods? I'm just tired to see those warnings from our regular process and I wouldn't like to patch marisa-trie locally. Thank you.
Got this on Windows 64:
trie.keys()[3]
Traceback (most recent call last):
File "c:\Users\Administrator\Desktop\fuck.py", line 1, in <module>
import hashlib
File "c:\Python34\Lib\site-packages\marisa_trie.pyd", line 516, in marisa_trie.BytesTrie.keys (src\marisa_trie.cpp:9045)
File "c:\Python34\Lib\site-packages\marisa_trie.pyd", line 527, in marisa_trie.BytesTrie.keys (src\marisa_trie.cpp:8865)
builtins.RuntimeError: Unknown exception
The list if supported platforms is missing.
It will be nice to list what platforms are supported. If not is there is a way to do so, or workaround
When I try pip install marisa-trie
i get the following error:
Could not find a version that satisfies the requirement install (from versions: )
No matching distribution found for install
I'm finding the expected memory usage to be much higher than you suggest. Does anything strike you as odd about this?
import string
import marisa_trie
keys = []
fmt = "<I"
for i in xrange(int(3e6)):
key = "".join([random.choice(string.ascii_uppercase).decode('unicode-escape') for j in xrange(10)])
keys.append(key)
t=marisa_trie.Trie(keys)
Just want to let you know that this project fails to compile on the OSX 10.10.3 with CLI Tools 6.3 because of the missing <__debug> header. Check out this StackOverflow article for more info.
I've worked around this by simply downgrading to the previous CLI tools.
Or we could just call it a "longest" method - or "prefix" method (singular). Is there an efficient way to find the longest key that is a prefix of a string that I'm overlooking? It could be done with the current implementation by simply using prefixes, then finding the longest match, but there should be a much more efficient way possible by taking advantage of the Trie properties.
I can give it a go if you like - unless it's already implemented and I'm simply missing it.
python3 -m pip install --user marisa-trie
There is no member PyTypeObject->tp_print
in python 3.9.
Python 3.9.2
Linux SPPI 5.10.46-v7l+ #1432 SMP Fri Jul 2 21:17:20 BST 2021 armv7l GNU/Linux
src/marisa_trie.cpp: In function ‘int __Pyx_modinit_type_init_code()’:
src/marisa_trie.cpp:17944:34: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
17944 | __pyx_type_11marisa_trie__Trie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:17968:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
17968 | __pyx_type_11marisa_trie_BinaryTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:17981:46: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
17981 | __pyx_type_11marisa_trie__UnicodeKeyedTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:17995:33: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
17995 | __pyx_type_11marisa_trie_Trie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18014:38: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18014 | __pyx_type_11marisa_trie_BytesTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18039:40: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18039 | __pyx_type_11marisa_trie__UnpackTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18052:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18052 | __pyx_type_11marisa_trie_RecordTrie.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18070:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18070 | __pyx_type_11marisa_trie___pyx_scope_struct____init__.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18076:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18076 | __pyx_type_11marisa_trie___pyx_scope_struct_1_genexpr.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18082:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18082 | __pyx_type_11marisa_trie___pyx_scope_struct_2_iterkeys.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18088:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18088 | __pyx_type_11marisa_trie___pyx_scope_struct_3_iter_prefixes.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18094:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18094 | __pyx_type_11marisa_trie___pyx_scope_struct_4_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18100:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18100 | __pyx_type_11marisa_trie___pyx_scope_struct_5_iter_prefixes.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18106:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18106 | __pyx_type_11marisa_trie___pyx_scope_struct_6_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18112:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18112 | __pyx_type_11marisa_trie___pyx_scope_struct_7___init__.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18118:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18118 | __pyx_type_11marisa_trie___pyx_scope_struct_8_genexpr.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18124:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18124 | __pyx_type_11marisa_trie___pyx_scope_struct_9_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18130:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18130 | __pyx_type_11marisa_trie___pyx_scope_struct_10_iterkeys.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18136:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18136 | __pyx_type_11marisa_trie___pyx_scope_struct_11___init__.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18142:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18142 | __pyx_type_11marisa_trie___pyx_scope_struct_12_genexpr.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18148:60: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18148 | __pyx_type_11marisa_trie___pyx_scope_struct_13_iteritems.tp_print = 0;
| ^~~~~~~~
src/marisa_trie.cpp:18154:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
18154 | __pyx_type_11marisa_trie___pyx_scope_struct_14_genexpr.tp_print = 0;
| ^~~~~~~~
A trivial example:
>>> from marisa_trie import Trie
>>> Trie() == Trie()
False
>>> Trie(["foo", "bar"]) == Trie(["foo", "bar"])
False
There's one more interesting property: different tries seem to hash to the same value:
>>> hash(Trie())
271079393
>>> hash(Trie(["foo", "bar"]))
271079393
>>> hash(Trie(["foo", "bar", "boo"]))
271079393
This might be due to a free list-based allocation, but anyway the behaviour is confusing.
Hi. Great library. Would it be possible to add a has_keys_with_prefix
method? Datrie has one.
set(dir(datrie.Trie)) - set(dir(marisa_trie.Trie))
set(['__delitem__', 'setdefault', '__getitem__', 'prefix_values', 'items', 'longest_prefix', 'has_keys_with_prefix', 'longest_prefix_value', 'is_dirty', '__setitem__', 'values', 'iter_prefix_items', 'longest_prefix_item', 'iter_prefix_values', '_delitem', 'prefix_items'])
As a user, I want to create an instance of Trie
and add words to it one-at-a-time, so that I can use a Trie
in a streaming environment (in which strings arrive on-the-fly). For example,
>>> from marise import Trie
>>> trie = Trie()
>>> trie.add(u'key1')
>>> trie.add(u'key12')
>>> u'key1' in trie
True
>>> u'key12' in trie
True
>>> u'key2' in trie
False
This also makes Trie
behave more like a set
of strings.
The following Code is breaks under 64bit windows. 32bit Windows is ok, 64bit Linux does also work.
I've compiled the win-64bit extension with MSVC 2010.
marisa_trie.Trie([u'Das', u'Lahnth.al', u'mit', u'seinen', u'Heilquellen']).keys()
d:\vls-trunk\env-win64\Python27\lib\site-packages\marisa_trie.pyd in marisa_trie._Trie.keys (src\marisa_trie.cpp:4199)()
d:\vls-trunk\env-win64\Python27\lib\site-packages\marisa_trie.pyd in marisa_trie._Trie.keys (src\marisa_trie.cpp:4061)()
RuntimeError: Unknown exception
Hello,
I recently inherited some code from a developer who had departed. It is safe to say that the amount of data flowing into the trie has increased over time. This bug looks like an overflow.
Stack trace:
File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670)
File "marisa_trie.pyx", line 127, in marisa_trie._Trie.build (src/marisa_trie.cpp:2768)
RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX
I think these two should be deprecated in favour of their path-based friends.
Three reasons:
API should be as small as possible to be useful in 90% of the cases.
The methods only work on file objects and produce ugly error messages when called on e.g. BytesIO
:
>>> import io
>>> import marisa_trie
>>> marisa_trie.Trie().write(io.BytesIO())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/marisa_trie.pyx", line 193, in marisa_trie._Trie.write (src/marisa_trie.cpp:4201)
self._trie.write(f.fileno())
io.UnsupportedOperation: fileno
A related method mmap
lacks a file-based version.
@kmike, what do you think?
Running into the following build error:
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/johannes/miniconda3/include -arch x86_64 -I/Users/johannes/miniconda3/include -arch x86_64 -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/trie.o
clang: warning: include path for libstdc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
In file included from marisa-trie/lib/marisa/trie.cc:1:
marisa-trie/include/marisa/stdio.h:4:10: fatal error: 'cstdio' file not found
#include <cstdio>
^~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1
I tried to update the cpp files, but no luck either. Any hints welcome.
If I have the following variables:
keys = [u'1', u'12', u'13', u'123', u'132', u'1234']
vals = [u'a', u'b', u'c', u'd', u'e', u'f']
fmt = "s"
trie = marisa_trie.RecordTrie(fmt, zip(keys, vals))
But I keep getting argument for 's' must be a string
Any help?
Hi, I'm consistently getting the following error when trying to access a trie from a load or read from a file.
./read_trie_test.py
Traceback (most recent call last):
File "./read_trie_test.py", line 18, in <module>
print(t.restore_key(0))
File "marisa_trie.pyx", line 324, in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:6365)
File "marisa_trie.pyx", line 334, in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:6299)
File "marisa_trie.pyx", line 62, in marisa_trie._get_key (src/marisa_trie.cpp:1615)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 10: invalid start byte
I get the same error if the following code is used...
for k in t.keys():
print(k)
and again the same error if I use:
t['someKey'] # or t[u'somekey']
The trie file reads in w/o any error and i've written the file using both trie.save() and trie.write()
and in writing file I've used a codec.open() and codec.write() to force utf-8 encoding
I'm not sure if this is similar issue #10
$ pip install marisa-trie==0.7.3
Collecting marisa-trie==0.7.3
Using cached marisa-trie-0.7.3.tar.gz
Building wheels for collected packages: marisa-trie
Running setup.py bdist_wheel for marisa-trie ... error
Complete output from command /Users/fredmailhot/anaconda/envs/marisa_test/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/pip-build-qDhwKQ/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/tmpA7YRglpip-wheel- --python-tag cp27:
running bdist_wheel
running build
running build_clib
building 'libmarisa-trie' library
creating build
creating build/temp.macosx-10.7-x86_64-2.7
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/io
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/trie
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/vector
gcc -fno-strict-aliasing -I/Users/fredmailhot/anaconda/envs/marisa_test/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/agent.cc -o build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/agent.o
marisa-trie/lib/marisa/agent.cc:3:10: fatal error: 'marisa/agent.h' file not found
#include "marisa/agent.h"
^
1 error generated.
error: command 'gcc' failed with exit status 1
----------------------------------------
Failed building wheel for marisa-trie
Running setup.py clean for marisa-trie
Failed to build marisa-trie
Installing collected packages: marisa-trie
Running setup.py install for marisa-trie ... error
Complete output from command /Users/fredmailhot/anaconda/envs/marisa_test/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/pip-build-qDhwKQ/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/pip-vupQHV-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_clib
building 'libmarisa-trie' library
creating build
creating build/temp.macosx-10.7-x86_64-2.7
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/io
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/trie
creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/vector
gcc -fno-strict-aliasing -I/Users/fredmailhot/anaconda/envs/marisa_test/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/agent.cc -o build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/agent.o
marisa-trie/lib/marisa/agent.cc:3:10: fatal error: 'marisa/agent.h' file not found
#include "marisa/agent.h"
^
1 error generated.
error: command 'gcc' failed with exit status 1
It can not be installed with Pypy.
Following error is showed:
error: use of undeclared identifier 'PyByteArray_FromStringAndSize';
I get the following error when trying to import on macOS Catalina 10.15.4. I have version 0.7.5 installed via pipenv. Super simple Python shell transcript below.
14:40:23 ❯ pipenv run python
Python 3.7.3 (default, Mar 6 2020, 22:34:30)
[Clang 11.0.3 (clang-1103.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import marisa_trie
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/ianthetechie/.local/share/virtualenvs/python-json-apis-b20-K_lB/lib/python3.7/site-packages/marisa_trie.cpython-37m-darwin.so, 2): Symbol not found: __ZN6marisa4Trie4mmapEPKc
Referenced from: /Users/ianthetechie/.local/share/virtualenvs/python-json-apis-b20-K_lB/lib/python3.7/site-packages/marisa_trie.cpython-37m-darwin.so
Expected in: flat namespace
in /Users/ianthetechie/.local/share/virtualenvs/python-json-apis-b20-K_lB/lib/python3.7/site-packages/marisa_trie.cpython-37m-darwin.so
Consider publishing a wheel to PyPI. Makes installation more reliable (in particular possible on Windows and OS X without a compiler). See also http://pythonwheels.com/
Hi Mike,
Thanks for building this wrapper and providing the additional Bytes and RecordTrie classes, they're great and extremely useful and have been largely easy to build additional features into.
I would like to implement a RecordTrie feature as follows: if there are key-value pairs (u'a', (1, N_1)), (u'a', (2, N_2)), (u'a',(3, N_3)), ..., (u'a', (i, N_i))
, and I know that I only need the values stored in N_p
through N_q
. If i
is very large (say, a million), calling my_list = my_trie[u'a']
is prohibitively slow, and so I would like to start the loop at the p
th value, that is, at b_prefix = <bytes>u'a'.encode('utf8') + self._b_value_separator + <bytes>bytes(struct(">I", p))
.
Of course, if I set Agent
key to this, the loop will not continue on to (u'a', (p+1, N_{p+1}))
, and I cannot figure out how to "trick" the predictive search into continuing to loop past that specific prefix and on to anything with prefix simply <bytes>u'a'.encode('utf8') + self._b_value_separator
without resetting the entire loop from the top.
Do you have any idea of how this might be accomplished, or if it is a limitation of the marisa-trie base library?
Thanks,
-George
When I use marisa_trie, my execution always running:
`[nltk_data] Downloading package punkt to /root/nltk_data...'
'[nltk_data] Unzipping tokenizers/punkt.zip.'
I tried to build marisa-trie
with python-3.9 and failed with following error:
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -Imarisa-trie/include -I/opt/python3.9/include/python3.9 -c src/marisa_trie.cpp -o build/temp.linux-x86_64-3.9/src/marisa_trie.o
src/marisa_trie.cpp: In function ‘int __Pyx_modinit_type_init_code()’:
src/marisa_trie.cpp:17944:34: error: ‘PyTypeObject {aka struct _typeobject}’ has no member named ‘tp_print’
__pyx_type_11marisa_trie__Trie.tp_print = 0;
From documentation follows that tp_print
was removed in Python-3.9: https://docs.python.org/dev/whatsnew/3.9.html#id3 .
I've tried this with both RecordTrie and BytesTrie and a number of different formats, but can't get things to work no matter what I do.
keys = ['foo', 'foo1', 'foobar', 'bar']
values = [(1,), (2,), (3,), (4,)]
fmt = str("<H")
trie = marisa_trie.RecordTrie(fmt, zip(keys, values), order=marisa_trie.WEIGHT_ORDER)
trie.items(u'')
>>> [(u'foo1', (4,)), (u'foobar', (3,)), (u'foo', (1,)), (u'bar', (2,))]
I tried adding a weights
parameter as well, which isn't documented, but does seem supported in the code. It looks like it does something, because if an iterable with items that can't be converted to floats is passed in, it breaks. Nevertheless, values are still not returned in weight order:
trie = marisa_trie.RecordTrie(fmt, zip(keys, values), order=marisa_trie.WEIGHT_ORDER, weights=[1,2,3,4])
trie.items()
>>> [(u'foo1', (4,)), (u'foobar', (3,)), (u'foo', (1,)), (u'bar', (2,))]
Am I missing something here? Is there some way other way to set weight?
import marisa_trie
a = [(u'1', '1'), (u'1', '2')]
tr = marisa_trie.BytesTrie(a)
print tr.keys()
This will output [u'1', u'1'].
I guess, that, this function should returns a [u'1']
I'm ready to fix it, if someone consider, that this is bug.
There's a nonsensical DeprecationWarning in Trie.load
:
>>> import marisa_trie
>>> t = marisa_trie.Trie()
>>> t.load('data/language_names.marisa')
/home/rspeer/.virtualenvs/lum/bin/ipython:1: DeprecationWarning: Trie.save is deprecated and will be removed in marisa_trie 0.8.0. Please use Trie.load instead.
#!/home/rspeer/.virtualenvs/lum/bin/python3.5
<marisa_trie.Trie at 0x7f4e9fefb0f0>
Some things that are wrong with this:
Trie.save
, I used Trie.load
, which is exactly what it's telling me to use.Trie.load
would be able to replace Trie.save
.Trie.read
, not Trie.save
.Trie.load
is implemented by using Trie.read
, so there is no way to avoid the DeprecationWarning.Expected behavior: if I use the function that the DeprecationWarning tells me I should use, I should not get a DeprecationWarning.
Hi,
When creating a RecordTrie, the superclass _UnpackTrie unpacks all key value pairs in memory. So if I am correct, creating a Trie is not memory efficient at all. Is there a simple way to create large Tries more efficiently?
Thx,
joe
Using marisa-trie 0.7.4, on Python 3.5.1:
>>> t1 = marisa_trie.BytesTrie([('a', b'a'), ('ab', b'b'), ('ac', b'c')])
>>> t1.save('/tmp/t1.marisa')
/home/rspeer/.virtualenvs/lum/bin/ipython:1: DeprecationWarning: Trie.write is deprecated and will be removed in marisa_trie 0.8.0. Please use Trie.save instead.
#!/home/rspeer/.virtualenvs/lum/bin/python3.5
>>> t2 = marisa_trie.Trie()
>>> t2.load('/tmp/t1.marisa')
/home/rspeer/.virtualenvs/lum/bin/ipython:1: DeprecationWarning: Trie.save is deprecated and will be removed in marisa_trie 0.8.0. Please use Trie.load instead.
#!/home/rspeer/.virtualenvs/lum/bin/python3.5
<marisa_trie.Trie object at 0x7f4e9ea552d0>
>>> t2.keys()
Traceback (most recent call last):
File "<ipython-input-23-28b5ae76f2b3>", line 1, in <module>
t2.keys()
File "src/marisa_trie.pyx", line 267, in marisa_trie._Trie.keys (src/marisa_trie.cpp:6279)
File "src/marisa_trie.pyx", line 278, in marisa_trie._Trie.keys (src/marisa_trie.cpp:6172)
File "src/marisa_trie.pyx", line 403, in marisa_trie._UnicodeKeyedTrie._get_key (src/marisa_trie.cpp:8108)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte
Now tests at AppVeyour partially fail because pytest requires Python >= 2.7. May be it's time to deprecate 2.6 support or to not test it at least?
I don't see any methods that would allow for just proceeding one edge in the trie, for example:
trie = marisa_trie.Trie([u'key1', u'key2', u'kite'])
trie.edges('k') #would return [u'ke', u'ki']
Using the prefixes method for something like this would be very expensive if the trie is big and the prefix is short. Is there some technical detail I'm missing for why implementing a function for this would be costly, or some other reason this isn't implemented? It seems like it's a necessary step in the traversal with the prefixes()
method anyway, and quite useful for predictive lookup operations.
On PyPI I see there are many wheels available for 0.7.4, but the latest version only has a wheel for macOS, thus requiring other platforms to build from source, which means they need to have the Python development headers installed, etc. For casual users who are installing some package that depends on marisa-trie, this can be quite a burden.
Is it possible to setup some CD workflow that generates the wheels for various platforms?
I don't know whether this is user error, a return of Issue #34, or something new (since #34 seems to have been closed as resolved), but I just tried and failed to build marisa-trie under MacOS Mojave. Details below.
Vombatus:SciFi djb$ pip install marisa-trie
Collecting marisa-trie
Using cached https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz
Building wheels for collected packages: marisa-trie
Running setup.py bdist_wheel for marisa-trie ... error
Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-wheel-syviv4xc --python-tag cp37:
running bdist_wheel
running build
running build_clib
building 'libmarisa-trie' library
creating build
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/io
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/trie
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/vector
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/trie.o
warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
In file included from marisa-trie/lib/marisa/trie.cc:1:
marisa-trie/include/marisa/stdio.h:4:10: fatal error: 'cstdio' file not found
#include <cstdio>
^~~~~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1
----------------------------------------
Failed building wheel for marisa-trie
Running setup.py clean for marisa-trie
Failed to build marisa-trie
Installing collected packages: marisa-trie
Running setup.py install for marisa-trie ... error
Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-record-539k0cjv/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_clib
building 'libmarisa-trie' library
creating build
creating build/temp.macosx-10.7-x86_64-3.7
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/io
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/trie
creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/vector
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/trie.o
warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
In file included from marisa-trie/lib/marisa/trie.cc:1:
marisa-trie/include/marisa/stdio.h:4:10: fatal error: 'cstdio' file not found
#include <cstdio>
^~~~~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-record-539k0cjv/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.