robertostling / efselab Goto Github PK
View Code? Open in Web Editor NEWEfficient Sequence Labeling
License: GNU General Public License v3.0
Efficient Sequence Labeling
License: GNU General Public License v3.0
I'm running the lemmatizer that's part of swe-pipeline on a very limited online host. It only gives me 500 Mb of RAM that I have to try to cram my NLP stuff into.
Here's a small test script that just loads the lemmatizer into memory and uses psutil
to measure the memory used:
def memory_usage_psutil():
# return the memory usage in MB
import os
import psutil
process = psutil.Process(os.getpid())
mem = process.memory_info()[0] / float(2 ** 20)
return mem
if __name__ == '__main__':
print("Base memory usage: %.2f MB" % memory_usage_psutil())
import lemmatize
lemmatizer = lemmatize.SUCLemmatizer()
lemmatizer.load('swe-pipeline/suc-saldo.lemmas')
print("Lemmatize memory usage: %.2f MB" % memory_usage_psutil())
To run it, put it in the efselab root directory, install psutil with pip install psutil
, and run it.
(efselab) ~/Projects/efselab $ python test.py
Base memory usage: 9.10 MB
Lemmatize memory usage: 492.65 MB
I just updated to Python 3.5 and it seems efselab then crashes with a segmentation fault.
$ make clean; make
python3 setup.py build_ext --inplace
running build_ext
building 'lemmatize' extension
creating build/temp.macosx-10.11-x86_64-3.5
clang -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Qunused-arguments -Qunused-arguments -I/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c lemmatize.c -o build/temp.macosx-10.11-x86_64-3.5/lemmatize.o
lemmatize.c:8664:28: warning: unused function '__Pyx_PyObject_AsString' [-Wunused-function]
static CYTHON_INLINE char* __Pyx_PyObject_AsString(PyObject* o) {
^
lemmatize.c:8661:32: warning: unused function '__Pyx_PyUnicode_FromString' [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyUnicode_FromString(const char* c_str) {
^
lemmatize.c:8776:33: warning: unused function '__Pyx_PyIndex_AsSsize_t' [-Wunused-function]
static CYTHON_INLINE Py_ssize_t __Pyx_PyIndex_AsSsize_t(PyObject* b) {
^
lemmatize.c:8838:33: warning: unused function '__Pyx_PyInt_FromSize_t' [-Wunused-function]
static CYTHON_INLINE PyObject * __Pyx_PyInt_FromSize_t(size_t ival) {
^
lemmatize.c:6907:32: warning: unused function '__Pyx_GetItemInt_List_Fast' [-Wunused-function]
static CYTHON_INLINE PyObject *__Pyx_GetItemInt_List_Fast(PyObject *o, Py_ssize_t i,
^
lemmatize.c:6922:32: warning: unused function '__Pyx_GetItemInt_Tuple_Fast' [-Wunused-function]
static CYTHON_INLINE PyObject *__Pyx_GetItemInt_Tuple_Fast(PyObject *o, Py_ssize_t i,
^
lemmatize.c:7084:26: warning: unused function '__Pyx_PyBytes_Equals' [-Wunused-function]
static CYTHON_INLINE int __Pyx_PyBytes_Equals(PyObject* s1, PyObject* s2, int equals) {
^
lemmatize.c:8081:26: warning: function '__Pyx_PyInt_As_int' is not needed and will not be emitted
[-Wunneeded-internal-declaration]
static CYTHON_INLINE int __Pyx_PyInt_As_int(PyObject *x) {
^
lemmatize.c:8406:32: warning: unused function '__Pyx_PyInt_From_int' [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_int(int value) {
^
lemmatize.c:8432:27: warning: function '__Pyx_PyInt_As_long' is not needed and will not be emitted
[-Wunneeded-internal-declaration]
static CYTHON_INLINE long __Pyx_PyInt_As_long(PyObject *x) {
^
10 warnings generated.
clang -bundle -undefined dynamic_lookup -Qunused-arguments -Qunused-arguments build/temp.macosx-10.11-x86_64-3.5/lemmatize.o -o /Users/EmilStenstrom/Projects/efselab/lemmatize.cpython-35m-darwin.so
building 'fasthash' extension
clang -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Qunused-arguments -Qunused-arguments -I/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c fasthash.c -o build/temp.macosx-10.11-x86_64-3.5/fasthash.o -Wall
In file included from fasthash.c:3:
./c/hash.c:238:17: warning: unused function 'hash32_utf8_prefix' [-Wunused-function]
static uint32_t hash32_utf8_prefix(
^
./c/hash.c:252:17: warning: unused function 'hash32_utf8_suffix' [-Wunused-function]
static uint32_t hash32_utf8_suffix(
^
2 warnings generated.
clang -bundle -undefined dynamic_lookup -Qunused-arguments -Qunused-arguments build/temp.macosx-10.11-x86_64-3.5/fasthash.o -o /Users/EmilStenstrom/Projects/efselab/fasthash.cpython-35m-darwin.so
$ python3 build_udt_en.py --name udt_en --python
Building tagger...
Generating C code to udt_en.c...
Segmentation fault: 11
Things I've tried:
make clean
In the suc-saldo.lemmas file there's a minor error:
All words that start with "alkoholinnehav" have a zero width space in front of them which makes it not match any lemmatization rules. I found it by simply sorting the file and seeing that "alkoholinnehav" ended up at the bottom of the list instead with with other A-words.
This link appears to be dead:
I've build the swedish pipeline and it is working fine with swe_pipeline.py.
Now I'd like to use the python interface, but I get the following error:
Python 3.6.5 (default, Apr 12 2018, 22:45:43)
[GCC 7.3.1 20180312] on linux
Type "help", "copyright", "credits" or "license" for more information.
$ import udt_suc_sv
$ with open('swe-pipeline/suc.bin', 'rb') as f: weights = f.read()
...
$ udt_suc_sv.tag(weights, ['öppningsbar', 'bro', '.'])
Traceback (most recent call last):
File "", line 1, in
ValueError: Expected 3 fields for token, found single string
$
What am I doing wrong?
Hi,
I'm trying to create a tagger with the swedish pipeline on a machine running win10 and Cygwin.
When I run the command 'python3 build_suc.py --skip-generate --python --n-train-fields 2' I get some gcc output and then nothing happens, have been waiting for 30+ minutes.
I've taken a printscreen that shows the efselab folder as well as my terminal here: http://imgur.com/ApdwxQI
Any idea of what's missing/wrong or how to debug this?
When I try to follow the steps in the Readme, it works for udt_en, but not for suc. What happens with suc is as follows: python3 build_suc.py --name suc --python
builds a c file and then gets stuck at cc -Wall -Wno-unused-function -Ofast -I /home/sasha/efselab -o suc suc.c
, nothing happens. I interrupt it and compile the c file using gcc -o suc suc.c
. That works, and I am able to run ./suc train suc-data/suc-blogs.tab suc-data/talbanken_dev.tab suc.bin
, which works for 6 iterations and then crashes at
Finding optimal feature compression...
unable to open model file for reading: No such file or directory
At every iteration, it also reports:
Error at 235!
Tuning error: 100.00%
When tagging sentences containing entites with multiple tokens in them I sometimes get incorrect tags:
Det brinner i Fort McMurray.
PN VB PP AB PM
Fort
should be PM, just like McMurray.
(I understand this is a hard problem! I just wanted to document it somewhere)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.