Code Monkey home page Code Monkey logo

efselab's People

Contributors

emilstenstrom avatar py3ams avatar robertostling avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

efselab's Issues

Memory usage high with lemmatizer

I'm running the lemmatizer that's part of swe-pipeline on a very limited online host. It only gives me 500 Mb of RAM that I have to try to cram my NLP stuff into.

Here's a small test script that just loads the lemmatizer into memory and uses psutil to measure the memory used:

def memory_usage_psutil():
    # return the memory usage in MB
    import os
    import psutil
    process = psutil.Process(os.getpid())
    mem = process.memory_info()[0] / float(2 ** 20)
    return mem

if __name__ == '__main__':
    print("Base memory usage: %.2f MB" % memory_usage_psutil())

    import lemmatize
    lemmatizer = lemmatize.SUCLemmatizer()
    lemmatizer.load('swe-pipeline/suc-saldo.lemmas')
    print("Lemmatize memory usage: %.2f MB" % memory_usage_psutil())

To run it, put it in the efselab root directory, install psutil with pip install psutil, and run it.

(efselab) ~/Projects/efselab $ python test.py 
Base memory usage: 9.10 MB
Lemmatize memory usage: 492.65 MB

Segmentation fault with Python 3.5

I just updated to Python 3.5 and it seems efselab then crashes with a segmentation fault.

$ make clean; make

python3 setup.py build_ext --inplace
running build_ext
building 'lemmatize' extension
creating build/temp.macosx-10.11-x86_64-3.5
clang -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Qunused-arguments -Qunused-arguments -I/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c lemmatize.c -o build/temp.macosx-10.11-x86_64-3.5/lemmatize.o
lemmatize.c:8664:28: warning: unused function '__Pyx_PyObject_AsString' [-Wunused-function]
static CYTHON_INLINE char* __Pyx_PyObject_AsString(PyObject* o) {
                           ^
lemmatize.c:8661:32: warning: unused function '__Pyx_PyUnicode_FromString' [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyUnicode_FromString(const char* c_str) {
                               ^
lemmatize.c:8776:33: warning: unused function '__Pyx_PyIndex_AsSsize_t' [-Wunused-function]
static CYTHON_INLINE Py_ssize_t __Pyx_PyIndex_AsSsize_t(PyObject* b) {
                                ^
lemmatize.c:8838:33: warning: unused function '__Pyx_PyInt_FromSize_t' [-Wunused-function]
static CYTHON_INLINE PyObject * __Pyx_PyInt_FromSize_t(size_t ival) {
                                ^
lemmatize.c:6907:32: warning: unused function '__Pyx_GetItemInt_List_Fast' [-Wunused-function]
static CYTHON_INLINE PyObject *__Pyx_GetItemInt_List_Fast(PyObject *o, Py_ssize_t i,
                               ^
lemmatize.c:6922:32: warning: unused function '__Pyx_GetItemInt_Tuple_Fast' [-Wunused-function]
static CYTHON_INLINE PyObject *__Pyx_GetItemInt_Tuple_Fast(PyObject *o, Py_ssize_t i,
                               ^
lemmatize.c:7084:26: warning: unused function '__Pyx_PyBytes_Equals' [-Wunused-function]
static CYTHON_INLINE int __Pyx_PyBytes_Equals(PyObject* s1, PyObject* s2, int equals) {
                         ^
lemmatize.c:8081:26: warning: function '__Pyx_PyInt_As_int' is not needed and will not be emitted
      [-Wunneeded-internal-declaration]
static CYTHON_INLINE int __Pyx_PyInt_As_int(PyObject *x) {
                         ^
lemmatize.c:8406:32: warning: unused function '__Pyx_PyInt_From_int' [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_int(int value) {
                               ^
lemmatize.c:8432:27: warning: function '__Pyx_PyInt_As_long' is not needed and will not be emitted
      [-Wunneeded-internal-declaration]
static CYTHON_INLINE long __Pyx_PyInt_As_long(PyObject *x) {
                          ^
10 warnings generated.
clang -bundle -undefined dynamic_lookup -Qunused-arguments -Qunused-arguments build/temp.macosx-10.11-x86_64-3.5/lemmatize.o -o /Users/EmilStenstrom/Projects/efselab/lemmatize.cpython-35m-darwin.so
building 'fasthash' extension
clang -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Qunused-arguments -Qunused-arguments -I/usr/local/Cellar/python3/3.5.2/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c fasthash.c -o build/temp.macosx-10.11-x86_64-3.5/fasthash.o -Wall
In file included from fasthash.c:3:
./c/hash.c:238:17: warning: unused function 'hash32_utf8_prefix' [-Wunused-function]
static uint32_t hash32_utf8_prefix(
                ^
./c/hash.c:252:17: warning: unused function 'hash32_utf8_suffix' [-Wunused-function]
static uint32_t hash32_utf8_suffix(
                ^
2 warnings generated.
clang -bundle -undefined dynamic_lookup -Qunused-arguments -Qunused-arguments build/temp.macosx-10.11-x86_64-3.5/fasthash.o -o /Users/EmilStenstrom/Projects/efselab/fasthash.cpython-35m-darwin.so
$ python3 build_udt_en.py --name udt_en --python
Building tagger...
Generating C code to udt_en.c...
Segmentation fault: 11

Things I've tried:

  • make clean
  • Deleting all .pyc-files
  • Creating a new virtualenv from scratch and reinstalling Cython

Minor error in suc-saldo.lemmas

In the suc-saldo.lemmas file there's a minor error:

All words that start with "alkoholinnehav" have a zero width space in front of them which makes it not match any lemmatization rules. I found it by simply sorting the file and seeing that "alkoholinnehav" ended up at the bottom of the list instead with with other A-words.

Python interface for Swedish tagger does not work

I've build the swedish pipeline and it is working fine with swe_pipeline.py.
Now I'd like to use the python interface, but I get the following error:

Python 3.6.5 (default, Apr 12 2018, 22:45:43)
[GCC 7.3.1 20180312] on linux
Type "help", "copyright", "credits" or "license" for more information.
$ import udt_suc_sv
$ with open('swe-pipeline/suc.bin', 'rb') as f: weights = f.read()
...
$ udt_suc_sv.tag(weights, ['öppningsbar', 'bro', '.'])
Traceback (most recent call last):
File "", line 1, in
ValueError: Expected 3 fields for token, found single string
$

What am I doing wrong?

Trouble with Swedish pipeline

Hi,
I'm trying to create a tagger with the swedish pipeline on a machine running win10 and Cygwin.
When I run the command 'python3 build_suc.py --skip-generate --python --n-train-fields 2' I get some gcc output and then nothing happens, have been waiting for 30+ minutes.
I've taken a printscreen that shows the efselab folder as well as my terminal here: http://imgur.com/ApdwxQI

Any idea of what's missing/wrong or how to debug this?

unable to open model file for reading

When I try to follow the steps in the Readme, it works for udt_en, but not for suc. What happens with suc is as follows: python3 build_suc.py --name suc --python builds a c file and then gets stuck at cc -Wall -Wno-unused-function -Ofast -I /home/sasha/efselab -o suc suc.c, nothing happens. I interrupt it and compile the c file using gcc -o suc suc.c. That works, and I am able to run ./suc train suc-data/suc-blogs.tab suc-data/talbanken_dev.tab suc.bin, which works for 6 iterations and then crashes at

Finding optimal feature compression...
unable to open model file for reading: No such file or directory

At every iteration, it also reports:

Error at 235!
  Tuning error:   100.00%

Named entities with multiple tokens sometimes get incorrect tags

When tagging sentences containing entites with multiple tokens in them I sometimes get incorrect tags:

Det brinner i  Fort McMurray.
PN  VB      PP AB   PM

Fort should be PM, just like McMurray.

(I understand this is a hard problem! I just wanted to document it somewhere)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.