larsmans / seqlearn Goto Github PK

Sequence learning toolkit for Python

Home Page: http://larsmans.github.io/seqlearn/

License: MIT License

Makefile 0.09% Python 99.91%

seqlearn's Introduction

seqlearn

seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API.

Compiling and installing

Get NumPy >=1.6, SciPy >=0.11, Cython >=0.20.2 and a recent version of scikit-learn. Then issue:

python setup.py install

to install seqlearn.

If you want to use seqlearn from its source directory without installing, you have to compile first:

python setup.py build_ext --inplace

Getting started

The easiest way to start using seqlearn is to fetch a dataset in CoNLL 2000 format. Define a task-specific feature extraction function, e.g.:

>>> def features(sequence, i):
...     yield "word=" + sequence[i].lower()
...     if sequence[i].isupper():
...         yield "Uppercase"
...

Load the training file, say train.txt:

>>> from seqlearn.datasets import load_conll
>>> X_train, y_train, lengths_train = load_conll("train.txt", features)

Train a model:

>>> from seqlearn.perceptron import StructuredPerceptron
>>> clf = StructuredPerceptron()
>>> clf.fit(X_train, y_train, lengths_train)

Check how well you did on a validation set, say validation.txt:

>>> X_test, y_test, lengths_test = load_conll("validation.txt", features)
>>> from seqlearn.evaluation import bio_f_score
>>> y_pred = clf.predict(X_test, lengths_test)
>>> print(bio_f_score(y_test, y_pred))

For more information, see the documentation.

seqlearn's People

Stargazers

Watchers

Forkers

vene vgoklani kmike fmailhot pombredanne chyikwei fgregg fireae adityatewari mkdmkk zhangaustin amaggi madhuka stevenlol fancyspeed eraldop pombreda loganding 1oscar grishabhg riyazbhat armadillabs provemyself xuanhan863 giwa sandy4321 puneet-shivanand kushalc 5ji653m6 astrocyteresearch vmarkovtsev sdpython anukat2015 cademarkegard little1tow linzhineng zmoon111 zbxzc35 kentchun33333 leecodedog liulj0507 oncoimmunity demoninpiano merofeev zhaoqiuye carloslopezroa psorianom yuwenlidao jugaadd vyraun chest3x ntugce greengrass2015 likaiguo hanbman freephys gchers somnathasati doermatt safibaig sneheshs nianxiaohu lovemengsi yinsenm hanfeijp joschif mjuarezm pruksmhc timurdzhumakaev mystery-college-of-the-adapts messiest zdeli liuxiaoan8008 vishalbelsare zzygyx9119 beijingtl phychaos zhengruitao shadowkun serenidpity harry-2016 zhangyi2k15 eick2e skyting mahe-git2hub moomoofarm1 jaysonsdlin fagan2888 mousewu puremath86 manuelschmidt eladwarshawsky harry-dev98 olimjon-ibragimov myechona sebaron eulerian-tuple mahdi-akraminia reinforcement-learning-1400-2 daydreamdreamday

seqlearn's Issues

great code, may you share new repos of others in this direction ?

Validation code needed for lengths parameters

Right now, passing a lengths array that is inconsistent with X.shape[0] will lead to a segfault in predict. We need validation code.

Using a `requirements.txt` file

I was wondering if it would be possible to include a requirements.txt in the root directory with the prerequisites?

I am currently making a one-line installer for a (private) module that downloads and installs seqlearn directly from github using pip install -e git+https://github.com/larsmans/seqlearn.git#egg=seqlearn. But this won't work on new virtual envirements without first installing Cython manually.

I know this has limited usefulness, but it would simplify things for me a notch.

Awesome project you guys have got going here BTW.

How to use word embedding as a feature ?

I want to know, if there exist any way to pass word vector extracted from word2vec as a feature ? or, i have to use/change something else in place of existing feature hashing ? Great Thanks !

Dataset loading

I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.

What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.

Alternatives for load_conll

Hello,

I use your toolkit in sequence learning. I both train and predict data with csv/tsv files.
In the other hand, it would be useful to make predictions immediately from code (i.e, POS-tagging for a sentence as function argument, not for a sentence from file).
Does seqlearn get data only from files? Is it possible to implement an alternative for Pandas dataframes loading?

Thanks,
Daria

How to use seqlearn in Anaconda?

Can we somehow install 'seqlearn' in Anaconda so that it can be used in the Spyder platform. I searched but it was not available in anaconda.org database. Any help would be great

Chahat

matlab in python ?

[Question] Training Algorithm for hmm

Hi,
Just have a question and would be happy if you could help.
Is there a fast way to use different training algorithms for HMM here? (I need something like hmmlearn's _BaseHMM in a supervised setting.)
Thank you ! #

in hmm.py there is a wrong reference to logsumexp method

In hmm.py logsumexp is imported as:
from scipy.misc import logsumexp
however in the new versions of scipy logsumexp was moved to scipy.special, therefore it should be:

from scipy.special import logsumexp

can you share more references for ideas for algorithm for Viterbi perceptron ?

can you share more references for ideas for algorithm for Viterbi perceptron ?
better with simple short ideas for Viterbi perceptron

How to use SequenceKFold?

SequenceKFold returns (train, test) indices, but "fit" methods also require proper "lenghts" arrays, so it is not clear how to use SequenceKFold for cross-validation.

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I am trying to enable trans_feature to get a more sophisticated model in seqlearn not only the unary term and pairwise term(transition matrix). But I encountered the exit error in the title. I am using my own dataset and there is no problem I tested using the provided cornell dataset. Have you ever encountered similar errors? The line that where exit happens is safe_add(A, B). It calls some compiled c code. I am not sure whether it is and how to solve it. Thank you very much

ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'int'

I am using the MultinomialHMM class of seqlearn and the code crashes with error

ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'int'

at the function count_trans() inside the fit() function. I am trying to change the data type of the y array from int to numpy.npy_intp but this type does not exist in numpy (np).

How can I fix this?

Does seq learn supports multiple core of machine?

I was wondering if seqlearn suppor multiple cores for big datasets? python-crf suite seems to be nice but restricts to only one core and makes training slower for big dataset.

installation error: about vc++10.0

hi, i do not know it is proper put here my puzzle or not! when i install seqlearn, it shows error: "Microsoft visual c++ 10.0 is required(unable to find vcvarsall.bat)". but i install vc++2013 and vc++2008 on my laptop! i have searched a lot of forums but i have no idea what they say about it! so how to solve it!!??
thanks!!!

transition feature

I tried out to enable transition feature in perceptron learning. After I read the source code, I found the implementation is not consistent to the comments of make_trans_matrix(y, n_classes, dtype=np.float64) in transmatrix.py. Based on my understanding, only relying on the coefficient w and the label count matrix it quite easy to result in some label bias problem, for in real cases based on BIO tagging technique, the label of 'O' will be quite predominant in feature space distribution and label count matrix. So the transition feature make such assumption that the feature distribution of one label's previous one can be consistent and has pattern. As a result the transition feature will resolve some label bias issues. I am not sure my interpretation of transition feature is correct or not. And I also modify corresponding code. If you like I will submit a merge request

Installation Error on Python 3.5

When I tried to install seqlearn on my macbook, following error occured:

ValueError: 'seqlearn/_decode/bestfirst.pyx' doesn't match any files

But I do have the bestfirst.pyx file in my folder.

seqlearn not working since new version of sklearn

Since I use a new version of sklearn, six is no longer part of sklearn, but a separate library. Hence, seqlearn gives an error, when trying to run a perceptron:

/usr/local/lib/python3.8/dist-packages/seqlearn/perceptron.py in <module>
      8 import numpy as np
      9 from scipy.sparse import csc_matrix
---> 10 from sklearn.externals import six
     11 
     12 from .base import BaseSequenceClassifier

ImportError: cannot import name 'six' from 'sklearn.externals' (/usr/local/lib/python3.8/dist-packages/sklearn/externals/__init__.py)

will it work for multivariate time series?

great code thanks
may you clarify :
will it work for multivariate time series
1
where all values are continues values
2
or even will it work for multivariate time series where values are mixture of continues and categorical values
for example 2 dimensions have continues values and 3 dimensions are categorical values

color        weight     gender  height  age

1 black 56 m 160 34
2 white 77 f 170 54
3 yellow 87 m 167 43
4 white 55 m 198 72
5 white 88 f 176 32

partial_fit method?

I consider this a point or matter of discussion, debate, or dispute: the implementation of a partial_fit method for incremental learning.

Change verbose reporting format for StructuredPerceptron

What do you think about

using sum_loss instead of loss (currently the printed loss is a loss for some random sequence) and
printing iteration number at the same line as sum_loss (=> 2x less output lines)?

How to use this for sequence of events.?

I have data for sequence of events and that maps to one class.. So for example my data will look like

X['Event1', 'Event2', 'Event3'] ---> Y[C1]
X['Event2', 'Event1', 'Event5', 'Event2', 'Event1', 'Event5'] ---> Y[C2]
..

Also is there any possibility to extend this for regression over sequences.? Like instead of Y as classes, it can be a regression too!

Sklearn compatibility

Hello,
I was hoping to reuse the model selection routines of the Scikitlearn API (grid search CV and the like), but it appears that neither HMM nor the StructuredPerceptron are considered to be valid estimator (http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator). By looking through the source code, everything seems to be abiding scikitlearn rules, but if I try:

from seqlearn.perceptron import StructuredPerceptron
from sklearn.utils.estimator_checks import check_estimator
model = StructuredPerceptron()
check_estimator(model)

I get:
AttributeError: 'StructuredPerceptron' object has no attribute 'name'

Any clue on how to fix this compatibility issue?
Thanks a lot in advance,
Enrico

pypi?

larsmans, would you be willing to push this to pypi so it is easy to include as a package dependency in other tools? I am happy to help with this if you'd like.

Installation error: command 'clang' failed with exit status 1

the last part of error message is

seqlearn/_decode/bestfirst.c:600:10: fatal error: 'numpy/arrayobject.h' file not found
#include "numpy/arrayobject.h"
         ^~~~~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"'; file='"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-record-gr6jqkyt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7m/seqlearn Check the logs for full command output.

load_conll, unicode support

Hi, loading data in conll format fails on my custom dataset with non-ascii characters. So when I read data with encoding 'utf-8' set, I get corresponding errors here:

  File "/usr/local/lib/python2.7/dist-packages/seqlearn/datasets.py", line 65, in <genexpr>
    lines = (str.split(line) for line in  f)
TypeError: descriptor 'split' requires a 'str' object but received a 'unicode'

def _conll_sequences(f, features, labels, lengths, split):
    # Divide input into blocks of empty and non-empty lines.
    lines = (str.strip(line) for line in  f)

Everything works perfectly, when I modify the last line like that:

 lines = (line.strip() for line in  f)

Is there anything that makes such fix unwanted?

Can seqlearn use hmm/gmm?

Hello!
I want to solve this task http://cslu.ohsu.edu/~bedricks/courses/cs655/hw/hw4/hw4.html
Can I use seqlearn for supervised learning with hmm/gmm structure?