Code Monkey home page Code Monkey logo

seqlearn's Introduction

seqlearn

seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API.

Compiling and installing

Get NumPy >=1.6, SciPy >=0.11, Cython >=0.20.2 and a recent version of scikit-learn. Then issue:

python setup.py install

to install seqlearn.

If you want to use seqlearn from its source directory without installing, you have to compile first:

python setup.py build_ext --inplace

Getting started

The easiest way to start using seqlearn is to fetch a dataset in CoNLL 2000 format. Define a task-specific feature extraction function, e.g.:

>>> def features(sequence, i):
...     yield "word=" + sequence[i].lower()
...     if sequence[i].isupper():
...         yield "Uppercase"
...

Load the training file, say train.txt:

>>> from seqlearn.datasets import load_conll
>>> X_train, y_train, lengths_train = load_conll("train.txt", features)

Train a model:

>>> from seqlearn.perceptron import StructuredPerceptron
>>> clf = StructuredPerceptron()
>>> clf.fit(X_train, y_train, lengths_train)

Check how well you did on a validation set, say validation.txt:

>>> X_test, y_test, lengths_test = load_conll("validation.txt", features)
>>> from seqlearn.evaluation import bio_f_score
>>> y_pred = clf.predict(X_test, lengths_test)
>>> print(bio_f_score(y_test, y_pred))

For more information, see the documentation.

Travis

seqlearn's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seqlearn's Issues

Using a `requirements.txt` file

I was wondering if it would be possible to include a requirements.txt in the root directory with the prerequisites?

I am currently making a one-line installer for a (private) module that downloads and installs seqlearn directly from github using pip install -e git+https://github.com/larsmans/seqlearn.git#egg=seqlearn. But this won't work on new virtual envirements without first installing Cython manually.

I know this has limited usefulness, but it would simplify things for me a notch.

Awesome project you guys have got going here BTW.

How to use word embedding as a feature ?

I want to know, if there exist any way to pass word vector extracted from word2vec as a feature ? or, i have to use/change something else in place of existing feature hashing ? Great Thanks !

Dataset loading

I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.

What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.

Alternatives for load_conll

Hello,

I use your toolkit in sequence learning. I both train and predict data with csv/tsv files.
In the other hand, it would be useful to make predictions immediately from code (i.e, POS-tagging for a sentence as function argument, not for a sentence from file).
Does seqlearn get data only from files? Is it possible to implement an alternative for Pandas dataframes loading?

Thanks,
Daria

How to use seqlearn in Anaconda?

Can we somehow install 'seqlearn' in Anaconda so that it can be used in the Spyder platform. I searched but it was not available in anaconda.org database. Any help would be great

Chahat

[Question] Training Algorithm for hmm

Hi,
Just have a question and would be happy if you could help.
Is there a fast way to use different training algorithms for HMM here? (I need something like hmmlearn's _BaseHMM in a supervised setting.)
Thank you ! #

How to use SequenceKFold?

SequenceKFold returns (train, test) indices, but "fit" methods also require proper "lenghts" arrays, so it is not clear how to use SequenceKFold for cross-validation.

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I am trying to enable trans_feature to get a more sophisticated model in seqlearn not only the unary term and pairwise term(transition matrix). But I encountered the exit error in the title. I am using my own dataset and there is no problem I tested using the provided cornell dataset. Have you ever encountered similar errors? The line that where exit happens is safe_add(A, B). It calls some compiled c code. I am not sure whether it is and how to solve it. Thank you very much

ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'int'

I am using the MultinomialHMM class of seqlearn and the code crashes with error

ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'int'

at the function count_trans() inside the fit() function. I am trying to change the data type of the y array from int to numpy.npy_intp but this type does not exist in numpy (np).

How can I fix this?

installation error: about vc++10.0

hi, i do not know it is proper put here my puzzle or not! when i install seqlearn, it shows error: "Microsoft visual c++ 10.0 is required(unable to find vcvarsall.bat)". but i install vc++2013 and vc++2008 on my laptop! i have searched a lot of forums but i have no idea what they say about it! so how to solve it!!??
thanks!!!

transition feature

I tried out to enable transition feature in perceptron learning. After I read the source code, I found the implementation is not consistent to the comments of make_trans_matrix(y, n_classes, dtype=np.float64) in transmatrix.py. Based on my understanding, only relying on the coefficient w and the label count matrix it quite easy to result in some label bias problem, for in real cases based on BIO tagging technique, the label of 'O' will be quite predominant in feature space distribution and label count matrix. So the transition feature make such assumption that the feature distribution of one label's previous one can be consistent and has pattern. As a result the transition feature will resolve some label bias issues. I am not sure my interpretation of transition feature is correct or not. And I also modify corresponding code. If you like I will submit a merge request

Installation Error on Python 3.5

When I tried to install seqlearn on my macbook, following error occured:

ValueError: 'seqlearn/_decode/bestfirst.pyx' doesn't match any files

But I do have the bestfirst.pyx file in my folder.

seqlearn not working since new version of sklearn

Since I use a new version of sklearn, six is no longer part of sklearn, but a separate library. Hence, seqlearn gives an error, when trying to run a perceptron:

/usr/local/lib/python3.8/dist-packages/seqlearn/perceptron.py in <module>
      8 import numpy as np
      9 from scipy.sparse import csc_matrix
---> 10 from sklearn.externals import six
     11 
     12 from .base import BaseSequenceClassifier

ImportError: cannot import name 'six' from 'sklearn.externals' (/usr/local/lib/python3.8/dist-packages/sklearn/externals/__init__.py)

will it work for multivariate time series?

great code thanks
may you clarify :
will it work for multivariate time series
1
where all values are continues values
2
or even will it work for multivariate time series where values are mixture of continues and categorical values
for example 2 dimensions have continues values and 3 dimensions are categorical values

color        weight     gender  height  age  

1 black 56 m 160 34
2 white 77 f 170 54
3 yellow 87 m 167 43
4 white 55 m 198 72
5 white 88 f 176 32

partial_fit method?

I consider this a point or matter of discussion, debate, or dispute: the implementation of a partial_fit method for incremental learning.

How to use this for sequence of events.?

I have data for sequence of events and that maps to one class.. So for example my data will look like

X['Event1', 'Event2', 'Event3'] ---> Y[C1]
X['Event2', 'Event1', 'Event5', 'Event2', 'Event1', 'Event5'] ---> Y[C2]
..

Also is there any possibility to extend this for regression over sequences.? Like instead of Y as classes, it can be a regression too!

Sklearn compatibility

Hello,
I was hoping to reuse the model selection routines of the Scikitlearn API (grid search CV and the like), but it appears that neither HMM nor the StructuredPerceptron are considered to be valid estimator (http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator). By looking through the source code, everything seems to be abiding scikitlearn rules, but if I try:

from seqlearn.perceptron import StructuredPerceptron
from sklearn.utils.estimator_checks import check_estimator
model = StructuredPerceptron()
check_estimator(model)

I get:
AttributeError: 'StructuredPerceptron' object has no attribute 'name'

Any clue on how to fix this compatibility issue?
Thanks a lot in advance,
Enrico

pypi?

larsmans, would you be willing to push this to pypi so it is easy to include as a package dependency in other tools? I am happy to help with this if you'd like.

Installation error: command 'clang' failed with exit status 1

the last part of error message is

seqlearn/_decode/bestfirst.c:600:10: fatal error: 'numpy/arrayobject.h' file not found
#include "numpy/arrayobject.h"
         ^~~~~~~~~~~~~~~~~~~~~
1 error generated.
error: command 'clang' failed with exit status 1
----------------------------------------

ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"'; file='"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-record-gr6jqkyt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7m/seqlearn Check the logs for full command output.

load_conll, unicode support

Hi, loading data in conll format fails on my custom dataset with non-ascii characters. So when I read data with encoding 'utf-8' set, I get corresponding errors here:

  File "/usr/local/lib/python2.7/dist-packages/seqlearn/datasets.py", line 65, in <genexpr>
    lines = (str.split(line) for line in  f)
TypeError: descriptor 'split' requires a 'str' object but received a 'unicode'
def _conll_sequences(f, features, labels, lengths, split):
    # Divide input into blocks of empty and non-empty lines.
    lines = (str.strip(line) for line in  f)

Everything works perfectly, when I modify the last line like that:

 lines = (line.strip() for line in  f)

Is there anything that makes such fix unwanted?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.