Comments (2)
Apologies for the answering delay, and thanks for using skweak!
Your code is mostly correct, but (after trying to debug it) there was apparently a problem in the lambda functions associated with each FunctionAnnotator
. As it turns out, the lambda functions seem to struggle with the references to the clf
objects (which are overwritten in the loop), which is why the final results was always identical to the last model.
Here is a code that should work:
import spacy
from skweak import heuristics, aggregation
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegressionCV,PassiveAggressiveClassifier
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from afinn import Afinn
import random
from tensorflow.keras.datasets import imdb
nlp = spacy.load("en_core_web_sm")
afinn = Afinn(language='en')
# get IMDB sentiment data
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
ind2text = {x:k for k,x in imdb.get_word_index().items()}
# convert to text
n_train = 500 # train samples
n_test = 200 # test samples
get_text = lambda data: [" ".join([ind2text[x] for x in d]) for d in data]
X_train,Y_train = get_text(training_data[0:n_train]),list(training_targets[0:n_train])
X_test,Y_test = get_text(testing_data[0:n_test]),list(testing_targets[0:n_test])
# create some whole-text classical classifiers
my_classifiers=[]
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer()),
('clf', LogisticRegressionCV(penalty="l1",cv=5,solver='liblinear')),
]))
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer()),
('clf', NuSVC()),
]))
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer()),
('clf', PassiveAggressiveClassifier()),
]))
my_classifiers.append(Pipeline([
('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
('tfidf', TfidfTransformer(use_idf=False)),
('clf', RandomForestClassifier()),
]))
all_annotators =skweak.base.CombinedAnnotator() # labeling functions
# Annotator based on a sklearn classification pipeline
class SklearnAnnotator(skweak.base.SpanAnnotator):
def __init__(self, name, trained_model):
skweak.base.SpanAnnotator.__init__(self, name)
self.trained_model = trained_model
def find_spans(self, doc):
yield 0, len(doc), str(self.trained_model.predict([doc.text])[0])
# add afinn annotator (span over whole document)
def labeling_fun_afinn(x):
yield 0, len(x),('0' if afinn.score(x.text)<=0 else '1')
all_annotators.add_annotator(heuristics.FunctionAnnotator("afinn", lambda x: labeling_fun_afinn(x)))
# apply predictor (span over whole document)
def labeling_fun_clf(x,model):
yield 0, len(x),str(model.predict([x.text])[0])
for i,clf in enumerate(my_classifiers):
clf.fit(X_train,Y_train)
all_annotators.add_annotator(SklearnAnnotator("classifier_%i" % i, clf))
# obtain processed docs
train_docs = list(all_annotators.pipe(nlp.pipe(X_train)))
test_docs = list(all_annotators.pipe(nlp.pipe(X_test)))
# create HMM aggregator
hmm = aggregation.HMM("hmm", ['0','1'],sequence_labelling=False)
# fit and annotate train data
hmm.fit(train_docs)
# apply model to test data (works as a "predict" function?)
test_docs = list(hmm.pipe(test_docs))
# get predicted classes
hmm_preds = [int(doc.spans["hmm"][0].label_) for doc in test_docs]
afinn_preds = [int(afinn.score(doc.text) >=0) for doc in test_docs]
print("\nResults")
print(" skweak HMM: F1=%f, accuracy=%f" % (f1_score(Y_test,hmm_preds),accuracy_score(Y_test,hmm_preds)))
print(" afinn: F1=%f, accuracy=%f" % (f1_score(Y_test,afinn_preds), accuracy_score(Y_test,afinn_preds)))
for i,clf in enumerate(my_classifiers):
y = clf.predict(X_test)
print(" classifier %i: F1=%f, accuracy=%f" % (i+1,f1_score(Y_test,y),accuracy_score(Y_test,y)))
So, the main difference was that a created a SklearnAnnotator
class that worked as a wrapper for the sklearn pipeline, instead of relying on lambda functions. Also note that I shortened the code by creating a CombinedAnnotator
that included all the 5 annotators, and ran pipe
to run the annotators on all documents. It's much more efficient than looping on each document and annotator one by one.
It's important to point out that the initial objective of skweak
is to aggregate weak/sparse labels, in particular when labelling functions may "abstain" and not give any prediction. You are here using skweak
for a slightly different goal (as you also mentioned), namely perform some type of ensemble learning where you want to combine the results of multiple classifiers. And here are much more sophisticated approaches to ensemble learning / mixtures of experts that I would expect to give better results on this type of problem compared to the HMM model of skweak.
It's also why there is no predict
method -- the goal of skweak is to provide a way to easily apply different annotations to text collections and then aggregate those to get a single annotation layer. You can then apply this aggregation on new documents (either by using __call__
on individual docs, or pipe
on iterables), but it's still an aggregation of existing results, not a prediction. To get a predictive model, what you need to do is to train a machine learning model (anything you want, e.g. a sklearn model) on the aggregated data.
from skweak.
Thank you! I now have much better understanding of skweak and how to use it properly.
from skweak.
Related Issues (20)
- Error Importing import examples.ner.conll2003_ner HOT 1
- matcher annotator HOT 1
- Functionality to construct the detected span from start and end index HOT 1
- Converting .spacy files to conll format to train other models on it. HOT 5
- skweak.utils.docbin_reader always loads 'en_core_web_md' regardless which model was saved? HOT 2
- Support for loading any pre-trained model inside the 'Model Annotator' HOT 2
- Error in MultilabelNaiveBayes HOT 5
- SpanCategorizer HOT 1
- Custom NER model training HOT 2
- Support options in displacy.render
- minimal example not working HOT 3
- Does skweak use POS tags and lemma information to aggregate labels? HOT 1
- How to use the already available Label Matrix to train Skweak? HOT 1
- Step by step NER alternative 2 HOT 1
- Annotating whole sentences (without using regex) HOT 2
- Adding to the gazetteer annotator constrains HOT 1
- Is skweak being actively maintained and will be maintained? HOT 1
- How to import annotator in the annotator(doc)
- hmmlearn 0.3.0 HOT 1
- Update examples stepbystep
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skweak.