Code Monkey home page Code Monkey logo

Comments (2)

plison avatar plison commented on June 12, 2024

Apologies for the answering delay, and thanks for using skweak!

Your code is mostly correct, but (after trying to debug it) there was apparently a problem in the lambda functions associated with each FunctionAnnotator. As it turns out, the lambda functions seem to struggle with the references to the clf objects (which are overwritten in the loop), which is why the final results was always identical to the last model.

Here is a code that should work:

import spacy
from skweak import heuristics, aggregation
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegressionCV,PassiveAggressiveClassifier
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from afinn import Afinn
import random
from tensorflow.keras.datasets import imdb

nlp = spacy.load("en_core_web_sm")
afinn = Afinn(language='en')

# get IMDB sentiment data
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
ind2text = {x:k for k,x in imdb.get_word_index().items()}
# convert to text
n_train = 500 # train samples
n_test = 200 # test samples
get_text = lambda data: [" ".join([ind2text[x] for x in d]) for d in data]
X_train,Y_train = get_text(training_data[0:n_train]),list(training_targets[0:n_train])
X_test,Y_test = get_text(testing_data[0:n_test]),list(testing_targets[0:n_test])

# create some whole-text classical classifiers
my_classifiers=[]
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegressionCV(penalty="l1",cv=5,solver='liblinear')),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', NuSVC()),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', PassiveAggressiveClassifier()),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer(use_idf=False)),
     ('clf', RandomForestClassifier()),
]))

all_annotators =skweak.base.CombinedAnnotator() # labeling functions

# Annotator based on a sklearn classification pipeline
class SklearnAnnotator(skweak.base.SpanAnnotator):
    def __init__(self, name, trained_model):
        skweak.base.SpanAnnotator.__init__(self, name)
        self.trained_model = trained_model
        
    def find_spans(self, doc):
        yield 0, len(doc), str(self.trained_model.predict([doc.text])[0])
        
# add afinn annotator (span over whole document)
def labeling_fun_afinn(x):
    yield 0, len(x),('0' if afinn.score(x.text)<=0 else '1')
all_annotators.add_annotator(heuristics.FunctionAnnotator("afinn", lambda x: labeling_fun_afinn(x)))

# apply predictor (span over whole document)
def labeling_fun_clf(x,model):
    yield 0, len(x),str(model.predict([x.text])[0])
for i,clf in enumerate(my_classifiers):
    clf.fit(X_train,Y_train)
    all_annotators.add_annotator(SklearnAnnotator("classifier_%i" % i, clf))

# obtain processed docs
train_docs = list(all_annotators.pipe(nlp.pipe(X_train)))
test_docs = list(all_annotators.pipe(nlp.pipe(X_test)))

# create HMM aggregator
hmm = aggregation.HMM("hmm", ['0','1'],sequence_labelling=False)

# fit and annotate train data
hmm.fit(train_docs)
# apply model to test data (works as a "predict" function?)
test_docs = list(hmm.pipe(test_docs))

# get predicted classes
hmm_preds = [int(doc.spans["hmm"][0].label_) for doc in test_docs]
afinn_preds = [int(afinn.score(doc.text) >=0) for doc in test_docs]

print("\nResults")
print(" skweak HMM: F1=%f, accuracy=%f" % (f1_score(Y_test,hmm_preds),accuracy_score(Y_test,hmm_preds)))
print(" afinn: F1=%f, accuracy=%f" % (f1_score(Y_test,afinn_preds), accuracy_score(Y_test,afinn_preds)))
for i,clf in enumerate(my_classifiers):
    y = clf.predict(X_test)
    print(" classifier %i: F1=%f, accuracy=%f" % (i+1,f1_score(Y_test,y),accuracy_score(Y_test,y)))

So, the main difference was that a created a SklearnAnnotator class that worked as a wrapper for the sklearn pipeline, instead of relying on lambda functions. Also note that I shortened the code by creating a CombinedAnnotator that included all the 5 annotators, and ran pipe to run the annotators on all documents. It's much more efficient than looping on each document and annotator one by one.

It's important to point out that the initial objective of skweak is to aggregate weak/sparse labels, in particular when labelling functions may "abstain" and not give any prediction. You are here using skweak for a slightly different goal (as you also mentioned), namely perform some type of ensemble learning where you want to combine the results of multiple classifiers. And here are much more sophisticated approaches to ensemble learning / mixtures of experts that I would expect to give better results on this type of problem compared to the HMM model of skweak.

It's also why there is no predict method -- the goal of skweak is to provide a way to easily apply different annotations to text collections and then aggregate those to get a single annotation layer. You can then apply this aggregation on new documents (either by using __call__ on individual docs, or pipe on iterables), but it's still an aggregation of existing results, not a prediction. To get a predictive model, what you need to do is to train a machine learning model (anything you want, e.g. a sklearn model) on the aggregated data.

from skweak.

kauttoj avatar kauttoj commented on June 12, 2024

Thank you! I now have much better understanding of skweak and how to use it properly.

from skweak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.