First, thanks for this great tool. I'm trying to learn skweak for full document classi

Simple example of full document classification and questions about skweak HOT 2 CLOSED

norskregnesentral commented on June 12, 2024

Simple example of full document classification and questions

from skweak.

Comments (2)

plison commented on June 12, 2024

Apologies for the answering delay, and thanks for using skweak!

Your code is mostly correct, but (after trying to debug it) there was apparently a problem in the lambda functions associated with each FunctionAnnotator. As it turns out, the lambda functions seem to struggle with the references to the clf objects (which are overwritten in the loop), which is why the final results was always identical to the last model.

Here is a code that should work:

import spacy
from skweak import heuristics, aggregation
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegressionCV,PassiveAggressiveClassifier
from sklearn.svm import NuSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score
from afinn import Afinn
import random
from tensorflow.keras.datasets import imdb

nlp = spacy.load("en_core_web_sm")
afinn = Afinn(language='en')

# get IMDB sentiment data
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
ind2text = {x:k for k,x in imdb.get_word_index().items()}
# convert to text
n_train = 500 # train samples
n_test = 200 # test samples
get_text = lambda data: [" ".join([ind2text[x] for x in d]) for d in data]
X_train,Y_train = get_text(training_data[0:n_train]),list(training_targets[0:n_train])
X_test,Y_test = get_text(testing_data[0:n_test]),list(testing_targets[0:n_test])

# create some whole-text classical classifiers
my_classifiers=[]
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', LogisticRegressionCV(penalty="l1",cv=5,solver='liblinear')),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', NuSVC()),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer()),
     ('clf', PassiveAggressiveClassifier()),
]))
my_classifiers.append(Pipeline([
     ('vect', CountVectorizer(ngram_range=(1,2),max_features=500)),
     ('tfidf', TfidfTransformer(use_idf=False)),
     ('clf', RandomForestClassifier()),
]))

all_annotators =skweak.base.CombinedAnnotator() # labeling functions

# Annotator based on a sklearn classification pipeline
class SklearnAnnotator(skweak.base.SpanAnnotator):
    def __init__(self, name, trained_model):
        skweak.base.SpanAnnotator.__init__(self, name)
        self.trained_model = trained_model
        
    def find_spans(self, doc):
        yield 0, len(doc), str(self.trained_model.predict([doc.text])[0])
        
# add afinn annotator (span over whole document)
def labeling_fun_afinn(x):
    yield 0, len(x),('0' if afinn.score(x.text)<=0 else '1')
all_annotators.add_annotator(heuristics.FunctionAnnotator("afinn", lambda x: labeling_fun_afinn(x)))

# apply predictor (span over whole document)
def labeling_fun_clf(x,model):
    yield 0, len(x),str(model.predict([x.text])[0])
for i,clf in enumerate(my_classifiers):
    clf.fit(X_train,Y_train)
    all_annotators.add_annotator(SklearnAnnotator("classifier_%i" % i, clf))

# obtain processed docs
train_docs = list(all_annotators.pipe(nlp.pipe(X_train)))
test_docs = list(all_annotators.pipe(nlp.pipe(X_test)))

# create HMM aggregator
hmm = aggregation.HMM("hmm", ['0','1'],sequence_labelling=False)

# fit and annotate train data
hmm.fit(train_docs)
# apply model to test data (works as a "predict" function?)
test_docs = list(hmm.pipe(test_docs))

# get predicted classes
hmm_preds = [int(doc.spans["hmm"][0].label_) for doc in test_docs]
afinn_preds = [int(afinn.score(doc.text) >=0) for doc in test_docs]

print("\nResults")
print(" skweak HMM: F1=%f, accuracy=%f" % (f1_score(Y_test,hmm_preds),accuracy_score(Y_test,hmm_preds)))
print(" afinn: F1=%f, accuracy=%f" % (f1_score(Y_test,afinn_preds), accuracy_score(Y_test,afinn_preds)))
for i,clf in enumerate(my_classifiers):
    y = clf.predict(X_test)
    print(" classifier %i: F1=%f, accuracy=%f" % (i+1,f1_score(Y_test,y),accuracy_score(Y_test,y)))

So, the main difference was that a created a SklearnAnnotator class that worked as a wrapper for the sklearn pipeline, instead of relying on lambda functions. Also note that I shortened the code by creating a CombinedAnnotator that included all the 5 annotators, and ran pipe to run the annotators on all documents. It's much more efficient than looping on each document and annotator one by one.

It's important to point out that the initial objective of skweak is to aggregate weak/sparse labels, in particular when labelling functions may "abstain" and not give any prediction. You are here using skweak for a slightly different goal (as you also mentioned), namely perform some type of ensemble learning where you want to combine the results of multiple classifiers. And here are much more sophisticated approaches to ensemble learning / mixtures of experts that I would expect to give better results on this type of problem compared to the HMM model of skweak.

It's also why there is no predict method -- the goal of skweak is to provide a way to easily apply different annotations to text collections and then aggregate those to get a single annotation layer. You can then apply this aggregation on new documents (either by using __call__ on individual docs, or pipe on iterables), but it's still an aggregation of existing results, not a prediction. To get a predictive model, what you need to do is to train a machine learning model (anything you want, e.g. a sklearn model) on the aggregated data.

from skweak.

kauttoj commented on June 12, 2024

Thank you! I now have much better understanding of skweak and how to use it properly.

from skweak.

Simple example of full document classification and questions about skweak HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent