martinthenext / eth_ml Goto Github PK
View Code? Open in Web Editor NEWProjects in Machine Learning ETH team trying to use mechanical turk and active learning for solving word-sense disambiguation task
Projects in Machine Learning ETH team trying to use mechanical turk and active learning for solving word-sense disambiguation task
Currently, bag of words feature takes the entire context of an ambiguous term as an argument. We need to implement a new feature that will only account for k
words around the ambiguous term in vectorization.
Technical details:
CountVectorizer
to implement a bag of words window.OptionAwareNaiveBayesLeftRightCutoff
needs to be modified to use all context and tested on Medline, see motivation here
This new classifier should be named OptionAwareNaiveBayesFullContextLeftRightCutoff
and added to models.py
.
Do cross-validation of OptionAwareNaiveBayesLeftRight
classifier on MTurk data.
To evaluate the accuracy of the best classifier (OptionAwareNaiveBayesFullContextLeftRightCutoff
trained on Medline) on the "Gold standard" we need to measure its agreement with expert annotations.
Expert annotations are stored in this file.
load_ambiguous_annotations_labeled
method from data.py so that it also works with loading data from this tsv file.expert_classifer_agreement.py
where you use the function get_mturk_pickled_classifier_agreement
from mturk_classifier_agreement.py
to get the agreement between the pickled classifier (you need to load it with joblib.load
) and the expert.expert_classifer_agreement.py
should have a pickle of a classifier and an expert annotation tsv file as parameters and output two numbers:
In the dimensionality reduction section of result summary there is a graph where each point is a feature set with certain parameters and coordinates are accuracy values on EMEA and Medline.
Under the graph in the section re-evaluation you can find similar data for the new dataset. The task is to produce the new graph from this data. The graph should look like the old one - Pareto front should be highlighted and color-coded with according features.
Fit 10 separate classifiers - one for every semantic group. For classification of an annotation instance:
Simplest classifier to output probability - logistic regression.
Resulting collection of classifiers should be wrapped into a classifier class.
In this ticket you have to implement a very easy to use function that will allow to plot learning curves. The image should be written to the specified file location. The user of the function should be able to use it without knowing how it works. A good example of a call would be:
plot_curves('output.jpg', passive_learner = [0.2, 0.21, 0.22],
active_learner = [0.2, 0.23, 0.55])
For every keyword argument (see info about kwargs) this would plot a line plot with list index (starting with 1) on X axis and list values on Y axis. For the supplied example it would plot points (1, 0.2), (2, 0.21), (3, 0.22) in red and (1, 0.2), (2, 0.23), (3, 0.55) in blue and have a legend about red means 'passive_learner' and blue means 'active learner'. Optional control over graphical parameters could be also useful. Please describe how to use the function in a docstring.
If a plotting library motivates some other argument structure, it's ok - the main thing that it should be very straightforward and easy to use.
It would be nice to use matplotlib as it is installed on the working server.
Now in models.py there is a class called ContextRestrictedBagOfWords
. It implements two functions fit_transform
and transform
to vectorize annotations.
The task is to make a new version of this class called ContextRestrictedBagOfBigrams
that would use word bigrams instead of just words.
Please refer to sklean docs on CountVectorizer, specifically to the parameter ngram_range
.
To test a new class you can just plug it into an existing classifier instead of ContextRestrictedBagOfWords
like that:
class NaiveBayesContextRestricted(AnnotationClassifier):
def __init__(self, **kwargs):
self.classifier = MultinomialNB()
window_size = kwargs.get('window_size', 3)
self.vectorizer = ContextRestrictedBagOfBigrams(window_size)
This task in similar to the bigram one in a sense that one also needs to modify the ContextRestrictedBagOfWords
. There should be two feature vectors created for each annotation instead of one: bag of words on the right part and bag of words on the left part. Then two vectors should be joined into one.
In realization of this idea one should work with feature matrices directly, joining outputs of two CountVectorizer
s.
For the sake of dimensionality reduction, following variants of ContextRestrictedBagOfWordsLeftRight
should be implemented:
ContextRestrictedBagOfWordsLeftRightCutoff
and make a cut-off frequency a parameter of it's constructor, just like the window size. Hint: parameter min_df
can be set to 3
in CountVectorizer optionsContextRestrictedBagOfWordsLeftRightStopWords
Hint: stop_words='english'
.To compare the performance of new vectorizers, create a script prototypes/compare_vectorizers.py
that does the following:
OptionAwareNaiveBayesLeftRight
on the given data, just like mturk_classifier_agreement.py
min_df
) and according agreements.The resulting script should have the same command-line arguments as mturk_classifier_agreement.py
.
For the sake of testing on small scale, generate a small subset of two corpus files so that Maria and Valya can work on them without accessing the server.
Looks like train_and_serialize.py
produces similar classifiers for any value of dataset_fraction
parameter.
Evidence 1. Pickled classifier files for different fractions have the same size.
Evidence 2. Plots for passive vs. active are exactly the same for active learner and slightly different for passive, which indicates that active learning is acting on the same data.
Full dataset:
What's expected to be 5% fraction:
Implement a procedure to measure agreement of a classifier with the labeled ambiguous data. For that purpose:
Annotation
class should be modified to store ambiguous data as welldata.py
to deserialize labeled ambiguous data (from MTurk or expert) to a list of Annotation
objects.cv.py
In the result summary OptionAwareNaiveBayesFullContextLeftRightCutoff
attains 75%
accuracy when trained on Medline and 61%
accuracy when trained on EMEA.
Nevertheless, when wrapped into a passive/active learner, it produces results which are worse - 63%
and 56%
accordingly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.