Code Monkey home page Code Monkey logo

nlp-viterbi's Introduction

NLP - Named Entity tagger with Viterbi algorithm

Code for assignments from prof. Michael Collin's NLP course, COMSW4705 Spring 2015 at Columbia U.

Summary: The Viterbi algorithm finds the maximum probability path for a series of observations, based on emission and transition probabilities. In a Markov Process, emission is the probability of an output given a state and transition is the probability of transitioning to the state given the previous states. In our case, the emission parameter e(x|y) = the probability of the word being x given you attributed tag y. If your training data had 100 counts of 'person' tags, one of which is the word 'London' (I know a guy who named his kid London), e('London'|'person') = 0.01. Now with 50 counts of 'location' tags, 5 of which are 'London', e('London'|'location') = 0.1 which clearly trumps 0.01. The transition parameter q(yi | yi-1, yi-2) = the probability of putting tag y in position i given it's two previous tags. This is calculated by Count(trigram)/Count(bigram). For each word in the development data, he Viterbi algorithm will associate a score for a word-tag combo based on the emission and transition parameters it obtained from the training data. It does this for every possible tag and sees which is more likely. Clearly this won't be 100% correct as natural language is unpredictable, but you should get pretty high accuracy.

Optional Preprocessing

Re-label words in training data with frequency < 5 as 'RARE' - This isn't required, but useful. Re-run count_freqs.py if used.

Python code: label rare.py

Usage: python label_rare.py [input_file]

Pseudocode:

  1. Uses Python Counter to obtain word counts in [input_file]; removes all word-count pairs with count < 5, store remaining pairs in a dictionary named rare_words.
  2. Iterates through each line in [input file], checks if word is in rare words dictionary, if so, replaces word with RARE.

Step 1. Get Count(y) and Count(x~y)

Python code: emission_counts.py

Pseudocode:

  1. Iterate through each line in ner.counts file and store each word-label-count combo in a dictionary count_xy and update the dictionary of count_y. For example count_xy[Peter][I-PER] returns the number of times the word ‘Peter’ was labeled ‘I-PER’ in the training data and count_y[I-PER] the total number of 'I-PER' tags. The dictionary count_y contains 8 items, one for each label ( RARE , O, I-MISC, I-PER, I-ORG, I-LOC, B-MISC, B-PER, B-ORG, B-LOC);
  2. Return count_xy, count_y

Step 2. Get bigram and trigram counts

Python code: transition_counts.py

Pseudocode:

  1. Iterate through each line in the n-gram_counts file
  2. If the line contains ’2-GRAM’ add an item to the bigram_counts dictionary using the bigram (two space-separated labels following the tag type '2-gram') as key, count as value. This dictionary will contain Count(yi-2,yi-1).
  3. If the line contains ’3-GRAM’, add an item to the trigram_counts dictionary using the trigram as key, count as value. This dictionary will contain Count(yi-2, yi-1, yi).
  4. Return dictionaries of bigram and trigram counts.

Step 3. Viterbi

(For each line in the [input_file]):

  1. If the word was seen in training data (present in the count_xy dictionary), for each of the possible labels for the word:
  2. Calculate emission = count_xy[word][label] / float(count_y[label]
  3. Calculate transition = trigram_counts[trigram])/float(bigram_counts[bigram] Note: yi-2 = *, yi-1 = * for the first round
  4. Set probability = emission x transition
  5. Update max(probability) and arg max if needed. 2 If the word was not seen in the training data:
  6. Calculate emission = count xy[RARE][label] / float(count y[label].
  7. Calculate q(yi|yi-1, yi-2) = trigram counts[trigram])/float(bigram counts[bigram]. Note: yi-2 = ∗, yi-1 = ∗ for the first round
  8. Set probability = emission × transition
  9. Update max(probability) if needed, arg max = RARE
  10. Write arg max and log(max(probability)) to output file.
  11. Update yi-2, yi-1.
  12. Update yi-2, yi-1.

Evaluation

Prof. Michael Collins provided an evaluation script to verify the output of your Viterbi implementation. Usage: python eval_ne_tagger.py ner_dev.key [output_file]

nlp-viterbi's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.