Code Monkey home page Code Monkey logo

ingredient-phrase-tagger's Introduction

CRF Ingredient Phrase Tagger

This repo contains scripts to extract the Quantity, Unit, Name, and Comments from unstructured ingredient phrases. We use it on Cooking to format incoming recipes. Given the following input:

1 pound carrots, young ones if possible
Kosher salt, to taste
2 tablespoons sherry vinegar
2 tablespoons honey
2 tablespoons extra-virgin olive oil
1 medium-size shallot, peeled and finely diced
1/2 teaspoon fresh thyme leaves, finely chopped
Black pepper, to taste

Our tool produces something like:

{
    "qty":     "1",
    "unit":    "pound"
    "name":    "carrots",
    "other":   ",",
    "comment": "young ones if possible",
    "input":   "1 pound carrots, young ones if possible",
    "display": "<span class='qty'>1</span><span class='unit'>pound</span><span class='name'>carrots</span><span class='other'>,</span><span class='comment'>young ones if possible</span>",
}

We use a conditional random field model (CRF) to extract tags from labelled training data, which was tagged by human news assistants. We wrote about our approach on the New York Times Open blog. More information about CRFs can be found here.

On a 2012 Macbook Pro, training the model takes roughly 30 minutes for 130k examples using the CRF++ library.

Development

On OSX:

brew install crf++
python setup.py install

Quick Start

The most common usage is to train the model with a subset of our data, test the model against a different subset, then visualize the results. We provide a shell script to do this, at:

./roundtrip.sh

You can edit this script to specify the size of your training and testing set. The default is 20k training examples and 2k test examples.

Usage

Training

To train the model, we must first convert our input data into a format which crf_learn can accept:

bin/generate_data --data-path=input.csv --count=1000 --offset=0 > tmp/train_file

The count argument specifies the number of training examples (i.e. ingredient lines) to read, and offset specifies which line to start with. There are roughly 180k examples in our snapshot of the New York Times cooking database (which we include in this repo), so it is useful to run against a subset.

The output of this step looks something like:

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

1/2          I1      L4      NoCAP  NoPAREN  B-QTY
cup          I2      L4      NoCAP  NoPAREN  B-UNIT
sugar        I3      L4      NoCAP  NoPAREN  B-NAME

2            I1      L8      NoCAP  NoPAREN  B-QTY
tablespoons  I2      L8      NoCAP  NoPAREN  B-UNIT
dry          I3      L8      NoCAP  NoPAREN  B-NAME
white        I4      L8      NoCAP  NoPAREN  I-NAME
wine         I5      L8      NoCAP  NoPAREN  I-NAME

Next, we pass this file to crf_learn, to generate a model file:

crf_learn template_file tmp/train_file tmp/model_file

Testing

To use the model to tag your own arbitrary ingredient lines (stored here in input.txt), you must first convert it into the CRF++ format, then run against the model file which we generated above. We provide another helper script to do this:

python bin/parse-ingredients.py input.txt > results.txt

The output is also in CRF++ format, which isn't terribly helpful to us. To convert it into JSON:

python bin/convert-to-json.py results.txt > results.json

See the top of this README for an example of the expected output.

Authors

License

Apache 2.0.

ingredient-phrase-tagger's People

Contributors

adammck avatar ericagreene avatar jsundram avatar macdiva avatar wrboyce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ingredient-phrase-tagger's Issues

Possibly Improved Sequence Tagger

Hi,
This is a pretty awesome project, thanks for posting it! I've been experimenting with Structured Prediction methods, and decided to use this project to compare CRFs and Learning to Search methods. Pending a more rigorous evaluation (fingers crossed) I'm seeing roughly 96% per-token accuracy and 95.55% sentence-level accuracy with L2S and vw, taking 22 minutes total for read+train+test. This is an 80/20 split on the full dataset, using the output of bin/generate. Once I've cleaned up the source, I'll be happy to send over a pull request.

- Arthur

Facing value error

while running this command I am getting an error - Traceback (most recent call last):
File "bin/generate_data", line 7, in
cli.Cli(sys.argv[1:]).run()
File "/home/ramya/anaconda3/lib/python3.6/site-packages/ingredient_phrase_tagger-0.0.0.dev0-py3.6.egg/ingredient_phrase_tagger/training/cli.py", line 15, in run
File "/home/ramya/anaconda3/lib/python3.6/site-packages/ingredient_phrase_tagger-0.0.0.dev0-py3.6.egg/ingredient_phrase_tagger/training/cli.py", line 26, in generate_data
ValueError: invalid literal for int() with base 10: ''

Please help me

Improve roundtrip script

It should take command line parameters for the number of training and testing examples: ./roundtrip --train-count 1000 --test-count 100.

And maybe even check that the sum is less than the number of lines of ingredients.csv.

Issues parsing irrational numbers

Have you guys run into issues trying to parse strings with fractions that correlate to irrational numbers? For example, the following will all return null for qty.

1/3 cup flour
2/3 tsp almond extract
14/15 gallon milk

CRF Output

Hi, I am not able to understand to what does these tab separated fields mean.

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

Please, help me out.

Thanks

BIO tagging/chunking bug

The first entry for the test set looks like this:

1	I1	L12	NoCAP	NoPAREN	B-QTY
boneless	I2	L12	NoCAP	NoPAREN	I-COMMENT
pork	I3	L12	NoCAP	NoPAREN	B-NAME
tenderloin	I4	L12	NoCAP	NoPAREN	I-NAME
,	I5	L12	NoCAP	NoPAREN	B-COMMENT
about	I6	L12	NoCAP	NoPAREN	I-COMMENT
1	I7	L12	NoCAP	NoPAREN	B-QTY
pound	I8	L12	NoCAP	NoPAREN	I-COMMENT

The corresponding CSV entry is: 20000,"1 boneless pork tenderloin, about 1 pound",pork tenderloin,1.0,0.0,,"boneless, about 1 pound"

The second token should be labelled "B-COMMENT" because there's no comment proceeding it.

The issue is with addPrefixes and bestTag. addPrefixes determines that '1' is both the QTY and also part of the entry's comment so it says the possible tags are ['B-COMMENT', 'B-QTY'] it then goes to the next token and determines that it's a COMMENT but tags it as I-COMMENT because the previous token has B-COMMENT as a possible tag. The bestTag picks anything over a COMMENT so it assigns the B-QTY to the '1' and 'boneless' is then tagged incorrectly with I-COMMENT.

Essentially, I think addPrefixes and bestTag should be combined into a single function since BIO chunking really needs to know what the previous tag is actually going to be.

Additionally, it may also be reasonable that if the first instance of '1' is labelled as QTY then the second should be labelled 'COMMENT', but that would be a separate issue apart from the BIO chunking.

Alternative Units

Hi, you guys have created an awesome tool here.

I was wondering if its possible to recognize (or 'teach' it to recognize) alternative units.

For example:

  • 2 ounces milk returns qty: 2, unit: ounce, name: milk
  • But 2 oz milk returns qty: 2, name: oz milk

It doesn't have any idea that oz is the same as ounce.

Thanks!

Update README

  • Give some background. Maybe just copy-paste from the blog post about this?
  • Remove all NYT-specific references. Nobody should have to know what the D.U. is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.