nytimes / ingredient-phrase-tagger Goto Github PK

Extract structured data from ingredient phrases using conditional random fields

Home Page: http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/

License: Other

Shell 4.21% Python 84.42% Ruby 11.37%

ingredient-phrase-tagger's Introduction

CRF Ingredient Phrase Tagger

This repo contains scripts to extract the Quantity, Unit, Name, and Comments from unstructured ingredient phrases. We use it on Cooking to format incoming recipes. Given the following input:

1 pound carrots, young ones if possible
Kosher salt, to taste
2 tablespoons sherry vinegar
2 tablespoons honey
2 tablespoons extra-virgin olive oil
1 medium-size shallot, peeled and finely diced
1/2 teaspoon fresh thyme leaves, finely chopped
Black pepper, to taste

Our tool produces something like:

{
    "qty":     "1",
    "unit":    "pound"
    "name":    "carrots",
    "other":   ",",
    "comment": "young ones if possible",
    "input":   "1 pound carrots, young ones if possible",
    "display": "<span class='qty'>1</span><span class='unit'>pound</span><span class='name'>carrots</span><span class='other'>,</span><span class='comment'>young ones if possible</span>",
}

We use a conditional random field model (CRF) to extract tags from labelled training data, which was tagged by human news assistants. We wrote about our approach on the New York Times Open blog. More information about CRFs can be found here.

On a 2012 Macbook Pro, training the model takes roughly 30 minutes for 130k examples using the CRF++ library.

Development

On OSX:

brew install crf++
python setup.py install

Quick Start

The most common usage is to train the model with a subset of our data, test the model against a different subset, then visualize the results. We provide a shell script to do this, at:

./roundtrip.sh

You can edit this script to specify the size of your training and testing set. The default is 20k training examples and 2k test examples.

Usage

Training

To train the model, we must first convert our input data into a format which crf_learn can accept:

bin/generate_data --data-path=input.csv --count=1000 --offset=0 > tmp/train_file

The count argument specifies the number of training examples (i.e. ingredient lines) to read, and offset specifies which line to start with. There are roughly 180k examples in our snapshot of the New York Times cooking database (which we include in this repo), so it is useful to run against a subset.

The output of this step looks something like:

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

1/2          I1      L4      NoCAP  NoPAREN  B-QTY
cup          I2      L4      NoCAP  NoPAREN  B-UNIT
sugar        I3      L4      NoCAP  NoPAREN  B-NAME

2            I1      L8      NoCAP  NoPAREN  B-QTY
tablespoons  I2      L8      NoCAP  NoPAREN  B-UNIT
dry          I3      L8      NoCAP  NoPAREN  B-NAME
white        I4      L8      NoCAP  NoPAREN  I-NAME
wine         I5      L8      NoCAP  NoPAREN  I-NAME

Next, we pass this file to crf_learn, to generate a model file:

crf_learn template_file tmp/train_file tmp/model_file

Testing

To use the model to tag your own arbitrary ingredient lines (stored here in input.txt), you must first convert it into the CRF++ format, then run against the model file which we generated above. We provide another helper script to do this:

python bin/parse-ingredients.py input.txt > results.txt

The output is also in CRF++ format, which isn't terribly helpful to us. To convert it into JSON:

python bin/convert-to-json.py results.txt > results.json

See the top of this README for an example of the expected output.

Authors

License

Apache 2.0.

ingredient-phrase-tagger's People

Contributors

Stargazers

Watchers

Forkers

cclauss fulquan wanjinchang devnambi manugarri cthames edvinsson hdooster jmichalicek gregdl viktortnk bmcmahen digideskio physheng pranavgoelcs agouil jsundram dustins dexteradeus kastnerkyle wojohowitz00 minsukchang ericadams-di yufish aferrandini nataliahernandezdlm snapcart-ruben hengrumay benwilder etburke dennisfoconnor joshuacrowley nick-ulle jacobgardner ozgen lumiqai jf248 pierrearb csathler ramji-c schollz ameyab alvincjin hophamtenquang furrers hobingzhao actank ninadamondikar foodpairing sunnysidesounds tettoffensive arvoreen generalic clairejaja jnmandal drdelambre azkario froskekongen jderoner ryanfeather tomwhite tuxago leearaneta mtlynch jhamburg shubhampachori12110095 taoru ykankaya vivekseth pshwu gduverger akhil2495 anujsrc jtarricone bweakfastclub znort filipetavares vyas9296 codydenike mabounassif jamesmillerio nuvard damuel4000 michaelkariv sush0408 ccolle fluidity-co nuno-silva18 dolanor ikwattro gareththomasnz shreyag12 joecwallace alvations sjones6 backwardn vilyan01 sisyphus192 kylewludwig throttleup

ingredient-phrase-tagger's Issues

Possibly Improved Sequence Tagger

Hi,
This is a pretty awesome project, thanks for posting it! I've been experimenting with Structured Prediction methods, and decided to use this project to compare CRFs and Learning to Search methods. Pending a more rigorous evaluation (fingers crossed) I'm seeing roughly 96% per-token accuracy and 95.55% sentence-level accuracy with L2S and vw, taking 22 minutes total for read+train+test. This is an 80/20 split on the full dataset, using the output of bin/generate. Once I've cleaned up the source, I'll be happy to send over a pull request.

- Arthur

Facing value error

while running this command I am getting an error - Traceback (most recent call last):
File "bin/generate_data", line 7, in
cli.Cli(sys.argv[1:]).run()
File "/home/ramya/anaconda3/lib/python3.6/site-packages/ingredient_phrase_tagger-0.0.0.dev0-py3.6.egg/ingredient_phrase_tagger/training/cli.py", line 15, in run
File "/home/ramya/anaconda3/lib/python3.6/site-packages/ingredient_phrase_tagger-0.0.0.dev0-py3.6.egg/ingredient_phrase_tagger/training/cli.py", line 26, in generate_data
ValueError: invalid literal for int() with base 10: ''

Please help me

Improve roundtrip script

It should take command line parameters for the number of training and testing examples: ./roundtrip --train-count 1000 --test-count 100.

And maybe even check that the sum is less than the number of lines of ingredients.csv.

Issues parsing irrational numbers

Have you guys run into issues trying to parse strings with fractions that correlate to irrational numbers? For example, the following will all return null for qty.

1/3 cup flour
2/3 tsp almond extract
14/15 gallon milk

CRF Output

Hi, I am not able to understand to what does these tab separated fields mean.

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

Please, help me out.

Thanks

BIO tagging/chunking bug

The first entry for the test set looks like this:

1	I1	L12	NoCAP	NoPAREN	B-QTY
boneless	I2	L12	NoCAP	NoPAREN	I-COMMENT
pork	I3	L12	NoCAP	NoPAREN	B-NAME
tenderloin	I4	L12	NoCAP	NoPAREN	I-NAME
,	I5	L12	NoCAP	NoPAREN	B-COMMENT
about	I6	L12	NoCAP	NoPAREN	I-COMMENT
1	I7	L12	NoCAP	NoPAREN	B-QTY
pound	I8	L12	NoCAP	NoPAREN	I-COMMENT

The corresponding CSV entry is: 20000,"1 boneless pork tenderloin, about 1 pound",pork tenderloin,1.0,0.0,,"boneless, about 1 pound"

The second token should be labelled "B-COMMENT" because there's no comment proceeding it.

The issue is with addPrefixes and bestTag. addPrefixes determines that '1' is both the QTY and also part of the entry's comment so it says the possible tags are ['B-COMMENT', 'B-QTY'] it then goes to the next token and determines that it's a COMMENT but tags it as I-COMMENT because the previous token has B-COMMENT as a possible tag. The bestTag picks anything over a COMMENT so it assigns the B-QTY to the '1' and 'boneless' is then tagged incorrectly with I-COMMENT.

Essentially, I think addPrefixes and bestTag should be combined into a single function since BIO chunking really needs to know what the previous tag is actually going to be.

Additionally, it may also be reasonable that if the first instance of '1' is labelled as QTY then the second should be labelled 'COMMENT', but that would be a separate issue apart from the BIO chunking.

Alternative Units

Hi, you guys have created an awesome tool here.

I was wondering if its possible to recognize (or 'teach' it to recognize) alternative units.

For example:

2 ounces milk returns qty: 2, unit: ounce, name: milk
But 2 oz milk returns qty: 2, name: oz milk

It doesn't have any idea that oz is the same as ounce.

Thanks!

Update README

Give some background. Maybe just copy-paste from the blog post about this?
Remove all NYT-specific references. Nobody should have to know what the D.U. is.

doesn't handle one type of format

250 gram-4 pieces of Foli Fish this program doesnt handled the above statement like ingredient phrase.