Code Monkey home page Code Monkey logo

nlvr's People

Contributors

alsuhr avatar alsuhr-c avatar debajyotidatta avatar yoavartzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlvr's Issues

Typos in the dataset

I found that there are many typos in the dataset (e.g., ciircles, yelllow, yelloe, ...). I can send out a small PR that fixes some of the sentences with typos. But I was wondering whether you just want to keep the dataset the same as it is now (i.g., just leave the typos as they are) ?

Thanks for creating this cool dataset.

NLVR dataset uploaded & visualized at tagtog

Thank you so much for creating this great dataset.

I have uploaded the NLVR dataset to tagtog for easier visualization and exploration of the data.

Here the project's link with its guidelines/README: https://www.tagtog.net/NLVR/NLVR/-settings#tab-guidelines

Here for instance a sample: https://www.tagtog.net/NLVR/NLVR/pool%2Ftrain/aWRfhf_ACQLgY5U9nULhEdhX8938-998_1.md?p=0&i=3

It looks like this:

64972786-2c40e080-d8aa-11e9-9f17-abb79826997c

--

Do you have some thoughts? Feedback? It would be interesting to entirely explore the NLVR2 dataset too.

The images have 4 channels

The given images have 4 colour channels and the last channel looks like this,

screenshot from 2017-07-14 13-31-06

Is there any particular reason for this?

Label distributions

I am looking at the distributions of labels for sentences across structured representations, and wanted to check if my observations are correct. I grouped structured representations by gathering those in examples whose identifiers have the same "m" values, where each identifier is of the form "m-n".

I imagined I would find each sentence occurring with four different structured representations, two in which it is true and two in which it is false. However, I see that

  1. Only 3358 out of 3696 sentences in train, dev and test occur with 4 different structured representations, and others have less than four structures.
  2. Only 2176 out of the 3696 sentences have equal numbers of true and false labels.
  3. 465 out of the 3696 sentences have all true or all false labels.

I can see how (1) may have happened during post-processing to remove examples with low agreement, but I am not sure about (2) and (3). Please let me know if this is expected.

Thanks!

question about Implementation details of Image Features+RNN

In image features + RNN method, you use color, shape and size etc. to construct a set of feature for every object index. Let's take color and shape into account for example. Given a image with two objects:
Object 1 is represented as: color[0,0,1] shape [1,0,0]
Object 2 is represented as : color [0,1,0] shape[0,1,0]

You said you use the concatenation of the one-hot image features to compute the image embeddings
with two layers of size 32.

Does it mean that you first concatenate the features of a Object, namely Object 1 ->[0,0,1,1,0,0] and Object 2 ->[0,1,0,0,1,0]. And then they will be put into Embedding Layer1 to produce vector e1 and e2.
e1 and e2 will be concatenated again and then put into Embedding Layer2 to produce final image Embedding?

Or it means concatenates all features of all Objects,namely image->[0,0,1,1,0,0,0,1,0,0,1,0] and then it will be embedded by two Embedding Layers?

Maybe both of my comprehensions are wrong.

Thanks you for this cool corpus.

Broken json files

When I try to parse any of json files in python or julia, I got the following error:

>>> import json
>>> with open('train.json') as data_file:
...     data= json.load(data_file)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "erenay/anaconda/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "erenay/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "erenay/anaconda/lib/python2.7/json/decoder.py", line 367, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 991 column 1 (char 700 - 901156)

Vocabulary words

This isn't necessarily an issue, but rather an observation. It seems there are quite a few misspelt words in the example sentences.

Here is the output of a script which takes all examples, extracts their sentences, splits on spaces, lowercases all words, and uniqfies the words to form a simple vocabulary:

['1', 'more', '', 'isa', 'yelow', '.', 'has', 'numbers', 'block', 'having', 'have', 'no', 'base', 'leats', 'item,', 'back', 'other.', 'all', 'triangle,', 'odd', 'circles', 'circle', 'od', 'but', 'two', 'adge', 'black.', 'another', 'at', 'every', 'nealy', 'and', 'bases', 'touching', 'underneath', 't', 'sthe', 'box', 'same', 'one.', 'count', 'one', 'other', 'among', 'directly', 'objects', 'color.', 'do', 'wirh', 'bow', 'to', 'only.', '3', 'left', 'third', 'middle', 'than', 'that', 'less', 'closely', 'line', 'items,', 'shape.', 'out', 'boxes.', 'line.', 'bellow', 'ones.', 'triangle.', 'wwith', 'item', 'first', 'lot', 'nearly', 'ablue', 'triangle', 'coloured', 'not', 'both', 'box,', 'lease', 'box.', 'square,', 's', 'set', 'different.', 'over', 'colors.', 'alternately', 'them', 'just', 'in', 'between', 'smaller', '2', 'yelloe', 'objects,', 'squere', 'which', 'small', 'bottom-right', 'blue.', 'where', 'most', 'traingles.', 'ble', 'either', 'blicks', 'near', 'each', 'side', 'grey', ',', 'traingle', 'block.', 'bottom', 'beneath', '5', 'an', 'touhing', 'blue,', 'items.', 'bottom.', 'containing', 'positioned.', 'tocuhing', 'they', 'corner', 'height', 'base.', 'lest', 'square', 'three.', 'blocks,', 'consecutive', 'are', 'one,', 'total', 'yellow', 'different', 'towers', 'top.', 'item.', 'yelllow', 'block/', 'none', 'size.', 'bule', 'shapes', 'objects.', 'triangles.', 'only', 'number', 'some', 'size', 'least', 'ans', 'their', 'colour', 'objetcs', 'yellow.', 'ciircles', 'black,', 'any', 'second', 'corners.', 'middle.', 'contain', 'six', 'wih', 'medium', 'including', 'black', 'color', 'yellow,', 'circles.', 'attach', 'under', 'shape', 'all.', 'wth', 'height.', 'blacks', 'attached.', 'blocks', 'right', 'atleast', 'four', 'exacty', 'tow', 'theer', 'block,', 'each.', 'ones', 'bloxk', 'ia', 'bases.', 'big', 'exactly', 'items', 'then,idle', 'blccks', 'roof', 'squares.', 'eactly', 'this', 'attached', 'wall.', 'that.', 'colors', 'exacrly', 'is', 'corner.', 'ha', 'triangles', 'below', 'bo', 'opis', '6', 'edge', 'squares', 'thee', 'or', 'with', 'a', 'there', 'colored', 'exacts', 'towers.', 'boxes', 'hte', 'lower', 'positions', 'made', 'three', 'kinds', 'egde', 'top', 'almost', 'it.', 'four.', 'it', 'single', 'type', 'most.', 'cirlce', 'i', 'abox.', 'blocks.', 'same.', 'square.', 'blue', 'after', 'colours', 'bkack', 'together', 'colour.', 'ad', 'its', 'even', 'close', 'tleast', '4', 'the', 'five', 'seven', 'trianlge', 'circle,', 'position.', 'tower.', 'as', 'above', 'stacked', 'without', 'al', 'many', 'circ;e', 'tower', 'blocks..', 'side.', 'from', 'multiple', 'object', 'level.', 'stack', 'rectangle', 'b;ue', 'of', 'tower,', 'being', 'object.', 'circle.', 'on', 'sqaures', 'contains', 'wall', 'll']

I was just wondering whether there's any agreed upon convention for preprocessing the text?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.