pytorch / text Goto Github PK

View Code? Open in Web Editor NEW

3.5K 309.0 815.0 48.35 MB

Models, data loaders and abstractions for language processing, powered by PyTorch

Home Page: https://pytorch.org/text

License: BSD 3-Clause "New" or "Revised" License

Python 81.08% Shell 1.75% C++ 10.69% Batchfile 0.61% CMake 0.68% C 0.12% Jupyter Notebook 5.07%

nlp data-loader deep-learning pytorch dataset models

text's Introduction

https://circleci.com/gh/pytorch/text.svg?style=svg

torchtext

WARNING: TorchText development is stopped and the 0.18 release (April 2024) will be the last stable release of the library.

This repository consists of:

torchtext.datasets: The raw text iterators for common NLP datasets
torchtext.data: Some basic NLP building blocks
torchtext.transforms: Basic text-processing transformations
torchtext.models: Pre-trained models
torchtext.vocab: Vocab and Vectors related classes and factory functions
examples: Example NLP workflows with PyTorch and torchtext library.

Installation

We recommend Anaconda as a Python package management system. Please refer to pytorch.org for the details of PyTorch installation. The following are the corresponding torchtext versions and supported Python versions.

Version Compatibility

PyTorch version	torchtext version	Supported Python version
nightly build	main	>=3.8, <=3.11
2.3.0	0.18.0	>=3.8, <=3.11
2.2.0	0.17.0	>=3.8, <=3.11
2.1.0	0.16.0	>=3.8, <=3.11
2.0.0	0.15.0	>=3.8, <=3.11
1.13.0	0.14.0	>=3.7, <=3.10
1.12.0	0.13.0	>=3.7, <=3.10
1.11.0	0.12.0	>=3.6, <=3.9
1.10.0	0.11.0	>=3.6, <=3.9
1.9.1	0.10.1	>=3.6, <=3.9
1.9	0.10	>=3.6, <=3.9
1.8.1	0.9.1	>=3.6, <=3.9
1.8	0.9	>=3.6, <=3.9
1.7.1	0.8.1	>=3.6, <=3.9
1.7	0.8	>=3.6, <=3.8
1.6	0.7	>=3.6, <=3.8
1.5	0.6	>=3.5, <=3.8
1.4	0.5	2.7, >=3.5, <=3.8
0.4 and below	0.2.3	2.7, >=3.5, <=3.8

Using conda:

conda install -c pytorch torchtext

Using pip:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en_core_web_sm

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

For torchtext 0.5 and below, sentencepiece:

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.:

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using the nightly build of PyTorch, checkout the environment it was built with conda (here) and pip (here).

Additionally, datasets in torchtext are implemented using the torchdata library. Please take a look at the installation instructions to download the latest nightlies or install from source.

Documentation

Find the documentation here.

Datasets

The datasets module currently contains:

Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
Machine translation: IWSLT2016, IWSLT2017, Multi30k
Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
Question answering: SQuAD1, SQuAD2
Text classification: SST2, AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
Model pre-training: CC-100

Models

The library currently consist of following pre-trained models:

RoBERTa: Base and Large Architecture
DistilRoBERTa
XLM-RoBERTa: Base and Large Architure
T5: Small, Base, Large, 3B, and 11B Architecture
Flan-T5: Base, Large, XL, and XXL Architecture

Tokenizers

The transforms module currently support following scriptable tokenizers:

Tutorials

To get started with torchtext, users may refer to the following tutorial available on PyTorch website.

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

text's People

Contributors

Stargazers

Watchers

Forkers

chagge vyraun bmccann leonweber napsternxg smerity ink-pad binbinbian dvsrepo koustuvsinha mupavan dmitriy-serdyuk dogancan mattmacy terratenney fehiepsi little1tow zhanglae ajaytalati miyyer leezqcst gagb burtenshaw kobikun latkins nelson-liu joyce94 argilla-io smartai alrojo akornilo deeptechlabs mhossny tobby2002 sojvai stevenlol matt-peters jianyuzhan codeaudit atifs aa1607 mainak24 zhangyunyan77 huangyiran blazezhazha woollysocks aotemandess ankithakur47 xkuang wangzhen-nlp czhang99 psavine42 yanhedewang sivareddyg kylegao91 19ai ignaciocases sonalgupta oroszgy hailingc gregorysenay waynedane yongxiongwei ahhegazy siddsach rachtsingh zhongminjin ml-ai-nlp-ir sampwing chenglongchen hengqujushi joseprzmoreno qiao-zhang elanmart entilzha wabyking lingyongyan zbxzc35 ryanleary galsang sinboyxx kelleyyin xiaoduozhou jkr26 kmkurn woailaosang astonzhang shubhampachori12110095 7125messi catcatrun frankatmech bastings heliwang tkim oracle1983 orionr sebastiangehrmann dreamgonfly anupsavvy keitakurita

text's Issues

Length of iterator fails in Python 2

The division len(dataset) / batch_size will be cast to int in python2, so that math.ceil doesn't really work when len(dataset) is not a multiple of batch size.

Example of an embedding loader?

Hi,

is the any chance of an example of a pretrained word embeddings loader?

A single example of how to quickly load say, word2vec or glove, would be really cool. I guess once, people see a common example and use it, it should be straightforward to adapt the loader to other pretrained embeddings.

Thanks a lot 👍

PS - I saw this thread on the opennmt forum, but I couldn't get it to work?

Error while loading csv

This line complains when loading data via csv:

train, val, test = data.TabularDataset.splits(path='/home/data/',train='train.csv',
    validation='val.csv', test='test.csv', format='csv',
    fields=[('text', text_field), ('labels', label_field)])

Error:

    272         if data[-1] == '\n':
    273             data = data[:-1]
--> 274         return cls.fromlist(list(csv.reader([data]))[0])
    275 
    276     @classmethod

TypeError: fromlist() takes exactly 3 arguments (2 given)

Using python 2.7

<unk> token constant

Problem
To reference the '<unk>' token, one needs to rely on the string '<unk>' or vocab.itos[0].

Solution A
The client is able to set the '<unk>' token. The Vocab object throws an error if it is not set.

Solution B
The '<unk>' token needs to be defined as a constant in the Vocab class.

error on import

File "/home/ehoffer/anaconda2/lib/python2.7/site-packages/torchtext-0.1.1-py2.7.egg/torchtext/data.py", line 87
    def build_vocab(self, *args, lower=False, **kwargs):
                                     ^
SyntaxError: invalid syntax

Maybe a python 2.7 vs. 3 issue

How to use pytorch text in the projects

Hi,

I've very limited python knowledge. I couldn't find how to integrate pytorch-text to my projects. When I type from dataloaders.text.torchtext import data to python reply after import torch I got the following error:

>>> from dataloaders.text.torchtext import data
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named dataloaders.text.torchtext

Then I decided, my system don't have the corrresponding package and I tried to install it like torch-vision package. (pip install torchvision) Unfortunatelly i couldn't make it since there isn't such a package. So how could I use this package ?

Cache build_vocab; Shared vocabulary

src.build_vocab(mt_train, max_size=80000) trg.build_vocab(mt_train, max_size=40000)

In the README example, it looks like build_vocab is used twice on the same dataset. For large datasets this could take awhile.

Feature Request: Word+Character-level tokenization

Hi,
Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn't seem supported yet but I may have missed something.

I'd be happy to contribute something to extend torchtext to support this, but I'm not sure what the proper way to handle this would be (ideally it'd be extensible to other tokenization schemes as well, but perhaps that's a stretch). Thoughts?

Thanks!

Splitting up torchtext/data.py into files for each class

As it currently stands, torchtext/data.py has a lot of functionality in one file. I think it'd be nice to split the file into pieces separated by individual functionality, e.g. a file for Fields, a file for Datasets (and their subclasses, of course), a file for Iterators, etc. In particular, I think the structure of torch.nn would work well here.

I think doing this would be a lot clearer in terms of organization of the codebase, but the biggest issue I have with it is that it changes the syntax for the import from from torchtext import data to import torchtext.data as data (much as one would do import torch.nn as nn), rendering the code backwards-incompatible. But i'm not sure how much weight is placed on this, considering the repo isn't on pypi yet and there's a big WIP on the readme...what do you all think?

Special token index is verbose.

Context
Special tokens are frequently used for masking or padding or interpreting the model. It's important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.

Problem
Creating two fields, one for French and one for English, there are no class constants for the index of eos_token. The only way to find out the index of eos_token is per instance of the class (etc. self.stoi[eos_token]).

The code by default is not designed to guarantee that the French dictionary has the same EOS index as the English dictionary.

Possible Solution A
With setting the optional parameter 'eos_token' would it be possible to set 'eos_token_index'?

Possible Solution B
Vocab or Field constant for the index of special tokens.

Batch does not carry index

Use Case:
replace_unk most strategies of replacing tokens rely on aligning with the source sequence before numericialize

Problem:
Using the Batch object, you are unable to retrieve the original text before padding and numericialize.
There are no indexes stored with the batch to retrieve the original text in the dataset.

Quick work around:
Define a field in dataset that is an 'index' field. While building your dataset, pass in indexes for each item.

Batch will then allow you to look up an index attribute.

Building docs

It'd be nice to actually get docs built for people to reference, even while this is still WIP. I'm happy to set up a Sphinx project, but i noticed that pytorch/vision doesn't have docs within the repo, but rather in the main pytorch repo.

Thus, is it preferable to:

Set up a Sphinx project in this repo (and move it over to the main repo when this is less WIP)?
Set up a Sphinx project in the main repo?
Don't even bother with building docs yet?

cc @jekbradbury (@soumith @apaszke may have things they want to say about this as well)

Include Moses Tokenizer

Add interfaces to expose itos and stoi

Maybe I didn't find it through the code, but it'd be nice to expose the token-index translation function through vocabulary interfaces.

Feature Request: Implement sorting within batches

I don't think this is currently implemented -- pardon me if it is. It'd be great to sort within batches, most notably for use with pack_padded_sequence.

Consistency with sorting: torch.RNN default vs torchtext.Iterator default

Torch RNN Default:
Using torch.RNN with batches it is required that the batch is sorted with decreasing lengths.

Default Behavior:*
The default behavior of BucketIterator does not work well with Torch RNN.

    train_iter, dev_iter, test_iter = data.BucketIterator.splits(
        (train, dev, test),
        batch_sizes=(32, 256, 256),
        sort_key=lambda x: data.interleave_keys(-len(x.input), -len(x.output)),
        device=-1)  # Use CPU

By default train_iter is compatible with Torch RNN. All the batches are sorted by decreasing lengths.

But by default dev_iter and test_iter are not sorted by decreasing lengths. This is because train=False then self.sort=True. Then because of this issue, #69 dev_iter and test_iter are actually shuffled.

Possible Solution:
It's possible that solving this issue #69 will solve this one as well but I think these are two separate bugs.

TypeError: BucketIterator object is not an iterator

BucketIterator it be an iterator must also implement the next function.

https://stackoverflow.com/questions/33956034/why-is-an-iterable-object-not-an-iterator

Missing check for whether device is None in data.Field.numericalize

I'm getting the following error when calling data.Field.numericalize without a device argument. It seems like there needs to be a check for whether device is None before entering this context manager.

Traceback (most recent call last):
  File "./train.py", line 102, in <module>
    main()
  File "./train.py", line 83, in main
    for step, batch in enumerate(train_iter):
  File "build/bdist.linux-x86_64/egg/torchtext/data.py", line 579, in __iter__
  File "build/bdist.linux-x86_64/egg/torchtext/data.py", line 482, in __init__
  File "build/bdist.linux-x86_64/egg/torchtext/data.py", line 233, in numericalize
  File "/home/dogan/anaconda2/lib/python2.7/site-packages/torch/cuda/__init__.py", line 132, in __enter__
    torch._C._cuda_setDevice(self.idx)
RuntimeError: invalid argument to setDevice

Bug in Last Commit: Missing function len() in type Example

Hi everyone!

It seems like the last commit of data.py by @jekbradbury introduced a bug. Here's the stacktrace of my code that works with the second last revision (8b5c731)

Traceback (most recent call last):
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/mulga/oana-2/experiments/src/main/python/nips2017/__main__.py", line 207, in main
    exp.run()
  File "/data/mulga/oana-2/experiments/src/main/python/nips2017/experiment/base_experiment.py", line 431, in run
    self._train(epoch)
  File "/data/mulga/oana-2/experiments/src/main/python/nips2017/experiment/experiment_2.py", line 375, in _train
    for iteration, batch in enumerate(self._data.training_iter, 1):
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/site-packages/torchtext/data.py", line 589, in __iter__
    for minibatch in self.batches:
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/site-packages/torchtext/data.py", line 463, in pool
    for p in batch(data, batch_size * 100, batch_size_fn):
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/site-packages/torchtext/data.py", line 442, in batch
    size_so_far += batch_size_fn(ex)
TypeError: object of type 'Example' has no len()

I didn't have a detailed look at it yet, and can provide a minimal example that reproduces the bug later, but I suppose the stacktrace might already be enough for you guys to know what's going on.

Best,
Patrick

min_freq=0 bug

Noticed:

>>>some_field.build_vocab(some_dataset, min_freq=0)
>>>padding_idx = some_field.vocab.stoi['<pad'>]
>>>print(padding_idx, '<pad>')
12 <pad>

Looks like is not equal to 1 which is not okay.

Printed stoi and itos as well:

>>>print(some_field.vocab.stoi)
defaultdict(<function Vocab.__init__.<locals>.<lambda> at 0x103f4f0d0>, {'<pad>': 12, '1': 2, '2': 3, '9': 4, '0': 5, '5': 6, '4': 7, '6': 8, '8': 9, '3': 10, '7': 11, '<unk>': 13})
>>>print(some_field.vocab.itos)
['<unk>', '<pad>', '1', '2', '9', '0', '5', '4', '6', '8', '3', '7', '<pad>', '<unk>']

Possible reason:
Counter subtract does remove the specials but puts their count at 0.
counter.subtract({tok: counter[tok] for tok in ['<unk>'] + specials})

Possible solution:
Throw an error if min_freq < 1

Does Field support numerical features?

Hi,

I am processing a text dataset. Beside the text feature which could be processed by Field in a subclass of Dataset, there is some float numbers which I would like to use with the text. So how could I create a subclass of data.Dataset with appropriate Field?

Thanks,

Consistency with sorting: `sort=True`

Problem:

    train_iter, dev_iter, test_iter = data.BucketIterator.splits(
        (train, dev, test),
        batch_sizes=(32, 256, 256),
        sort_key=lambda x: len(x.input),
        sort=True,
        device=-1)  # Use CPU

If sort=True and train=True, then the train_iter batches are shuffled. This behavior is unexpected.

Cause:
Because by default self.shuffle=True is train=True. Then https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L113 shuffle overrides sort.

Possible Solution:
sort=True should override shuffle=None and train=True.

error when x is a string instead of a unicode

this line, six.text_type.lower requires the input x to be of type unicode, so when x is a string, this will complain with an error. what about changing this line to be:

 x = Pipeline(six.text_type.lower)(unicode(x))

any ideas

Randomly initialising word vectors

There doesn't seem to be the option to initialise word vectors without using pretrained embeddings. There's an option to fill in vectors for tokens missing from the pretrained embeddings with normally distributed values. It would be cool if there was a built in option to initialise embeddings from a uniform distribution without having to specify a word embedding file.

Serializing Fields

Is there a canonical way to serialize Fields for later use? (e.g. if you want to load a model, preprocess/numericalize some test data, and then run the model on the test data)?

torch.save won't work on it out of the box since Pipeline and Vocab both have un-pickleable lambda locals. I hackily got around it by just redefining these lambdas as named functions (happy to clean it up and send a PR if you want), but I'm wondering if there's a better way.

Thanks!

BucketIterator vs BPTTIterator Tensor vs Not

BPTTIterator numericalizes the data and turns it into a tensor while BucketIterator does not.

Do you think the behavior here should be consistent? Either the data is always converted to tensors or is not?

Possible bug in LanguageModelingDataset

In the code for LanguageModelingDataset, the original text seems to be pre-processed twice, viz.:

text += text_field.preprocess(line) at line 22
examples = [data.Example.fromlist([text], fields)] at line 26, which in turn calls
setattr(ex, name, field.preprocess(val)) at line 44 of example.py

In fact, if I try to create a simple LanguageModelingDataset, I am getting an error as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/datasets/language_modeling.py", line 28, in __init__
    examples = [data.Example.fromlist([text], fields)]
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/data/example.py", line 44, in fromlist
    setattr(ex, name, field.preprocess(val))
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/data/field.py", line 91, in preprocess
    x = self.tokenize(x)
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/data/field.py", line 63, in <lambda>
    tokenize=(lambda s: s.split()), include_lengths=False,
AttributeError: 'list' object has no attribute 'split'

Include filter in preprocessing pipeline

Now there is filter_pred in Dataset's constructor, but it happens only before the preprocessing step.
Also the way preprocessing is done now (x = preprocess(x)) does not support filtering operations.
So filtering on tokenized or numerized sequences is not supported or very heavy if done in filter_pred.

Bug in tokenization during Field preprocessing

Refer to line - https://github.com/pytorch/text/blob/master/torchtext/data.py#L135.

        if self.sequential and isinstance(x, str):

Note that in python2, isinstance(x, str) returns False when x is unicode. Took a while to realize why tokenization was not happening for me. I guess a simple fix would be isinstance(x, six.string_types). Shall I issue a PR (though this is an extremely tiny change)?

load_vectors does not fail gracefully when the requested #dimensions is not available

While testing out my SQUAD loader I discovered that no matter what dimension is requested for glove.42B load_vectors will download the 300d vectors. And if it's not 300 it will download it over and over ad infinitum:

downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
extracting word vectors into .data_cache
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
extracting word vectors into .data_cache
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
extracting word vectors into .data_cache
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip

If you're busy I can fix it myself, but it seems pretty silly to have an option to the API that doesn't actually work.

[Discussion] Saving the field object

Usage:
The field object is critical to checkpointing as it provides:

tokenization
padding
numericalize

Having the ability to save the field object allows the user, given arbitrary text, to preprocess the text. The preprocessed text is then used with a checkpointed model. Then the output is predicted and interpreted without the output dictionary.

Problem:
torch.save is implemented with pickle. The field object accepts lambdas for tokenization, preprocessing and postprocessing; therefore, cannot be pickled.

Key Points:

The vocab object needs to be pickled because the output of the model is uninterpretable without it.

Discussion:
What is the right abstraction here?
Should the vocab object be saved and the field object discarded?
Is it appropriate to have the field object and the vocab object closely bound?

Dimension order of batches

It seems the BucketIterator produces batches in WxN dim order rather than NxW dim order, where N is the batch size. Given that PyTorch tends to want data with the batch dimension first, wouldn't it be better to transpose the output? Is there an advantage to the current output? (I suppose it's nice for RNN inputs, but I'm still trying to figure out why PyTorch wants RNN inputs in that form).

import error

I have installed this package in my machine with python 2.7, however, when i import it, it gives me error:

Segmentation fault (core dumped)

Datasets used in tests

Hello,

Can you please provide the datasets that are used in the "tests"?
It references to this path: "~/chainer-research/jmt-data/pos_wsj/pos_wsj" but I can't find the pos_wsj dataset.
Since the docs are not that great, I think the tests will be a good place to start learning.

Also, in the "tests/vocab.py" it references the Glove 300d vectors but it is not whether it is a text file or what?

max_size vocab is not consistent.

Context:
Num field includes the numbers 0 - 9. I set max_size=10. Then I print the vocab that was built:

    num_field.build_vocab(train, max_size=10)
    print(num_field.vocab.itos)
    # ['<unk>', '<pad>', '<s>', '</s>', u'1', u'2']
    print(len(num_field.vocab.itos))
    # 6

Then I checked the words created from tokenization:

print(words)
# [(u'1', 11308), (u'2', 11270), (u'9', 11058), (u'0', 11020), (u'5', 10952), (u'4', 10942), (u'6', 10914), (u'8', 10820), (u'3', 10766), (u'7', 10706), ('</s>', 0), ('<pad>', 0), ('<s>', 0), ('<unk>', 0)]

Looks like the vocab built includes only 6 tokens yet the max_size is 10 while there are 14 possible tokens.

Problem:
If the number of tokens is larger than max_size, build_vocab does not fill up the vocabulary up till max_size.

Possible Solution:
Update https://github.com/pytorch/text/blob/master/torchtext/vocab.py#L129 to not subtract len(self.itos) from max_size.

TypeError: decoding Unicode is not supported

Python Verison:

$ python --version
Python 2.7.13

Error:

Traceback (most recent call last):
  File "examples/sample.py", line 81, in <module>
    fields=[('input', qa_field), ('output', qa_field)])
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/dataset.py", line 107, in __init__
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/example.py", line 31, in fromTSV
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/field.py", line 89, in preprocess
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/pipeline.py", line 13, in __call__
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/pipeline.py", line 19, in call
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/field.py", line 89, in <lambda>
TypeError: decoding Unicode is not supported

Possible solution:
Check at line 88 that x is not already a Unicode:
https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L88

Code:
if six.PY2 and isinstance(x, six.string_types) and not isinstance(x, unicode):

A typo in README

In README, under Data/Batching, padding, and numericalizing (including building a vocabulary object), there is one missing closing paren for the following part:

mt_dev = data.TranslationDataset(
    path='data/mt/newstest2014', exts=('.en', '.de'),
    fields=(src, trg)

Fuzzy dictionary request for Vocab stoi class member

I would like to make a feature request for making the Vocab.stoi dictionary (make it an option?) such as the below so that approximate words are matched such as "hellpp" -> "help".

from collections import defaultdict

from fuzzywuzzy import process


class FuzzyDict(defaultdict):
    """ FuzzyDict attempts to pair to find a good key match
        before resorting to the passed default_factory
    """

    def __init__(self, default_factory, threshold=85, process_fn=process.extractOne):
        """

        :default_factory: the factory function that outputs the values for
            keys not in the dictionary
        :params threshold: is the score that the process function should
            output to be accepted as a key
        """
        self.default_factory = default_factory
        self.threshold = threshold
        self.process_fn = process_fn

    def __missing__(self, key):
        """ Handle a key that does not exist in the FuzzyDict """

        if len(self) > 0:
            best_choice, score = self.process_fn(key, self.keys())
            if score > self.threshold:
                return self[best_choice]

        return self.default_factory()

Use case is as follows:

>>> fuzz_dict = FuzzyDict(lambda: 0)
>>> fuzz_dict['Obama'] = 1
>>> fuzz_dict['Omaha']
0
>>> fuzz_dict['Oabama']
1
>>> fuzz_dict['Obama ']
1
>>> fuzz_dict[  'Obama']
1
>>> fuzz_dict['  Obama']
1
>>> fuzz_dict['B Obama']
1
>>> fuzz_dict['B. Obama']
1
>>> fuzz_dict['Barack'] = 2
>>> fuzz_dict['Barak']
2
>>> fuzz_dict['help'] = 5
>>> fuzz_dict['hlep']
0
>>> fuzz_dict['helpp']
5

py2 fails with snli example

Just FYI, seems to fail on master.
Consider adding a travis contbuild like vision or tnt packages to catch these early.

~/local/examples/snli] python train.py
downloading
extracting
Traceback (most recent call last):
  File "train.py", line 22, in <module>
    train, dev, test = datasets.SNLI.splits(inputs, answers)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/datasets/snli.py", line 47, in splits
    filter_pred=lambda ex: ex.label != '-')
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 324, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 398, in __init__
    examples = [make_example(line, fields) for line in f]
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 232, in fromJSON
    return cls.fromdict(json.loads(data), fields)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 241, in fromdict
    setattr(ex, name, field.preprocess(val))
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 136, in preprocess
    x = Pipeline(str.lower)(x)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 30, in __call__
    x = pipe.call(x)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 36, in call
    return self.convert_token(x, *args)
TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'

Segmentation Fault

I did a git clone of this repo, ran python setup.py install, and then when I try to do a simple import torchtext, I keep getting a segmentation fault.
Based on the comments from #11 , I tried re-installing all relevant package dependencies, such as numpy, scipy, nltk, but the error persists. As observed in the referenced issue, I have also tried the following which works fine:

import nltk (even numpy or scipy or matplotlib here works just fine)
import torchtext

Further, as suggested by @soumith in the other thread, I tried using gdb, which gives me the following point where the segfault is actually happening:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffdfeb7fc0 in PyArray_API () from /home/riddhiman/.local/lib/python2.7/site-packages/numpy/core/multiarray.so

Any pointers as to why this is happening, and how to resolve this issue? For now, I am using the workaround of importing numpy and then importing torchtext, but it would be nice to know the real cause.

Restart Iterator from a particular Epoch and Batch

For Machine Learning tasks, it's useful to be able to restart the task at a checkpoint. For the iterator implementation, it looks it's not possible to start the iterator at an arbitrary epoch and batch.

Load data from a list of lists and build a vocabulary

I have a dataset in which each sample is essentially a sequence of sentences. I have parsed the data to form a list of lists (which if I convert to a numpy array would give an ndarray of shape (M,N) ). Most of the functionality from what I could see is for loading dataset directly from files. Is there any way I could load the data from such a list of lists or a numpy array/pytorch Tensor and then use the build_vocab method on that?

Epoch event in Iterator

The iterator implementation allows it to repeat. That is cool.

For Machine Learning, it is typical to evaluate the model at the end of an Epoch. Allow the user to add a function to the iterator that is called at the start of a new Epoch.

Using a field representing real numbers with the iterator

I am trying to learn a regressor on text data and I use torchtext in all my other tasks but I see a problem in using it for this use case.

I define the field for targets as follows:

TARGETS = data.Field(
            sequential=False, tensor_type=torch.DoubleTensor, batch_first=True)
self.fields = [('targets', TARGETS), ('text', TEXT)]
self.train, self.val, self.test = data.TabularDataset.splits(
            path=self.path,
            train=self.train_suffix,
            validation=self.val_suffix,
            test=self.test_suffix,
            format=formatting,
            fields=self.fields)
TEXT.build_vocab(self.train)

I have a file that contains tab separate \t

When I make iterators out of it,

train_iter, val_iter, test_iter = data.Iterator.splits(
                (self.train, self.val, self.test),
                batch_sizes=(self.batch_size, self.test_batch_size,
                             self.test_batch_size),
                sort_key=lambda x: len(x.text),
                shuffle=True)
print(next(iter(train_iter)))

it gives me an error when getting the next batch:

AttributeError: 'Field' object has no attribute 'vocab'

I know this is because I didn't run .build_vocab for the TARGETS field. But why do I really need to do this? What if I just want to get real numbers and compute losses on them?

Any workaround is appreciated. If I am doing something wrong, please let me know too.

[Discussion] Object Design

Background: Looking into including a subword tokenizer. Been having a difficult time figuring out the right abstractions. The problem is a subword tokenizer unlike other tokenizers requires to define its own subword vocabulary.

Thinking about the above, I have a couple discussion questions.

Field object vs Batch object abstraction
The Field object defines instructions for dealing with batches of examples. Should the Batch object handle this responsibility instead? The Field object should only be defined for one example.

Rename the Field object to TextEncoder
Tensor2Tensor defines the functionality of converting text to tensors as a "TextEncoder". "TextEncoder" to me is a clearer object name than "Field". Should we rename the "Field" object?

Reference: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py

Proper abstraction tokenizers and field objects
The subword tokenizer requires a subword vocabulary. To achieve this, one would need to override the build_vocab method. To implement this, you'd need a separate subword field.

This is not consistent with the moses and spacy tokenizers that do not require separate fields.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Tried opening a text/plain; charset=utf-8 file with torchtext.

# file -i data/simple_questions_wikidata/train.tsv
data/simple_questions_wikidata/train.tsv: text/plain; charset=utf-8

Got this stack trace:

Traceback (most recent call last):
  File "src/jobs/seq2seq/train.py", line 234, in <module>
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
    for line in f]
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
    make_example(line.decode('utf-8') if six.PY2 else line, fields)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0

Fixed with:
with open(os.path.expanduser(path), encoding='utf-8') as f:

Here: https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L104

Understanding vocabulary of text and labels

I have a dataset containing a text field text and corresponding class field lbl, which consists of only two classes, numeric 0 and 1. I am loading the .tsv file as written in the Readme :

text_field = data.Field()
label_field = data.Field()

train, val, test = data.TabularDataset.splits(path='/data/',train='train.tsv',
    validation='val.tsv', test='test.tsv', format='tsv',
    fields=[('text', text_field), ('lbl', label_field)])

I understand that this line builds the vocabulary :

text_field.build_vocab(train,val)
label_field.build_vocab(train,val)

But when I am testing how many vocabulary items I have for label field using len(label_field.vocab). I am not getting the value 2. While the text field vocabulary seems correct (40k+). What am I doing wrong? Is there any way to view the data within text_field and label_field ?

Using Iterator.splits(...) only on a test Dataset object

I have noticed that for the function body of the Iterator.splits() function in torchtext/data.py. It assumes that the first dataset passed is a training dataset and the following are not.

    @classmethod
    def splits(cls, datasets, batch_sizes=None, **kwargs):
        """Create Iterator objects for multiple splits of a dataset.

        Arguments:
            datasets: Tuple of Dataset objects corresponding to the splits. The
                first such object should be the train set.
            batch_sizes: Tuple of batch sizes to use for the different splits,
                or None to use the same batch_size for all splits.
            Remaining keyword arguments: Passed to the constructor of the
                iterator class being used.
        """
        if batch_sizes is None:
            batch_sizes = [kwargs.pop('batch_size')] * len(datasets)
        ret = []
        for i in range(len(datasets)):
            train = i == 0
            ret.append(cls(
                datasets[i], batch_size=batch_sizes[i], train=train, **kwargs))
        return tuple(ret)

If I were to pass train argument (boolean used to make Variables volatile) inside the kwargs there will be a conflict between the two. In my use case, I had another python script to run a test case and I did not want to load all the train, dev data to use the test dataset for evaluation. It will fail from passing two train arguments inside.

test_iter = data.Iterator.splits(
    (test, ), batch_size=args.batch_size, device=args.gpu, repeat=False, train=False)[0]

I am not sure if this even needs attention, but I thought I might as well post it.

-- Edit
Closed because I should just initialize it with Iterator itself. Did not notice because I followed an example...

[Feature Request] parameter for detokenization in fields

The ability to specify how to detokenize a tensor using the field.

Particularly this came up when using https://github.com/google/sentencepiece. Would like the ability to add subword units together as part of field. A counter part to tokenize.

By default numericalize should have an opposite counter part to denumericalize.

Error on loading json data

The following code is giving me an AttributeError: 'list' object has no attribute 'items' error:

sent_feats = data.Field()
sequences = data.TabularDataset(path=captions_path, format="json", \
                    fields=[{'caption' : ("sentences", sent_feats)}])

My dict has the following structure:

{ seq_id : { caption_id : { caption: "" } } }

Is there any syntactical error I am making?