Hi, I've very limited python knowledge. I couldn't find how to integ

Does the change in <a class="issue-link js-issue-link" data-error-text="Failed to load

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to use pytorch text in the projects about text HOT 12 CLOSED

pytorch commented on September 18, 2024

How to use pytorch text in the projects

from text.

Comments (12)

jekbradbury commented on September 18, 2024 1

Does the change in #52 solve your problem? You should be able to try it locally.

from text.

nelson-liu commented on September 18, 2024

It's not on PyPI yet, so you can't pip install. With your terminal, cd to the location of the torchtext code on your disk (download or clone the repo first), then run: python setup.py install

from text.

mambuDL commented on September 18, 2024

Thanks for the reply. It solved my problem. Unfortunatelly now I have another one. I tried to use the translator.py file inside the tests folder. To do I downloaded the same dataset and put a folder. Inside the code I only change the path at line-25 so that it points to the data where I've downloaded it. I got the following error:

$ python translation.py 

    Warning: no model found for 'de'

    Only loading the 'de' tokenizer.


    Warning: no model found for 'en'

    Only loading the 'en' tokenizer.

Traceback (most recent call last):
  File "translation.py", line 27, in <module>
    fields=(DE, EN))
  File "build/bdist.linux-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 35, in __init__
    EN.build_vocab(train.trg, max_size=50000)
  File "build/bdist.linux-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
  File "build/bdist.linux-x86_64/egg/torchtext/data/field.py", line 83, in preprocess
  File "translation.py", line 14, in tokenize_de
    return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]
TypeError: Argument 'string' has incorrect type (expected unicode, got str)

This is the modification I did :

train, val = datasets.TranslationDataset.splits(
    path='~/myproject/data/de-en/', train='train.tags.de-en',
    validation='IWSLT16.TED.tst2013.de-en', exts=('.de', '.en'),
    fields=(DE, EN))

when I command ls under ~/myproject/data/de-en this is the result:

IWSLT16.TED.dev2010.de-en.de.xml  
IWSLT16.TED.tst2010.de-en.en.xml
  IWSLT16.TED.tst2012.de-en.de.xml
  IWSLT16.TED.tst2013.de-en.en.xml
  IWSLT16.TEDX.dev2012.de-en.de.xml 
 IWSLT16.TEDX.tst2013.de-en.en.xml  
README             
  train.tags.de-en.en
IWSLT16.TED.dev2010.de-en.en.xml
  IWSLT16.TED.tst2011.de-en.de.xml 
 IWSLT16.TED.tst2012.de-en.en.xml 
 IWSLT16.TED.tst2014.de-en.de.xml
  IWSLT16.TEDX.dev2012.de-en.en.xml
  IWSLT16.TEDX.tst2014.de-en.de.xml
  train.en
IWSLT16.TED.tst2010.de-en.de.xml  
IWSLT16.TED.tst2011.de-en.en.xml  
IWSLT16.TED.tst2013.de-en.de.xml  
IWSLT16.TED.tst2014.de-en.en.xml 
 IWSLT16.TEDX.tst2013.de-en.de.xml 
 IWSLT16.TEDX.tst2014.de-en.en.xml  
train.tags.de-en.de

from text.

nelson-liu commented on September 18, 2024

i'm presuming you're running this on python 2 --- you're going to want to convert the string to the unicode type before tokenizing it. Either 1) run this on python 3, or 2) convert the strings to unicode beforehand with six.text_type (you'll want to pip install six to use it, there's an example here).

@jekbradbury, perhaps it'd be worth converting to unicode in the preprocess function?

from text.

jekbradbury commented on September 18, 2024

Yeah, looks like that's an oversight since we do it in the lower=True case.

from text.

nelson-liu commented on September 18, 2024

@mambuDL if you pull from master, rerun python setup.py install, and try what you did again, it should work out.

from text.

mambuDL commented on September 18, 2024

@nelson-liu @jekbradbury Thanks, I did aply what you suggest. Although that error now has gone, the new ones appeared :)

This is the traceback I got when I run python translation.py

python translation.py 

   Warning: no model found for 'de'

   Only loading the 'de' tokenizer.


   Warning: no model found for 'en'

   Only loading the 'en' tokenizer.

Traceback (most recent call last):
 File "translation.py", line 27, in <module>
   fields=(DE, EN))
 File "build/bdist.linux-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
 File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 35, in __init__
   EN.build_vocab(train.trg, max_size=50000)
 File "build/bdist.linux-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
 File "build/bdist.linux-x86_64/egg/torchtext/data/field.py", line 89, in preprocess
 File "build/bdist.linux-x86_64/egg/torchtext/data/pipeline.py", line 13, in __call__
 File "build/bdist.linux-x86_64/egg/torchtext/data/pipeline.py", line 19, in call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

from text.

nelson-liu commented on September 18, 2024

hmm, looks like it's necessary to encode to UTF-8...not sure if it's better to do that in Field.preprocess, or while reading the translation dataset (with io.open instead of open).

from text.

mambuDL commented on September 18, 2024

Does it mean I should give up to use this package and try to write my own code to read the textual dataset for now if I use python2 or are you planning to fix it soon or later ?

from text.

PetrochukM commented on September 18, 2024

Following up on this. What is the release timeline? When will this project be released to PyPI?

from text.

marikgoldstein commented on September 18, 2024

It's up on pip but it's one of the 0.1.x versions. Cloning and running python setup.py install gives the most recent version with many working dataset module features.

from text.

marikgoldstein commented on September 18, 2024

@jekbradbury I left a comment in pull #52. Seems like there is still an ascii vs. UTF-8 issue. I commented there because this issue thread is a mix of a few issues.

from text.

How to use pytorch text in the projects about text HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent