Comments (12)
Does the change in #52 solve your problem? You should be able to try it locally.
from text.
It's not on PyPI yet, so you can't pip install. With your terminal, cd to the location of the torchtext code on your disk (download or clone the repo first), then run: python setup.py install
from text.
Thanks for the reply. It solved my problem. Unfortunatelly now I have another one. I tried to use the translator.py file inside the tests folder. To do I downloaded the same dataset and put a folder. Inside the code I only change the path at line-25 so that it points to the data where I've downloaded it. I got the following error:
$ python translation.py
Warning: no model found for 'de'
Only loading the 'de' tokenizer.
Warning: no model found for 'en'
Only loading the 'en' tokenizer.
Traceback (most recent call last):
File "translation.py", line 27, in <module>
fields=(DE, EN))
File "build/bdist.linux-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 35, in __init__
EN.build_vocab(train.trg, max_size=50000)
File "build/bdist.linux-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
File "build/bdist.linux-x86_64/egg/torchtext/data/field.py", line 83, in preprocess
File "translation.py", line 14, in tokenize_de
return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]
TypeError: Argument 'string' has incorrect type (expected unicode, got str)
This is the modification I did :
train, val = datasets.TranslationDataset.splits(
path='~/myproject/data/de-en/', train='train.tags.de-en',
validation='IWSLT16.TED.tst2013.de-en', exts=('.de', '.en'),
fields=(DE, EN))
when I command ls under ~/myproject/data/de-en this is the result:
IWSLT16.TED.dev2010.de-en.de.xml
IWSLT16.TED.tst2010.de-en.en.xml
IWSLT16.TED.tst2012.de-en.de.xml
IWSLT16.TED.tst2013.de-en.en.xml
IWSLT16.TEDX.dev2012.de-en.de.xml
IWSLT16.TEDX.tst2013.de-en.en.xml
README
train.tags.de-en.en
IWSLT16.TED.dev2010.de-en.en.xml
IWSLT16.TED.tst2011.de-en.de.xml
IWSLT16.TED.tst2012.de-en.en.xml
IWSLT16.TED.tst2014.de-en.de.xml
IWSLT16.TEDX.dev2012.de-en.en.xml
IWSLT16.TEDX.tst2014.de-en.de.xml
train.en
IWSLT16.TED.tst2010.de-en.de.xml
IWSLT16.TED.tst2011.de-en.en.xml
IWSLT16.TED.tst2013.de-en.de.xml
IWSLT16.TED.tst2014.de-en.en.xml
IWSLT16.TEDX.tst2013.de-en.de.xml
IWSLT16.TEDX.tst2014.de-en.en.xml
train.tags.de-en.de
from text.
i'm presuming you're running this on python 2 --- you're going to want to convert the string to the unicode type before tokenizing it. Either 1) run this on python 3, or 2) convert the strings to unicode beforehand with six.text_type
(you'll want to pip install six
to use it, there's an example here).
@jekbradbury, perhaps it'd be worth converting to unicode in the preprocess function?
from text.
Yeah, looks like that's an oversight since we do it in the lower=True
case.
from text.
@mambuDL if you pull from master, rerun python setup.py install
, and try what you did again, it should work out.
from text.
@nelson-liu @jekbradbury Thanks, I did aply what you suggest. Although that error now has gone, the new ones appeared :)
This is the traceback I got when I run python translation.py
python translation.py
Warning: no model found for 'de'
Only loading the 'de' tokenizer.
Warning: no model found for 'en'
Only loading the 'en' tokenizer.
Traceback (most recent call last):
File "translation.py", line 27, in <module>
fields=(DE, EN))
File "build/bdist.linux-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 35, in __init__
EN.build_vocab(train.trg, max_size=50000)
File "build/bdist.linux-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
File "build/bdist.linux-x86_64/egg/torchtext/data/field.py", line 89, in preprocess
File "build/bdist.linux-x86_64/egg/torchtext/data/pipeline.py", line 13, in __call__
File "build/bdist.linux-x86_64/egg/torchtext/data/pipeline.py", line 19, in call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
from text.
hmm, looks like it's necessary to encode to UTF-8...not sure if it's better to do that in Field.preprocess
, or while reading the translation dataset (with io.open
instead of open
).
from text.
Does it mean I should give up to use this package and try to write my own code to read the textual dataset for now if I use python2 or are you planning to fix it soon or later ?
from text.
Following up on this. What is the release timeline? When will this project be released to PyPI?
from text.
It's up on pip but it's one of the 0.1.x versions. Cloning and running python setup.py install gives the most recent version with many working dataset module features.
from text.
@jekbradbury I left a comment in pull #52. Seems like there is still an ascii vs. UTF-8 issue. I commented there because this issue thread is a mix of a few issues.
from text.
Related Issues (20)
- The Future of torchtext HOT 1
- BLEU_SCORE weird behaviour
- Fail to import torchtext KeyError: 'SP_DIR' HOT 1
- how to install libtorchtext for cpp project use? please give some operation .thanks
- Unable to download wikitext datasets HOT 4
- AttributeError: module 'torchtext' has no attribute 'legacy'
- # Liste von Namen und Alter personen = [ {"name": "Max", "alter": 30}, {"name": "Anna", "alter": 25}, {"name": "Lisa", "alter": 35} ] # Ausgabe der Liste for person in personen: print("Name:", person["name"]) print("Alter:", person["alter"]) print()
- [Release Blocking] TorchData is too old for PyTorch 2.3 HOT 1
- Remove SpaCy/NLTK as an optional dependency by creating our own tokenizer for a number of languages
- wikitext-2 is not available anymore HOT 3
- Why torchtext needs to reinstall torch
- [RFC] Deprecate/Stop TorchText releases starting with Pytorch release 2.4 HOT 9
- PyTorch 2.4 is not supported by TorchText
- Wikitext-103 URL is down HOT 3
- t5_demo can't retrieve CNNDM from drive.google; how to use local copy?
- Importing Batch TorchText.Legacy versus Torchtext Failures HOT 3
- strange pyd error with no documentation + "OSError: [WinError 127] The specified procedure could not be found" HOT 2
- undefined symbol
- test HOT 1
- `pip install torchtext` is broken after torch 2.4.0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text.