I noticed that you add a <eos> tokens at the en

Right, I mean though using the same , the way Transformer-XL load data is still

The way you preprocess data is different from that of Transformer-XL about adaptive-span HOT 5 CLOSED

facebookresearch commented on June 15, 2024

The way you preprocess data is different from that of Transformer-XL

from adaptive-span.

Comments (5)

lucaslingle commented on June 15, 2024 1

In case anyone else was wondering about this:

The preprocessing script prep_enwik8.py used by Transformer-XL keeps newlines intact when it writes the characters back out to a separate file. Then, when vocab.encode_file is used, the code here reads in the preprocessed data line-by-line, keeping the ending \n appended to each line. Tokenization subsequently treats \n as a separate token.

Thus, \n tokens are not dropped by the Transformer-XL preprocessing, and no <eos> character is needed.

from adaptive-span.

ajoulin commented on June 15, 2024

Hi,

enwik8 and text8 were introduced in the context of data compression, where every character must be considered, including the end of line (\n replaced by <eos> in our code).

Example of a previous dataloader used for character level language modeling on enwik8:
https://github.com/salesforce/awd-lstm-lm/blob/master/data.py

from adaptive-span.

yzh119 commented on June 15, 2024

Thanks for your justification, it's weird to see Transformer-XL not add eos for these two datasets, considering you use exactly the same prep_enwik8.py to preprocess data.

from adaptive-span.

ajoulin commented on June 15, 2024

Note that pre_enwik8.py originally comes from the repository I linked in my previous message:

https://github.com/salesforce/awd-lstm-lm/tree/master/data/enwik8

from adaptive-span.

yzh119 commented on June 15, 2024

Right, I mean though using the same script, the way Transformer-XL load data is still different from yours and awd-lstm.
It seems including eos is the right way in the context of compression.

from adaptive-span.

Recommend Projects

The way you preprocess data is different from that of Transformer-XL about adaptive-span HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent