Comments (2)
Hello @520jefferson, There are examples at https://github.com/google/sentencepiece showing how to create custom vocabularies for T5. Once the custom vocab is created as spm file, you can use HF to load it as a tokenizer.
Please let us know exactly what you want to achieve, and we can provide further guidance.
from examples.
Thanks for reply, codes and vocab like follows.
The vocab.txt (vocab.json is manually constructed from vocab.txt ) and meregs.txt i upload to google drive as follows:
vocab.txt:https://drive.google.com/file/d/10jC8L_-RDLRv5QkAato8nJWGU1UQQcz1/view?usp=sharing
vocab.json:https://drive.google.com/file/d/1e5Ll0bAHhikhnYV5XaW3NB8aTSWdCvnC/view?usp=sharing
merges.txt:https://drive.google.com/file/d/1ifXlQaYod_kobqgNe82tmHTtHpxBYnBq/view?usp=sharing
2.The sentences for training and validation and test like this (after bpe, tokens split by " "):
你 觉得 大人 辛苦 还是 学生 辛苦 都 很 辛苦
头条 文章 没 啥 违规 , 却 被 小@@ 浪@@ 浪 屏蔽 了 , 而且 删 了 先生 的 转发 评价 , 农历 新年 将 至 , 俺 不想 发火 , 行 , 俺 再 发 一遍 ! 怎么 删 了 , 还 没 看 呢
专辑 有 签名 么 ? ! … 没有 机会 去 签@@ 售@@ 会 啦 幸好 里面 的 容 和 小 卡片 有 签名
你 帮 我 买 东西 吗 你 给钱 我 , 当然 帮 你 买 耶
你 说 那个 早晨 喝 那个 水有 什么 好处 可以 提高 睡眠 质量 养成 良好 的 睡眠 时间 和 习惯 慢慢 养成 早睡早起 的 习惯 , 习@@ 惯@@ 成@@ 自然
求个 风景 超 美的 网游 最好 是 韩国 的 剑侠情缘 叁
现在 百度 帐号 是 不能 拿 邮箱 注册 了 么 ? 只能 拿 手机号 了 么 ? 如果 可以 应该 怎么 拿 邮箱 注册 ? 谢谢 ! 先 用 手机 注册 , 然后 绑定 一个 邮箱 , 再@@ 解 绑 手机 即可
咱们 出去 转 会儿 遛@@ 弯@@ 儿 去 呗 我 在 工@@ 体 的 漫 咖啡 , 要 不要 来 坐 会儿
我 知道 最近 做 什么 准备 演唱会 的 事 吧
3.i want to use the tokenizer to load the vocab and tokenizer to tokenizer my sentence and give it to the t5 model.
load model like this(config: https://drive.google.com/file/d/1WOb-gqjkt1m6GBTFeq4wOWS3dW3Qt1oK/view?usp=sharing):
from transformers import T5Config, T5ForConditionalGeneration
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)
load tokenizer:
from tokenizers.models import WordLevel
from transformers import PreTrainedTokenizerFast
vocab = WordLevel.from_file("vocab.json","")
fast_tokenizer=PreTrainedTokenizerFast(tokenizer_object=vocab)
fast_tokenizer.encode("你 觉得 大人 辛苦 还是 学生 辛苦 都 很 辛苦")
then i met this errror:AttributeError: 'tokenizers.models.WordLevel' object has no attribute 'truncation'
So i want to load the vocab into tokenizer and use it like this { source = tokenizer.batch_encode_plus([source_text], max_length= 75, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
} and return { 'source_ids': source_ids.to(dtype=torch.long), 'source_mask': source_mask.to(dtype=torch.long), 'target_ids': target_ids.to(dtype=torch.long), 'target_ids_y': target_ids.to(dtype=torch.long) } , and give the tokenizer result to model and train the model like translation task, how should i do ?
from examples.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from examples.