Code Monkey home page Code Monkey logo

Comments (2)

mwitiderrick avatar mwitiderrick commented on September 27, 2024

Hello @520jefferson, There are examples at https://github.com/google/sentencepiece showing how to create custom vocabularies for T5. Once the custom vocab is created as spm file, you can use HF to load it as a tokenizer.

Please let us know exactly what you want to achieve, and we can provide further guidance.

from examples.

520jefferson avatar 520jefferson commented on September 27, 2024

@mwitiderrick

Thanks for reply, codes and vocab like follows.

The vocab.txt (vocab.json is manually constructed from vocab.txt ) and meregs.txt i upload to google drive as follows:
vocab.txt:https://drive.google.com/file/d/10jC8L_-RDLRv5QkAato8nJWGU1UQQcz1/view?usp=sharing
vocab.json:https://drive.google.com/file/d/1e5Ll0bAHhikhnYV5XaW3NB8aTSWdCvnC/view?usp=sharing
merges.txt:https://drive.google.com/file/d/1ifXlQaYod_kobqgNe82tmHTtHpxBYnBq/view?usp=sharing
2.The sentences for training and validation and test like this (after bpe, tokens split by " "):
你 觉得 大人 辛苦 还是 学生 辛苦 都 很 辛苦
头条 文章 没 啥 违规 , 却 被 小@@ 浪@@ 浪 屏蔽 了 , 而且 删 了 先生 的 转发 评价 , 农历 新年 将 至 , 俺 不想 发火 , 行 , 俺 再 发 一遍 ! 怎么 删 了 , 还 没 看 呢
专辑 有 签名 么 ? ! … 没有 机会 去 签@@ 售@@ 会 啦 幸好 里面 的 容 和 小 卡片 有 签名
你 帮 我 买 东西 吗 你 给钱 我 , 当然 帮 你 买 耶
你 说 那个 早晨 喝 那个 水有 什么 好处 可以 提高 睡眠 质量 养成 良好 的 睡眠 时间 和 习惯 慢慢 养成 早睡早起 的 习惯 , 习@@ 惯@@ 成@@ 自然
求个 风景 超 美的 网游 最好 是 韩国 的 剑侠情缘 叁
现在 百度 帐号 是 不能 拿 邮箱 注册 了 么 ? 只能 拿 手机号 了 么 ? 如果 可以 应该 怎么 拿 邮箱 注册 ? 谢谢 ! 先 用 手机 注册 , 然后 绑定 一个 邮箱 , 再@@ 解 绑 手机 即可
咱们 出去 转 会儿 遛@@ 弯@@ 儿 去 呗 我 在 工@@ 体 的 漫 咖啡 , 要 不要 来 坐 会儿
我 知道 最近 做 什么 准备 演唱会 的 事 吧

3.i want to use the tokenizer to load the vocab and tokenizer to tokenizer my sentence and give it to the t5 model.
load model like this(config: https://drive.google.com/file/d/1WOb-gqjkt1m6GBTFeq4wOWS3dW3Qt1oK/view?usp=sharing):
from transformers import T5Config, T5ForConditionalGeneration
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)

load tokenizer:
from tokenizers.models import WordLevel
from transformers import PreTrainedTokenizerFast
vocab = WordLevel.from_file("vocab.json","")
fast_tokenizer=PreTrainedTokenizerFast(tokenizer_object=vocab)
fast_tokenizer.encode("你 觉得 大人 辛苦 还是 学生 辛苦 都 很 辛苦")

then i met this errror:AttributeError: 'tokenizers.models.WordLevel' object has no attribute 'truncation'

So i want to load the vocab into tokenizer and use it like this { source = tokenizer.batch_encode_plus([source_text], max_length= 75, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
} and return { 'source_ids': source_ids.to(dtype=torch.long), 'source_mask': source_mask.to(dtype=torch.long), 'target_ids': target_ids.to(dtype=torch.long), 'target_ids_y': target_ids.to(dtype=torch.long) } , and give the tokenizer result to model and train the model like translation task, how should i do ?

from examples.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.