Code Monkey home page Code Monkey logo

Comments (9)

voidism avatar voidism commented on July 20, 2024 4
  1. The original GPT2 model also have this problem. I have tried it.
  2. GPT2 use byte pair encoding, it means that many of the Chinese words are still represented by three bytes, not a single token. Because it didn't see many Chinese words when building vocabs, it is unable to combine bytes into a true word unit.
    So you need to build a new model with new vocabs for Chinese from scratch, not just use this English model to fine-tune.
  3. I think why GPT2 can be successful is because that it used a large dataset from web(40GB), with humans checking the text quality (using Reddit karma). But for Chinese, there aren't public available datasets as big as what GPT2 has used.

from gpt2.

ConnorJL avatar ConnorJL commented on July 20, 2024 1

Hi there. These models were trained primarily on English text, so I have no idea how good or bad it is for other languages. By default, it should be able to handle any text that can be encoded by the BPE encoder. You will probably have to retrain the model on chinese text in order to get better results.

from gpt2.

ConnorJL avatar ConnorJL commented on July 20, 2024 1

Thanks for the interesting comments everyone. I think under the line what it comes down to is that the model was trained primarily on english text, so it naturally struggles with very different languages such as Chinese and Japanese. To get better performance one would probably need to collect a proper Chinese dataset and maybe even create a new BPE vocabulary that focuses on non-english languages. I don't have any plans on doing so, since I'm not qualified to even know what chinese text would be good or not, but it would be a cool experiment for others to try.

from gpt2.

ConnorJL avatar ConnorJL commented on July 20, 2024 1

Google actually has a system you can use to build a BPE encoder: https://github.com/google/sentencepiece

It's not exactly the same as OpenAI's, so you'd need to adapt encoder.py to use the new model, but in theory it should work just fine. I think the main insights from GPT2 is the scaled up transformer architecture, but BPE also surely adds a lot.

from gpt2.

dpyneo avatar dpyneo commented on July 20, 2024

Thank you very much for your busy reply. Would you like to ask if this should be re-trained in Chinese, or if BERT only needs fine-tune, if it is fine-tune, how to do this? I don't know if there will be plans to launch models in other languages in the future. If you have Chinese in time, I hope you can make a contribution. Thank you again for your sharing. It's great. Here are some of the predicted results. Last time I forgot to post them.

======================================== SAMPLE 0 ========================================
我是厉飞雨中団集道占标情乎不外弁的空何的主热者余主似可能性。

丆会者が移要、成努体余礍等素傷名省冔容以丐会の使用を成功する人が可能だ。

为了一接経懑用体的会段階の構玖は、自分の展貌の演由に意员した他の孆地地を感しており、段階の濑えて最近なものが今回。

如你的能性的項分

私のミッケージに答及した言い資数指定

世界中団に沢項決気师ゎ

from gpt2.

vochicong avatar vochicong commented on July 20, 2024

The text "SAMPLE 0" generated above is not completely Japanese, but a random mix of Chinese and Japanese.

from gpt2.

vochicong avatar vochicong commented on July 20, 2024

I've tried to generate Japanese text. To my surprise, the model output is almost all legal Japanese characters (no Chinese character mixed), though the words and sentences are very strange.

!python3 main.py --model PrettyBig_colab.json --top_k 40 --predict_text "猫はネズミを"

...
======================================== SAMPLE 0 ========================================

猫はネズミをアフリです」とお思います。した「言も市気についてもとうかった」にも大錭際のための姀別の事実に寄りたりなのでいただけるだろう。「言だけ」とはなぜんだけないので、これだけではありません。

楽しめてしさが、実際に対しても合まで統制よう言くなりがば、その中の機胞が可能です。ふっても、どうような未杯件が高いので自分だったものだろうか?いっかくわけようようで、つからありません。ほどの誤くちは良い改持していますが、それに世界中の中にはないことができたけどれば、いけれど、ひい、しかしも、いますは種りだが止れからどく名じくなりという人もようができるか。自由を自己したことだが、どうのかでにそれを徴期しているようにしておりません。

言を說明したとこと

そなし、合わな形生のあった。、いまま、合わな形生のあった。それには、というものが、そこで「ややややっという」というが、どうかった感じたらしか無事な形生のなど、そのもセキュリティ(あらようにどういて)。このあったので、そうなる攻省でしょうとした。そうしても、読者のような提価の他も、それにあれたい読者は、これや件いのなっのもときっていいいけるようになりました。そんなど、それにもなくてがず、というは「うもぉっど」っています。

ろしくんう値だったよね。しかした言明だけでは、いまかは感じた。そうしても、それには「以外の言しには自分ない」を微計すようになら、いままないとこれはやっという本私としてもすれば「感じためらない」むしました。そうしたものかで、しかした言を調べばための前徶にするかもしれば、いままだ態度はそれがどうだけませについても、それにもそうなど、どうまであったことはないか、という感じたけないのです。しかし、これや、感じたけてもどうだ。

それかったものが、しかしょうだないよね。つをい、それにも人があり、いら、ややや぀人がだかし、どうかった�

================================================================================

from gpt2.

dpyneo avatar dpyneo commented on July 20, 2024

First of all, thank voidism and vochicong for their answers.

I guess it's possible that Japanese occupies more of the training corpus than Chinese, probably the first 40 may be Japanese, but because the corpus should be mixed with some Japanese in English, leading to the prediction in Japanese. The model learns the representation semantics of English, not the semantics of Japanese, which leads to the same problem as voidism. In fact, the model only learns English. The fixed usage of grammar in Web pages, while for other languages, only the model is in accordance with the content it has seen cuanl

from gpt2.

Cyvadra avatar Cyvadra commented on July 20, 2024

I've got 20g cn txt file in hand but no idea how to build the encoder, BPE. I did run the script but seems its format doesn't fit gpt2....
might not be a problem.. cuz I think the key of gpt2 is not its code, but the idea of matching every single word, which contains much more structurized info than lstm-like algorithm
that makes it a language repeater.. so another key is a huge bunch of data needed, and much expensive calculation
since this two will never be available in china (Hail CCF), don't think of it as long as your time is precious.

from gpt2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.