Code Monkey home page Code Monkey logo

Comments (6)

feiyunamy avatar feiyunamy commented on May 28, 2024

噢 似乎self.tokenizer.tokenize(triple[0])[1:-1]是为了规避[SEP]与[CLS]token,那么这种tokenize表现就是bug?但是如果换一个tokenizer, 代码中预测部分,预测出来的结果也会少一个字,所以不是很清楚到底是出于何种考虑。

from casrel.

Phoeby2618 avatar Phoeby2618 commented on May 28, 2024

问题1:代码中的tokenizer貌似是针对英文token的,对每个单词wordpiece,把单词之间的空格替换成[unused1]。如果是中文会出现你描述的情况,中文的tokenizer还需要改写下。

from casrel.

shm007g avatar shm007g commented on May 28, 2024

同意 @Phoeby2618 的说法,我试了(1)把中文分割成带空格的类似英文的格式,用代码里面的HBTokenizer(2)中文用原文,tokenizer用原生的Tokenier加上[unused1],metric函数中把' '.join(sub.split('[unused1]'))也改过来了。(3)中文用原文,tokenizer用原生的Tokenier不加[unused1],metric同上。
前2者结果差不多。最后一种情况,pred的关系实体总是为0。应该是[unused1]不能随便去掉,暂时没搞清楚咋回事。

from casrel.

seokjin954 avatar seokjin954 commented on May 28, 2024

我现在也发现了这个问题,打算试试您上面说的方法(2),不知您现在有没有更好的办法。

from casrel.

yanjiahui123 avatar yanjiahui123 commented on May 28, 2024

这里的self.tokenizer.tokenize(triple[0])[1:-1]确实是为了规避开头的[CLS]标签和末尾的[SEP],这是函数内部拼接上去的,但是有个问题就是,如果实体token不在词典,那么该实体token就会被细分成多个token。

from casrel.

zjw-coder avatar zjw-coder commented on May 28, 2024

各位大佬,能分享下处理中文数据的代码嘛,或者怎么修改

from casrel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.