Code Monkey home page Code Monkey logo

Comments (4)

carlini avatar carlini commented on August 29, 2024

This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator.

from deduplicate-text-datasets.

gawei1995 avatar gawei1995 commented on August 29, 2024

This deduplicator doesn't know anything about documents. It just knows strings. Do you have a document separator that you use that's not present in any of the documents? (e.g., if you have a tokenizer with <65k tokens you can use \xff\xff\xff\xff as a separator.该重复数据删除器对文档一无所知。它只知道字符串。您使用的文档分隔符是否存在于任何文档中? (例如,如果您的分词器具有 <65k 标记,则可以使用 \xff\xff\xff\xff 作为分隔符。

i use the \xff\xff as a separator . the tokenizer is gpt2 with <51k. Is there a big difference between "\xff\xff" and "\xff\xff\xff\xff"? thx for reply

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well?

Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions.

from deduplicate-text-datasets.

gawei1995 avatar gawei1995 commented on August 29, 2024

Huh. If you can be sure that 0xff00 isn't a valid token then \xff\xff should work because you should never be able get away with 2. Do you put a unique counter between documents as well?呵呵。如果您可以确定 0xff00 不是有效令牌,那么 \xff\xff 应该可以工作,因为您永远无法逃脱 2. 您是否也在文档之间放置了唯一的计数器?

Otherwise it could match [final bit of document 1][document separator][beginning of document 2] to a document 3/4 if those were in the same positions.否则,它可以将[文档 1 的最后一位][文档分隔符][文档 2 的开头]与文档 3/4 匹配(如果它们位于相同的位置)。

maybe,I'll try it. thx

from deduplicate-text-datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.