Code Monkey home page Code Monkey logo

Comments (9)

nelson-liu avatar nelson-liu commented on May 19, 2024

I know NLTK has a Moses tokenizer, but unsure if it's a good port / I've always used tokenize.pl anyway

For the reference, for large corpora (e.g MT) I've found it faster to pretokenize your data and feed it into TorchText with the str.split tokenizer, versus having to tokenize on every run.

from text.

PetrochukM avatar PetrochukM commented on May 19, 2024

@nelson-liu Is there an option to include a bash script to tokenize your data as part of the library?

This issue mentions that Moses in NLTK was fixed nltk/nltk#1214

from text.

nelson-liu avatar nelson-liu commented on May 19, 2024

Hmm, not directly. Ostensibly you could write a function that takes an input string and runs the bash script on it (e.g with subprocess) and parse the output for tokenization, but that feels like a lot of overhead

from text.

PetrochukM avatar PetrochukM commented on May 19, 2024

Edited my comment above. Looks like NLTK contributors fixed Moses in this merged pull request: nltk/nltk#1553

from text.

jekbradbury avatar jekbradbury commented on May 19, 2024

I think including the NLTK version of the Moses tokenizer is a good idea, though it shouldn’t be difficult to use the existing API to call it for now

from text.

nelson-liu avatar nelson-liu commented on May 19, 2024

+1, slighty unrelated but wonder if it's worth including the other NLTK tokenizers (e.g punkt or PTB) / having some sort of public facing API for using the spacy tokenizers in other languages. This seems like a slippery slope API design-wise, though...

from text.

jekbradbury avatar jekbradbury commented on May 19, 2024

Yeah, I think I’d rather show in the docs etc how easy it is to call them yourself. But lots of people (including me until a few minutes ago!) don’t know that NLTK now has a Moses-compatible tokenizer, so it might be worth including that one so more people know they can move away from the perl script preprocessing approach.

from text.

nelson-liu avatar nelson-liu commented on May 19, 2024

from text.

nelson-liu avatar nelson-liu commented on May 19, 2024

PR at #58

from text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.