Code Monkey home page Code Monkey logo

Comments (5)

recski avatar recski commented on September 16, 2024

@juditacs @zseder FYI
hunpos tag must not be called with input that cannot be encoded to latin1, and we have been passing unicode objects. I'm not sure this should be "fixed" in hunpos, as that would involve using some extra, possibly third-party code for transliteration, and there still wouldn't always be a working way, let alone a unique way to do that. So basically I agree with the nltk-hunpos developer that it should be the caller's responsiblity to pass latin-1 encodable input to hunpos. In that spirit, I've added an "iconv -f UTF8 -t LATIN1//TRANSLIT" to our preprocessing pipeline, which just took care of all our problems.
Comments welcome!

from semeval.

juditacs avatar juditacs commented on September 16, 2024

I disagree with transliterating everything to latin1 just because Hunpos is old. I think we should only encode the text to latin1 then decode it again upon calling HunposTagger.

from semeval.

recski avatar recski commented on September 16, 2024

I don't get it, do you mean "decoding after calling HunposTagger"? If so,
I still don't really see the point, what is there to gain?

On Thu, Nov 20, 2014 at 4:48 PM, Judit Acs [email protected] wrote:

I disagree with transliterating everything to latin1 just because Hunpos
is old. I think we should only encode the text to latin1 then decode it
again upon calling HunposTagger.


Reply to this email directly or view it on GitHub
#7 (comment).

from semeval.

juditacs avatar juditacs commented on September 16, 2024

I meant just before calling HunposTagger, we encode the strings in latin1 then decode them (the tagger expects unicode if I understand correctly) and feed HunposTagger with the latin1-encodable unicode input. We align the Hunpos output with the original (lossless) input.

from semeval.

recski avatar recski commented on September 16, 2024

HunposTagger expects input that it can encode to latin1, but I think I know
what you mean. Sure, we could keep unicode data, I just didn't care, since
we are procesing English input and the only non-ascii characters in there
(em-dashes, apostrophes, an occasional accent, altogether there's only
about 50 lines containing any of these) can be transliterated to latin1
with iconv without any real loss of information.

On Thu, Nov 20, 2014 at 8:20 PM, Judit Acs [email protected] wrote:

I meant just before calling HunposTagger, we encode the strings in latin1
then decode them (the tagger expects unicode if I understand correctly) and
feed HunposTagger with the latin1-encodable unicode input. We align the
Hunpos output with the original (lossless) input.


Reply to this email directly or view it on GitHub
#7 (comment).

from semeval.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.