Comments (5)
@juditacs @zseder FYI
hunpos tag must not be called with input that cannot be encoded to latin1, and we have been passing unicode objects. I'm not sure this should be "fixed" in hunpos, as that would involve using some extra, possibly third-party code for transliteration, and there still wouldn't always be a working way, let alone a unique way to do that. So basically I agree with the nltk-hunpos developer that it should be the caller's responsiblity to pass latin-1 encodable input to hunpos. In that spirit, I've added an "iconv -f UTF8 -t LATIN1//TRANSLIT" to our preprocessing pipeline, which just took care of all our problems.
Comments welcome!
from semeval.
I disagree with transliterating everything to latin1 just because Hunpos is old. I think we should only encode the text to latin1 then decode it again upon calling HunposTagger.
from semeval.
I don't get it, do you mean "decoding after calling HunposTagger"? If so,
I still don't really see the point, what is there to gain?
On Thu, Nov 20, 2014 at 4:48 PM, Judit Acs [email protected] wrote:
I disagree with transliterating everything to latin1 just because Hunpos
is old. I think we should only encode the text to latin1 then decode it
again upon calling HunposTagger.—
Reply to this email directly or view it on GitHub
#7 (comment).
from semeval.
I meant just before calling HunposTagger, we encode the strings in latin1 then decode them (the tagger expects unicode if I understand correctly) and feed HunposTagger with the latin1-encodable unicode input. We align the Hunpos output with the original (lossless) input.
from semeval.
HunposTagger expects input that it can encode to latin1, but I think I know
what you mean. Sure, we could keep unicode data, I just didn't care, since
we are procesing English input and the only non-ascii characters in there
(em-dashes, apostrophes, an occasional accent, altogether there's only
about 50 lines containing any of these) can be transliterated to latin1
with iconv without any real loss of information.
On Thu, Nov 20, 2014 at 8:20 PM, Judit Acs [email protected] wrote:
I meant just before calling HunposTagger, we encode the strings in latin1
then decode them (the tagger expects unicode if I understand correctly) and
feed HunposTagger with the latin1-encodable unicode input. We align the
Hunpos output with the original (lossless) input.—
Reply to this email directly or view it on GitHub
#7 (comment).
from semeval.
Related Issues (20)
- Acronyms and head HOT 1
- ignore stopwords and frequent adverbs HOT 4
- way too many OOVs - need spell correction, etc. HOT 7
- LSA embedding alternatives
- wordnet boost HOT 9
- P1B and P2B are not normalized by 2*sentence length, as in (7) of Han_2013
- machine word sim should consider more than just 0 edges HOT 2
- HunPos results depend on previous sentence HOT 1
- is_num_equivalent returns False for 1000 and 1,000 HOT 1
- machine sim needs caching HOT 1
- Stopword filtering filters pronouns HOT 2
- add simple greedy hunspell correction for OOVs
- bug in regex detecting numbers
- if first output of hunmorph is oov, we should try the next one!
- Twitter embedding
- Sentiment analysis on Twitter
- check and update the README
- NER shouldn't run at all if NE penalties are off. HOT 1
- should check if all resources exist before starting to load anything
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from semeval.