Code Monkey home page Code Monkey logo

masakhane-ner's Issues

No dataset for Luo

Hi,

I found that there is no data for the Luo language in this repository, and not included on the Huggingface page as well. Could you also public the data to make the dataset complete?

Many thanks!

Word separator character in the Amharic dataset

Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.

Problem description

The Amharic language uses the character to separate words from each other. This character is not used in the dataset, which looks like a reasonable decision, since it should be possible to add it automatically, when reading a document from file. However, in some situations, there is a single word separator character within the dataset, e.g., at https://github.com/masakhane-io/masakhane-ner/blob/main/data/amh/test.txt#L83. This seems to be wrong, since it makes it harder to process the dataset. Either the word separator should be present between all words, or it should be skipped completely and left to the consumer of the dataset to add it in the correct places.

Proposed fix

Either add the word separator to all places in which the Amharic language would put them, OR remove all word separators and expect the consumer of the dataset to add it while loading the dataset.

Repository license

The arxiv document states the research as under open license CC-BY.

Can the authors confirm this code is open source, open license ?

I can then submit a PR with GNU + CC-BY license.

Feature request: Fula Pular

Are any of these languages related to Pular? My neighbors only speak Pular so this would be a game-changer to be able to converse with them.
Thanks for your consideration!

Truncated results for XLM-R and mBERT

Hi,

It seems that some of the prediction files truncate sentences too short. For example, here is a long sentence in Hausa in the test file:

but this sentence is truncated in the XLM-R results:

Here's a similar result for mBERT:

Maybe you need to increase the maximum sequence length in whatever software you're using to be able to handle the whole sentences?

Faulty full stop character in the Amharic dataset

Thank you for providing such a nice dataset. We are currently working on integrating them into GERBIL to enable other researchers to use them more easily. However, while working with the Amharic dataset, we encountered a severe issue.

Problem description

The Amharic language uses punctuation characters that are not common in other languages. The two important characters for this issue are the word separator and the full stop .

This is an excerpt of the dataset (dev.txt):

አምቦ B-LOC
ከዚህ O
በኋላ O
የቱሪዝም O
የባህል O
እና O
የፖለቲካ O
ማዕከል O
ትሆናለች O
፡፡ O

The last character should be a full stop, i.e., . However, in this example and in other sentences in the dataset, the last line comprises two word separators (2x). I think that this is a mistake and should be fixed within the dataset.

Proposed fix

Replace ፡፡ with in all three files of the Amharic dataset.

Improve readme.md

Definition : What this code does ?
Install: Required dependencies, commands to install ?
Run: demo run command.
Contribute: how to add a language ?
License.

Thank a lot for this project. 🙏🏼 African languages need more of those.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.