mhagiwara / github-typo-corpus Goto Github PK

View Code? Open in Web Editor NEW

480.0 480.0 38.0 239 KB

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

Python 100.00%

github-typo-corpus's People

Stargazers

Watchers

github-typo-corpus's Issues

Ways of getting more typos

Hi,
I would like to know if you explored the extraction of typos from title changes in Github issues and Pull Requests. There also a lot of spelling mistake changes happen. The classifier and language detection pipeline would not change. Only the data acquisition part will be different this time as there is no indicator of the change being a typo.

There are also the edits in the comments on a issue/PR. That is also another big source of such errors.

Dataset is very nosiy

This dataset is very noisy, it has a lot of errors (bad corrections, wrong language detection, etc.). It doesn't make much sense to use this dataset for evaluation.

Dataset Clarifications

Hi,
I had a few questions regarding the dataset and the methodology:

Why are cases where there a space between non letter characters is considered as a typo? I am talking about the cases like this:

{
   "repo":"https://github.com/abacritt/angularx-social-login",
   "commit":"d4c912f5ccd70c81f424fadbf1fe1a2ecb942f07",
   "message":"Fix typo\n",
   "edits":[
      {
         "src":{
            "text":"            IN.User.authorize(function(){",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":14.759690820409018
         },
         "tgt":{
            "text":"            IN.User.authorize(function() {",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":13.02657495778316
         },
         "prob_typo":0.9393920322691369,
         "is_typo":true
      },
      {
         "src":{
            "text":"            IN.User.logout(function(){",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":21.30207029147685
         },
         "tgt":{
            "text":"            IN.User.logout(function() {",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":18.466537526108358
         },
         "prob_typo":0.9433146599837509,
         "is_typo":true
      }
   ]
}

Here there is only a space between ) and {. Doesn't it come under semantic error? Also isn't this type for programming language as opposed to human languages?

I checked for "lang":"hin" in the dataset and found many cases similar to the one I have listed down. There is no Hindi character in the message due to which it might have been tagged as Hindi. What could be the reason behind this? The language identifier might have these irregularities. Also coincidentally, this kind of error also shows another example for the 1st point I mentioned. Does having these examples help?

{
   "repo":"https://github.com/apache/airflow",
   "commit":"b81bd08a334efa5242af705743519be43346295e",
   "message":"[AIRFLOW-2538] Update faq doc on how to reduce airflow scheduler latency\n\nMake sure you have checked",
   "edits":[
      {
         "src":{
            "text":"---------------------------------------------------------------------------------------------",
            "path":"docs/faq.rst",
            "lang":"hin"
         },
         "tgt":{
            "text":"-----------------------------------------------------------------------------------------",
            "path":"docs/faq.rst",
            "lang":"hin"
         }
      }
   ]
}

In the above example, there is not is_typo flag. This I also observed in cases for the other languages as well. What might be the reason for that?

I am trying to understand the whole methodology and not undermining your work. I read the paper as well. It was very well written; classifier was especially neat. Thanks for releasing the dataset.

Interesting Tool for this

This is very cool.

We have a tool called Dolt (https://www.dolthub.com) that puts Git semantics around a SQL database. Would you consider working with us to version the output of your scraper. We are happy to run it for you in our Airflow instance.

https://github.com/liquidata-inc/liquidata-etl-jobs

Getting just the spelling mistakes of words?

Hi,

The very first entry has a correction that simply changes ){ to ) { (it inserts a space). For my application, I'd like to focus on typos of English words only. Do you have a suggested way to filter these out?

Your paper says that you annotated the data set with some classifications, such as "Spell" when the error was a spelling mistake. I think this would help me to do the filtration I need to. Is it possible to get these annotations?

Thank you

SyntaxError: invalid syntax running collect_repositories.py

Hi! I tried to download your dataset and faced this issue while running python3 collect_repositories.py

github-typo-corpus/src$ python3 collect_repositories.py 
  File "collect_repositories.py", line 55
    archive_url = f'https://data.gharchive.org/{date_str}-{hour}.json.gz'

Can you help, please? May be I do something wrong, but I didn't see instructions in readme

mhagiwara / github-typo-corpus Goto Github PK

github-typo-corpus's People

Stargazers

Watchers

Forkers

github-typo-corpus's Issues

Ways of getting more typos

Dataset is very nosiy

Dataset Clarifications

Interesting Tool for this

Getting just the spelling mistakes of words?

SyntaxError: invalid syntax running collect_repositories.py

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent