Code Monkey home page Code Monkey logo

github-typo-corpus's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

github-typo-corpus's Issues

Ways of getting more typos

Hi,
I would like to know if you explored the extraction of typos from title changes in Github issues and Pull Requests. There also a lot of spelling mistake changes happen. The classifier and language detection pipeline would not change. Only the data acquisition part will be different this time as there is no indicator of the change being a typo.

There are also the edits in the comments on a issue/PR. That is also another big source of such errors.

Dataset is very nosiy

This dataset is very noisy, it has a lot of errors (bad corrections, wrong language detection, etc.). It doesn't make much sense to use this dataset for evaluation.

Dataset Clarifications

Hi,
I had a few questions regarding the dataset and the methodology:

  1. Why are cases where there a space between non letter characters is considered as a typo? I am talking about the cases like this:
{
   "repo":"https://github.com/abacritt/angularx-social-login",
   "commit":"d4c912f5ccd70c81f424fadbf1fe1a2ecb942f07",
   "message":"Fix typo\n",
   "edits":[
      {
         "src":{
            "text":"            IN.User.authorize(function(){",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":14.759690820409018
         },
         "tgt":{
            "text":"            IN.User.authorize(function() {",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":13.02657495778316
         },
         "prob_typo":0.9393920322691369,
         "is_typo":true
      },
      {
         "src":{
            "text":"            IN.User.logout(function(){",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":21.30207029147685
         },
         "tgt":{
            "text":"            IN.User.logout(function() {",
            "path":"src/lib/src/providers/linkedIn-login-provider.ts",
            "lang":"eng",
            "ppl":18.466537526108358
         },
         "prob_typo":0.9433146599837509,
         "is_typo":true
      }
   ]
}

Here there is only a space between ) and {. Doesn't it come under semantic error? Also isn't this type for programming language as opposed to human languages?

  1. I checked for "lang":"hin" in the dataset and found many cases similar to the one I have listed down. There is no Hindi character in the message due to which it might have been tagged as Hindi. What could be the reason behind this? The language identifier might have these irregularities. Also coincidentally, this kind of error also shows another example for the 1st point I mentioned. Does having these examples help?
{
   "repo":"https://github.com/apache/airflow",
   "commit":"b81bd08a334efa5242af705743519be43346295e",
   "message":"[AIRFLOW-2538] Update faq doc on how to reduce airflow scheduler latency\n\nMake sure you have checked",
   "edits":[
      {
         "src":{
            "text":"---------------------------------------------------------------------------------------------",
            "path":"docs/faq.rst",
            "lang":"hin"
         },
         "tgt":{
            "text":"-----------------------------------------------------------------------------------------",
            "path":"docs/faq.rst",
            "lang":"hin"
         }
      }
   ]
}
  1. In the above example, there is not is_typo flag. This I also observed in cases for the other languages as well. What might be the reason for that?

I am trying to understand the whole methodology and not undermining your work. I read the paper as well. It was very well written; classifier was especially neat. Thanks for releasing the dataset.

Getting just the spelling mistakes of words?

Hi,

The very first entry has a correction that simply changes ){ to ) { (it inserts a space). For my application, I'd like to focus on typos of English words only. Do you have a suggested way to filter these out?

Your paper says that you annotated the data set with some classifications, such as "Spell" when the error was a spelling mistake. I think this would help me to do the filtration I need to. Is it possible to get these annotations?

Thank you

SyntaxError: invalid syntax running collect_repositories.py

Hi! I tried to download your dataset and faced this issue while running python3 collect_repositories.py

github-typo-corpus/src$ python3 collect_repositories.py 
  File "collect_repositories.py", line 55
    archive_url = f'https://data.gharchive.org/{date_str}-{hour}.json.gz'

Can you help, please? May be I do something wrong, but I didn't see instructions in readme

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.