Code Monkey home page Code Monkey logo

wikidetox's People

Contributors

andreasveit avatar cjadams avatar cristiandnm avatar dartar avatar dborkan avatar dependabot[bot] avatar ericagreene avatar iislucas avatar ipavlopoulos avatar iris-qu avatar jetpack avatar kearstenprince avatar nthain avatar sarathsaleem avatar sorensenjs avatar tamajongnc avatar vegetable68 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikidetox's Issues

Issue with multiple last edits to single last edit

Machine annotation: at least one of the last edits is toxic indicates bad conversation
Human annotation: whether the deepest last edit is toxic indicates bad conversation

Thus we might need post processing to some human conversations if they are:

  1. There are multiple last edits
  2. Machine annotated as bad
  3. Human annotated as fine

Save model files

The code here to train models for the toxicity Kaggle competition currently doesn't save the models. It writes out predictions and probabilities for the held-out data and unlabeled test data, but it doesn't save the model so we can re-run it.

Invalid JSON records

My pipeline detected 3 invalid JSON lines in the data. Here is one such record

Check failed: reader_.parse(json_line, root, false) {"user_id": null, "user_text": "216.84.45.194", "timestamp": "2007-02-01T21:50:19Z", "authors": "[[null, "216.84.45.194"]]", "content": "* Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.\n* Is there a pre-existing way that anyone knows of to load the XML file into MySQL '''without''' having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)\n* Shouldn't this generate no errors?\nxmllint 20050909_pages_current.xml\nCurrently for me it generates errors like this:\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296\n[[got:\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800&\n^\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158\n\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800\udf46\n^\nAll the best,\n", "parent_id": null, "replyTo_id": "104935801.2734.0", "indentation": 1, "type": "COMMENT_ADDING", "conversation_id": "104935801.2734.0", "page_title": "Wikipedia talk:Database download", "page_id": "83068", "id": "104935801.2750.0", "rev_id": "104935801"}

Here is the corresponding wiki page https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=104935801

It looks like the unicode escaping might be problematical.

Missing pages in reconstructed result

Observation:

The missing pages were also missing in the ingested input.
Unable to find why it happened.

Missing page examples:
'Talk:First Taranaki War',
'Template talk:Summary of casualties of the Iraq War',
'User talk:216.21.150.44',
'User talk:FARVA',
'User talk:Hawkestone',
'User talk:Lasallefan',
'User talk:Relax ull be ok',
'User talk:Wiki-star'

Improve conversation edit/add distinctions

A few conversations that we seem to have mixed up one users edit with another's contribution, so the conversation looks like its one person talking to themselves, but actually it's supposed to be more of a discussion I think, e.g. 292907612.6363.6363

Conversation Reconstruction Error

80775063.11000.11002
One of the comment(second one in the conversation, I removed: Both Russian President [Vladimir Putin]....) was assigned to wrong author

Cleanup of directory layout

I think there's a bunch of old unused stuff that I'd like to remove and some directory structure I'd like to improve on too:

Cleanup by removing:

Restructuring:

  • I think we should also move the dataflow and local versions of conversation reconstruction into the root level reconstruct_conversations and then they day be reconstruct_conversations/dataflow and reconstruct_conversations/local instead of having three root level directories.

Merged conversations

Sometimes we find multiple conversations that are merged together as one single conversation. This seemed to happen when the header for a new conversation starts with
=SECTION= rather than the more common ==SECTION==. This seemed to be relatively common, and seems to have an easy fix.

Suggestions from Cristain Danescu and Jonathon Chang.

Conv-Viewer comments are shown with initial wiki-markup ':' symbols

Comments sometimes start with wiki-markup indents (':'s in the beginning of the line).

Consider interpreting these; removing them in the comment plain text transformation; or even having a full wiki-markup interpreter, and showing the conversation as it would really look.

Add these data cleaning heuristics

These are not elegant, but might help.

  1. Write some curse word regexs that would convert, for example, 'f u c k' --> 'fuck' and f*ck' --> 'fuck'
  2. Identify URLs
  3. Use a fancier tokenizer

Write subset of predictions to HTML file

Reading CSV files is tough, but it's often useful to look through the test data and predictions beyond just looking at the accuracy metrics. One solution is to write a sample of the predictions in a HTML format that we can add some basic styling to so it's easy to read. That way we can go from new model -> analyzing results really quickly.

Authors field is double encoded

The authors field is a JSON list that is escaped and encoded as a string

"authors": "[["0", "Conversion script"], ["63", "Wesley"]]",

This should probably flattened into a JSON list of strings.

Add flags for all the parameters

The script to train a model for the Kaggle competition currently has a lot of parameters (here) that are not possible to change via a flag when you run the script. This makes doing things like trying 10 different settings for a hyper-parameter tricky.

We probably don't want to "flag-ify" all these parameters, but it would be nice to move most of them to flags with reasonable defaults.

Some conversations have multiple actions with no parent

In particular, there's about 1% of comments that have a section heading introduction at the same time as a comment addition (same revision & timestamp, but separated into 2 actions), e.g.

{  
   "96983583.202.24":{  
      "id":"96983583.202.24",
      "comment_type":"COMMENT_ADDING",
      "content":" From the other notes from other Wikipedia notes, I'm not the only one who questions your judgment.",
      "timestamp":"2006-12-28 19:35:53 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Sotaman",
      "parent_id":""
   },
   "96983583.24.24":{  
      "id":"96983583.24.24",
      "comment_type":"SECTION_CREATION",
      "content":" ==Bluffs? You see bluffs? I'll call your bluff==",
      "timestamp":"2006-12-28 19:35:53 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Sotaman",
      "parent_id":""
   },
   "96983583.48.24":{  
      "id":"96983583.48.24",
      "comment_type":"COMMENT_ADDING",
      "content":" Here, visit this link: [EXTERNA_LINK: http://images.google.com/images?q=bluffs&ie;=UTF-8&oe;=UTF-8&rls;=org.mozilla:en-US:official&client;=firefox-a&sa;=N&tab;=wi] Then tell me you see bluffs around Montevideo, much less (as you earlier characterized them) \"spectacular\" bluffs. Sorry if my standards for geographical features are a bit higher than what's seen in your minds eye.",
      "timestamp":"2006-12-28 19:35:53 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Sotaman",
      "parent_id":"96983583.24.24"
   },
   "97139226.240.240":{  
      "id":"97139226.240.240",
      "comment_type":"COMMENT_ADDING",
      "content":" :Higher?  More like narrower.  Bluffs are not just sheer cliffs.  As I said, check the dictionary.",
      "timestamp":"2006-12-29 14:41:34 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Jinxmchue",
      "parent_id":"96983583.202.24"
   }
},

We now handle this in the JS by inferring who should be the parent, but we might want to consider doing this in the python in future.

Cleanup of the data-structure for crowdsourcing

On the format for comments/conversations in the CSV rows for crowdflower:

So... what we currently have no does basically work, but its messier and more error prone than I'd like it to be.

The current interface/JSON type description for a comment is (in TypeScript syntax):

interface Comment {
  // TODO(ldixon): make it always a string, and have it empty string for not present, instead of -1; also rename to parent_id
  absolute_replyTo: string | number; // id of the parent
  comment_type : 'COMMENT_MODIFICATION' | 'COMMENT_ADDING';
  content: string;
  indentation : string;  // but this is a number inside the string;
  // TODO(ldixon): remove
  parent_ids: { [id:string]: boolean };
  // TODO(ldixon): remove
  relative_replyTo : string | number; // relative id of the parent.
  status: 'just added' | 'content changed';
  timestamp : string;
  // TODO(ldixon): remove; not needed and in fact harmful (can be used to game crowdsourcing)
  toxicity_score : number;
  // TODO(ldixon): change up-stream to be a hash of the user-id. rename to hashed_user_id
  user_text: string;
}

I suggest instead we make it look like so:

interface Comment {
  id: string;
  parent_id: string;
  comment_type : 'COMMENT_MODIFICATION' | 'COMMENT_ADDING' | ...;
  content: string;
  status: 'just added' | 'content changed'| ...;
  timestamp : string;
  hashed_user_id: string;
}

And we keep a conversation as just:

interface Conversation = { [id:string]: Comment }

i.e. we change from data looking like this:

{
  "550613551.0.0":{
      "content":"== Name == ",
      "indentation":"-1",
      "comment_type":"COMMENT_MODIFICATION",
      "toxicity_score":0.11125048073648158,
      "user_text":"Adamdaley",
      "timestamp":"2013-04-16 08:55:31 UTC",
      "absolute_replyTo":-1,
      "status":"just added",
      "relative_replyTo":-1,
      "parent_ids":{
         "550613551.0.0":true
      }
   },
   "675014505.416.416":{
      "content":" I edited it to the largest \"labor\" uprising and the largest \"organized armed uprising\" since the civil war. They were not in rebellion per se and the race riots of the 60's are clearly a larger uprising (I'm not too sure on armed).",
      "indentation":"0",
      "comment_type":"COMMENT_ADDING",
      "toxicity_score":0.06011961435406282,
      "user_text":"70.151.72.162",
      "timestamp":"2015-08-07 17:03:18 UTC",
      "absolute_replyTo":"550613551.0.0",
      "status":"just added",
      "relative_replyTo":0,
      "parent_ids":{
         "675014505.416.416":true
      }
   },

to looking like so:

{
  "550613551.0.0":{
      "id": "550613551.0.0",
      "parent_id": "",
      "content":"== Name == ",
      "comment_type":"COMMENT_MODIFICATION",
      "hashed_user_id":"DKJHWEIU",
      "timestamp":"2013-04-16 08:55:31 UTC",
      "status":"just added",
   },
   "675014505.416.416":{
      "id": "550613551.0.0",
      "parent_id": "550613551.0.0",
      "content":" I edited it to the largest \"labor\" uprising and the largest \"organized armed uprising\" since the civil war. They were not in rebellion per se and the race riots of the 60's are clearly a larger uprising (I'm not too sure on armed).",
      "comment_type":"COMMENT_ADDING",
      "hashed_user_id":"NMCWUPR",
      "timestamp":"2015-08-07 17:03:18 UTC",
      "status":"just added",
   },
}

This avoids:

  • Leaking the score which would help people game the job
  • Showing IP addresses/real usernames to raters

It also:

  • Makes the names of fields consistent.
  • Removes fields that we don't actually want in the crowd-sourcing.

Add page title to json

Let's add the page title to each comment in a conversation so that we could provide it as additional context in the annotation job.

Compare Kaggle model results against Perspective API scores

As a way to further evaluate these models, it would be nice to have a flag that will score a subset of the test data using the Perspective API. I'm imagining outputting results that have

  • comment_id
  • comment_text
  • y_class (e.g. 'toxic', 'obscene' etc.)
  • y_gold (if available)
  • y_prob (e.g. .89, 0.03 etc.)
  • perspective_api_prob
  • |y_prob - perspective_api_prob|

Write out the results in the correct format for the Kaggle competition

The Kaggle competition requires the submissions be formatted like this:

id,toxic,severe_toxic,obscene,threat,insult,identity_hate
6044863,0.5,0.5,0.5,0.5,0.5,0.5
6102620,0.5,0.5,0.5,0.5,0.5,0.5
14563293,0.5,0.5,0.5,0.5,0.5,0.5
21086297,0.5,0.5,0.5,0.5,0.5,0.5

We're not actually competing in the competition, but it would be good to output our predictions in the same format so we can test our scoring scripts.

Add fields to finished ingestions

Some the previously finished ingested batches are missing two extra fields: records_count (number of records after resizing the record), record_index (the index of the subpiece of the record) due to running with an earlier version of code.

Consider better separation of modifications from additions.

Consider if we can separate edits to others comments as there seem to actually be quite a few. We should investigate conversations like conversation id: 85577055.2121.2121

For minor edits, ideally we'd show the main author, and then just say minor edits by ... , and perhaps even underline the bits that are different in some way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.