conversationai / wikidetox Goto Github PK

Experiments to help discussion on Wikipedia talk pages

License: Apache License 2.0

Python 8.05% Jupyter Notebook 4.05% JavaScript 1.65% HTML 0.19% CSS 0.21% TypeScript 1.62% OpenEdge ABL 83.53% Shell 0.12% Dockerfile 0.01% Vue 0.57% SCSS 0.01%

wikidetox's People

Contributors

Stargazers

Watchers

Forkers

labout ajha17 cameronaaron nerdizzyz daywatch ipavlopoulos dav009 hosford42 kearstenprince zhunings vitco sarathsaleem ml-d cyberflamego aspenmayer harel-coffee

wikidetox's Issues

Update helper shell according to dataflow_main's runtime arguments

If we change the runtime arguments of dataflow_main, we might want to update the commands in helper shell accordingly. This can be done after all the data cleaning.

Use TensorFlow's ClusterSpec library

The ClusterSpec class will allow us to switch between training on CPU and GPU. Right now we only train on CPU, but it should be easy to use GPU with ml-engine.

Docs: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec

Issue with multiple last edits to single last edit

Machine annotation: at least one of the last edits is toxic indicates bad conversation
Human annotation: whether the deepest last edit is toxic indicates bad conversation

Thus we might need post processing to some human conversations if they are:

There are multiple last edits
Machine annotated as bad
Human annotated as fine

Save model files

The code here to train models for the toxicity Kaggle competition currently doesn't save the models. It writes out predictions and probabilities for the held-out data and unlabeled test data, but it doesn't save the model so we can re-run it.

Add travis testing support

Add continuous integration testing using a .travis.yml file.

Invalid JSON records

My pipeline detected 3 invalid JSON lines in the data. Here is one such record

Check failed: reader_.parse(json_line, root, false) {"user_id": null, "user_text": "216.84.45.194", "timestamp": "2007-02-01T21:50:19Z", "authors": "[[null, "216.84.45.194"]]", "content": "* Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.\n* Is there a pre-existing way that anyone knows of to load the XML file into MySQL '''without''' having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)\n* Shouldn't this generate no errors?\nxmllint 20050909_pages_current.xml\nCurrently for me it generates errors like this:\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296\n[[got:\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800&\n^\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158\n\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800\udf46\n^\nAll the best,\n", "parent_id": null, "replyTo_id": "104935801.2734.0", "indentation": 1, "type": "COMMENT_ADDING", "conversation_id": "104935801.2734.0", "page_title": "Wikipedia talk:Database download", "page_id": "83068", "id": "104935801.2750.0", "rev_id": "104935801"}

Here is the corresponding wiki page https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=104935801

It looks like the unicode escaping might be problematical.

Missing pages in reconstructed result

Observation:

The missing pages were also missing in the ingested input.
Unable to find why it happened.

Missing page examples:
'Talk:First Taranaki War',
'Template talk:Summary of casualties of the Iraq War',
'User talk:216.21.150.44',
'User talk:FARVA',
'User talk:Hawkestone',
'User talk:Lasallefan',
'User talk:Relax ull be ok',
'User talk:Wiki-star'

Improve conversation edit/add distinctions

A few conversations that we seem to have mixed up one users edit with another's contribution, so the conversation looks like its one person talking to themselves, but actually it's supposed to be more of a discussion I think, e.g. 292907612.6363.6363

Conversation Reconstruction Error

80775063.11000.11002
One of the comment(second one in the conversation, I removed: Both Russian President [Vladimir Putin]....) was assigned to wrong author

Cleanup of directory layout

I think there's a bunch of old unused stuff that I'd like to remove and some directory structure I'd like to improve on too:

Cleanup by removing:

Root level get_revisions directory. See PR: #63
Root level conversation_reconstruction directory doesn't contain anything currently that I think is in use. See PR: #64
conversation_reconstruction_dataflow/ingest_revisions/.dataflow_main.py.swp; Done in commit: f92c785 (oops, I meant to do this as a PR, will setup security so we don't push directly to master in future)

Restructuring:

I think we should also move the dataflow and local versions of conversation reconstruction into the root level reconstruct_conversations and then they day be reconstruct_conversations/dataflow and reconstruct_conversations/local instead of having three root level directories.

Update the conversation reconstruction pipeline to the latest dataflow

Merged conversations

Sometimes we find multiple conversations that are merged together as one single conversation. This seemed to happen when the header for a new conversation starts with
=SECTION= rather than the more common ==SECTION==. This seemed to be relatively common, and seems to have an easy fix.

Suggestions from Cristain Danescu and Jonathon Chang.

Conv-Viewer comments are shown with initial wiki-markup ':' symbols

Comments sometimes start with wiki-markup indents (':'s in the beginning of the line).

Consider interpreting these; removing them in the comment plain text transformation; or even having a full wiki-markup interpreter, and showing the conversation as it would really look.

Add these data cleaning heuristics

These are not elegant, but might help.

Write some curse word regexs that would convert, for example, 'f u c k' --> 'fuck' and f*ck' --> 'fuck'
Identify URLs
Use a fancier tokenizer

Write subset of predictions to HTML file

Reading CSV files is tough, but it's often useful to look through the test data and predictions beyond just looking at the accuracy metrics. One solution is to write a sample of the predictions in a HTML format that we can add some basic styling to so it's easy to read. That way we can go from new model -> analyzing results really quickly.

Authors field is double encoded

The authors field is a JSON list that is escaped and encoded as a string

"authors": "[["0", "Conversion script"], ["63", "Wesley"]]",

This should probably flattened into a JSON list of strings.

Add screenshot/image to readme

Add flags for all the parameters

The script to train a model for the Kaggle competition currently has a lot of parameters (here) that are not possible to change via a flag when you run the script. This makes doing things like trying 10 different settings for a hyper-parameter tricky.

We probably don't want to "flag-ify" all these parameters, but it would be nice to move most of them to flags with reasonable defaults.

Some conversations have multiple actions with no parent

In particular, there's about 1% of comments that have a section heading introduction at the same time as a comment addition (same revision & timestamp, but separated into 2 actions), e.g.

{  
   "96983583.202.24":{  
      "id":"96983583.202.24",
      "comment_type":"COMMENT_ADDING",
      "content":" From the other notes from other Wikipedia notes, I'm not the only one who questions your judgment.",
      "timestamp":"2006-12-28 19:35:53 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Sotaman",
      "parent_id":""
   },
   "96983583.24.24":{  
      "id":"96983583.24.24",
      "comment_type":"SECTION_CREATION",
      "content":" ==Bluffs? You see bluffs? I'll call your bluff==",
      "timestamp":"2006-12-28 19:35:53 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Sotaman",
      "parent_id":""
   },
   "96983583.48.24":{  
      "id":"96983583.48.24",
      "comment_type":"COMMENT_ADDING",
      "content":" Here, visit this link: [EXTERNA_LINK: http://images.google.com/images?q=bluffs&ie;=UTF-8&oe;=UTF-8&rls;=org.mozilla:en-US:official&client;=firefox-a&sa;=N&tab;=wi] Then tell me you see bluffs around Montevideo, much less (as you earlier characterized them) \"spectacular\" bluffs. Sorry if my standards for geographical features are a bit higher than what's seen in your minds eye.",
      "timestamp":"2006-12-28 19:35:53 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Sotaman",
      "parent_id":"96983583.24.24"
   },
   "97139226.240.240":{  
      "id":"97139226.240.240",
      "comment_type":"COMMENT_ADDING",
      "content":" :Higher?  More like narrower.  Bluffs are not just sheer cliffs.  As I said, check the dictionary.",
      "timestamp":"2006-12-29 14:41:34 UTC",
      "status":"just added",
      "page_title":"User talk:Jinxmchue",
      "user_text":"Jinxmchue",
      "parent_id":"96983583.202.24"
   }
},

We now handle this in the JS by inferring who should be the parent, but we might want to consider doing this in the python in future.

Dataflow Reconstruction -- Ingestion

https://beam.apache.org/documentation/sdks/pydoc/0.6.0/_modules/apache_beam/transforms/core.html#PartitionFn

Change Groupby of timestamp to Partition function.

Broken Conversations in Conversation Reconstruction

conversation id: 261839933.1349.1349

Add documentation for conversation reconstruction package

Update README in conversation_reconstruction_dataflow package.
Add documentation for each package of the code.
Integrate conversation viewer into the package.
Add Dataflow metrics to the pipeline.

Ingestion Failure Examples

Ingestion failed on the 7z files listed here:

https://github.com/conversationai/wikidetox/blob/dataflow_pipeline/conversation_reconstruction_dataflow/ingest_revisions/input_lists/7z_file_list_stuck

The ParDo operation gets stuck(doesn't seem to output results to WriteToBigQuery)
No errors prompted on dataflow

Cleanup of the data-structure for crowdsourcing

On the format for comments/conversations in the CSV rows for crowdflower:

So... what we currently have no does basically work, but its messier and more error prone than I'd like it to be.

The current interface/JSON type description for a comment is (in TypeScript syntax):

interface Comment {
  // TODO(ldixon): make it always a string, and have it empty string for not present, instead of -1; also rename to parent_id
  absolute_replyTo: string | number; // id of the parent
  comment_type : 'COMMENT_MODIFICATION' | 'COMMENT_ADDING';
  content: string;
  indentation : string;  // but this is a number inside the string;
  // TODO(ldixon): remove
  parent_ids: { [id:string]: boolean };
  // TODO(ldixon): remove
  relative_replyTo : string | number; // relative id of the parent.
  status: 'just added' | 'content changed';
  timestamp : string;
  // TODO(ldixon): remove; not needed and in fact harmful (can be used to game crowdsourcing)
  toxicity_score : number;
  // TODO(ldixon): change up-stream to be a hash of the user-id. rename to hashed_user_id
  user_text: string;
}

I suggest instead we make it look like so:

interface Comment {
  id: string;
  parent_id: string;
  comment_type : 'COMMENT_MODIFICATION' | 'COMMENT_ADDING' | ...;
  content: string;
  status: 'just added' | 'content changed'| ...;
  timestamp : string;
  hashed_user_id: string;
}

And we keep a conversation as just:

interface Conversation = { [id:string]: Comment }

i.e. we change from data looking like this:

{
  "550613551.0.0":{
      "content":"== Name == ",
      "indentation":"-1",
      "comment_type":"COMMENT_MODIFICATION",
      "toxicity_score":0.11125048073648158,
      "user_text":"Adamdaley",
      "timestamp":"2013-04-16 08:55:31 UTC",
      "absolute_replyTo":-1,
      "status":"just added",
      "relative_replyTo":-1,
      "parent_ids":{
         "550613551.0.0":true
      }
   },
   "675014505.416.416":{
      "content":" I edited it to the largest \"labor\" uprising and the largest \"organized armed uprising\" since the civil war. They were not in rebellion per se and the race riots of the 60's are clearly a larger uprising (I'm not too sure on armed).",
      "indentation":"0",
      "comment_type":"COMMENT_ADDING",
      "toxicity_score":0.06011961435406282,
      "user_text":"70.151.72.162",
      "timestamp":"2015-08-07 17:03:18 UTC",
      "absolute_replyTo":"550613551.0.0",
      "status":"just added",
      "relative_replyTo":0,
      "parent_ids":{
         "675014505.416.416":true
      }
   },

to looking like so:

{
  "550613551.0.0":{
      "id": "550613551.0.0",
      "parent_id": "",
      "content":"== Name == ",
      "comment_type":"COMMENT_MODIFICATION",
      "hashed_user_id":"DKJHWEIU",
      "timestamp":"2013-04-16 08:55:31 UTC",
      "status":"just added",
   },
   "675014505.416.416":{
      "id": "550613551.0.0",
      "parent_id": "550613551.0.0",
      "content":" I edited it to the largest \"labor\" uprising and the largest \"organized armed uprising\" since the civil war. They were not in rebellion per se and the race riots of the 60's are clearly a larger uprising (I'm not too sure on armed).",
      "comment_type":"COMMENT_ADDING",
      "hashed_user_id":"NMCWUPR",
      "timestamp":"2015-08-07 17:03:18 UTC",
      "status":"just added",
   },
}

This avoids:

Leaking the score which would help people game the job
Showing IP addresses/real usernames to raters

It also:

Makes the names of fields consistent.
Removes fields that we don't actually want in the crowd-sourcing.

Add page title to json

Let's add the page title to each comment in a conversation so that we could provide it as additional context in the annotation job.

Compare Kaggle model results against Perspective API scores

As a way to further evaluate these models, it would be nice to have a flag that will score a subset of the test data using the Perspective API. I'm imagining outputting results that have

comment_id
comment_text
y_class (e.g. 'toxic', 'obscene' etc.)
y_gold (if available)
y_prob (e.g. .89, 0.03 etc.)
perspective_api_prob
|y_prob - perspective_api_prob|

Conversation Viewer has timestamp overlap with comment content

View from firefox 58.0.2 (64-bit)

Write out the results in the correct format for the Kaggle competition

The Kaggle competition requires the submissions be formatted like this:

id,toxic,severe_toxic,obscene,threat,insult,identity_hate
6044863,0.5,0.5,0.5,0.5,0.5,0.5
6102620,0.5,0.5,0.5,0.5,0.5,0.5
14563293,0.5,0.5,0.5,0.5,0.5,0.5
21086297,0.5,0.5,0.5,0.5,0.5,0.5

We're not actually competing in the competition, but it would be good to output our predictions in the same format so we can test our scoring scripts.