conversationai / wikidetox Goto Github PK
View Code? Open in Web Editor NEWExperiments to help discussion on Wikipedia talk pages
License: Apache License 2.0
Experiments to help discussion on Wikipedia talk pages
License: Apache License 2.0
If we change the runtime arguments of dataflow_main, we might want to update the commands in helper shell accordingly. This can be done after all the data cleaning.
The ClusterSpec class will allow us to switch between training on CPU and GPU. Right now we only train on CPU, but it should be easy to use GPU with ml-engine.
Docs: https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec
Machine annotation: at least one of the last edits is toxic indicates bad conversation
Human annotation: whether the deepest last edit is toxic indicates bad conversation
Thus we might need post processing to some human conversations if they are:
The code here to train models for the toxicity Kaggle competition currently doesn't save the models. It writes out predictions and probabilities for the held-out data and unlabeled test data, but it doesn't save the model so we can re-run it.
Add continuous integration testing using a .travis.yml file.
My pipeline detected 3 invalid JSON lines in the data. Here is one such record
Check failed: reader_.parse(json_line, root, false) {"user_id": null, "user_text": "216.84.45.194", "timestamp": "2007-02-01T21:50:19Z", "authors": "[[null, "216.84.45.194"]]", "content": "* Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.\n* Is there a pre-existing way that anyone knows of to load the XML file into MySQL '''without''' having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)\n* Shouldn't this generate no errors?\nxmllint 20050909_pages_current.xml\nCurrently for me it generates errors like this:\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296\n[[got:\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800&\n^\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158\n\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800\udf46\n^\nAll the best,\n", "parent_id": null, "replyTo_id": "104935801.2734.0", "indentation": 1, "type": "COMMENT_ADDING", "conversation_id": "104935801.2734.0", "page_title": "Wikipedia talk:Database download", "page_id": "83068", "id": "104935801.2750.0", "rev_id": "104935801"}
Here is the corresponding wiki page https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=104935801
It looks like the unicode escaping might be problematical.
Observation:
The missing pages were also missing in the ingested input.
Unable to find why it happened.
Missing page examples:
'Talk:First Taranaki War',
'Template talk:Summary of casualties of the Iraq War',
'User talk:216.21.150.44',
'User talk:FARVA',
'User talk:Hawkestone',
'User talk:Lasallefan',
'User talk:Relax ull be ok',
'User talk:Wiki-star'
A few conversations that we seem to have mixed up one users edit with another's contribution, so the conversation looks like its one person talking to themselves, but actually it's supposed to be more of a discussion I think, e.g. 292907612.6363.6363
80775063.11000.11002
One of the comment(second one in the conversation, I removed: Both Russian President [Vladimir Putin]....) was assigned to wrong author
I think there's a bunch of old unused stuff that I'd like to remove and some directory structure I'd like to improve on too:
Cleanup by removing:
get_revisions
directory. See PR: #63conversation_reconstruction
directory doesn't contain anything currently that I think is in use. See PR: #64conversation_reconstruction_dataflow/ingest_revisions/.dataflow_main.py.swp
; Done in commit: f92c785 (oops, I meant to do this as a PR, will setup security so we don't push directly to master in future)Restructuring:
reconstruct_conversations
and then they day be reconstruct_conversations/dataflow
and reconstruct_conversations/local
instead of having three root level directories.Sometimes we find multiple conversations that are merged together as one single conversation. This seemed to happen when the header for a new conversation starts with
=SECTION= rather than the more common ==SECTION==. This seemed to be relatively common, and seems to have an easy fix.
Suggestions from Cristain Danescu and Jonathon Chang.
Comments sometimes start with wiki-markup indents (':'s in the beginning of the line).
Consider interpreting these; removing them in the comment plain text transformation; or even having a full wiki-markup interpreter, and showing the conversation as it would really look.
These are not elegant, but might help.
Reading CSV files is tough, but it's often useful to look through the test data and predictions beyond just looking at the accuracy metrics. One solution is to write a sample of the predictions in a HTML format that we can add some basic styling to so it's easy to read. That way we can go from new model -> analyzing results really quickly.
The authors field is a JSON list that is escaped and encoded as a string
"authors": "[["0", "Conversion script"], ["63", "Wesley"]]",
This should probably flattened into a JSON list of strings.
The script to train a model for the Kaggle competition currently has a lot of parameters (here) that are not possible to change via a flag when you run the script. This makes doing things like trying 10 different settings for a hyper-parameter tricky.
We probably don't want to "flag-ify" all these parameters, but it would be nice to move most of them to flags with reasonable defaults.
In particular, there's about 1% of comments that have a section heading introduction at the same time as a comment addition (same revision & timestamp, but separated into 2 actions), e.g.
{
"96983583.202.24":{
"id":"96983583.202.24",
"comment_type":"COMMENT_ADDING",
"content":" From the other notes from other Wikipedia notes, I'm not the only one who questions your judgment.",
"timestamp":"2006-12-28 19:35:53 UTC",
"status":"just added",
"page_title":"User talk:Jinxmchue",
"user_text":"Sotaman",
"parent_id":""
},
"96983583.24.24":{
"id":"96983583.24.24",
"comment_type":"SECTION_CREATION",
"content":" ==Bluffs? You see bluffs? I'll call your bluff==",
"timestamp":"2006-12-28 19:35:53 UTC",
"status":"just added",
"page_title":"User talk:Jinxmchue",
"user_text":"Sotaman",
"parent_id":""
},
"96983583.48.24":{
"id":"96983583.48.24",
"comment_type":"COMMENT_ADDING",
"content":" Here, visit this link: [EXTERNA_LINK: http://images.google.com/images?q=bluffs&ie;=UTF-8&oe;=UTF-8&rls;=org.mozilla:en-US:official&client;=firefox-a&sa;=N&tab;=wi] Then tell me you see bluffs around Montevideo, much less (as you earlier characterized them) \"spectacular\" bluffs. Sorry if my standards for geographical features are a bit higher than what's seen in your minds eye.",
"timestamp":"2006-12-28 19:35:53 UTC",
"status":"just added",
"page_title":"User talk:Jinxmchue",
"user_text":"Sotaman",
"parent_id":"96983583.24.24"
},
"97139226.240.240":{
"id":"97139226.240.240",
"comment_type":"COMMENT_ADDING",
"content":" :Higher? More like narrower. Bluffs are not just sheer cliffs. As I said, check the dictionary.",
"timestamp":"2006-12-29 14:41:34 UTC",
"status":"just added",
"page_title":"User talk:Jinxmchue",
"user_text":"Jinxmchue",
"parent_id":"96983583.202.24"
}
},
We now handle this in the JS by inferring who should be the parent, but we might want to consider doing this in the python in future.
Change Groupby of timestamp to Partition function.
conversation id: 261839933.1349.1349
Ingestion failed on the 7z files listed here:
On the format for comments/conversations in the CSV rows for crowdflower:
So... what we currently have no does basically work, but its messier and more error prone than I'd like it to be.
The current interface/JSON type description for a comment is (in TypeScript syntax):
interface Comment {
// TODO(ldixon): make it always a string, and have it empty string for not present, instead of -1; also rename to parent_id
absolute_replyTo: string | number; // id of the parent
comment_type : 'COMMENT_MODIFICATION' | 'COMMENT_ADDING';
content: string;
indentation : string; // but this is a number inside the string;
// TODO(ldixon): remove
parent_ids: { [id:string]: boolean };
// TODO(ldixon): remove
relative_replyTo : string | number; // relative id of the parent.
status: 'just added' | 'content changed';
timestamp : string;
// TODO(ldixon): remove; not needed and in fact harmful (can be used to game crowdsourcing)
toxicity_score : number;
// TODO(ldixon): change up-stream to be a hash of the user-id. rename to hashed_user_id
user_text: string;
}
I suggest instead we make it look like so:
interface Comment {
id: string;
parent_id: string;
comment_type : 'COMMENT_MODIFICATION' | 'COMMENT_ADDING' | ...;
content: string;
status: 'just added' | 'content changed'| ...;
timestamp : string;
hashed_user_id: string;
}
And we keep a conversation as just:
interface Conversation = { [id:string]: Comment }
i.e. we change from data looking like this:
{
"550613551.0.0":{
"content":"== Name == ",
"indentation":"-1",
"comment_type":"COMMENT_MODIFICATION",
"toxicity_score":0.11125048073648158,
"user_text":"Adamdaley",
"timestamp":"2013-04-16 08:55:31 UTC",
"absolute_replyTo":-1,
"status":"just added",
"relative_replyTo":-1,
"parent_ids":{
"550613551.0.0":true
}
},
"675014505.416.416":{
"content":" I edited it to the largest \"labor\" uprising and the largest \"organized armed uprising\" since the civil war. They were not in rebellion per se and the race riots of the 60's are clearly a larger uprising (I'm not too sure on armed).",
"indentation":"0",
"comment_type":"COMMENT_ADDING",
"toxicity_score":0.06011961435406282,
"user_text":"70.151.72.162",
"timestamp":"2015-08-07 17:03:18 UTC",
"absolute_replyTo":"550613551.0.0",
"status":"just added",
"relative_replyTo":0,
"parent_ids":{
"675014505.416.416":true
}
},
to looking like so:
{
"550613551.0.0":{
"id": "550613551.0.0",
"parent_id": "",
"content":"== Name == ",
"comment_type":"COMMENT_MODIFICATION",
"hashed_user_id":"DKJHWEIU",
"timestamp":"2013-04-16 08:55:31 UTC",
"status":"just added",
},
"675014505.416.416":{
"id": "550613551.0.0",
"parent_id": "550613551.0.0",
"content":" I edited it to the largest \"labor\" uprising and the largest \"organized armed uprising\" since the civil war. They were not in rebellion per se and the race riots of the 60's are clearly a larger uprising (I'm not too sure on armed).",
"comment_type":"COMMENT_ADDING",
"hashed_user_id":"NMCWUPR",
"timestamp":"2015-08-07 17:03:18 UTC",
"status":"just added",
},
}
This avoids:
It also:
Let's add the page title to each comment in a conversation so that we could provide it as additional context in the annotation job.
As a way to further evaluate these models, it would be nice to have a flag that will score a subset of the test data using the Perspective API. I'm imagining outputting results that have
comment_id
comment_text
y_class
(e.g. 'toxic', 'obscene' etc.)y_gold
(if available)y_prob
(e.g. .89, 0.03 etc.)perspective_api_prob
y_prob
- perspective_api_prob
|The Kaggle competition requires the submissions be formatted like this:
id,toxic,severe_toxic,obscene,threat,insult,identity_hate
6044863,0.5,0.5,0.5,0.5,0.5,0.5
6102620,0.5,0.5,0.5,0.5,0.5,0.5
14563293,0.5,0.5,0.5,0.5,0.5,0.5
21086297,0.5,0.5,0.5,0.5,0.5,0.5
We're not actually competing in the competition, but it would be good to output our predictions in the same format so we can test our scoring scripts.
Some the previously finished ingested batches are missing two extra fields: records_count (number of records after resizing the record), record_index (the index of the subpiece of the record) due to running with an earlier version of code.
Looks like ml-engine supports hyperparameter tuning. It would be great to integrate with that.
Docs: https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview
e.g. in conversation id: 232341942.8700.8700: The last comment starts: "Issue of the MonthThe great 'The/the debates' return. ". "Month" and "The" don't have any space. I think that's due to something in the python comment interpretation?
It would be great to add a hook into Tensorboard so we can visualize training.
Using Token offsets makes us sensitive to the tokenization algorithm. Better instead to use character offsets if that's reasonable to do.
Consider if we can separate edits to others comments as there seem to actually be quite a few. We should investigate conversations like conversation id: 85577055.2121.2121
For minor edits, ideally we'd show the main author, and then just say minor edits by ... , and perhaps even underline the bits that are different in some way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.