nlprinceton / sarc Goto Github PK
View Code? Open in Web Editor NEWEvaluation code for the Self-Annotated Reddit Corpus
License: MIT License
Evaluation code for the Self-Annotated Reddit Corpus
License: MIT License
Hi, ~~ Thanks a million if you can help me clarify a data source question!
I visited the link to SARC Files. There are only two folders - main and pol for sarcasm evaluation, right?
However, I read two papers [1] [2], they all claimed they used two subsets: /r/movies and /r/technology from your SARC dataset. [1] claimed they used 8188 samples from /r/movies and 22510 samples from /r/technology. [2] claimed they used said they used 15019 samples from r/movies and 13485 samples from r/technology. [2] gave a link to SARC-2.0.
I checked that in the SARC-2.0 main folder, there are
train-balanced r/movies=1414, train-unbalanced r/movies=121595
test-balanced r/movies=364, test-unbalanced r/movies=27930
train-balanced r/technology=1652, train-unbalanced r/technology=73641
test-balanced r/technology=408, test-unbalanced r/technology=20104
It's impossible to get a balanced dataset for both [1] and [2], right?.. the sarc samples are very few compared with non-sarc samples...
could you help me check if there is any way for the two papers to get such data?
Thanks a million! This confused me quite a lot.
[1] https://www.aclweb.org/anthology/P18-1093.pdf
[2] https://dl.acm.org/doi/pdf/10.1145/3308558.3313735
Hi~~~
I'm investigating sarcasm detection using your dataset and particularly I'm collecting users' information now. Please correct where I misunderstand.
from SARC/2.0/README.txt, it reads that raw/sarc.csv contains sarcastic and non-sarcastic comments of authors in authors.json. I read the raw/sarc.csv dataset, the first example shows [0, "Yousa guys didn't upvote nothing!", 'BritishEnglishPolice',
'worldpolitics', 3, 3, 0, '2009-01', 1233446126,
"Mafia business 'equal' to 9% of Italian GDP", 'c07e6gg',
'7tvvp']
I guess the sentence "Yousa guys didn't upvote nothing!" is a post, "BritishEnglishPolice" is an author made this post, "worldpolitics" is subrredit, "3, 3, 0" correspond to scores/ups/downs respectively, "2009-01" and "1233446397" are date and UTC. "Mafia business 'equal' to 9% of Italian GDP" is a comment on this post. And "c07e6gg" and "7tvvp" are sarc and non-sarc responses to this comment. But when I search the comments.json for "c07e6gg" and "7tvvp", it returned nothing. Beside. what does "0" at the beginning mean? I see 0 appeared in many example. Could you help me understand this sarc.csv file? As my goal is to acquire some sarc and non-sarc comments or comments for a given author.
I'm using SARC/2.0 main and pol datasets. I think SARC/2.0 should contain all those in SARC/1.0 and SARC/0.0, right?
Thank you very much!
Best regards
text_embedding not found. also didn't find it in python libraries
hey all,
I'm running your code, and I'm running into an issue with running the bag of words model
So I've made a slight modifications to the instructions, so maybe that's playing a role, but I'm running the command line within the SARC directory:
`directory stuff/SARC> python eval.py main -l
Load SARC data
Traceback (most recent call last):
File "eval.py", line 119, in
main()
File "eval.py", line 45, in main
load_sarc_responses(train_file, test_file, comment_file, lower=args.lower)
File "directory stuff\SARC\utils.py", line 34, in load_sarc_responses
responses = row[1].split(' ')
IndexError: list index out of range`
The reason I'm running this within the SARC directory is in eval.py I've added a few lines of code(listed below) before everything else in eval.py so it would include the text_embeddings module (it shares a parent directory of SARC), and was the only way I could figure out importing the text_embedding within module eval.py so if there's a better way I'm all ears!
`import sys
sys.path.append('../')`
Final note, this error has occurred with both the data provided in the README (https://nlp.cs.princeton.edu/SARC/2.0/) and from the SARC data on Kaggle (https://www.kaggle.com/danofer/sarcasm), as the data from the readme looks odd when viewed in excel and I initially thought is messing with the csv parsing
Thank you,
Matt
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.