nlprinceton / sarc Goto Github PK

Evaluation code for the Self-Annotated Reddit Corpus

License: MIT License

Python 100.00%

sarc's Issues

subsets to SARC Files

Hi, ~~ Thanks a million if you can help me clarify a data source question!
I visited the link to SARC Files. There are only two folders - main and pol for sarcasm evaluation, right?

However, I read two papers [1] [2], they all claimed they used two subsets: /r/movies and /r/technology from your SARC dataset. [1] claimed they used 8188 samples from /r/movies and 22510 samples from /r/technology. [2] claimed they used said they used 15019 samples from r/movies and 13485 samples from r/technology. [2] gave a link to SARC-2.0.

I checked that in the SARC-2.0 main folder, there are
train-balanced r/movies=1414, train-unbalanced r/movies=121595
test-balanced r/movies=364, test-unbalanced r/movies=27930
train-balanced r/technology=1652, train-unbalanced r/technology=73641
test-balanced r/technology=408, test-unbalanced r/technology=20104

It's impossible to get a balanced dataset for both [1] and [2], right?.. the sarc samples are very few compared with non-sarc samples...

could you help me check if there is any way for the two papers to get such data?
Thanks a million! This confused me quite a lot.

[1] https://www.aclweb.org/anthology/P18-1093.pdf
[2] https://dl.acm.org/doi/pdf/10.1145/3308558.3313735

raw/sarc.csv find user comments

Hi~~~

I'm investigating sarcasm detection using your dataset and particularly I'm collecting users' information now. Please correct where I misunderstand.

from SARC/2.0/README.txt, it reads that raw/sarc.csv contains sarcastic and non-sarcastic comments of authors in authors.json. I read the raw/sarc.csv dataset, the first example shows [0, "Yousa guys didn't upvote nothing!", 'BritishEnglishPolice',
'worldpolitics', 3, 3, 0, '2009-01', 1233446126,
"Mafia business 'equal' to 9% of Italian GDP", 'c07e6gg',
'7tvvp']
I guess the sentence "Yousa guys didn't upvote nothing!" is a post, "BritishEnglishPolice" is an author made this post, "worldpolitics" is subrredit, "3, 3, 0" correspond to scores/ups/downs respectively, "2009-01" and "1233446397" are date and UTC. "Mafia business 'equal' to 9% of Italian GDP" is a comment on this post. And "c07e6gg" and "7tvvp" are sarc and non-sarc responses to this comment. But when I search the comments.json for "c07e6gg" and "7tvvp", it returned nothing. Beside. what does "0" at the beginning mean? I see 0 appeared in many example. Could you help me understand this sarc.csv file? As my goal is to acquire some sarc and non-sarc comments or comments for a given author.
I'm using SARC/2.0 main and pol datasets. I think SARC/2.0 should contain all those in SARC/1.0 and SARC/0.0, right?

Thank you very much!
Best regards

SARC file link is dead

text_embedding not found

text_embedding not found. also didn't find it in python libraries

IndexError: list index out of range

hey all,

I'm running your code, and I'm running into an issue with running the bag of words model

So I've made a slight modifications to the instructions, so maybe that's playing a role, but I'm running the command line within the SARC directory:
`directory stuff/SARC> python eval.py main -l

Load SARC data
Traceback (most recent call last):
File "eval.py", line 119, in
main()
File "eval.py", line 45, in main
load_sarc_responses(train_file, test_file, comment_file, lower=args.lower)
File "directory stuff\SARC\utils.py", line 34, in load_sarc_responses
responses = row[1].split(' ')
IndexError: list index out of range`

The reason I'm running this within the SARC directory is in eval.py I've added a few lines of code(listed below) before everything else in eval.py so it would include the text_embeddings module (it shares a parent directory of SARC), and was the only way I could figure out importing the text_embedding within module eval.py so if there's a better way I'm all ears!
`import sys

sys.path.append('../')`

Final note, this error has occurred with both the data provided in the README (https://nlp.cs.princeton.edu/SARC/2.0/) and from the SARC data on Kaggle (https://www.kaggle.com/danofer/sarcasm), as the data from the readme looks odd when viewed in excel and I initially thought is messing with the csv parsing

Thank you,

Matt

nlprinceton / sarc Goto Github PK

sarc's Issues

subsets to SARC Files

raw/sarc.csv find user comments

SARC file link is dead

text_embedding not found

IndexError: list index out of range

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent