Code Monkey home page Code Monkey logo

wisesight-sentiment's Introduction

Wisesight Sentiment Corpus

DOI

ข้อความภาษาไทยจากสื่อสังคมออนไลน์ พร้อมกับป้ายกำกับความรู้สึก (บวก, กลางๆ, ลบ, คำถาม) รวม 26,737 ข้อความ เผยแพร่เป็นสมบัติสาธารณะ ภายใต้สัญญาอนุญาต Creative Commons Zero v1.0 Universal

Social media messages in Thai language with sentiment label (positive, neutral, negative, question). Released to public domain under Creative Commons Zero v1.0 Universal license.

Last update: 2019-03-31

For wisesight-160 and wisesight-1000, which are samples from this corpus in a tokenized form, see https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization

For data exploration and classification examples, see Thai Text Classification Benchmarks.

Also available as Huggingface datasets: wisesight_sentiment (use data from earlier version of Wisesight Sentiment Corpus) , wisesight1000

Source

  • Size: 26,737 messages
  • Language: Central Thai
  • Style: Informal and conversational. With some news headlines and advertisement.
  • Time period: Around 2016 to early 2019. With small amount from other period.
  • Domains: Mixed. Majority are consumer products and services (restaurants, cosmetics, drinks, car, hotels), with some current affairs.
  • Privacy:
    • Only messages that made available to the public on the internet (websites, blogs, social network sites).
    • For Facebook, this means the public comments (everyone can see) that made on a public page.
    • Private/protected messages and messages in groups, chat, and inbox are not included.
  • Alternations and modifications:
    • Keep in mind that this corpus does not statistically represent anything in the language register.
    • Large amount of messages are not in their original form. Personal data are removed or masked.
    • Duplicated, leading, and trailing whitespaces are removed. Other punctuations, symbols, and emojis are kept intact.
    • (Mis)spellings are kept intact.
    • Messages longer than 2,000 characters are removed.
    • Long non-Thai messages are removed. Duplicated message (exact match) are removed.
  • More characteristics of the data can be explore by this notebook.

Corpus file structure

  • All files are UTF-8 encoded plaintext
  • One message per line. A newline character in the original message will be replaced with a space.
  • q.txt Questions (575 messages)
  • neg.txt Message with negative sentiment (6,823)
  • neu.txt Message with neutral sentiment (14,561)
  • pos.txt Message with positive sentiment (4,778)
  • The legacy dataset in Kaggle competition format is also provided inside kaggle-competition/ directory:
    • train.txt - Message for training (24,066 messages)
    • train_label.txt - Label for training. Each line is the label corresponding to the same line in train.txt
    • test.txt - Message for testing (2,674 messages)
    • test_label.txt - Label for testing. Each line is the label corresponding to the same line in test.txt
    • test_majority.csv - Sample submission in Kaggle format. Contains neu class as all the predictions.
    • test_solution.csv - Test solution in Kaggle format.
    • Sample code for data exploration, training, and prediction are also provided.

Personal data

  • We trying to exclude any known personally identifiable information from this data set.
  • Usernames and non-public figure names are removed
  • Phone numbers are masked (e.g. 088-888-8888, 09-9999-9999, 0-2222-2222)
  • If you see any personal data still remain in the set, please tell us - so we can remove them.

Sentiment value annotation methodology

  • Sentiment values are assigned by human annotators.
  • A human annotator put his/her best effort to assign just one label, out of three, to a message.
  • A message can be ambiguous. When possible, the judgement will be based solely on the text itself.
    • In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess.
    • In some cases, the human annotator may have an access to the message's context, like an image. These additional information are not included as part of this corpus.
  • Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative.
  • Showing interest in a topic or in a product is counted as positive.
    • In this sense, a question about a particular product could has a positive sentiment value, if it shows the interest in the product.
  • Saying that other product or service is better is counted as negative.
  • General information or news title tend to be counted as neutral.

Copyright and Disclaimer

  • If applicable, copyright of each message content belongs to the original poster.
  • Annotation data (labels) are released to public domain.
  • Wisesight (Thailand) Co., Ltd. helps facilitate the annotation, but does not necessarily agree upon the labels made by the human annotators. This annotation is for research purpose and does not reflect the professional work that Wisesight has been done for its customers.
  • The human annotator does not necessarily agree or disagree with the message. Likewise, the label he/she made to the message does not necessarily reflect his/her personal view towards the message.

Citation

Please cite the following if you make use of the dataset:

Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, and Charin Polpanumas. 2019. PyThaiNLP/wisesight-sentiment: First release. September.

BibTeX:

@software{bact_2019_3457447,
  author       = {Suriyawongkul, Arthit and
                  Chuangsuwanich, Ekapol and
                  Chormai, Pattarawat and
                  Polpanumas, Charin},
  title        = {PyThaiNLP/wisesight-sentiment: First release},
  month        = sep,
  year         = 2019,
  publisher    = {Zenodo},
  version      = {v1.0},
  doi          = {10.5281/zenodo.3457447},
  url          = {https://doi.org/10.5281/zenodo.3457447}
}

Acknowledgement

Thanks PyThaiNLP community, Kitsuchart Pasupa (Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang), and Ekapol Chuangsuwanich (Faculty of Engineering, Chulalongkorn University) for advice. The original Kaggle competition, using the first version of this corpus, can be found at https://www.kaggle.com/c/wisesight-sentiment/

wisesight-sentiment's People

Contributors

bact avatar cstorm125 avatar ekapolc avatar p16i avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

wisesight-sentiment's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.