Code Monkey home page Code Monkey logo

thai-discord-chat-datasets's Introduction

Thai Discord Chat Datasets

This repository contains datasets of Thai Discord chat conversations that were collected between 2016 and 2022. The dataset files are named as data(01-18).txt and collectively contain a total of 28,698,673 lines of chat.

Dataset Overview

The chat logs are organized in a conversational format, with each line representing a message exchanged between users. The structure of each line follows the convention of Discord chat logs, where user mentions are denoted by "@U" followed by the user's identifier within the chat scope.

Sample Chat Log:

U1:ง้าบเช่นกัน
U2:@U3 นอนยังฮับ
U2:ราตรีสวัสดิ์
*
U1:แล้วโทรศัพท์ก็ดับ
U2:อรุณสวัสดิ์
U2:@U1 @U4 @U3  อรุณสวัสดิ์ฮะ
U1:ทำเช้านี้เค็มจัง เกลือแบบโคตรๆเลยเนี่ย
*

Dataset Collection

The chat data in this repository was collected by me. The collection period spans from 2016 to 2022, ensuring a diverse range of conversations over the years.

Chat Filtering Rules

The chat data in this repository has been filtered according to the following rules:

  1. Thai Language: Only chat messages in the Thai language have been included in the dataset.
  2. Exclusion of URLs: Chat messages containing URLs or links have been omitted from the dataset.
  3. Language Identification: The Franc module was utilized to identify the language of each chat message, and only messages identified as "tha" (Thai) were included.
  4. Exclusion of Common Bot Prefix: Chat messages that start with a common bot prefix have been excluded from the dataset.
  5. Conversational Scope: Each conversation in the dataset has been restricted to a maximum of 2 hours. If a conversation exceeds this time limit, it is considered a new conversation.
  6. Multiple Real Users: Each chat conversation in the dataset includes more than two real users engaging in the conversation. Bot users are not counted as part of the user count.

Usage

Researchers and developers can use this dataset for various purposes, such as natural language processing (NLP) tasks, chatbot training, sentiment analysis, or any other project that requires Thai language chat data. The dataset provides a valuable resource for understanding Thai language usage in Discord conversations over the specified time period.

Please note that the dataset is provided "as is," and I does not guarantee its accuracy or completeness. Users are encouraged to review and preprocess the data according to their specific requirements.

License

The Thai Discord Chat Datasets repository is made available under the MIT License. You are free to use, modify, and distribute the dataset in accordance with the terms specified in the license.

thai-discord-chat-datasets's People

Contributors

chanios avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

wannaphong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.