Code Monkey home page Code Monkey logo

messaging-chat-parser's Introduction

๐Ÿ“ฒ Messaging parser

Use what you had written

What is this repo?

This repository provides python scripts to parse WhatsApp and Telegram messages.
The goal is to obtain text files with a good structure for machine learning purposes. [4]

๐Ÿ“ฅ Inputs

Data to provide:

  • WhatsApp data
    • .txt files exported from one or more chat - how
      • place all txt files in ./data/chat_raw/whatsapp/*.txt
  • Telegram data
    • .json with the telegram dump - how [5]
      • copy and rename the json file in ./data/chat_raw/telegram/telegram_dump.json

โš™ Usage

  • Install requirements.txt
  • WhatsApp [1]

    python ./src/whatsapp_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4 --user_name <user_name>

  • Telegram [2]

    python ./src/telegram_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4

  • Join files and extract user messages

    python ./src/joiner.py

๐Ÿ“ค Outputs

  • telegram-chats.txt and wa-chats.txt
    • Will have this structure both:
      [me] bla bla bla
      [others] bla bla bla
      [others] bla bla bla
      <|endoftext|>
      [me] bla bla bla
      ...
    • Where the three tags:
      • [me]: placed as suffix of text wrote by the user [3]
      • [others]: placed as suffix of text wrote by others
      • <|endoftext|>: added when the time elapsed between two sequential messages is > 4 hours
  • all-messages.txt
    • One file with both telegram-chats.txt and wa-chats.txt rows.
  • user-messages.txt
    • One line per message wrote by the user [3]

๐Ÿ“ Notes

  • [1] How find <user_name> value?
    • From the WhatsApp chat exported text, e.g. from one line:
      12/12/19, 08:40 - <user_name>: bla bla bla
  • [2] Check that the telegram dump is named telegram_dump.json and is inside
    ./data/chat_raw/telegram/telegram_dump.json
  • [3] user = the owner of the messages (I hope it coincides with who use those scripts)
    • the account that had done the data dump for Telegram
    • the value passed in --user_name in WhatsApp parser
  • [4] โš  Is always better to don't run random scripts on personal information (like chat messages)
    • You can check this code
    • Take in mind that before:
      • This is a free-time project, I'm not guaranteeing efficiently or good programming practice
      • I'm not so good at writing English
      • Good luck
  • [5] Be sure to select the "Account information" checkbox into the telegram dump dialog window
  • Both Telegram and WhatsApp parsers aren't tested on the group's chats data and is not intended to manage those types of information.
  • Is possible to change the chat session behavior
    • with --session_token we can change the session splitting token, if argument not provided session split will be disabled.
    • with --delta_h_threshold is possible to change the time windows to be elapsed between two sequential messages before inserting a session_token
  • ๐Ÿ“… Parsing data with custom values:
    • Both WhatsApp and Telegram parser use a default Italian datetime format
    • You can always use a custom format parser by using the --time_format parameter:
      • WhatApp:

      python ./src/whatsapp_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4 --user_name <user_name> --time_format "%d/%m/%y, %H:%M"

      • Telegram:

      python ./src/telegram_parser.py --session_token "<|endoftext|>" --time_format "%Y-%m-%dT%H:%M:%S"

messaging-chat-parser's People

Contributors

pistocop avatar sginj avatar shaunlwm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

messaging-chat-parser's Issues

user not added to parsed messages

user not added to parsed messages

in the folder chat_parsed, in both files, telegram-chats.txt and wa-chats.txt the only messages that appear are [others]

Wrong datetime format

I get the error ValueError: time data '04/01/2017, 10:31 'does not match format'%d/%m/%y, %H:%M'. The problem seems to be with the %y that should be %Y

My Python version is 3.9.5
My OS is MacOS

Ability to use parsers individually

how can I disable telegram parser and use only WhatsApp parser? or vice-versa?

I don't have telegram export and the code is failing is trying to parse telegram export.

The program detects all my lines has 'invalid'

Program detects all my lines has 'invalid'

Hello, I've found your project on Reddit, and I'm currently testing this repository.

Well, I tried do preprocess my data, but unfortunally I'm getting this error. Any ideas?

Command used:
python ./src/whatsapp_parser.py --session_token "<|endoftext|>" --delta_h_threshold 4 --us --user_name johnny

Logs:
[whatsapp_parser.py][INFO]: WA_STOP_WORDS:['https', '<Media omessi>', '<Media omitted>', 'www']

[whatsapp_parser.py][INFO]: Found 1 txt files in ./data/chat_raw/whatsapp/ folder ['./data/chat_raw/whatsapp/_chat.txt']

[whatsapp_parser.py][INFO]: Found 33322 invalid lines in ./data/chat_raw/whatsapp/_chat.txt

[whatsapp_parser.py][INFO]: Saving ./data/chat_parsed/wa-chats.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.