team-synackd / nawab Goto Github PK

View Code? Open in Web Editor NEW

8.0 1.0 2.0 87 KB

NaWaB is a bot which shares all sorts of information regarding Computer Networks scraping twitter content

License: MIT License

Python 100.00%

nawab-bot nawab twitter retweets python

nawab's People

Contributors

Stargazers

Watchers

Forkers

iammarco11 ananthanandanan

nawab's Issues

Change to python3 from python2.

Current codebase is mostly focused on python2 than python3. Remove deprecated functions and shift to python3.

Feature request for next release

Let's keep updating this list to track our own feature requests that we are interested to merge with the existing code.

Improved keyword search
User able to give twitter access to nawab bot so that they can retweet from the telegram bot itself.

Telegram bot: A button for admins to integrate with the bot on manually retweeting a choosen tweet.

Blacklist and whitelist seems to be broken

The blacklist feature included (atleast the logic on how it works) seems to be not perfect and is buggy.

Bootstrap a README file

Right now, we lack a README file elaborating on what the project is and other related context that can be put into a README. Would be good to have good README for this project.

Proper formatting of logs with log levels coming out of telegram bot

Blacklist the bot account itself

I did notice that the bot is retweeting the same tweet again from the retweet of the bot. Blacklisting the bot twitter account would reduce the duplication or retweets coming from the bot itself.

Need to keep track of tweets already scraped

Inorder to avoid duplicate retweet or posting to the channel, we need to keep track of all the tweets which has been already tweeted by the bot.

Hint: Each tweets have specific dedicated IDs, can have that over a database and always check if the next tweet we retweet exists in that database or not?

Script inactive at 12:30 am to 6:30 am

It can be arising from either tweepy or the Twitter API, Can also be one of the impact due to COVID-19. Better error handling of the error codes from all the modules we use might be one solution to identify the origin of the problem.

Better file format for tid_store.

Moving forward, since we need a better verbose output in tid_store, it would be better to change the way tid's are stored.

Resume from the last tid store received at the telegram bot/client

Assume a scenario where the user starts the telegram bot twice a day within, that is less than 24 hours. That would mean that with our current logic, the user will get the same tweet, again and again, showing duplicates at the telegram bot level.

Hence we need to find a way to resume sending tid to telegram bot only after the last send tid before stop command was executed for a particular user.

An ideal solution would be to store the last received tid at the telegram cache.

Verbose logging

Initial thoughts, the logs should also contain DateTime objects in a parsable format.

Telegram bot telemetry

We need to understand how many people have connected to telegram bot.
The frequency of people using the bot
The frequency of people quitting the bot, quitting the bot within 2-5 mins of time frame.
Usage of buttons etc
How many unique users are connected at the moment, how many have users in the past etc

telemetry and user data measurement can be tricky as we cannot compromise privacy. We don't need to know who has connected, we just need the numbers or the stats.

Let this be the start point of idea where we can improve upon, once we have a minimal idea on how the data would look like, we can update as we progress.

Performant way to read context about blacklist, whitelist or keywords

We currently have a few features like:

Blacklist: Accounts or context not to be tweeted or parsed by naWab.
Whitelist: Accounts or context which are always monitored by naWab and most of their tweets are scraped by naWab.
Keywords: We currently iterate through the list of keywords we have specific to networks that we scrap. This was done in order to generalise naWab as a scrapping tool and be used for not only network-related context but for any context.

Currently, all these are written and stored in a .txt file. We need to investigate what another sort of ways is better than to always open a file and reading the txt file for processing?

One option for us is to, use a pandas data frame to tabularise all the data and store all these into a single CSV or any other. The computation is done within the code utilising the data frame initialised once in the code is better than opening and reading a txt while throughout the process.

Is there a better way to tackle this?

A single entry point for the bots

Currently, there are two different entry points for the whole nawab. There is one from the tg_bot while other from the twitter bot. Yet, both are distinctive.

We need to device a scenario where there is only one single entry point to the nawab. That is, the twitter bot keeps scarping the tweets in the background while the telegram bot, once received a request to start, should carry on with its process. Even then the twitter bot should keep on collecting tweets.

Better default location for the logs

Currently, the logs are created inside the repo itself, which is absurd. We should have the logs by default in a location like /var/log/<name>.log.

Since we are integrating argparse #18, there should be a provision to choose a log directory location as well for easier traversal for the user if required.

Integrate argparse

We need to parameterize the script in order to make the whole bot experience a bit more flexible

tid_store is storing duplicate tids

Though, while retweet, it is not retweeting the duplicate tweets.

List of Arguments needed for argparse

Option to pass the username for blacklist
Enable a retweeting option
Silent logging mechanism; ie writing the logs to a file instead of printing them

Update readme with the newly added cli options

Another log level: debug

There can be another log level as debug, for example: logging.debug() and can be set to be written into a log file or so simple as by setting,

logging.basicConfig(filename='name.log', level=logging.DEBUG)

Code refactor

Currently, the code is written mostly in functions. Would be good to introduce classes into the code architecture.

Update the commit history as well.

Provide better error handling methods

Replace tid_store file with Dict()

This would reduce file writing and reading at multiple places. Thus, we could restrict writting tid store as a file at fewer places where we want to actually know what's inside.

Telegram bot: Integrate admin privilege to handle or input tweets.

The main python file nawab.py should be written in class

Can't kill nawab script while tg bot is running.

when the tg bot is executing its start command, and while that is in progress, if we try to kill the python program with keyboard interupt (CTRL + C) it doesn't seem to kill itself

There are multiple iteration for just a single keyword search

Log information:

INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP

Inaccurate logging of the retweets; tweets are logged late after the next retweet is done

Hey,

So there is an issue which I encountered after running the script, as 2 tweets were retweeted, while it logged only one of those tweets in the file.

Telegram bot: /help and /stop options.

Need to add more verbose options like /help and /stop.

Can add more options to make this feel like a chatbot as well, if that is a good idea.

check if we are properly opening and closing the files

I was recently going through a few python coding guidelines from their official documentation, here is what I found:

Read From a File
Use the with open syntax to read from files. This will automatically close files for you.

Bad:

f = open('file.txt')
a = f.read()
print a
f.close()
Good:

with open('file.txt') as f:
    for line in f:
        print line
The with statement is better because it will ensure you always close the file, even if an exception is raised inside the with block.

See to it if we are properly handling things in here or not.

Telegram channel to output the tweets

Perhaps with retweeting the tweets related to Networks, would be good to have a dedicated Telegram channel to club all the tweets scraped by naWab.

blacklist the user account or the twitter account in charge of running the twitter bot

Telegram bot: Use the bot instead of channel

Since @iammarco11 figured out that a telegram channel can't be flexible enough to serve out needs, my suggestion is to use the telegram bot itself, instead of public service from the bot as a how channel would have served, it would be a bit stricter option where users have to activate the bot in order to keep receiving the tweets.

The workflow idea:

We keep the twitter bot keep running, which scrapes all the tweets, tweet id etc which is needed.
Have a telegram bot, for instance, NaWab bot. When /start is initiated at the bot, the bot can read/look at the most recent tweets scraped by the twitter bot and start sending it to the telegram bot. This can be done either by providing APIs from twitter scraping script we have, or either can read tid_store. An API to read tid_store would be the ideal way to proceed.
Add retweet inline keyboard option in telegram bot for admin specific use only.
We can customize and personalise the telegram bot as the above proceeds.

Let me know if there is any more questions.

Template config files

Add template config files on handling both twitter and telegram API tokens. Update the README according to it as well.

Telegram bot: When the user asks to show tweets curated, it should show the recently scraped ones.

When the telegram bot shows the tweets scraped, the tweets should not be from the start of the tid_store. Instead, the bot should be aware of the recent tweets and post that alone. Verbose tid_store would be required and be of help.

Example, Check the date user requested, and parse the tweet id only from that date in the tid_store.

We can also calibrate it to check time as well, if we are scarping more data per day.

Also, tid_store in json might be good too.