Code Monkey home page Code Monkey logo

nawab's People

Contributors

ananthanandanan avatar aniketh01 avatar iammarco11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

nawab's Issues

Feature request for next release

Let's keep updating this list to track our own feature requests that we are interested to merge with the existing code.

  • Improved keyword search
  • User able to give twitter access to nawab bot so that they can retweet from the telegram bot itself.

Bootstrap a README file

Right now, we lack a README file elaborating on what the project is and other related context that can be put into a README. Would be good to have good README for this project.

Blacklist the bot account itself

I did notice that the bot is retweeting the same tweet again from the retweet of the bot. Blacklisting the bot twitter account would reduce the duplication or retweets coming from the bot itself.

Need to keep track of tweets already scraped

Inorder to avoid duplicate retweet or posting to the channel, we need to keep track of all the tweets which has been already tweeted by the bot.

Hint: Each tweets have specific dedicated IDs, can have that over a database and always check if the next tweet we retweet exists in that database or not?

Script inactive at 12:30 am to 6:30 am

It can be arising from either tweepy or the Twitter API, Can also be one of the impact due to COVID-19. Better error handling of the error codes from all the modules we use might be one solution to identify the origin of the problem.

Resume from the last tid store received at the telegram bot/client

Assume a scenario where the user starts the telegram bot twice a day within, that is less than 24 hours. That would mean that with our current logic, the user will get the same tweet, again and again, showing duplicates at the telegram bot level.

Hence we need to find a way to resume sending tid to telegram bot only after the last send tid before stop command was executed for a particular user.

An ideal solution would be to store the last received tid at the telegram cache.

Verbose logging

Initial thoughts, the logs should also contain DateTime objects in a parsable format.

Telegram bot telemetry

  • We need to understand how many people have connected to telegram bot.
  • The frequency of people using the bot
  • The frequency of people quitting the bot, quitting the bot within 2-5 mins of time frame.
  • Usage of buttons etc
  • How many unique users are connected at the moment, how many have users in the past etc

telemetry and user data measurement can be tricky as we cannot compromise privacy. We don't need to know who has connected, we just need the numbers or the stats.

Let this be the start point of idea where we can improve upon, once we have a minimal idea on how the data would look like, we can update as we progress.

Performant way to read context about blacklist, whitelist or keywords

We currently have a few features like:

  1. Blacklist: Accounts or context not to be tweeted or parsed by naWab.
  2. Whitelist: Accounts or context which are always monitored by naWab and most of their tweets are scraped by naWab.
  3. Keywords: We currently iterate through the list of keywords we have specific to networks that we scrap. This was done in order to generalise naWab as a scrapping tool and be used for not only network-related context but for any context.

Currently, all these are written and stored in a .txt file. We need to investigate what another sort of ways is better than to always open a file and reading the txt file for processing?

One option for us is to, use a pandas data frame to tabularise all the data and store all these into a single CSV or any other. The computation is done within the code utilising the data frame initialised once in the code is better than opening and reading a txt while throughout the process.

Is there a better way to tackle this?

A single entry point for the bots

Currently, there are two different entry points for the whole nawab. There is one from the tg_bot while other from the twitter bot. Yet, both are distinctive.

We need to device a scenario where there is only one single entry point to the nawab. That is, the twitter bot keeps scarping the tweets in the background while the telegram bot, once received a request to start, should carry on with its process. Even then the twitter bot should keep on collecting tweets.

Better default location for the logs

Currently, the logs are created inside the repo itself, which is absurd. We should have the logs by default in a location like /var/log/<name>.log.

Since we are integrating argparse #18, there should be a provision to choose a log directory location as well for easier traversal for the user if required.

Integrate argparse

We need to parameterize the script in order to make the whole bot experience a bit more flexible

List of Arguments needed for argparse

  • Option to pass the username for blacklist
  • Enable a retweeting option
  • Silent logging mechanism; ie writing the logs to a file instead of printing them

Another log level: debug

There can be another log level as debug, for example: logging.debug() and can be set to be written into a log file or so simple as by setting,

logging.basicConfig(filename='name.log', level=logging.DEBUG)

Code refactor

Currently, the code is written mostly in functions. Would be good to introduce classes into the code architecture.

Update the commit history as well.

Replace tid_store file with Dict()

This would reduce file writing and reading at multiple places. Thus, we could restrict writting tid store as a file at fewer places where we want to actually know what's inside.

There are multiple iteration for just a single keyword search

Log information:

INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP

check if we are properly opening and closing the files

I was recently going through a few python coding guidelines from their official documentation, here is what I found:

Read From a File
Use the with open syntax to read from files. This will automatically close files for you.

Bad:

f = open('file.txt')
a = f.read()
print a
f.close()
Good:

with open('file.txt') as f:
    for line in f:
        print line
The with statement is better because it will ensure you always close the file, even if an exception is raised inside the with block.

See to it if we are properly handling things in here or not.

Telegram channel to output the tweets

Perhaps with retweeting the tweets related to Networks, would be good to have a dedicated Telegram channel to club all the tweets scraped by naWab.

Telegram bot: Use the bot instead of channel

Since @iammarco11 figured out that a telegram channel can't be flexible enough to serve out needs, my suggestion is to use the telegram bot itself, instead of public service from the bot as a how channel would have served, it would be a bit stricter option where users have to activate the bot in order to keep receiving the tweets.

The workflow idea:

  1. We keep the twitter bot keep running, which scrapes all the tweets, tweet id etc which is needed.
  2. Have a telegram bot, for instance, NaWab bot. When /start is initiated at the bot, the bot can read/look at the most recent tweets scraped by the twitter bot and start sending it to the telegram bot. This can be done either by providing APIs from twitter scraping script we have, or either can read tid_store. An API to read tid_store would be the ideal way to proceed.
  3. Add retweet inline keyboard option in telegram bot for admin specific use only.
  4. We can customize and personalise the telegram bot as the above proceeds.

Let me know if there is any more questions.

Template config files

Add template config files on handling both twitter and telegram API tokens. Update the README according to it as well.

Telegram bot: When the user asks to show tweets curated, it should show the recently scraped ones.

When the telegram bot shows the tweets scraped, the tweets should not be from the start of the tid_store. Instead, the bot should be aware of the recent tweets and post that alone. Verbose tid_store would be required and be of help.

Example, Check the date user requested, and parse the tweet id only from that date in the tid_store.

We can also calibrate it to check time as well, if we are scarping more data per day.

Also, tid_store in json might be good too.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.