team-synackd / nawab Goto Github PK
View Code? Open in Web Editor NEWNaWaB is a bot which shares all sorts of information regarding Computer Networks scraping twitter content
License: MIT License
NaWaB is a bot which shares all sorts of information regarding Computer Networks scraping twitter content
License: MIT License
Current codebase is mostly focused on python2 than python3. Remove deprecated functions and shift to python3.
Let's keep updating this list to track our own feature requests that we are interested to merge with the existing code.
The blacklist feature included (atleast the logic on how it works) seems to be not perfect and is buggy.
Right now, we lack a README file elaborating on what the project is and other related context that can be put into a README. Would be good to have good README for this project.
I did notice that the bot is retweeting the same tweet again from the retweet of the bot. Blacklisting the bot twitter account would reduce the duplication or retweets coming from the bot itself.
Inorder to avoid duplicate retweet or posting to the channel, we need to keep track of all the tweets which has been already tweeted by the bot.
Hint: Each tweets have specific dedicated IDs, can have that over a database and always check if the next tweet we retweet exists in that database or not?
It can be arising from either tweepy or the Twitter API, Can also be one of the impact due to COVID-19. Better error handling of the error codes from all the modules we use might be one solution to identify the origin of the problem.
Moving forward, since we need a better verbose output in tid_store, it would be better to change the way tid's are stored.
Assume a scenario where the user starts the telegram bot twice a day within, that is less than 24 hours. That would mean that with our current logic, the user will get the same tweet, again and again, showing duplicates at the telegram bot level.
Hence we need to find a way to resume sending tid to telegram bot only after the last send tid before stop command was executed for a particular user.
An ideal solution would be to store the last received tid at the telegram cache.
Initial thoughts, the logs should also contain DateTime objects in a parsable format.
telemetry and user data measurement can be tricky as we cannot compromise privacy. We don't need to know who has connected, we just need the numbers or the stats.
Let this be the start point of idea where we can improve upon, once we have a minimal idea on how the data would look like, we can update as we progress.
We currently have a few features like:
Currently, all these are written and stored in a .txt file. We need to investigate what another sort of ways is better than to always open a file and reading the txt file for processing?
One option for us is to, use a pandas data frame to tabularise all the data and store all these into a single CSV or any other. The computation is done within the code utilising the data frame initialised once in the code is better than opening and reading a txt while throughout the process.
Is there a better way to tackle this?
Currently, there are two different entry points for the whole nawab. There is one from the tg_bot while other from the twitter bot. Yet, both are distinctive.
We need to device a scenario where there is only one single entry point to the nawab. That is, the twitter bot keeps scarping the tweets in the background while the telegram bot, once received a request to start, should carry on with its process. Even then the twitter bot should keep on collecting tweets.
Currently, the logs are created inside the repo itself, which is absurd. We should have the logs by default in a location like /var/log/<name>.log
.
Since we are integrating argparse #18, there should be a provision to choose a log directory location as well for easier traversal for the user if required.
We need to parameterize the script in order to make the whole bot experience a bit more flexible
Though, while retweet, it is not retweeting the duplicate tweets.
There can be another log level as debug, for example: logging.debug() and can be set to be written into a log file or so simple as by setting,
logging.basicConfig(filename='name.log', level=logging.DEBUG)
Currently, the code is written mostly in functions. Would be good to introduce classes into the code architecture.
Update the commit history as well.
This would reduce file writing and reading at multiple places. Thus, we could restrict writting tid store as a file at fewer places where we want to actually know what's inside.
when the tg bot is executing its start command, and while that is in progress, if we try to kill the python program with keyboard interupt (CTRL + C) it doesn't seem to kill itself
Log information:
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |https://twitter.com/Gardena_Global/status/1254984986152841217
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |Id: 1254984986152841217is stored to the db from this iteration
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:45 PM |starting new query search: #BGP
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/MohsinulMalik/status/1255115033345835009
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255115033345835009is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |starting new query search: #Routing
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |https://twitter.com/Trifibre/status/1255111779899904002
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |Id: 1255111779899904002is stored to the db from this iteration
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
INFO: 04/28/2020 06:19:46 PM |starting new query search: #IP
Hey,
So there is an issue which I encountered after running the script, as 2 tweets were retweeted, while it logged only one of those tweets in the file.
Need to add more verbose options like /help and /stop.
Can add more options to make this feel like a chatbot as well, if that is a good idea.
I was recently going through a few python coding guidelines from their official documentation, here is what I found:
Read From a File
Use the with open syntax to read from files. This will automatically close files for you.
Bad:
f = open('file.txt')
a = f.read()
print a
f.close()
Good:
with open('file.txt') as f:
for line in f:
print line
The with statement is better because it will ensure you always close the file, even if an exception is raised inside the with block.
See to it if we are properly handling things in here or not.
Perhaps with retweeting the tweets related to Networks, would be good to have a dedicated Telegram channel to club all the tweets scraped by naWab.
Since @iammarco11 figured out that a telegram channel can't be flexible enough to serve out needs, my suggestion is to use the telegram bot itself, instead of public service from the bot as a how channel would have served, it would be a bit stricter option where users have to activate the bot in order to keep receiving the tweets.
The workflow idea:
Let me know if there is any more questions.
Add template config files on handling both twitter and telegram API tokens. Update the README according to it as well.
When the telegram bot shows the tweets scraped, the tweets should not be from the start of the tid_store. Instead, the bot should be aware of the recent tweets and post that alone. Verbose tid_store would be required and be of help.
Example, Check the date user requested, and parse the tweet id only from that date in the tid_store.
We can also calibrate it to check time as well, if we are scarping more data per day.
Also, tid_store in json might be good too.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.