oduwsdl / followercounthistory Goto Github PK

Crawler that grabs Twitter follower counts across time via internet archives given account user name

License: MIT License

R 6.71% Python 69.67% Dockerfile 0.37% Shell 1.70% JavaScript 19.77% HTML 1.77%

archives twitter web

followercounthistory's Introduction

Twitter Follower Count History via Web Archives

Follower Count History is a Python module that collects Twitter follower count from the web archives using MemGator for a given Twitter handle. The module parses the follower count by identifying various CSS Selectors that match the follower count element on the historical Twitter pages for almost every major overhaul their page layout has gone through. The program collects all of the memento data points by default.

Since PyPi version 1.0.14, the FCH package is now platform independent. Pre 1.0.14 versions of FCH on PyPi are only compatible with Linux systems.

[1] Mohammed Nauman Siddique. 2020. Historical Twitter Follower Count Via Web Archives. (August 2020). Retrieved August 05, 2020 from https://ws-dl.blogspot.com/2020/08/2020-08-05-historical-twitter-follower.html

[2] Miranda Smith. 2018. Twitter Follower Count History via the Internet Archive. (March 2018). Retrieved July 25, 2020 from https://ws-dl.blogspot.com/2018/03/2018-03-14-twitter-follower-count.html

Installation and Usage

Dependencies

Python 3
bs4
warcio
requests
pytest (used for testing purposes)
pytz (used for testing purposes)
R* (Optional: to create graph)

Usage

$ git clone https://github.com/oduwsdl/FollowerCountHistory.git
$ cd FollowerCountHistory
$ pip install -r requirements.txt
$  ./fch/__main__.py [-h] [--st] [--et] [--freq] [--out] <Twitter handle/ Twitter URL>

Install from pypi

$ pip install fch
$  fch [-h] [--st] [--et] [--freq] [--out] <Twitter handle/ Twitter URL>

To just create the graph from a csv file

$ Rscript twitterFollowerCount.R <CSV file path>

Docker

We have published a docker image at oduwsdl/fch with the tag 2.0, which can be used to run this tool as following:

$  docker container run --rm -it   -v <Output Directory>:/app  -u $(id -u):$(id -g)  oduwsdl/fch:2.0 [options] <Twitter Handle>

Example of output being mapped to the current directory

$  docker container run --rm -it -v $PWD:/app -u $(id -u):$(id -g) oduwsdl/fch:2.0 --out  --st=20200101000000 --et=20200331000000 --freq=2592000  joebiden

Example of docker command for generating follower graph

$ docker container run --rm -it -v $PWD:/app -u $(id -u):$(id -g) --entrypoint /bin/bash oduwsdl/fch:2.0
I have no name!@736a209b64d6:/app$ ./fch/__main__.py --freq=2592000 joebiden| Rscript twitterFollowerCount.R

Options

Follower Count History (fch)

positional arguments:
  thandle     Enter a Twitter handle/ URL

optional arguments:
  -h, --help     show this help message and exit
  --st           Memento start datetime (YYYYMMDDHHMMSS)
  --et           Memento end datetime (YYYYMMDDHHMMSS)
  --freq         Sampling frequency of mementos (in seconds)
  -f             Output file path (Supported Extensions: JSON and CSV)
  -v, --version  Report the version of fch

--st: Default is set to Twitter birth date (2006-03-21 12:00:00). It accepts the memento datetime in RFC 8601 fourteen digit variation.
--et: Default is set to the current datetime. It accepts the memento datetime in RFC 8601 fourteen digit variation.
--freq: Default is set to download all the mementos
-f: Accepts JSON and CSV file paths for output. If no value is provided, output is returned to stdout in CSV format.

Output

The program can generate output in JSON and CSV format. The -f option directs the output of CSV or JSON files to the supplied file path. By default, the module returns the outut in CSV format to the stdout.

Output Fields

Field	Description
MementoTimestamp	memento datetime in RFC 8601 fourteen digit variation
URI-M	link to the memento
FollowerCount	follower count from the URI-M
AbsGrowth	follower count increase/decrease w.r.t. the first memento
RelGrowth	follower Count increase/decrease w.r.t. the previous memento
AbsPerGrowth	pecentage increase/decrease in follower count w.r.t. the first memento
RelPerGrowth	pecentage increase/decrease in follower count w.r.t. the previous memento
AbsFolRate	daily Twitter follower growth rate w.r.t. the first memento
RelFolRate	daily Twitter follower growth rate w.r.t. the previous memento

Sample Outputs

JSON Output

[{
	"MementoDatetime": "20200101001959",
	"URIM": "https://web.archive.org/web/20200101001959/https://twitter.com/JoeBiden",
	"FollowerCount": 4048208
}, {
	"MementoDatetime": "20200131120028",
	"URIM": "https://web.archive.org/web/20200131120028/https://twitter.com/joebiden",
	"FollowerCount": 4142510
}, {
	"MementoDatetime": "20200301001210",
	"URIM": "https://web.archive.org/web/20200301001210/https://twitter.com/JoeBiden/",
	"FollowerCount": 4202148
}]

CSV Output

MementoDatetime,URIM,FollowerCount,AbsGrowth,RelGrowth,AbsPerGrowth,RelPerGrowth,AbsFolRate,RelFolRate
20200101001959,https://web.archive.org/web/20200101001959/https://twitter.com/JoeBiden,4048208,0,0,0,0,0,0
20200131120028,https://web.archive.org/web/20200131120028/https://twitter.com/joebiden,4142510,94302,94302,2.33,2.33,0.0358,0.0358
20200301001210,https://web.archive.org/web/20200301001210/https://twitter.com/JoeBiden/,4202148,153940,59638,3.8,1.44,0.0297,0.02339

Output to stdout

$ fch --st=20200101000000 --et=20200331000000  --freq=2592000 joebiden

Output to files

Command to return output to the file path

$ fch --st=20200101000000 --et=20200331000000  --freq=2592000 -f=output/joebiden.csv joebiden
$ fch --st=20200101000000 --et=20200331000000  --freq=2592000 -f=output/joebiden.json joebiden

Command to create graphs for each handle

$ Rscript twitterFollowerCount.R <file path>

List of Graphs for each Twitter handle:

File Name	Description
`<Twitterhandle>`-follower-count.jpg	shows Twitter follower growth over time
`<Twitterhandle>`-follower-growth-relative.jpg	shows Twitter follower growth w.r.t. previous memento
`<Twitterhandle>`-follower-growth.jpg	shows absolute number and pecentage Twitter follower growth w.r.t. to first memento
`<Twitterhandle>`-follower-perc-growth-relative.jpg	shows Twitter follower growth over time w.r.t. previous memento in percentage
`<Twitterhandle>`-follower-rate-relative.jpg	shows new followers added per day w.r.t. previous memento
`<Twitterhandle>`-follower-rate.jpg	shows new followers added per day w.r.t. first memento

Examples

Command to find Twitter follower count for a Twitter handle from all the mementos since the account creation up until today
- Output to stdout as CSV
```
$  fch joebiden
```
- Output as CSV file
```
$  fch -f=joebiden.csv joebiden
```
Command to find Twitter follower count for a Twitter handle with a monthly sampling of the the mementos since the account creation up until today
```
Frequency = 3600*24*30
Frequency = 2592000
```

Output to stdout as CSV

$  fch --freq=2592000 joebiden

Output as CSV file

$  fch -f=joebiden.csv --freq=2592000 joebiden

Command to find Twitter follower count for a Twitter handle with a monthly sampling of the the mementos within a specified start and end timestamp
- Output to stdout as CSV
```
$  fch --st=20200101000000 --et=20200331000000 --freq=2592000 joebiden
```
- Output as CSV file
```
$  fch -f=joebiden.csv --st=20200101000000 --et=20200331000000 --freq=2592000 joebiden
```
Fch group script
To run the 'fchgrp.sh' script, pass a txt or csv file as a parameter
- Output defaults to 'output/' folder
- Can use the same parameters as fch to generate the same results

$  ./fchgrp.sh user.txt

Setting output to 'newOutput/' for csv files using 'fchgrp.sh'

$  ./fchgrp.sh user.txt -f=newOutput/

Generate D3 Stub

GenerateD3Stub is a python script that uses generateGraph_base.js and index.html and csv data generated by fch.
Usage: GenerateD3Stub functions by passing n number of csv filepaths as parameters. The file generated afterwards 'generatedGraphStub.html' will have the embeddable code for use.

$  cd html_generator; python generateD3stub.py csv1path csv2path ... csvnpath

By default, the GenerateD3Stub will print the data passed to it in the console, so the user can verify the correct data is being passed.

followercounthistory's People

Contributors

Stargazers

Watchers

Forkers

imjonathan machawk1 koble-ai jbudcardi

followercounthistory's Issues

TODO: push to archive

if there are no archives, or only old ones, of the person push to the archive

Make available on pypi

It would be useful for some potential users to have the tool available on Pypi so instead of requiring them to download the source, they can run a single command like pip install fch or pip install followercounthistory.

We (@ibnesayeed and I) recently did this for cdxjGenerator, which has comparatively trivial code relative to FollowerCountHistory.

I have not gotten a chance to thoroughly examine the codebase for potential complications but regardless, it would be useful to consider to make the tool more accessible to those that would like to use it.

Execution echoes numbers to stdout, what do they mean?

Running:

python3 ./FollowerHist.py -e Ocasio2018

echoes the following to stdout:

http://web.archive.org/web/timemap/link/http://twitter.com/Ocasio2018
15 archive points found
20171018033758
2955
20180411044556
20180505060722
19150
20180523034652
20888
20180527074406
22706
20180529221423
24899
20180530045347
25135
20180606211514
33135
20180614141606
36746
20180621005111
43546
20180625203123
48758
20180625203536
48766
20180625203539
48766
20180627025800
85757
20180627165955

I presume the longer numerical strings are 14-digit datetimes of respective mementos and the other numerical strings are a count but there is nothing to signify this.

I proposed making the stdout results a little more descriptive, even if the ultimate result is outputted to a file.

Flatten follower scraping logic

FollowerCount.py uses nested try-excepts to check for the historical Twitter UIs.

try:
	result = soup.select(".ProfileNav-item--followers")[0]
	try:
		result = result.find("a")['title']
	except:
		result = result.find("a")['data-original-title']
except:
	try:
		result = soup.select(".js-mini-profile-stat")[-1]['title']
	except:
		try:
			result = soup.select(".stats li")[-1].find("strong")['title']
		except:
			try:
				result = soup.select(".stats li")[-1].find("strong").text
			except:
...

This nesting goes down to excessive levels but seems to work for the expected logic. However, there are some stylistic issues that would help the programmatic flow.

For example, PEP20 states "Flat is better than nested." PEP8 also recommends a line length of 72 characters, which is far exceeded due the nested try-except scoping. In the previous PEP, bare excepts are also discouraged due to the implications described there.

try-excepts are the right paradigm to use here, via Python recommendation of asking for forgiveness (exceptions) over permission (conditionals) (i.e., EAFP). However, the implementation could be improved by making the code structure flatter, which will should have positive effects in maintainability, among other benefits.

Failed to establish a new connection to memgator.cs.odu.edu

Running fch joebiden gives the following error, tested on both Ubuntu as well as MacOS:

Fetch Timemap: Error: http://twitter.com/joebiden   HTTPSConnectionPool(host='memgator.cs.odu.edu', port=443): Max retries exceeded with url: /timemap/cdxj/http://twitter.com/joebiden (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fc868335ee0>: Failed to establish a new connection: [Errno 61] Connection refused'))

TODO: File append

currently just rewrites the data in csv file below previous data including headers. Change so that the code keeps old data and just looks for new data from the archive

CLI feedback no longer working in 1.0.11

In the most recent version of fch (1.0.11), the command line flags no longer appear to work correctly. This was not the case in the previous version (1.0.10). I first noticed this when installing from source to verify #21 but was also able to replicate via the pypi release.

❯ fch
zsh: command not found: fch
❯ pip install fch==1.0.11
❯ fch
Traceback (most recent call last):
  File "/usr/local/bin/fch", line 5, in <module>
    from fch.__main__ import main
  File "/usr/local/lib/python3.8/site-packages/fch/__main__.py", line 11, in <module>
    from fch.core.config.configreader import ConfigurationReader
ModuleNotFoundError: No module named 'fch.core.config.configreader'
❯ pip uninstall -y fch 
❯ fch
zsh: command not found: fch
❯ pip install fch==1.0.10
❯ fch
usage: fch [-h] [--st] [--et] [--freq] [-f] thandle
fch: error: the following arguments are required: thandle
❯

TODO Redirects

Tool is accepting redirects as a new memento

Get data for specific data

Hi! I was wondering if there's a way to specify specific dates for which I want data for. For instance, if I wanted to only get data for the year 2022, how would I do so? Thanks in advance!

TODO: Fix axis label for large numbers

Axis label gets overlapped when the numbers on the y axis get large enough. Remove label or account for size dynamically

Fetch Timemap: Error: http://twitter.com/joebiden [WinError 3] The system cannot find the path specified: '/tmp\\TimeMap'

I installed via :
Install from PyPI

I keep getting the this error

even when working in \tmp\ directory it doesn't work.

Do I need to edit anything in the TimeMapDownloader class?
Thanks.

Cannot generate graphs either in the container or locally on my machine.

Hi all,

First off, thanks for the awesome tool! It was instrumental in my being able to pull Twitter follower statistics for a year-end review for the Monero project.

However, I can't seem to get graphs working properly, no matter what I try.

Commands used:

docker container run --rm -it -v $PWD:/app -u $(id -u):$(id -g) --entrypoint /bin/bash oduwsdl/fch:2.0
I have no name!@054b386f00b1:/app$ ./fch/__main__.py --st=20200418000000 --et=20210418000000 monero | Rscript twitterFollowerCount.R
Error in `$<-.data.frame`(`*tmp*`, MementoTimestamp, value = numeric(0)) :
  replacement has 0 rows, data has 7
Calls: $<- -> $<-.data.frame
Execution halted

Rscript version:

Rscript --version
R scripting front-end version 3.5.2 (2018-12-20)

Any help would be greatly appreciated!

TODO Account for other languages

Getting a few pages in Bengali from the Internet Archive that breaks the R script because the numbers are not Arabic numerals.

False negative format detection by R script with relative data path

I installed fch via pip but wanted to generate plots, so also ran git clone https://github.com/oduwsdl/FollowerCountHistory in /tmp/.

While my current working directory is /tmp/, I ran fch machawk1 > followers.csv. This created /tmp/followers.csv.

I then moved into the source directory using cd FollowerCountHistory/, ran Rscript twitterFollowerCount.R ../followers.csv, and received an error message with no plots generated:

[1] "Unsupported file type"
Warning message:
In if (ext == "csv") { :
  the condition has length > 1 and only the first element will be used

However, running the same command with an absolute path to the CSV file works, i.e., Rscript twitterFollowerCount.R /tmp/followers.csv generates plots without an error. Further, moving the CSV file into the source directory (mv ../followers.csv ./) and running Rscript twitterFollowerCount.R followers.csv works but adding the relative part of the data path, Rscript twitterFollowerCount.R ./followers.csv causes the same above error.

This is likely an issue with the Rscript trying to detect the file type and choking on anything but the absolute path or the data file in the same directory.

FollowerCountHistory from current master (e918af7)
fch 1.0.10 from pypi. (via pip freeze, as there is no documented -v flag, #21 )
Python 3.8.5, macOS 10.15.6
R scripting front-end version 4.0.2 (2020-06-22)

Provide a way for fch to report the current version of the tool that is installed

In testing #20, I found that fch does not have a -v or --version command line flag for the tool to self-report the installed version number. Doing so is typical of command-line tools, so I believe fch should exhibited this behavior.

A example to accomplish this using argparse is available here.

Remove push-to-archive feature, use in tool that imports FollowerCountHistory as a module

The README provides options to push the last memento to archives. This feature seems beyond the objective of this tool, however useful.

FollowerCountHistory ought to be more functionally cohesive. I am suggesting we remove this option and create another tool that imports FollowerCountHistory with this expanded functionality.

In doing this, FollowerCountHistory could also be adapted to be a Python module, uploaded to pip, and used by others' tools without the functional scope creep.

parse_timemap error

good morning,

I just tried to run a basic query:

fch joebiden

Such command results in the following error:

parse_timemap: 'NoneType' object has no attribute 'groupdict'
'NoneType' object is not iterable

Am i missing something? All dependencies are installed

thanks in advance for any help!

TODO: Change R graph

Fix axis and labels to make more attractive graph