pbinkley / twarc-report Goto Github PK

View Code? Open in Web Editor NEW

55.0 55.0 6.0 140 KB

Data conversions and examples for generating reports from twarc collections using tools such as D3.js

License: Creative Commons Zero v1.0 Universal

JavaScript 15.62% Python 44.84% HTML 36.41% Shell 3.14%

python social

twarc-report's Introduction

twarc-report

Data conversions and examples for generating reports from twarc collections using tools such as D3.js

Requirements
Getting Started
Recommended Directory Structure
Harvest
Profile
D3 Visualizations
Exploring D3 Examples
Adding Scripts
License

These utilities accept a Twitter json file (as fetched by twarc), analyze it various ways, and output a json or csv file. The initial purpose is to feed data into D3.js for various visualizations, but the intention is to make the outputs generic enough to serve other uses as well. Each utility has a D3 example template, which it can use to generate a self-contained html file. It can also generate csv or json output, and there is a worked example of how to use csv in a pre-existing D3 chart.

The d3graph.py utility was originally added to the twarc repo as directed.py but is moving here for consistency.

Requirements

All requirements may be installed with pip install -r requirements.txt

dateutil - python-dateutil
pytz - pip install pytz
tzlocal - pip install tzlocal
pysparklines - pip install pysparklines
requests_oauthlib - pip install requests_oauthlib

Install twarc according to its instructions, i.e. with pip install twarc. Run twarc.py once so that it can ask for your access token etc. (see twarc's readme). Make sure that twarc-archive.py is on the system path.

Getting Started

clone twarc-report to a local directory with your favorite Git client
install the requirements and populate the twarc submodule, as above
create a projects subdirectory under twarc-report
create a project directory under projects, named appropriately
in the project directory create metadata.json and fill in the search you want to track
in twarc-report, run ./harvest.py projects/[yourproject] to harvest your tweets (this may take some time - hours or days for very large searches)
run ./reportprofile.py projects/[yourproject] to see a summary of your harvest
run other scripts to generate various visualizations (see below)
run ./harvest.py projects/[yourproject] whenever you want to update your harvest.

Note that only tweets from the last 7 days or so are available from Twitter at any given time, so be sure to update your harvest accordingly to avoid gaps.

Recommended Directory Structure

twarc-report/ # local clone
    projects/
        assets/ # copy of twarc-report/assets/
        projectA/
            data/ # created by harvest.py
                tweets/ # populated with tweet*.json files by harvest.py
            metadata.json
            timeline.html # generated by a twarc-report script
            ...
        projectB/
        ...

Metadata about the project, including the search query, is kept in metadata.json. The metadata.json file is created by the user and contains metadata for the harvest. It should be in this form:

{"search": "#ferguson",
"title": "Ferguson Tweets",
"creator": "Peter Binkley"}

(Currently only the search value is used but other metadata fields will be used to populate HTML output in future releases.)

The harvested tweets and other source data are stored in the data subdirectory, with the tweets going the tweets directory. These directories are created by harvest.py if they don't exist.

Generated HTML files use relative paths like ../assets/d3.vs.min.js to call shared libraries from the assets directory. They can be created in the project directories (ProjectA etc.). This allows you to publish the output by syncing the project and assets directories to a web server while exclusing the data subdirectory. You can also run python's SimpleHTTPServer in the projects directory to load examples you've created in the project directories:

python -m SimpleHTTPServer 8000

And then visit e.g. http://localhost:8000/ProjectA/projectA-timebar.html.

Harvest

The script harvest.py will use twarc's twarc-archive.py to start or update a harvest using a given search and stored in a given directory. The directory path is passed as the only parameter:

./harvest.py projects/ProjectA

The search is read from the metadata.json file, and tweets are stored in data/tweets.

Profile

Running reportprofiler.py on a tweet collection with the flag -o text will generate a summary profile of the collection, with some basic stats (number of tweets, retweets, users, etc.) and some possibly interesting sparklines.

Count:        25100
Users:         5779
User percentiles: █▂▁▁▁▁▁▁▁▁
                  [62, 12, 6, 5, 3, 2, 2, 2, 2, 2]

That indicates that the top 10 percent of users accounted for 62% of the tweets, while the bottom 10% accounted for 2% of the tweets. This will give a quick sense of whether the collection is dominated by a few voices or has broad participation. The profile also includes the top 10 users and top 10 shared urls, with similar sparklines.

Note: the sparklines are generated by pysparklines, using Unicode block characters. If they have an uneven baseline, it's the fault of the font. On a Mac, I find that Menlo Regular gives a good presentation in the terminal.

D3 visualizations

Some utilities to generate D3.js visualizations of aspects of a collection of tweets are provided. Use "--output=json" or "--output=csv" to output the data for use with other D3 examples, or "--help" for other options.

d3graph.py

A directed graph of mentions or retweets, in which nodes are users and arrows point from the original user to the user who mentions or retweets them:

% d3graph.py --mode mentions projects/nasa > projects/nasa/nasa-directed-mentions.html
% d3graph.py --mode retweets projects/nasa > projects/nasa/nasa-directed-retweets.html
% d3graph.py --mode replies projects/nasa > projects/nasa/nasa-directed-replies.html

d3cotag.py

An undirected graph of co-occurring hashtags:

% d3cotag.py projects/nasa > projects/nasa/nasa-cotags.html

A threshold can be specified with "-t": hashtags whose number of occurrences falls below this will not be linked. Instead, if "-k" is set, they will be replaced with the pseudo-hashtag "-OTHER". Hashtags can be excluded with "-e" (takes a comma-delimited list). If the tweets were harvested by a search for a single hashtag then it's a good idea to exclude that tag, since every other tag will link to it.

d3timebar.py

A bar chart timeline with arbitrary intervals, here five minutes:

% d3times.py -a -t local -i 5M projects/nasa > projects/nasa/nasa-timebargraph.html

Examples

The output timezone is specified by "-t"; the interval is specified by "-i", using the standard abbreviations: seconds = S, minutes = M, hours = H, days = d, months = m, years = Y. The example above uses five-minute intervals. Output may be aggregated using "-a": each row has a time value and a count. Note that if you are generating the html example, you must use "-a".

d3wordcloud.py

An animated wordcloud, in which words are added and removed according to changes in frequency over time.

% d3wordcloud.py -t local -i 1H projects/nasa > projects/nasa/nasa-wordcloud.html

Example

The optional "-t" control timezone and "-i" controls interval, as in d3timebar.py. Start and end timestamps may be set with "-s" and "-e".

This script calls a fork of Jason Davies' d3-cloud project. The forked version attempts to keep the carried-over words in transitions close to their previous position.

Exploring D3 Examples

The json and csv outputs can be used to view your data in D3 example visualizations with minimal fuss. There are many many examples to be explored; Mike Bostock's Gallery is a good place to start. Here's a worked example, using Bostock's Zoomable Timeline Area Chart. It assumes no knowledge of D3.

First, look at the data input. In line 137 this example loads a csv file

d3.csv("flights-departed.csv", function(data) {

The csv file looks like this:

date,value
1988-01-01,12681
...

We can easily generate a csv file that matches that format:

% ./d3times.py -a -i 1d -o csv

(I.e. aggregate, one-day interval, output csv). We then just need to edit the output to make the column headers match the original csv, i.e. change them to "date,value".

We also need to check the way the example loads scripts and css assets, especially the D3 library. In this case it expects a local copy:

<script type="text/javascript" src="d3/d3.js"></script>
<script type="text/javascript" src="d3/d3.csv.js"></script>
<script type="text/javascript" src="d3/d3.time.js"></script>
<link type="text/css" rel="stylesheet" href="style.css"/>

Either change those links to point to the original location, or save a local copy. (Note that if you're going to put your example online you'll want local copies of scripts, since the same-origin policy will prevent them from being loaded from the source).

Once you've matched your data to the example and made sure it can load the D3.js library, the example may work. In this case it doesn't - it shows an empty chart. The title "U.S. Commercial Flights, 1999-2001" and the horizontal scale explain why: it expects dates within a certain (pre-Twitter) range, and the x domain is hard-coded accordingly. The setting is easy to find, in line 146:

x.domain([new Date(1999, 0, 1), new Date(2003, 0, 0)]);

Change those dates to include the date range of your data, and the example should work. Don't worry about matching your dates closely: the chart is zoomable, after all. Alternatively, you could borrow a snippet from the template timebar.html to set the domain to match the earliest and latest dates in your data:

x.domain([
    d3.min(values, function(d) {return d.name}), 
    d3.max(values, function(d) {return d.name})
 ]);

A typical Twarc harvest gets you a few days worth of tweets, so the day-level display of this example probably isn't very interesting. We're not bound by the time format of the example, however. We can see it in line 63:

parse = d3.time.format("%Y-%m-%d").parse,

We can change that to parse at the minute interval: "%Y-%m-%d %H:%M", and generate our csv at the same interval with "-i 1M". With those changes we can zoom in until bars represent a minute's worth of tweets.

This example doesn't work perfectly: I see some odd artifacts around the bottom of the chart, as if the baseline were slightly above the x axis and small values are presented as negative. And it doesn't render in Chrome at all (Firefox and Safari are fine). The example is from 2011 and uses an older version of the D3 library, and with some tinkering it could probably be updated and made functional. It serves to demonstrate, though, that only small changes and no knowledge of the complexities of D3 are needed to fit your data into an existing D3 example.

Adding Scripts

The heart of twarc-report is the Profiler class in profiler.py. The scripts pass json records from the twarc harvests to this class, and it tabulates some basic properties: number of tweets and authors, earliest and latest timestamp, etc. The scripts create their own profilers that inherit from this class and that process the extra fields etc. needed by the script. To add a new script, start by working out its profiler class to collect the data it needs from each tweet in the process() method, and to organize the output in the report() method.

The various output formats are generated by functions in d3output.py.

License

twarc-report's People

Contributors

Stargazers

Watchers

Forkers

ruebot wenzi jeffreymoro lkjonessoc simonb83 hcpenguin

twarc-report's Issues

d3.v3.min.js not found in wordcloud example

Timebar example works perfectly, and makes me really excited to explore this more.

But in https://github.com/pbinkley/twarc-report#d3wordcloudpy
clicking the example --> https://www.wallandbinkley.com/twarc/c4l15/animatedwordcloud.html

results in a

failed to load resource: the server responded with a status of 404 (Not Found)

for the

https://www.wallandbinkley.com/twarc/d3.v3.min.js

Populating the twarc subdirectory

When I try the second command:
git submodule update
I get:
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Clone of '[email protected]:edsu/twarc.git' into submodule path 'twarc' failed

Project metadata

As a follower of an event that is being live-tweeted, I want to have a project directory where I will update a harvest periodically with a cronjob using twarc/utils/archive.py, with project metadata such as the twarc query, project title and creator, etc., all stored in a json file, so that the same cron job can generate twarc-report outputs that include the project metadata for clarity.

I'm thinking of json like this:

{"twarcquery": "#code4lib OR #c4l15 OR #code4arc", 
"title": "Code4lib Conference, Portland OR, 10-12 Feb. 2015",
"creator": "Peter Binkley"}

And have a module that loads it with:

with open("metadata.json") as json_data:
    project_metadata = json.load(json_data)
    json_data.close()
title = project_metadata["title"]

And finally, use this in a script that embeds archive.py and runs the updates and the twarc-report outputs.

how to convert twarc data to full texts?

Hi everyone,
After a long time of struggling in twarc, finally, I extracted the tweets from Twitter hashtags. My question now is, how can I convert the data that I got to full text? all I can see now is just numbers. (the image is attached)

PS: I followed these steps which are here: https://github.com/DocNow/twarc and my file save as josn and I opened it on Excel.
Another PS: I am not a programmer nor developer :)

Add universal cron function

Add a script that can be run from a single cron job, and that will harvest all the active projects (based on start/end dates in metadata.json), and generate outputs.

Add jekyll integration

Generate outputs into a Jekyll site. Develop Jekyll plugins:

versioning of outputs at given intervals (say all the tweets from a calendar day). Provide menus of versions.
data-driven pages for images, link, wall, etc. Maintain local cache of thumbnails of images and links.
build and deploy jekyll site after each harvest

Replace old geo element with place

Locations are now contained in place element:

 "place": {
    "full_name": "Toronto, Ontario",
    "url": "https://api.twitter.com/1.1/geo/id/3797791ff9c0e4c6.json",
    "country": "Canada",
    "place_type": "city",
    "bounding_box": {
      "type": "Polygon",
      "coordinates": [
        [
          [
            -79.639319,
            43.403221
          ],
          [
            -78.90582,
            43.403221
          ],
          [
            -78.90582,
            43.855401
          ],
          [
            -79.639319,
            43.855401
          ]
        ]
      ]
    },
    "contained_within": [],
    "country_code": "CA",
    "attributes": {},
    "id": "3797791ff9c0e4c6",
    "name": "Toronto"
  },

harvest.py

line 41: module archive has no attribute main.

import archive
archive.main() <--- causing problems, please advice.

some

Hi, frequently I redirect reports to text files on my debians & OSXs, to keep trace about ongoing Twarcs. But in the first IF...THEN...ELSE 4 lines lack of .encode("utf-8") when calling sparkline.sparkify() to print percentiles.
Adding .encode("utf-8"), like you do in all other calls, at lines 25-29-33-37 solve the following errors:

reportprofile.py tweets.json > reporttweets.txt
Traceback (most recent call last):
  File "../twarc-report/reportprofile.py", line 25, in <module>
    print "User percentiles: " + sparkline.sparkify(data["userspercentiles"])
UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-27: ordinal not in range(128)

Sorry if I'm not doing a pull/request, but I'm not sure it happens to all users.

Update use of twarc

Update the way native twarc functions are called, to use the new twarc structure.

Repackage scripts as subcommands

Refator to imitate the structure of twarc, with a single executable twarc-report that takes subcommands to specify the desired script. Enable installation by pip install.

clarification of harvest.py's relationship to twarc submodule

Hello,

Thank you so much for extending the twarc library!

This isn't an issue in the classic sense so I apologize for using this mechanism.

I was wondering if you could say a bit more about the relationship between harvest.py and twarc (the submodule specified, not the most current version).

More specifically, from looking at the code in harvest.py which eventually calls upon twarc's archive.py, it does not seem that there is a mechanism for including the API keys. The version of twarc that twarc-report uses called upon one to enter them as:

twarc.py --consumer_key foo --consumer_secret bar --access_token baz --access_token_secret bez --search ferguson

How is this handled in when using harvest.py?

Thanks for your help!

Benjamin

d3cotags.py -t option causes error in Python 3: "RuntimeError: dictionary changed size during iteration"

Any -t option greater than 1 causes an error I think due to differences in iterating through keys in a dict between Python 2 and 3. It seems to work fine in Python 2.7 .

Just FYI, and thanks for the tools!

Add basic management suite

Add script to create new projects (specify the query, the title, etc., and it will create the project directory, populate it with the necessary files such as metadata.json).
Add script to generate all outputs for a given project (both twarc-report and native twarc) into an ouput directory, with an index.html

Updated pysparklines dependency does not support Python 2, need to set version in requirements.txt

I see your requirements specifies pysparklines but no version and I just wanted to let you know that the recently released 1.0 does not support Python versions before 3, so could cause your project issues. I would change your requirements.txt to use pysparklines==0.9 to resolve this problem until your project is fully Python 3 compatible.

archive.py

Hello,

When I try to execute harvest.py, I receive the following error:

Traceback (most recent call last):
File "./harvest.py", line 41, in
archive.main()
File "twarc/utils/archive.py", line 76, in main
sys.exit(1)
NameError: global name 'sys' is not defined

Am I making a mistake?

Error when trying to run harvest.py

I receive the following error message on my fierst attempt to run harvest.py. I have tried several different solutions to ensure that twarc-archive.py is in my PATH, but continue to receive this error message.

vagelos-ve536-0866:twarc-report-master Research$ ./harvest.py projects/projectA
/Users/Research/anaconda/bin/twarc-archive.py
/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc-archive.py
/Library/Frameworks/Python.framework/Versions/3.4/bin/twarc-archive.py
/opt/local/bin/twarc-archive.py
/opt/local/sbin/twarc-archive.py
/usr/local/bin/twarc-archive.py
Cannot run twarc-archive.py

Manage project config with git

Version the metadata.json in git to track changes in the query etc. (e.g. when you add an extra hashtag after you've been harvesting a project for a while). Associate the commit id with each harvest.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.