Linux and Mac OS |
Windows OS |
Module Version |
---|---|---|
gdeltPyR is a Python-based framework to access and analyze Global Database of Events, Language, and Tone (GDELT) 1.0 or 2.0 data for analysis in Python Pandas or R dataframes. A user can enter a date, date range (two strings), or individual dates and return a tidy data set ready for scientific or data-driven exploration.
gdeltPyR
retrieves Global Database of Events, Language, and Tone (GDELT) data (version 1.0 or version 2.0) via parallel HTTP GET requests and is an alternative to accessing GDELT data via Google BigQuery . Therefore, the more cores you have, the less time it takes to pull more data. Moreover, the more RAM you have, the more data you can pull. And finally, for RAM-limited workflows, create a pipeline that pulls data, writes to disk, and flushes.
The GDELT Project advertises as the largest, most comprehensive, and highest resolution open database of human society ever created. It monitors print, broadcast, and web news media in over 100 languages from across every country in the world to keep continually updated on breaking developments anywhere on the planet. Its historical archives stretch back to January 1, 1979 and accesses the world’s breaking events and reaction in near-realtime as both the GDELT Event and Global Knowledge Graph update every 15 minutes. Visit the GDELT website to learn more about the project.
- Added geodataframe output; can be easily converted into a shapefile or choropleth visualization.
- Added continuous integration testing for Windows, OSX, and Linux (Ubuntu)
- Normalized columns output; export data with SQL ready columns (no special characters, all lowercase)
- Choosing between the native-english or translated-to-english datasets from GDELT v2.
import gdelt
gd= gdelt.gdelt(versin=2)
events = gd.Search(['2017 May 23'],table='events',output='gpd',normcols=True,coverage=False)
- Query Google's BigQuery directly from
gdeltPyR
using thepandas.io.gbq
interface; requires authentication and Google Compute account - Adding a query for GDELT Visual Knowledge Graph (VGKG)
- Adding a query for GDELT American Television Global Knowledge Graph (TV-GKG)
gdeltPyR
can be installed via pip
pip install gdelt
GDELT 1.0 Queries
import gdelt
# Version 1 queries
gd1 = gdelt.gdelt(version=1)
# pull single day, gkg table
results= gd1.Search('2016 Nov 01',table='gkg')
print(len(results))
# pull events table, range, output to json format
results = gd1.Search(['2016 Oct 31','2016 Nov 2'],coverage=True,table='events')
print(len(results))
GDELT 2.0 Queries
# Version 2 queries
gd2 = gdelt.gdelt(version=2)
# Single 15 minute interval pull, output to json format with mentions table
results = gd2.Search('2016 Nov 1',table='mentions',output='json')
print(len(results))
# Full day pull, output to pandas dataframe, events table
results = gd2.Search(['2016 11 01'],table='events',coverage=True)
print(len(results))
gdeltPyR
can output results directly into several formats which include:
- pandas dataframe
- csv
- json
- geopandas dataframe (as of version 0.1.10)
- GeoJSON (coming soon version 0.1.11)
- Shapefile (coming soon version 0.1.11)
Performance on 4 core, MacOS Sierra 10.12 with 16GB of RAM:
- 900,000 by 61 (rows x columns) pandas dataframe returned in 36 seconds
- data is a merged pandas dataframe of GDELT 2.0 events database data
gdeltPyR
provides access to 1.0 and 2.0 data. Four basic parameters guide the query syntax:
Name | Description | Input Possibilities/Examples |
---|---|---|
version | (integer) - Selects the version of GDELT data to query; defaults to version 2. | 1 or 2 |
date | (string or list of strings) - Dates to query | "2016 10 23" or "2016 Oct 23" |
coverage | (bool) - For GDELT 2.0, pulls every 15 minute interval in the dates passed in the 'date' parameter. Default coverage is False or None. gdeltPyR will pull the latest 15 minute interval for the current day or the last 15 minute interval for a historic day. |
True or False or None |
translation | (bool) - For GDELT 2.0, if the english or translated-to-english dataset should be downloaded | True or False |
tables | (string) - The specific GDELT table to pull. The default table is the 'events' table. See the GDELT documentation page for more information | 'events' or 'mentions' or 'gkg' |
output | (string) - The output type for the results | 'json' or 'csv' or 'gpd' |
These parameter values can be mixed and matched to return the data you want. the coverage parameter is used with GDELT version 2; when set to "True", the gdeltPyR will query all available 15 minute intervals for the dates passed. For the current day, the query will return the most recent 15 minute interval. |
Facts
- GDELT 1.0 is a daily dataset
- 1.0 only has 'events' and 'gkg' tables
- 1.0 posts the previous day's data at 6AM EST of next day (i.e. Monday's data will be available 6AM Tuesday EST)
- GDELT 2.0 is updated every 15 minutes
- 2.0 has 'events','gkg', and 'mentions' tables
- 2.0 has a distinction between native english and translated-to-english news
- 2.0 has more columns
- None
- Query Google BigQuery copy of GDELT directly from
gdeltPyR
; will require project ID and authentication usingpandas gbq
inteface. - Adding a query for GDELT Visual Knowledge Graph (VGKG)
- Adding a query for GDELT American Television Global Knowledge Graph (TV-GKG)