Code Monkey home page Code Monkey logo

emergent-analytics / workstreams Goto Github PK

View Code? Open in Web Editor NEW
12.0 2.0 15.0 33.03 MB

This repository publishes notebooks created as part of an analysis of the 2020 COVID-19 crisis onto the economy, performed by a team of data scientists from IBM and Rolls-Royce for Emergent Alliance.

License: MIT License

Jupyter Notebook 97.02% Python 0.97% HTML 0.86% Dockerfile 0.01% Shell 0.01% CSS 0.01% JavaScript 1.13%
covid-19 covid19 covid coronavirus coronavirus-analysis geospatial geo python-multiprocessing nlp risk region economics economics-models airbnb notam notams sentiment-analysis sentiment jupyter-notebooks multithreading

workstreams's Introduction

Workstreams

The Regional Risk-Pulse Index project is organised into three broad thematic categories or workstreams:

  • WS1 - Emergent Risk Index. When is it the right time for local authorities to intervene based on the health status of a region?
  • WS2 - Emergent Pulse. How are people feeling and behaving as a result of the measures against Covid-19?
  • WS3 - Economic Scenario Modelling. How are shocks to specific industrial sectors affect the whole economy (restaurant shutdowns, air travel restrictions, etc)? Under these thematic categories, we have developed a wide array of interconnected analyses which are outlined below.

WS1 - Emergent Risk Index

Health Risk Index

A composed regional risk index that takes into account the level of infection rate, % of vulnerable population, and many other factors to assess the ‘riskiness’ of a specific region. How can we help local authorities to decide when is the right time to intervene?

Geospatial

A geospatial knowledge of the world with respect to neighbourhood relationships at country and sub-country level. It includes a Jupyter notebook that executes parallel threads without the need for a running ipyparallel instance, see computation of neighbours at the bottom

Labelling tool

A tagging tool for humans to annotate infection waves. Cookiecutter Preview

WS2 - Emergent Pulse

The aim of this workstream is to create a series of index that can describe the status of a region, for example: What is the sentiment of the population of that region? How are they behaving? How are the news and social media changing? How is the tourism industry being affected?. This image gives an overview of some of the topics covered in this workstream.

Workstream 2 topics

News Analysis

The News Analysis folder contain the analysis done on a corpus of news articles to understand the trends of frequency and sentiment of articles over time, and to perform binary classification on a topic of interest. Please note that this folder contains only code and the news data need to be provided by the user. The following provides a brief overview of the different notebooks available in the folder:

  • ws2_1_data_preparation.ipynb - preprocessing of textual data in English.
  • ws2_1_2_data_preparation_language.ipynb - preprocessing of textual data, German and French languages are added.
  • ws2_2_topic_modelling.ipynb - topic modeling of articles.
  • ws2_3_sentiment_analysis.ipynb - sentiment analysis of articles.
  • ws2_4_heatmaps.ipynb - creating heatmaps of frequency and sentiment of articles over time.
  • ws2_5_text_classification.ipynb - binary classification pipeline.

The meltwater folder is a Python package that provides a client to access the Export APIs from Meltwater: https://developer.meltwater.com. Note that we do not provide an interface to all the endpoints exposed by API. We only provide an interface to two groups of endpoints: the "searches" endpoints and the "One-time export" endpoints. However, the client can easily be extended to include the other groups of endpoints.

Social Media Analysis

Stemming from the News Analysis and part of our collaboration with the NHS in Nottinghamshire, here we use data from Twitter to look at the perception of vaccines and lockdown measures in the Nottingham and Liverpool areas, as well as mental wellness. Twitter data allows us to monitor what type of information around Covid-19 is being shared in a local area. Some of the work we’ve done so far includes:

  • Analyse the content of tweets. We can infer the sentiment and emotions associated to a body of text. Based on its content, we can also classify it with similar tweets that share a common “topic” using Topic Modelling techniques.
  • Identify what type information is more popular in an area (viral tweets).
  • Identify who are key spreaders of such information - or influencers.

Mobility and Tourism

Here we use NOTAMS data to extract information on international travel restrictions to and from several countries. We also use Airbnb data to study how travelling patterns changed in the UK after the onset of the pandemic.

NOTAMS

The folder - airport_restrictions contain the analysis done on NOTAM data and travel restriction data from Humanitarian exchange to extract quarantine and country restrictions. Please note that this folder contains only code and the NOTAM data will have to be downloaded manually. The following provides a brief overview of the different notebooks available in the folder:

  • ws2_snr_NOTAMs_1_data_preparation.ipynb - Basic preprocessing of NOTAM - removing special characters, expanding abbreviations, removing stop words.
  • ws2_snr_NOTAMs_2_topic_modeling.ipynb - Identification of different topics present in the NOTAM.
  • ws2_snr_notams_3_quarantine_text.ipynb - Extraction of quarantine duration from NOTAM using Named Entity Recognition (NER) and regex.
  • ws2_snr_NOTAMs_1_data_preparation_mulitple files.ipynb - Similar to the first notebook on data preparation. Iteration of data preprocessing to multiple files.
  • ws2_snr_NOTAMs_country_level_restrictions_timeline.ipynb - Information extraction of restriction on foreigners using NER, Part of speech tagging and dependency parser.
  • ws2_snr_humdata_country_level_restriction_timeline.ipynb - Information extraction of restriction on foreigners using the same set of rules used in the above notebook on a different data source (travel restriction data from humanitarian exchange)
  • ws2_snr_validation_information_extraction_rules.ipynb - validation of the information extraction rules based on the results generated using NOTAM and travel restriction data from humanitarian exchange
  • ws2_snr_travel_advisory_data_download.ipynb - travel risk index from travel-advisory website that provides a travel risk index for each country based on the travel advisories from different foreign countries.

Named Entity Recognizer was used to identify country names mentioned in the data.

NER

Part of speech tagging was used to identify the verbs in the sentences and rules were used to determine if the verb had a positive or negative connotation.

POS tagging

Airbnb

The Airbnb folder contains the analysis done on the InsideAirbnb data: http://insideairbnb.com/get-the-data.html. The aim of this analysis is to offer data-driven insights into the new trends in tourism and hospitality. The following summarises the content of Airbnb notebooks.

  • Predictive_Model.ipynb - Predictive model using FBProphet that characterizes the expected Airbnb demand if Covid-19 pandemic did not happen.
  • Geo_Distribution_Tourism.ipynb - Geo distribution of the Airbnb demand in cities around the world.

Mobility estimator analysis

Causal inference of stringency measures on mobility. We use Microsoft's DoWhy library to carry out the causal inference analysis. The following gives an overview of the notebooks:

  • CA_Mobility_Weather_Countermeasures.ipynb - Investigate the effect of countermeasures /lockdown on mobility data.
  • CA_dowhy multiple scenarios_weekend_encoding.ipynb - Try out different hypothesis to investigate the effect of lockdown measures on mobility
  • CA_dowhy_validation_weekend_encoding.ipynb - Build a mobility estimator model using the causal estimator function and validate the model

CA_Causal_structure_discovery_economic_impact.ipynb

Investigate the effect of lockdown measures on economy. To identify economic activity we consider data sets such as electricity consumption and heavy truck toll movement data. As trucks are mainly used to transport goods their movement data helps in estimating the current economic activity.

Causal_graph

Our Results

The following blog posts can provide more information on the analysis and results of our work:

WS3 - Economic Scenario Modelling

We create an economic scenario modelling tool, the Emergent Economic Engine, to simulate how shocks to some industries may propagate to the rest of a networked economy. These shocks can be in the form of sectorial shutdowns, travel restrictions. The tool also offeres the possibility of counteracting these measures by injecting resources into the economy. The folder is organised into two subfolders:

  • Economic documentation, where we explain the basics of the Leontief Input-Output model and how we can shock it from a dynamic and a static point of view.
  • Simulation engine app, where we have the source code of the Emergent Economic Engine and the associated Input-Output tables.

Emergent Economic Engine

Our Results

In the following posts we explain the theory behind our model and the Emergent Economic Engine.

workstreams's People

Contributors

acorralescano avatar deepak-r2dl avatar deepaksrinivasan avatar dependabot[bot] avatar giorgos-aniftos avatar klausgpaul avatar leekyuh-ibm avatar mariaivanciu125 avatar mehrnoosh-vahdat avatar shrirajendran avatar vincent-nelis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

workstreams's Issues

docker-compose build default variables file (.env) missing

Subject of the issue

Docker compose build default variable definitions were missing

Workstream

ws1/ labelling tool/ cookiecutter

Your environment

  • N/A, generic

Steps to reproduce

When building/running the container services, a missing POSTGRES_PORT definition is being flagged and the container cannot be run

Expected behaviour

Give an example .env file and instruct how to use it

Actual behaviour

No .env file available as a template

Provide docker setup for Economic Engine

Provide docker setup for Economic Engine

Provisioning a docker image helps a lot in creating a reproducible, well configured environment.

Workstream

WS3, in particular Simulation engine app

Your environment

  • ubuntu 20/AMD/Intel and macOS Catalina 10.15
  • docker snap-in, docker app

Steps to reproduce

N/A

Expected behaviour

  • Allow docker container build straight from the repo
  • Amend README.md

Actual behaviour

N/A

ECDC Case data no longer useful after switching to weekly data

Subject of the issue

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

From the website

ECDC switched to a weekly reporting schedule for the COVID-19 situation worldwide and in the EU/EEA and the UK on 17 December this year. Hence, all daily updates have been discontinued from 14 December. ECDC will publish updates on the number of cases and deaths reported worldwide and aggregated by week every Thursday. The weekly data will be available as downloadable files in the following formats: XLSX, CSV, JSON and XML. As an exception, the weekly updates for the end-of-year festive season will be published on 23 December and 30 December 2020.

Workstream

Workstream 1, cookicutter labelling

Your environment

  • N/A, backend

Steps to reproduce

jupyter notebook Download Case Data will fail as also the column names have changed.

Expected behaviour

European authorities to continue reporting daily numbers, maybe at a reduced frequency.

Actual behaviour

ECDC decided something else.

Cookiecutter case number displays out of range (datetime) data

Subject of the issue

Some case number datasets displayed by cookiecutter may compute out of range datetimes well before 2020-01-01.

Workstream

Workstream 1

Your environment

  • using the docker versions of python and libraries

Steps to reproduce

This may vary day by day as it is caused by numerical instabilities of some of the wave detectors

Expected behaviour

x axis/datetime should only display relevant time ranges

Actual behaviour

Display zooms out and displays data from 1980-01-01/1970-01-01

Incorporate Familiarisation Tutorials

Subject of the issue

The cookiecutter tool suite is very complex and some features are not obvious at first glance.

Your environment

  • N/A

Steps to reproduce

Directly connect to the bokeh application.

Expected behaviour

The features of the tool and their purposes ought to be described.

Actual behaviour

Users are not able to expliot the toolset.

Temporal clusters of measures not updated correctly in cookiecutter Health tab

Subject of the issue

The heatmap with the temporal clusters is not updated for certain countries.

Workstream

ws1, cookiecutter labelling tool

Your environment

  • N/A

Steps to reproduce

On cookiecutter,

  • look at the landing page (which shows Germany by default), it will display a correct heatmap for the Temporal clusters
  • select another country for which stringency cluster data would be available. e.g. United Kingdom,
  • the heatmap will shrink and no data will be displayed
  • changing back to Germany, the display will function

Expected behaviour

Stringency cluster data should be displayed when available

Actual behaviour

The code was reassigning an newly created range to the y axis (the range is the country names, which needs updating. This is not updating the parent figure.

Instead of

self.p_oxcluster.y_range = FactorRange(factors=sorted(df.country.unique(),reverse=True))

the correct assignment is to the factors attribute

self.p_oxcluster.y_range.factors=sorted(df.country.unique(),reverse=True)

Too many changepoint detected computed waves displayed

Subject of the issue

There are multiple wave zones overlapping each other displayed in the case data bar chart.

Workstream

Workstream 1/cookiecutter labelling tool

Your environment

  • docker

Steps to reproduce

Open up cookiecutter, you will see multiple red-green wave/calm zones for one country, at country level

Expected behaviour

Changepoint detection should not result in overlapping zones of wave/calm

Actual behaviour

The query to retrieve computes zones does not consider the data source selected, and will, at country level, retrieve zones from both Johns Hopkins global, and ECDC datasets. Adding the currently selected data source to the WHERE clause in the could should revert back to desired behaviour, only displaying computed waves pertinent to the selected dataset, with no overlaps.

Merge branch cookiecutter-sql back to main

Subject of the issue

A lot of rework was done to move the data storage concept from files to SQL backend. This branch has now matured and can be pulled back into master.

Workstream

Cookiecutter supports ws1 (health) and ws3 (economic engine)

Your environment

  • python 3.6 and 3.8
  • ubuntu 16 and 18
  • Chrome Canary, Chrome

Steps to reproduce

N/A

Expected behaviour

Should work as specified

Actual behaviour

N/A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.