data4democracy / internal-displacement Goto Github PK

Studying news events and internal displacement.

Python 3.03% Jupyter Notebook 93.19% HTML 0.21% Shell 0.02% JavaScript 1.51% CSS 2.03% Nginx 0.02%

internal-displacement's Introduction

Internal Displacement

This repository is now archived. The project is being continued but is currently closed to new members. Data for Democracy is a community driven organization. If you want to start a new project in a similar area, you are welcome to do so! Check out the #refugees channel and rally your fellow data nerds!

Slack Channel: #internal-displacement

Project Description: Classifying, tagging, analyzing and visualizing news articles about internal displacement. Based on a challenge from the IDMC.

The tool we are building carries out a number of functions:

Ingest a list of URLs
Scrape content from the respective web pages
Tag the article as relating to disaster or conflict
Extract key information from text
Store information in a database
Display data in interactive visualisations

The final aim is a simple app that can perform all of these functions with little technical knowledge needed by the user.

Project Lead:

@grichardson

Maintainers: These are the additional people mainly responsible for reviewing pull requests, providing feedback and monitoring issues.

Scraping, processing, NLP

Front end and infrastructure

Getting started:

Join the Slack channel.
Read the rest of this page and the IDETECT challenge page to understand the project.
Check out our issues (small tasks) and milestones. Keep an eye out for help-wanted, beginner-friendly, and discussion tags.
See something you want to work on? Make a comment on the issue or ping us on Slack to let us know.
Beginner with GitHub? Make sure you've read the steps for contributing to a D4D project on GitHub.
Write your code and submit a pull request to add it to the project. Reach out for help any time!

Things you should know

Beginners are welcome! We're happy to help you get started. (For beginners with Git and GitHub specifically, our github-playground repo and the #github-help Slack channel are good places to start.)
We believe good code is reviewed code. All commits to this repository are approved by project maintainers and/or leads (listed above). The goal here is not to criticize or judge your abilities! Rather, sharing insights and achievements. Code reviews help us continually refine the project's scope and direction, and encourage discussion.
This README belongs to everyone. If we've missed some crucial information or left anything unclear, edit this document and submit a pull request. We welcome the feedback! Up-to-date documentation is critical to what we do, and changes like this are a great way to make your first contribution to the project.

Project Overview

There are millions of articles containing information about displaced people. Each of these is a rich source of information that can be used to analyse the flow of people and reporting about them.

We are looking to record:

URL
Number of times URL has been submitted
Main text
Source (eg. new york times)
Publication date
Title
Author(s)
Language of article
Reason for displacement (violence/disaster/both/other)
The location where the displacement happened
Reporting term: displaced/evacuated/forced to fee/homeless/in relief camp/sheltered/relocated/destroyed housing/partially destroyed housing/uninhabitable housing
Reporting unit: people/persons/individuals/children/inhabitants/residents/migrants or families/households/houses/homes
Number displaced
Metrics relating to machine learning accuracy and reliability

Project Components

These are the main parts and functions that make up the project.

Scraper and Pipeline
Take lists of URLs as input from input dataset
Filter irrelevant articles and types of content (videos etc.)
Scrape the main body text and metadata (publish date, language etc.)
Store the information in a database
Interpreter
Classify URLs as conflict/violence, disaster or other. There is a training dataset to help with tagging.
Extract information from articles: location and number of reporting units (households or individuals) displaced, date published and reporting term (conflict/violence, disaster or other). The larger extended input dataset and the text from articles we have already scraped can be used to help here.
Visualizer
A mapping tool to visualize the displacement figures and locations, identify hotspots and trends.
Other visualizations for a selected region to identify reporting frequency on the area
Visualizing the excerpts of documents where the relevant information is reported (either looking at the map or browsing the list of URLs).
Visualise relability of classification and information extraction algorithms (either overall or by article)
Some pre-tagged datasets (1, 2) can be used to start exploring visualization options.
App is in the internal-displacement-web folder
A non-technical-user friendly front end to wrap around the components above for inputting URLs, managing the databases, verifying data and interacting with visualisations
Automation of scraping, pipeline and interpreter

Running in Docker

You can run everything as you're accustomed to by installing dependencies locally, but another option is to run in a Docker container. That way, all of the dependencies will be installed in a controlled, reproducible way.

Install Docker: https://www.docker.com/products/overview
Run this command:
```
docker-compose up
```
or
```
docker-compose -f docker-compose-spacy.yml up
```
The spacy version will include the en_core_web_md 1.2.1 NLP model It is multiple gigabytes in size. The one without the model is much smaller.

Either way, this will take some time the first time. It's fetching and building all of its dependencies. Subsequent runs should be much faster.

This will start up several docker containers, running postgres, a Jupyter notebook server, and the node.js front end.

In the output, you should see a line like:
```
jupyter_1  |         http://0.0.0.0:3323/?token=536690ac0b189168b95031769a989f689838d0df1008182c
```
That URL will connect you to the Jupyter notebook server.
Visit the node.js server at http://localhost:3322

Note: You can stop the docker containers using Ctrl-C.

Note: If you already have something running on port 3322 or 3323, edit docker-compose.yml and change the first number in the ports config to a free port on your system. eg. for 9999, make it:

    ports:
      - "9999:3322"

Note: If you want to add python dependencies, add them to requirements.txt and run the jupyter-dev version of the docker-compose file:

docker-compose -f docker-compose-dev.yml up --build

You'll need to use the jupyter-dev version until your dependencies are merged to master and a new version is built. Talk to @aneel on Slack if you need to do this.

Note: if you want to run SQL commands againt the database directly, you can do that by starting a Terminal within Jupyter and running the PostgreSQL shell:

psql -h localdb -U tester id_test

Note: If you want to connect to a remote database, edit the docker.env file with the DB url for your remote database.

Skills Needed

Python 3
JavaScript/HTML/css
Nodejs
AWS
Visualisation (D3)

Tips for working on this project

Try to keep each contribution and pull request focussed mostly on solving the issue at hand. If you see more things that are needed, feel free to let us know and/or make another issue.
Datasets can be accessed from Dropbox
We have a working plan for the project.
Not ready to submit code to the main project? Feel free to play around with notebooks and submit them to the repository.

Things that inspire us

Refugees on IBM Watson News Explorer

internal-displacement's People

Contributors

Stargazers

Watchers

internal-displacement's Issues

Infrastructure Plan

Here's a sketch of an infrastructure plan:

Development

Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo)
Write to local DB in docker
Can read scrape requests from database, but most scrapes will be triggered manually (through notebooks or scripts)

Web app runs locally in Docker for prototyping (internal-displacement-web repo)
Reads local DB in docker
Writes scrape requests to database

IDETECT Preparation

Scraper and Web app Docker containers deployed to AWS instance or similar cloud-hosted infrastructure
Read and write to Amazon RDS database
Large batch of scrape requests input into database, read from there and processed by Scraper(s)

Number-like entities to integers

Prior to saving reports to database quantities need to be converted to integers.

In some cases this is trivial (i.e. 500, 684), however there are other cases that need more work for example 'thousand', 'hundreds' etc.

Probably the best place to implement the conversion is in Report.__init__

Pull data from S3 bucket

Currently, the csv data files are stored in my personal S3. We need to be able to download them locally or load them directly into a dataframe.

Extract document details from PDF

Can we extract items such as the title and date published from a pdf?

Create, maintain and update user guide / admin guide.

The competition deliverables include:

A brief document describing the functionalities, such as a user guide.

A document describing the steps to maintain and update the tool with further features, such as an admin guide.

Although there is not a lot to say at this point in time, it is worth keeping in mind rather than getting left to the last minute.

Scraper - Refactor old scraper

Take any useful code from Scraper in scrape_articles.py and refactor into scraper.py

Integrate article classification

Integrate code from notebooks that classifies an article as "conflict/distaster/both/other"

Scraper + pipeline unit tests

Complete the unit tests for the scraper and pipeline components

Scraped content to database

Make a call to Scraper.scrape for a given url and update the relevant attributes in the database with the result; add content to Content table

Train classifier on training dataset

There is a training dataset here that can be used to train a classifier to tag articles with "violence" or "disaster" as the cause of displacement. It is quite a short dataset, but using the URL scraping functionality in scrape_articles.py we should be able to make a start at training a classification algorithm.

Improve detection of numbers by spacy.

The extraction of information from unstructured texts currently relies on the ability of spacy to identify number-like substrings within text.
Specifically, the like_num method of the spacy Token class is used https://spacy.io/docs/api/token.

This method fails to detect approximate numerical terms (eg hundreds, thousands, dozens, few). While these terms are approximate, they are still useful for establishing orders of magnitude, and when compared across reports.

If someone could create an improved method to determine if a spacy token is like a number, that would significantly improve our ability to extract information from texts.

More details on this issue are included in this notebook.
https://github.com/Data4Democracy/internal-displacement/blob/master/notebooks/DependencyTreeExperiments3.ipynb

Pipeline - save data to csv

The scraper puts out a list of dictionaries with the contents and metadata from a webpage. We need to be able to save this as a csv. Bonus points if you can append it back to the original csv.

Scraper - Parsing PDFs?

Currently, scraper.py only scrapes html pages. Can we also scrape and parse PDFs?

Reliability score for report interpretation

Write a function that calculates the percentage of missing fields in report.Report after an article has been interpreted.

We may expand this later to include weighting or other factors. Discussion welcome.

Articles to reports

Given an Article, if it is in English, make a call to Interpreter.process_article_new to obtain the Reports; for each Report returned, save in Report table; if no Reports, then set its relevance to be False

Scrape and store article content from URLs

The master input, extended and training datasets all contain URLs. For initial exploration and later analysis, it would be nice to build functionality to scrape, strip, and store the article information.

Scraper - Tag broken URLs

Lots of the URLs are broken, or contain information that can't be parsed as text (eg. videos, images). How can we filter them out and tag them as such.

Scraper - Tag content type

During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.

Map of events/flow

World map(s) that show some combination of:

Number of events
Magnitude of events
Conflict/Violence labels

Bonus if we can filter the visualised points based on type, reporting unit, etc.

Visualization discussion

The ultimate aim of this project is to make a visualization tool that can:

Map the displacement figures and locations, identify hotspots and trends.
Visualize reporting frequency and statistics for a selected region (using histogram or other such charts)
Display excerpts of documents where the relevant information is reported (either by looking at the map or browsing the list of URLs).
Visualize anything else you can think of!

To get started, datasets to play with can be found here.

Classify article

Given an Article, run the classifier, and update its category - conflict/violence/disaster/both/other

Scraper - Asynchronous tasks for scraper.py

@jlln You used async tasks to increase the speed of the beautifulsoup scraper. Can we do the same with the newspaper one?

Detect URLs with PDF

Copied from @jlln

Nice work with the parser. I have looked into incorporating it into the scraper but I have encountered the issue of identifying PDFs:

How to distinguish an url returning a pdf versus one returning html?
-The url is not sufficient to identify the return object type.
-I have tried using the python requests module to pull the header and examine the header content type. However this doesn't always work, because some urls return an html page that contains a pdf in an iframe. eg http://erccportal.jrc.ec.europa.eu/getdailymap/docId/1125

Convert relative dates to absolute datetimes

An Article may have a publication date in datetime format. Dates extracted from text can often be relative or vague, eg. "last Saturday".

Write a function to Combine article.Article publication date with dates interpreted in report.Report in order to convert dates extracted from text into datetimes.

Plot of events

Scatter plot (or other) visualising information related to events. Could include

Location
Number of events
Magnitude of events

among others.

pgConfig change breaks production config

This code in master breaks production:

//if not using docker
//create a pgConfig.js file in the same directory and put your credentials there
const connectionObj = require('./pgConfig');

nodejs_1   | [0] Error: Cannot find module './pgConfig'
nodejs_1   | [0]     at Function.Module._resolveFilename (module.js:470:15)
nodejs_1   | [0]     at Function.Module._load (module.js:418:25)
nodejs_1   | [0]     at Module.require (module.js:498:17)
nodejs_1   | [0]     at require (internal/module.js:20:19)
nodejs_1   | [0]     at Object.<anonymous> (/internal-displacement-web/server/pgDB/index.js:5:23)
nodejs_1   | [0]     at Module._compile (module.js:571:32)
nodejs_1   | [0]     at Object.Module._extensions..js (module.js:580:10)
nodejs_1   | [0]     at Module.load (module.js:488:32)
nodejs_1   | [0]     at tryModuleLoad (module.js:447:12)
nodejs_1   | [0]     at Function.Module._load (module.js:439:3)
nodejs_1   | [0] [nodemon] app crashed - waiting for file changes before starting...

Can this be made optional? If the file pgConfig exists, require it, otherwise use the environment variables?

Deal with update_status errors

In Pipeline.process_url we make multiple calls to article.update_status().

The update_status method may raise UnexpectedArticleStatusException if it appears that the status has been changed in the meantime.

process_url should be prepared for dealing with this exception.

Implement filtering of documents not reporting on human mobility

This is the third filtering requirement from the competition guidelines, to eliminate articles which mention the word 'mobility' but are unrelated to human mobility.

As per @milanoleonardo, a possible approach:

this can be done by looking at the dependency trees of the sentences in the text to make sure there is a link between a “reporting term” and a “reporting unit” (see challenge for details). This would definitely remove all documents reporting on “hip displacement” or sentences like “displaced the body of people” etc.

Enhance country detection in article content

Enhance the country_code function in interpreter.py in order to more reliably recognize countries.
For example it currently fails for 'the United States' vs 'United States'.

It would also be good to try and detect countries even though the name is not explicitly mentioned, i.e. from city names etc.

The Mordecai library may be an option, however it requires its own NLP parsing and I was wondering if there was a simpler way to do this without using two NLP libraries + trained models.

Config for running in AWS

The docker-compose.yml and docker.env files are currently set up with local development in mind. We'll want a production-friendly config.

Don't run localdb
DB config refers to AWS RDS instance instead of localdb (please do not check credentials in to git)
Node.js runs production version, instead of development version

Check article language

Given an Article, check and update its language

Use Interpreter.check_language

Scraper - Detect and tag language

Tag which language scraped content is written in.

Maybe langdetect is useful?

Pipeline testing for pdf articles

Make sure pipeline is working with pdf articles for different scenarios:

Non existent / broken url
Non English
Irrelevant
Relevant

Ideally include some tests in tests/test_Pipeline.py

Pipeline - consistent date and time

Haven't looked too deeply into newspaper's handling of datetimes but if they vary from site to site, we will need to make them consistent. Maybe even having comma separated values for date, month and year published.

Python process to check for new URLs and run the pipeline on them

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

Manage PDF scraping

PDF scraping could get a bit intensive hard disk style and could slow down scraping if doing a bulk load of urls. Can we:

Have the option to turn off pdf scraping. What part of the code should control this?
Delete a pdf as soon as it has been downloaded and parsed

Best Machine Learning approach for classifying documents and articles

Each parsed article needs to be classified into one of three categories:

Disasters
Conflict & Violence
Other

Additionally, the chosen classifier / model must allow for easy online-learning or re-training in the future using new or larger datasets.

Improve text extraction from URLs with beautifulsoup

There is a barebones function to extract text from the URLs. However, this hasn't been tested across many different URLs and does not necessarily do the best job of extracting the main body of relevant text.

Database schema for documents

Need an initial DB schema to capture information about documents and facts.

Proposal:
Tables:

Article (id, URL, retrieval date, source, publication date, title, authors, language, analyzer, analysis date) -- metadata extracted from a retrieved article
Full Text (article, content) full text of article
Analysis (id, article, reason, location, reporting term, reporting unit, number, metrics, analyzer analysis date) -- analysis of retrieved article

Explore refugee data in Jupyter Notebooks

We want people to play around with our tool and do some data analysis and visualisations. Start a notebook, see what you can make or break and let us know below.

Build / train classifier for article classification

Best NLP approach to extract useful info from articles

From the articles, we need to extract the whether individuals or households are being displaced (the reporting unit), how many of them, the date the article was published published and which reporting term is most appropriate.

Incomplete bullet point (with typo) in README.md

Under Project Components -> Visualizer:

Visualise relability of

Did you mean to say

Visualise reliability of machine learning [something, e.g. model for article classification]

Thanks!

Scraping reliability score

Write a function in article.Article that calculates the percentage of scraped fields which are returned empty.

We may consider expanding the definition of scraping reliability later, so suggestions welcome.

Get random subsample of URLs from list

There is a placeholder function sample_urls in scrape_articles.py. This should be updated to return a random subsample URLs in a list.

New Article from submitted URL

Initialize an Article from a url, save it to DB and set its status to new / pending

Modify Location Schema to be able to distinguish between cities and country sub-dvisions

The extracted location could be either a Country, a country sub-division (i.e. province or state) or a City.

We need to modify the schema for Location class in model.py to ensure that we can capture these different options:

CREATE TABLE location ( id SERIAL PRIMARY KEY,    description TEXT,    city TEXT,  
state TEXT,  country CHAR(3) REFERENCES country ON DELETE CASCADE, latlong TEXT );

Create Dockerfile for internal-displacement

Should encapsulate all of the dependencies.

Generate a reliability score for a given article

In some contexts, information about IDPs is highly politicized, which could be problematic if you're drawing from media reports. You'd want to be very careful in selecting which sources you used for info about the Rohingya in Myanmar, for example.

It would be good to be able to score an article for reliability in order to help analysts as they analyze and interpret the extracted data.
In some cases, news sources may be government run, 'fake news' or have poor sources / track record, and so any data reported by and extracted from these sources should be identifiable as having potential issues.

On the front end, this could include a filter for analysts to use whereby they can select all articles, or those which a reliability score above a certain threshold.

Some thoughts for implementation include:

A maintainable list of known problematic sources
Measuring similarity of reported facts between sources
A maintainable list of highly trusted and common 'core' news sources and anything from these sources automatically gets a high reliability rating.
New or unknown sources automatically get a lower rating unless their facts are similar enough to a report from a highly trusted source etc.

Integrate event fact extraction

Integrate work from notebooks into codebase in an attempt to extract

The reporting term (i.e. destroyed, displaced, etc.)
The reporting unit (i.e. houses, people, villages etc.)
The quantity referenced (i.e. 500, thousands, tens)
The date of the event (i.e. Saturday 09 May 2015, last Saturday)
The location of the event

from each article's content.