Code Monkey home page Code Monkey logo

internal-displacement's Introduction

Internal Displacement

This repository is now archived. The project is being continued but is currently closed to new members. Data for Democracy is a community driven organization. If you want to start a new project in a similar area, you are welcome to do so! Check out the #refugees channel and rally your fellow data nerds!

Slack Channel: #internal-displacement

Project Description: Classifying, tagging, analyzing and visualizing news articles about internal displacement. Based on a challenge from the IDMC.

The tool we are building carries out a number of functions:

  1. Ingest a list of URLs
  2. Scrape content from the respective web pages
  3. Tag the article as relating to disaster or conflict
  4. Extract key information from text
  5. Store information in a database
  6. Display data in interactive visualisations

The final aim is a simple app that can perform all of these functions with little technical knowledge needed by the user.

Project Lead:

Maintainers: These are the additional people mainly responsible for reviewing pull requests, providing feedback and monitoring issues.

Scraping, processing, NLP

Front end and infrastructure

Getting started:

  1. Join the Slack channel.
  2. Read the rest of this page and the IDETECT challenge page to understand the project.
  3. Check out our issues (small tasks) and milestones. Keep an eye out for help-wanted, beginner-friendly, and discussion tags.
  4. See something you want to work on? Make a comment on the issue or ping us on Slack to let us know.
  5. Beginner with GitHub? Make sure you've read the steps for contributing to a D4D project on GitHub.
  6. Write your code and submit a pull request to add it to the project. Reach out for help any time!

Things you should know

  • Beginners are welcome! We're happy to help you get started. (For beginners with Git and GitHub specifically, our github-playground repo and the #github-help Slack channel are good places to start.)
  • We believe good code is reviewed code. All commits to this repository are approved by project maintainers and/or leads (listed above). The goal here is not to criticize or judge your abilities! Rather, sharing insights and achievements. Code reviews help us continually refine the project's scope and direction, and encourage discussion.
  • This README belongs to everyone. If we've missed some crucial information or left anything unclear, edit this document and submit a pull request. We welcome the feedback! Up-to-date documentation is critical to what we do, and changes like this are a great way to make your first contribution to the project.

Project Overview

There are millions of articles containing information about displaced people. Each of these is a rich source of information that can be used to analyse the flow of people and reporting about them.

We are looking to record:

  • URL
  • Number of times URL has been submitted
  • Main text
  • Source (eg. new york times)
  • Publication date
  • Title
  • Author(s)
  • Language of article
  • Reason for displacement (violence/disaster/both/other)
  • The location where the displacement happened
  • Reporting term: displaced/evacuated/forced to fee/homeless/in relief camp/sheltered/relocated/destroyed housing/partially destroyed housing/uninhabitable housing
  • Reporting unit: people/persons/individuals/children/inhabitants/residents/migrants or families/households/houses/homes
  • Number displaced
  • Metrics relating to machine learning accuracy and reliability

Project Components

These are the main parts and functions that make up the project.

  • Scraper and Pipeline
  • Take lists of URLs as input from input dataset
  • Filter irrelevant articles and types of content (videos etc.)
  • Scrape the main body text and metadata (publish date, language etc.)
  • Store the information in a database
  • Interpreter
  • Classify URLs as conflict/violence, disaster or other. There is a training dataset to help with tagging.
  • Extract information from articles: location and number of reporting units (households or individuals) displaced, date published and reporting term (conflict/violence, disaster or other). The larger extended input dataset and the text from articles we have already scraped can be used to help here.
  • Visualizer
  • A mapping tool to visualize the displacement figures and locations, identify hotspots and trends.
  • Other visualizations for a selected region to identify reporting frequency on the area
  • Visualizing the excerpts of documents where the relevant information is reported (either looking at the map or browsing the list of URLs).
  • Visualise relability of classification and information extraction algorithms (either overall or by article)
  • Some pre-tagged datasets (1, 2) can be used to start exploring visualization options.
  • App is in the internal-displacement-web folder
  • A non-technical-user friendly front end to wrap around the components above for inputting URLs, managing the databases, verifying data and interacting with visualisations
  • Automation of scraping, pipeline and interpreter

Running in Docker

You can run everything as you're accustomed to by installing dependencies locally, but another option is to run in a Docker container. That way, all of the dependencies will be installed in a controlled, reproducible way.

  1. Install Docker: https://www.docker.com/products/overview

  2. Run this command:

    docker-compose up
    

    or

    docker-compose -f docker-compose-spacy.yml up
    

    The spacy version will include the en_core_web_md 1.2.1 NLP model It is multiple gigabytes in size. The one without the model is much smaller.

    Either way, this will take some time the first time. It's fetching and building all of its dependencies. Subsequent runs should be much faster.

    This will start up several docker containers, running postgres, a Jupyter notebook server, and the node.js front end.

    In the output, you should see a line like:

    jupyter_1  |         http://0.0.0.0:3323/?token=536690ac0b189168b95031769a989f689838d0df1008182c
    

    That URL will connect you to the Jupyter notebook server.

  3. Visit the node.js server at http://localhost:3322

Note: You can stop the docker containers using Ctrl-C.

Note: If you already have something running on port 3322 or 3323, edit docker-compose.yml and change the first number in the ports config to a free port on your system. eg. for 9999, make it:

    ports:
      - "9999:3322"

Note: If you want to add python dependencies, add them to requirements.txt and run the jupyter-dev version of the docker-compose file:

docker-compose -f docker-compose-dev.yml up --build

You'll need to use the jupyter-dev version until your dependencies are merged to master and a new version is built. Talk to @aneel on Slack if you need to do this.

Note: if you want to run SQL commands againt the database directly, you can do that by starting a Terminal within Jupyter and running the PostgreSQL shell:

psql -h localdb -U tester id_test

Note: If you want to connect to a remote database, edit the docker.env file with the DB url for your remote database.

Skills Needed

  • Python 3
  • JavaScript/HTML/css
  • Nodejs
  • AWS
  • Visualisation (D3)

Tips for working on this project

  • Try to keep each contribution and pull request focussed mostly on solving the issue at hand. If you see more things that are needed, feel free to let us know and/or make another issue.
  • Datasets can be accessed from Dropbox
  • We have a working plan for the project.
  • Not ready to submit code to the main project? Feel free to play around with notebooks and submit them to the repository.

Things that inspire us

Refugees on IBM Watson News Explorer

internal-displacement's People

Contributors

alexanderrich avatar arnold-jr avatar coldfashioned avatar domingohui avatar frenski avatar jlln avatar simonb83 avatar wanderingstar avatar wwymak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

internal-displacement's Issues

Infrastructure Plan

Here's a sketch of an infrastructure plan:

Development

Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo)
Write to local DB in docker
Can read scrape requests from database, but most scrapes will be triggered manually (through notebooks or scripts)

Web app runs locally in Docker for prototyping (internal-displacement-web repo)
Reads local DB in docker
Writes scrape requests to database

IDETECT Preparation

Scraper and Web app Docker containers deployed to AWS instance or similar cloud-hosted infrastructure
Read and write to Amazon RDS database
Large batch of scrape requests input into database, read from there and processed by Scraper(s)

Number-like entities to integers

Prior to saving reports to database quantities need to be converted to integers.

In some cases this is trivial (i.e. 500, 684), however there are other cases that need more work for example 'thousand', 'hundreds' etc.

Probably the best place to implement the conversion is in Report.__init__

Pull data from S3 bucket

Currently, the csv data files are stored in my personal S3. We need to be able to download them locally or load them directly into a dataframe.

Create, maintain and update user guide / admin guide.

The competition deliverables include:

  • A brief document describing the functionalities, such as a user guide.
  • A document describing the steps to maintain and update the tool with further features, such as an admin guide.

Although there is not a lot to say at this point in time, it is worth keeping in mind rather than getting left to the last minute.

Scraped content to database

Make a call to Scraper.scrape for a given url and update the relevant attributes in the database with the result; add content to Content table

Train classifier on training dataset

There is a training dataset here that can be used to train a classifier to tag articles with "violence" or "disaster" as the cause of displacement. It is quite a short dataset, but using the URL scraping functionality in scrape_articles.py we should be able to make a start at training a classification algorithm.

Improve detection of numbers by spacy.

The extraction of information from unstructured texts currently relies on the ability of spacy to identify number-like substrings within text.
Specifically, the like_num method of the spacy Token class is used https://spacy.io/docs/api/token.

This method fails to detect approximate numerical terms (eg hundreds, thousands, dozens, few). While these terms are approximate, they are still useful for establishing orders of magnitude, and when compared across reports.

If someone could create an improved method to determine if a spacy token is like a number, that would significantly improve our ability to extract information from texts.

More details on this issue are included in this notebook.
https://github.com/Data4Democracy/internal-displacement/blob/master/notebooks/DependencyTreeExperiments3.ipynb

Pipeline - save data to csv

The scraper puts out a list of dictionaries with the contents and metadata from a webpage. We need to be able to save this as a csv. Bonus points if you can append it back to the original csv.

Reliability score for report interpretation

Write a function that calculates the percentage of missing fields in report.Report after an article has been interpreted.

We may expand this later to include weighting or other factors. Discussion welcome.

Articles to reports

Given an Article, if it is in English, make a call to Interpreter.process_article_new to obtain the Reports; for each Report returned, save in Report table; if no Reports, then set its relevance to be False

Scrape and store article content from URLs

The master input, extended and training datasets all contain URLs. For initial exploration and later analysis, it would be nice to build functionality to scrape, strip, and store the article information.

Scraper - Tag broken URLs

Lots of the URLs are broken, or contain information that can't be parsed as text (eg. videos, images). How can we filter them out and tag them as such.

Scraper - Tag content type

During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.

Map of events/flow

World map(s) that show some combination of:

  • Number of events
  • Magnitude of events
  • Conflict/Violence labels

Bonus if we can filter the visualised points based on type, reporting unit, etc.

Visualization discussion

The ultimate aim of this project is to make a visualization tool that can:

  • Map the displacement figures and locations, identify hotspots and trends.
  • Visualize reporting frequency and statistics for a selected region (using histogram or other such charts)
  • Display excerpts of documents where the relevant information is reported (either by looking at the map or browsing the list of URLs).
  • Visualize anything else you can think of!

To get started, datasets to play with can be found here.

Classify article

Given an Article, run the classifier, and update its category - conflict/violence/disaster/both/other

Detect URLs with PDF

Copied from @jlln

Nice work with the parser. I have looked into incorporating it into the scraper but I have encountered the issue of identifying PDFs:

  • How to distinguish an url returning a pdf versus one returning html?
    -The url is not sufficient to identify the return object type.
    -I have tried using the python requests module to pull the header and examine the header content type. However this doesn't always work, because some urls return an html page that contains a pdf in an iframe. eg http://erccportal.jrc.ec.europa.eu/getdailymap/docId/1125

Convert relative dates to absolute datetimes

An Article may have a publication date in datetime format. Dates extracted from text can often be relative or vague, eg. "last Saturday".

Write a function to Combine article.Article publication date with dates interpreted in report.Report in order to convert dates extracted from text into datetimes.

Plot of events

Scatter plot (or other) visualising information related to events. Could include

  • Location
  • Number of events
  • Magnitude of events

among others.

pgConfig change breaks production config

This code in master breaks production:

//if not using docker
//create a pgConfig.js file in the same directory and put your credentials there
const connectionObj = require('./pgConfig');
nodejs_1   | [0] Error: Cannot find module './pgConfig'
nodejs_1   | [0]     at Function.Module._resolveFilename (module.js:470:15)
nodejs_1   | [0]     at Function.Module._load (module.js:418:25)
nodejs_1   | [0]     at Module.require (module.js:498:17)
nodejs_1   | [0]     at require (internal/module.js:20:19)
nodejs_1   | [0]     at Object.<anonymous> (/internal-displacement-web/server/pgDB/index.js:5:23)
nodejs_1   | [0]     at Module._compile (module.js:571:32)
nodejs_1   | [0]     at Object.Module._extensions..js (module.js:580:10)
nodejs_1   | [0]     at Module.load (module.js:488:32)
nodejs_1   | [0]     at tryModuleLoad (module.js:447:12)
nodejs_1   | [0]     at Function.Module._load (module.js:439:3)
nodejs_1   | [0] [nodemon] app crashed - waiting for file changes before starting...

Can this be made optional? If the file pgConfig exists, require it, otherwise use the environment variables?

Deal with update_status errors

In Pipeline.process_url we make multiple calls to article.update_status().

The update_status method may raise UnexpectedArticleStatusException if it appears that the status has been changed in the meantime.

process_url should be prepared for dealing with this exception.

Implement filtering of documents not reporting on human mobility

This is the third filtering requirement from the competition guidelines, to eliminate articles which mention the word 'mobility' but are unrelated to human mobility.

As per @milanoleonardo, a possible approach:

this can be done by looking at the dependency trees of the sentences in the text to make sure there is a link between a “reporting term” and a “reporting unit” (see challenge for details). This would definitely remove all documents reporting on “hip displacement” or sentences like “displaced the body of people” etc.

Enhance country detection in article content

Enhance the country_code function in interpreter.py in order to more reliably recognize countries.
For example it currently fails for 'the United States' vs 'United States'.

It would also be good to try and detect countries even though the name is not explicitly mentioned, i.e. from city names etc.

The Mordecai library may be an option, however it requires its own NLP parsing and I was wondering if there was a simpler way to do this without using two NLP libraries + trained models.

Config for running in AWS

The docker-compose.yml and docker.env files are currently set up with local development in mind. We'll want a production-friendly config.

  • Don't run localdb
  • DB config refers to AWS RDS instance instead of localdb (please do not check credentials in to git)
  • Node.js runs production version, instead of development version

Pipeline testing for pdf articles

Make sure pipeline is working with pdf articles for different scenarios:

  • Non existent / broken url
  • Non English
  • Irrelevant
  • Relevant

Ideally include some tests in tests/test_Pipeline.py

Pipeline - consistent date and time

Haven't looked too deeply into newspaper's handling of datetimes but if they vary from site to site, we will need to make them consistent. Maybe even having comma separated values for date, month and year published.

Python process to check for new URLs and run the pipeline on them

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

Manage PDF scraping

PDF scraping could get a bit intensive hard disk style and could slow down scraping if doing a bulk load of urls. Can we:

  1. Have the option to turn off pdf scraping. What part of the code should control this?
  2. Delete a pdf as soon as it has been downloaded and parsed

Improve text extraction from URLs with beautifulsoup

There is a barebones function to extract text from the URLs. However, this hasn't been tested across many different URLs and does not necessarily do the best job of extracting the main body of relevant text.

Database schema for documents

Need an initial DB schema to capture information about documents and facts.

Proposal:
Tables:

  • Article (id, URL, retrieval date, source, publication date, title, authors, language, analyzer, analysis date) -- metadata extracted from a retrieved article
  • Full Text (article, content) full text of article
  • Analysis (id, article, reason, location, reporting term, reporting unit, number, metrics, analyzer analysis date) -- analysis of retrieved article

Explore refugee data in Jupyter Notebooks

We want people to play around with our tool and do some data analysis and visualisations. Start a notebook, see what you can make or break and let us know below.

Best NLP approach to extract useful info from articles

From the articles, we need to extract the whether individuals or households are being displaced (the reporting unit), how many of them, the date the article was published published and which reporting term is most appropriate.

Scraping reliability score

Write a function in article.Article that calculates the percentage of scraped fields which are returned empty.

We may consider expanding the definition of scraping reliability later, so suggestions welcome.

Modify Location Schema to be able to distinguish between cities and country sub-dvisions

The extracted location could be either a Country, a country sub-division (i.e. province or state) or a City.

We need to modify the schema for Location class in model.py to ensure that we can capture these different options:

CREATE TABLE location ( id SERIAL PRIMARY KEY,    description TEXT,    city TEXT,  
state TEXT,  country CHAR(3) REFERENCES country ON DELETE CASCADE, latlong TEXT );

Generate a reliability score for a given article

In some contexts, information about IDPs is highly politicized, which could be problematic if you're drawing from media reports. You'd want to be very careful in selecting which sources you used for info about the Rohingya in Myanmar, for example.

It would be good to be able to score an article for reliability in order to help analysts as they analyze and interpret the extracted data.
In some cases, news sources may be government run, 'fake news' or have poor sources / track record, and so any data reported by and extracted from these sources should be identifiable as having potential issues.

On the front end, this could include a filter for analysts to use whereby they can select all articles, or those which a reliability score above a certain threshold.

Some thoughts for implementation include:

  1. A maintainable list of known problematic sources
  2. Measuring similarity of reported facts between sources
  3. A maintainable list of highly trusted and common 'core' news sources and anything from these sources automatically gets a high reliability rating.
  4. New or unknown sources automatically get a lower rating unless their facts are similar enough to a report from a highly trusted source etc.

Integrate event fact extraction

Integrate work from notebooks into codebase in an attempt to extract

  • The reporting term (i.e. destroyed, displaced, etc.)
  • The reporting unit (i.e. houses, people, villages etc.)
  • The quantity referenced (i.e. 500, thousands, tens)
  • The date of the event (i.e. Saturday 09 May 2015, last Saturday)
  • The location of the event

from each article's content.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.