Code Monkey home page Code Monkey logo

ner2sna's Introduction

Entity Extraction and Network Analysis

Or, how you can extract meaningful information from raw text and use it to analyze the networks of individuals hidden within your data set. Network Diagram

We are all drowning in text. Fortunately there are a number of data science strategies for handling the deluge. If you'd like to learn about using machine learning for this check out my guide on document clustering. In this guide I'm going to walk you through a strategy for making sense of massive troves of unstructured text using entity extration and network analysis. These strategies are actively employed for legal e-discovery and within law enforcement and the intelligence community. Imagine you work at the FBI and you just uncovered a massive trove of documents on a confiscated laptop or server. What would you do? This guide offers an approach for dealing with this type of scenario. By the end of it you'll have generated a graph like the one above, which you can use to analyze the network hidden within your data set.

Overview

We are going take a set of documents (in our case, news articles), extract entities from within them, and develop a social network based on entity document co-occurrence. This can be a useful approach for getting a sense of which entities exist in a set of documents and how those entities might be related. I'll talk more about using document co-occurrence as the mechanism for drawing an edge in a social network graph later.

In this guide I rely on 4 primary pieces of software:

  1. Stanford Core NLP
  2. Fuzzywuzzy
  3. Networkx
  4. D3.js

If you're not familiar with these libraries, don't worry, I'll make it easy to get off to the races with them in no time.

Running the code

You should downoad and run ner2sna.ipynb. It will walk you through everything you need. You can use corpus.txt as a sample data set if you'd like. Also, make sure to capture the force directory when you try to run this on your own. You need force/force.html, force/force.css, and force/force.js in order to create the chart at the end of the guide.

If you have any questions for me, feel free to reach out on Twitter to @brandonmrose or open up an issue on the github repo.

ner2sna's People

Contributors

brandomr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ner2sna's Issues

Better way to get entities

Instead of hacking together a function to glue NER tokens together, you can just ask CoreNLP to do it for you with entitymentions

In[2]:

output = nlp.annotate(text, properties={
  'annotators': 'entitymentions',
  'outputFormat': 'json'
  })
In[3]:

def proc_sentence(response):
    return [entity['text'] for sent in response['sentences'] for entity in sent['entitymentions']]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.