Code Monkey home page Code Monkey logo

birdview-patent-landscape's Introduction

Patent Vista

A bird's-eye view of patent landscape.

Click here to see my presentation.

Click here to see a video demo.

How to install and get it up and running

  1. Migrate data from Google Cloud to AWS S3

  2. Set up a Spark cluster on AWS (e.g., 4 EC2 instances including 1 master and 3 workers)

  3. Install Neo4j on a separate instance

  4. Change file index in run.sh, start index and end index

  5. sh run.sh
    

Introduction


Patent Vista is a data pipeline designed to visualizaing patent citation relationships to:

  1. Spot VIP (very import patent) of a patent portfolio
    • More citations = more important (similar to more retweets more influence)
  2. Check how a patent portfoli is structured (technical fields distribution)
    • What is the strategy of protecting the core technologies?
  3. Identify potential licensees for monetizing patent asset
    • How to monetize patent asset?

Architecture


  1. 1.5 TB data is migrated from Google Cloud to Amazon S3
  2. The data is splitted into 1667 batches, each batch is 1 GB
  3. The batch data is fetched into Apache Spark for batch processing to extract citation information and patent ownership information
  4. The result of the batch processing is sotred into Neo4j to visualizaing relatinships

Dataset


Raw dataset is patents public data from Google Cloud.

Engineering challenges


Writing data from Spark to Neo4j was quite slow at the beginning. I used neo4j-python-driver for writing and it took more than 20 minutes for writing a batch.

Meanwhile neither Neo4j JDBC driver nor neo4j-spark-connector supports writing Neo4j, also there is not enough documentation in writing Neo4j for both.

neo4j-admin import is a very handy built-in command provided by Neo4j, writing a batch only took 6 seconds. But the problem is this command can only be used to import data into an empty database, if there is existing data in database, we cannot use this command. As I have 1667 batches, merging all the processing results for the batches to use this command seems not that efficient.

Then I learned how to do batch writing with Cypher

WITH $names AS nested
UNWIND nested AS x
MERGE (w:PATENT {name: x[0]})
MERGE (n:PATENT {name: x[1]})
MERGE (w)-[r:CITE]-(n)

This is very fast for writing nodes in Neo4j, but writing relationships could also take 50+ minutes.

Finally I found my Swiss Army Knife, creating a constraint which will automatically create a schema index in the database. With this, each batch takes less than 1 minute.๐Ÿ’ช

CREATE CONSTRAINT ON (p:PATENT) ASSERT p.name IS UNIQUE

Trade-offs


birdview-patent-landscape's People

Contributors

shao-shuai avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.