Code Monkey home page Code Monkey logo

patent_similarity_search's Introduction

Finding a patent's prior art using text similarity

This repository contains research work on finding prior art for a given patent. The approach is to find the most similar documents for a given patent application by comparing them using similarity measures calculated on the documents' full texts. For further details on the experiments please refer to the paper:

@article{helmers2019automating,
  title={Automating the search for a patent's prior art with a full text similarity search},
  author={Helmers, Lea and Horn, Franziska and Biegler, Franziska and Oppermann, Tim and M{\"u}ller, Klaus-Robert},
  journal={{PLoS ONE}},
  volume={14},
  number={3},
  pages={e0212103},
  year={2019},
  publisher={Public Library of Science}
}

All the data sets needed for reproducing the analyses are available at: https://figshare.com and can be downloaded in a compressed format after sign-up

  • SQLite database-file: https://figshare.com/articles/Patent_Database/7264733
  • Patent scoring by expert and corpus subsample: https://figshare.com/articles/human_eval_tar_gz/7257215
  • Entire corpus: https://figshare.com/articles/corpus_tar_gz/7257194

Compile dataset and load it into sqlite database

Crawling patent files from google patents

  • Adapt the seed patents in the main functions in patentcollector.py
python patentcollector.py

Create SQLite DB

  • Save your patent files as .csv files with following metadata as columns: ['id', 'title', 'category', 'pub_number', 'app_number', 'pub_date', 'abstract', 'description', 'claims', 'cited_patents', 'pub_dates']
  • Adapt the path in the main function of make_patent_db.py to point to the directory containing your patent files
python make_patent_db.py

Exploratory data analysis

Evaluate Corpus statistics

Check out the category distributions in your corpus

python compare_cats.py

Run similarity search

The different feature extraction methods:

Bag-of-words with tf-idf
python idf_regression.py
Kernel-PCA
python kpca.py
Latent semantic analysis (LSA)
python lat_sem_ana.py
Word2vec
python word2vec_app.py
Doc2vec
python doc2vec.py
LICENSE

You are free to use the content of this repository under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

patent_similarity_search's People

Contributors

helmersl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.