Invalid Patents

We want to download decisions from the Patent Trial and Appeal Board. These are found here. We can easily get decisions from the past N days by visting http://e-foia.uspto.gov/Foia/DispatchBPAIServlet?RetrieveRecent=30, and replacing the '30' in the URL with the number we want.

Using Selenium we download the decision PDFs, extract the text from the PDF, documents, and insert the full text into a RethinkDB database.

We use Slate/PDFMiner to extract text from PDF documents that are only rendered text. Some PDF documents contain images, and the text is not easily extractable. For these, we use Tesseract as an OCR engine.

Getting Set Up

You will need the following installed on your computer to get the below working:

RethinkDB - http://rethinkdb.com/
Python modules in requirements.txt
- you can run pip install -r requirements.txt to install these
Firefox - https://www.mozilla.org/en-US/firefox/new/
- any of the newer versions should work

Running Scripts

To run RethinkDB, start the following process in a window:

rethinkdb -d data

Make sure the requisite db/table are setup:

python setupRethinkDB.py

Download the PDF files from the last N days by running (expects N as an integer):

python getPDFs.py N

This will open up Firefox on your computer and programatically direct the browser to download all of the relevant PDFs, which will be placed in a newly created folder pdf_files in the same directory. Each file will be named according to its application number.

To run OCR on the PDFs, run

python insertTexts.py

which will automatically iterate through all the PDF documents in the pdf_files directory and perform all the necessary operations. It will upload these to the RethinkDB instance you started above, but the code is simple enough that you should be able to alter the script to output the data however you want.

gtfierro / invalidpatents Goto Github PK

invalidpatents's Introduction

Invalid Patents

Getting Set Up

Running Scripts

invalidpatents's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent