Code Monkey home page Code Monkey logo

remerge's Introduction

REMERGE

Linking PATSTAT to Company databases

Michele Peruzzi, Georg Zachmann, Reinhilde Veugelers

Introduction

PATSTAT is a database published by the European Patent Office and includes info on millions of patents and patentees. Its usage is sometimes limited by the little information on the patentees. Linking it to company databases has historically been a manual task. This is due to:

  • its focus on the patent applications, not on the patent applicants or inventors,
  • missing classification of patentees into categories such as individuals, companies, other organizations,
  • missing basic information on patentees such as their address, or their name. In addition, large company databases include a large majority of non-patenting companies. Ultimately, just a few patentees should be matched to a relatively small number of companies. For these reasons, advanced matching algorithm have not been used, as they make comparisons using the shared fields.

Remerge is a set of python scripts that allows to match PATSTAT to Company databases (in this case, Amadeus from Bureau van Dijk). It is not limited to comparisons between shared fields, and uses as much information as possible. A Lasso-regression model is estimated on the training set and applied to the data to get the estimated probabilities of matching.

Procedure

Starting from cleaned and geocoded data:

  1. filter_companies.py For every PATSTAT name, computes JW and Lev string distances, then for every PATSTAT name outputs Union(top10lw, top10lev), includes computation of geo-location includes separation of names and legal identifiers adds Amadeus variables for hand labeling later. this is the most resource-intensive part of the algorithm

  2. extract_sample.py loads RAW PATSTAT and Amadeus, loads candidate matches, takes previous dataset and asks user to find the true matches

  3. remerge_sector_matrix.py (can be run before 2.) calculates IPC-NAICS "similarity" by looking into the unique exact matches. A unique exact match is, of all pairings between a PATSTAT name and a company, the only one in which the two names are the same. Most PATSTAT names have no exact match. Unique ones are even less.

  4. generate_vars.py and prepare_modelfit.py Generate some of the variables that are used by the Lasso-regression.

  5. remerge_fitmodel_training.r Fits the Lasso-regression model. (Calls some python code) Loads R source code from regression_functions-modelmatrix.r

  6. remerge_fitmodel_wholedata.py Fits the generated model to the whole dataset. Saves the results.

  7. remerge_persontable.py (optional) Takes the matching results and returns a table of patstat_id : phat : company_id where patstat_id is the same as person_id in patstat and phat is the estimated probability of match. The resulting table can then be loaded into an SQL server.

remerge's People

Contributors

mkln avatar

Stargazers

Carlo Bottai avatar  avatar Ping avatar Julian avatar Flávio Juvenal avatar  avatar

Watchers

James Cloos avatar  avatar

remerge's Issues

data clean-up

In the procedure section you write "Starting from cleaned and geocoded data:".

In the original working paper, in the section "1. Clean and geocode the data" it is written "The technical details as well as the potential customisations are explained in the code documentation."

Data cleaning is a crucial part of this procedure. I would like to adapt the code to the latest PATSTAT version. Is it possible to have a look at the code for data cleaning?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.