Comments (7)
Ah! Yes, makes sense. BTW, I just realized the final paper link wasn't updated in the notebook. It's here: https://link.springer.com/article/10.1007/s10506-018-9222-4.
If you want to cluster the patents, you can still leverage some of this:
- download the word2vec embeddings matrix
- take your set of candidate patents and normalize/tokenize the text (see tokenize.py)
- create patent embeddings by looking up embeddings for each word in the patent and combining them by either averaging the embeddings or doing some type of IDF weighting (e.g., as described in https://openreview.net/pdf?id=SyK00v5xx)
- use the patent embeddings to cluster in scikit (see http://scikit-learn.org/stable/modules/clustering.html)
from patents-public-data.
(Below is only for the patent landscaping model - for the claims breadth model, Otto is better-suited to answer)
In terms of how much you'd pay for Cloud costs, unfortunately I don't have a good sense for that. FWIW, if you're not iterating a lot on different seed sets and you only run the big query once, it'll be cached locally so it'd just be a one-time penalty.
If you have a local GPU that's reasonably good processing time should be minimal - to train a model, just an hour at the very top end. For inference, it depends on how many patents you classify (that will also impact the BigQuery costs).
Local memory should also be minimal - the big memory hog is training the embeddings, but those are pre-trained and downloaded in the notebook. I don't know exact numbers though.
Suffice it to say that I'm able to run this on a Surface Pro 4 with 16 GB of RAM and it runs just fine. Training time is slow because the GPU isn't supported by TensorFlow. On a desktop with a 1080ti and 64gb of RAM that I use at work it takes about 5-10 minutes to train the models and memory use is barely noticeable.
Do report back with any findings you end up with, though - would be very curious.
from patents-public-data.
Thanks for the blazing-fast reply! Unfortunately after giving the paper a more thorough read-through I'm not sure I'll get work hours to spend on this. My team is more concerned with clustering rather than classification of patents, so a supervised learning approach won't do us much good.
from patents-public-data.
Thanks again, I'll give these a read, and see if I can get some time to contribute!
from patents-public-data.
Hi @SKalt - It sounds like you aren't going to run the supervised examples, but just in case - the costs to rerun the patent claim breadth are on the order of tens of dollars, provided you don't run the hyperparameter tuning job. Hyperparameter tuning can get expensive fast because of the number of models being run.
For clustering, in addition to the options described by @seinberg, you could cluster using some patent embeddings available in BigQuery: See the embedding_v1 field in this table.
From the description: "(Version 1) Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."
from patents-public-data.
Hi @SKalt , do you think you can help on this one #47?
The model gets deleted on cloud storage, I really want to try the model. I will be super appreciated if you can share the local copy of the models to me if you still have it. Thanks a lot!
from patents-public-data.
Unfortunately, I moved on from the job that was related to this work, so I no longer have access to my notes on this subject. I'm afraid I'll be of no help.
from patents-public-data.
Related Issues (20)
- Error in word2vec: model from Google Cloud Storage was not downloaded HOT 4
- Unable to use the Patent-BERT HOT 1
- BERT for Patents: unable to access hidden layers HOT 6
- embedding model is not found// Automated Patent Landscaping HOT 5
- Expiration date HOT 2
- BERT for Patents yields 1024 element array, but embedding_v1 is 64 element HOT 5
- ResourceExhaustedError while running Document_representation_from_BERT HOT 2
- Empty Tables in the Dataset
- Linking proteins and humangenes annotation preferred name to identifier HOT 9
- Converting Tensorflow Bert for Patent saved model to keras.
- How to access hidden layers? HOT 3
- How to download HOT 5
- BERT-Base
- context tokens
- Generating new Document Embeddings
- Sklearn 1.1.1 Issue HOT 1
- Dataset lacking cited_by data even though its available on the website. HOT 4
- claim_text_extraction.ipynb df = pd.read_csv('./data/20k_G_and_H_publication_numbers.csv') workaround
- Lots of Patents in the latest patent dataset are missing a description
- Missing embedding HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from patents-public-data.