I'm interested in reproducing the work in a docker container, but would rather not if

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[landscape] what are the costs of running the model? about patents-public-data HOT 7 CLOSED

google commented on May 21, 2024

[landscape] what are the costs of running the model?

from patents-public-data.

Comments (7)

feltenberger commented on May 21, 2024 1

Ah! Yes, makes sense. BTW, I just realized the final paper link wasn't updated in the notebook. It's here: https://link.springer.com/article/10.1007/s10506-018-9222-4.

If you want to cluster the patents, you can still leverage some of this:

download the word2vec embeddings matrix
take your set of candidate patents and normalize/tokenize the text (see tokenize.py)
create patent embeddings by looking up embeddings for each word in the patent and combining them by either averaging the embeddings or doing some type of IDF weighting (e.g., as described in https://openreview.net/pdf?id=SyK00v5xx)
use the patent embeddings to cluster in scikit (see http://scikit-learn.org/stable/modules/clustering.html)

from patents-public-data.

feltenberger commented on May 21, 2024

(Below is only for the patent landscaping model - for the claims breadth model, Otto is better-suited to answer)

In terms of how much you'd pay for Cloud costs, unfortunately I don't have a good sense for that. FWIW, if you're not iterating a lot on different seed sets and you only run the big query once, it'll be cached locally so it'd just be a one-time penalty.

If you have a local GPU that's reasonably good processing time should be minimal - to train a model, just an hour at the very top end. For inference, it depends on how many patents you classify (that will also impact the BigQuery costs).

Local memory should also be minimal - the big memory hog is training the embeddings, but those are pre-trained and downloaded in the notebook. I don't know exact numbers though.

Suffice it to say that I'm able to run this on a Surface Pro 4 with 16 GB of RAM and it runs just fine. Training time is slow because the GPU isn't supported by TensorFlow. On a desktop with a 1080ti and 64gb of RAM that I use at work it takes about 5-10 minutes to train the models and memory use is barely noticeable.

Do report back with any findings you end up with, though - would be very curious.

from patents-public-data.

SKalt commented on May 21, 2024

Thanks for the blazing-fast reply! Unfortunately after giving the paper a more thorough read-through I'm not sure I'll get work hours to spend on this. My team is more concerned with clustering rather than classification of patents, so a supervised learning approach won't do us much good.

from patents-public-data.

SKalt commented on May 21, 2024

Thanks again, I'll give these a read, and see if I can get some time to contribute!

from patents-public-data.

ostegm commented on May 21, 2024

Hi @SKalt - It sounds like you aren't going to run the supervised examples, but just in case - the costs to rerun the patent claim breadth are on the order of tens of dollars, provided you don't run the hyperparameter tuning job. Hyperparameter tuning can get expensive fast because of the number of models being run.

For clustering, in addition to the options described by @seinberg, you could cluster using some patent embeddings available in BigQuery: See the embedding_v1 field in this table.

From the description: "(Version 1) Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."

from patents-public-data.

peiyu-wang commented on May 21, 2024

Hi @SKalt , do you think you can help on this one #47?
The model gets deleted on cloud storage, I really want to try the model. I will be super appreciated if you can share the local copy of the models to me if you still have it. Thanks a lot!

from patents-public-data.

SKalt commented on May 21, 2024

Unfortunately, I moved on from the job that was related to this work, so I no longer have access to my notes on this subject. I'm afraid I'll be of no help.

from patents-public-data.

[landscape] what are the costs of running the model? about patents-public-data HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent