Code Monkey home page Code Monkey logo

Comments (7)

feltenberger avatar feltenberger commented on May 21, 2024 1

Ah! Yes, makes sense. BTW, I just realized the final paper link wasn't updated in the notebook. It's here: https://link.springer.com/article/10.1007/s10506-018-9222-4.

If you want to cluster the patents, you can still leverage some of this:

  • download the word2vec embeddings matrix
  • take your set of candidate patents and normalize/tokenize the text (see tokenize.py)
  • create patent embeddings by looking up embeddings for each word in the patent and combining them by either averaging the embeddings or doing some type of IDF weighting (e.g., as described in https://openreview.net/pdf?id=SyK00v5xx)
  • use the patent embeddings to cluster in scikit (see http://scikit-learn.org/stable/modules/clustering.html)

from patents-public-data.

feltenberger avatar feltenberger commented on May 21, 2024

(Below is only for the patent landscaping model - for the claims breadth model, Otto is better-suited to answer)

In terms of how much you'd pay for Cloud costs, unfortunately I don't have a good sense for that. FWIW, if you're not iterating a lot on different seed sets and you only run the big query once, it'll be cached locally so it'd just be a one-time penalty.

If you have a local GPU that's reasonably good processing time should be minimal - to train a model, just an hour at the very top end. For inference, it depends on how many patents you classify (that will also impact the BigQuery costs).

Local memory should also be minimal - the big memory hog is training the embeddings, but those are pre-trained and downloaded in the notebook. I don't know exact numbers though.

Suffice it to say that I'm able to run this on a Surface Pro 4 with 16 GB of RAM and it runs just fine. Training time is slow because the GPU isn't supported by TensorFlow. On a desktop with a 1080ti and 64gb of RAM that I use at work it takes about 5-10 minutes to train the models and memory use is barely noticeable.

Do report back with any findings you end up with, though - would be very curious.

from patents-public-data.

SKalt avatar SKalt commented on May 21, 2024

Thanks for the blazing-fast reply! Unfortunately after giving the paper a more thorough read-through I'm not sure I'll get work hours to spend on this. My team is more concerned with clustering rather than classification of patents, so a supervised learning approach won't do us much good.

from patents-public-data.

SKalt avatar SKalt commented on May 21, 2024

Thanks again, I'll give these a read, and see if I can get some time to contribute!

from patents-public-data.

ostegm avatar ostegm commented on May 21, 2024

Hi @SKalt - It sounds like you aren't going to run the supervised examples, but just in case - the costs to rerun the patent claim breadth are on the order of tens of dollars, provided you don't run the hyperparameter tuning job. Hyperparameter tuning can get expensive fast because of the number of models being run.

For clustering, in addition to the options described by @seinberg, you could cluster using some patent embeddings available in BigQuery: See the embedding_v1 field in this table.

From the description: "(Version 1) Machine-learned vector embedding based on document contents and metadata, where two documents that have similar technical content have a high dot product score of their embedding vectors."

from patents-public-data.

peiyu-wang avatar peiyu-wang commented on May 21, 2024

Hi @SKalt , do you think you can help on this one #47?
The model gets deleted on cloud storage, I really want to try the model. I will be super appreciated if you can share the local copy of the models to me if you still have it. Thanks a lot!

from patents-public-data.

SKalt avatar SKalt commented on May 21, 2024

Unfortunately, I moved on from the job that was related to this work, so I no longer have access to my notes on this subject. I'm afraid I'll be of no help.

from patents-public-data.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.