finalfusion / finalfusion-python Goto Github PK
View Code? Open in Web Editor NEWFinalfusion embeddings in Python
Home Page: https://finalfusion.github.io/
License: Other
Finalfusion embeddings in Python
Home Page: https://finalfusion.github.io/
License: Other
Put a very small model in the repo so we can run some tests from within python? Otherwise I'm not sure how to include tests for the CI to pick up
Individual lookups for quantized embeddings are extremely slow, batched lookups are a lot faster.
Support for this will be in the next finalfusion release. Update the finalfusion dependency once this is available.
Not sure if there's more, but that's what I remember:
I think AppVeyor is processing the builds sequentially, this takes a lot of time. We are currently building on both CI services when something is pushed to a branch and also when a PR is made.
Maybe we can restrict the AppVeyor builds to pull requests and releases? I'm not sure how release builds are triggered, so I don't know if this is actually doable.
Investigate current possibilities for documentation generation in pyo3 and pyo3-pack. Ideally, we'd generate something similar to readthedocs. This is the last item from #5.
In particular:
E.g. __getitem__
on the storage types is missing that among other methods. That leaves a lot of code paths un-analyzed because they implicitly become Any
.
Might also be relevant for finalfusion-rust
, I think we don't have finalfusion files with those chunks for testing.
Downloading a precompiled version of pyo3-pack would shave off more than 10 minutes from build times. Any objections to this?
Although it seems like something is messed up for the 0.6.1 binary, at least it's not possible for me to run it locally or on CI. Bash just states, that the file or directory doesn't exist. The newest release works just fine locally.
vocab.__contains__
performs a linear search through the vocab making it incredibly slow.
Just stumbled over this. Could we also have a wheel for python 3.6 and 3.5 on macOS?
Most of our methods raise exceptions if the model can't produce an embedding for the given input, e.g. embedding()
and similarity()
while we could also return None
in those cases.
I'm not sure what's canonical Python here, but I'd prefer to do something like:
embedding = embeds.Embedding("something oov")
if embedding is None:
embedding = generic_oov_embed()
over
try:
embedding = embeds.Embedding("something oov")
except KeyError:
embedding = generic_oov_embed()
What's your take on this?
https://finalfusion.github.io/python is now outdated
Just wanted to have your opinion on this @sebpuetz . I think my objection at the time is that we had to keep two implementations (Rust, Python) in sync. But:
If you try to get an embedding without having Numpy installed in the Python environment, you get a panic. Is it possible to have Numpy be automatically installed as a dependency when you run pip install finalfusion
?
setup.py
: include README, update URL, add other URLs #106 & #104I think the other issues can be resolved in 0.7.1 (e.g. error handlers for bad utf8 or batched lookups).
Is there anything else that you think should be done before the release?
Ensure that all methods are covered by unit test (as far as possible).
As of now, everything assumes all words are proper utf8. Perhaps add a lossy arg analogous to finalfusion-rust
in the read method(s).
The interface could use getters for the various model parameters specified here, particularly the commonly used ones such as dimension, vocab size, context size, etc.
If no norms are passed to Embeddings
, normalize embeddings and add norms to the embeddings.
On our finalfusion website:
https://finalfusion.github.io/python
similarity
is now word_similarity
.I think I missed this during reviewing, but the analogy integration test has:
export LC_ALL=en_US.UTF-8
This fails in two cases:
@sebpuetz Do you encounter problems when e.g. using export LC_ALL=C
? (Which should avoid doing any parsing of character data.)
Once PyO3/pyo3@ac28a31 is released, we should remove the workaround for embedding_similarity
introduced in #45.
We have added quite some features in the last month, e.g. the read methods for other formats, exposure of subword indices and the embedding similarity method.
Once #51 is done we could cut a new release, unless there is something else in the pipeline for the Python bindings.
What do you think?
Most can be copied from https://github.com/sebpuetz/ffp but I'll need to go through the sphinx stuff again to sync the changes here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.