The finalfusion-python from finalfusion

Increase PyPI bus factor

If you want, I can also add you to the PyPI finalfusion account. Increases the bus factor by 2. But I'd need your PyPI handles.

Set up infrastructure to automatically build Python wheels

Add tests.

Put a very small model in the repo so we can run some tests from within python? Otherwise I'm not sure how to include tests for the CI to pick up

Batched embedding lookup

Individual lookups for quantized embeddings are extremely slow, batched lookups are a lot faster.

Support memory-mapped quantized storage

Support for this will be in the next finalfusion release. Update the finalfusion dependency once this is available.

Add convenience methods etc.

Not sure if there's more, but that's what I remember:

Method to get the full embedding matrix, suggested by @Blubberli
numpy compatibility, suggested by @twuebi
make vocab accessible from python?
Check documentation

Restrict AppVeyor Builds to Pullrequests and Releases

I think AppVeyor is processing the builds sequentially, this takes a lot of time. We are currently building on both CI services when something is pushed to a branch and also when a PR is made.

Maybe we can restrict the AppVeyor builds to pull requests and releases? I'm not sure how release builds are triggered, so I don't know if this is actually doable.

Investigate generation of documentation

Investigate current possibilities for documentation generation in pyo3 and pyo3-pack. Ideally, we'd generate something similar to readthedocs. This is the last item from #5.

Implement Analogies & Similarities

Add typing to all methods

E.g. __getitem__ on the storage types is missing that among other methods. That leaves a lot of code paths un-analyzed because they implicitly become Any.

Test files with ExplicitVocab & FastTextVocab

Might also be relevant for finalfusion-rust, I think we don't have finalfusion files with those chunks for testing.

Download pyo3-pack instead of cargo install

Downloading a precompiled version of pyo3-pack would shave off more than 10 minutes from build times. Any objections to this?

Although it seems like something is messed up for the 0.6.1 binary, at least it's not possible for me to run it locally or on CI. Bash just states, that the file or directory doesn't exist. The newest release works just fine locally.

contains on vocab slow

vocab.__contains__ performs a linear search through the vocab making it incredibly slow.

3.5 and 3.6 wheels for macOS

Just stumbled over this. Could we also have a wheel for python 3.6 and 3.5 on macOS?

Release 0.4 branch

I would like to branch a new a new release. Primarily to ensure that we are in sync with finalfusion, to get analogy queries with masking, and it is nice to have the norms functionality as well.

@sebpuetz : do you want to get #32 in before the release?

Return None instead of raising exceptions

Most of our methods raise exceptions if the model can't produce an embedding for the given input, e.g. embedding() and similarity() while we could also return None in those cases.

I'm not sure what's canonical Python here, but I'd prefer to do something like:

embedding = embeds.Embedding("something oov")
if embedding is None:
    embedding = generic_oov_embed()

over

try:
    embedding = embeds.Embedding("something oov")
except KeyError:
    embedding = generic_oov_embed()

What's your take on this?

Update the finalfusion.io page

https://finalfusion.github.io/python is now outdated

Consider replacing finalfusion-python by ffp?

Just wanted to have your opinion on this @sebpuetz . I think my objection at the time is that we had to keep two implementations (Rust, Python) in sync. But:

finalfusion is mostly 'done' (API stable, data format stable), so having two implementations is less of an issue now.
In the meanwhile, we have seen cases of unsoundness in pyo3.

Numpy dependency should be specified

If you try to get an embedding without having Numpy installed in the Python environment, you get a panic. Is it possible to have Numpy be automatically installed as a dependency when you run pip install finalfusion?

Release 0.7

Update setup.py: include README, update URL, add other URLs #106 & #104
Documentation #104
mention scripts in docs #116
Add release workflow #112
~~Fix installing wheels in CI on Windows~~ not going to install & test wheels for windows on release workflows.
Add MANIFEST.in for sdists #113
Mention analogy + similarity scripts in README #114

I think the other issues can be resolved in 0.7.1 (e.g. error handlers for bad utf8 or batched lookups).

Is there anything else that you think should be done before the release?

Improve unit test coverage

Ensure that all methods are covered by unit test (as far as possible).

Handling malformed UTF8

As of now, everything assumes all words are proper utf8. Perhaps add a lossy arg analogous to finalfusion-rust in the read method(s).

Expose model parameters

The interface could use getters for the various model parameters specified here, particularly the commonly used ones such as dimension, vocab size, context size, etc.

Add scripts

Conversion scripts between embedding formats finalfusion/ffp#26
Bucket-to-explicit #109
Similarity & Analogies #111

Add static method to to read fastText embeddings

Normalize storage in Embeddings constructor.

If no norms are passed to Embeddings, normalize embeddings and add norms to the embeddings.

Update usage page on finalfusion.github.io

On our finalfusion website:
https://finalfusion.github.io/python

Some things are outdated. E.g. similarity is now word_similarity.
Some things are missing (reading word2vec/text/fastText embeddings).

Add README

Use of locale analogy integration test

I think I missed this during reviewing, but the analogy integration test has:

export LC_ALL=en_US.UTF-8

This fails in two cases:

Sandboxed builds, in which glibc locales are typically not available.
People who have systems without this locale installed (e.g. because they use some non-English locale).

@sebpuetz Do you encounter problems when e.g. using export LC_ALL=C? (Which should avoid doing any parsing of character data.)

Remove workaround for skip-set

Once PyO3/pyo3@ac28a31 is released, we should remove the workaround for embedding_similarity introduced in #45.

Release 0.5

We have added quite some features in the last month, e.g. the read methods for other formats, exposure of subword indices and the embedding similarity method.

Once #51 is done we could cut a new release, unless there is something else in the pipeline for the Python bindings.

What do you think?

Build macOS Python 3.7 wheels automatically

Fix up docstrings and set up API doc generation

Most can be copied from https://github.com/sebpuetz/ffp but I'll need to go through the sphinx stuff again to sync the changes here.

finalfusion / finalfusion-python Goto Github PK

finalfusion-python's People

Contributors

Stargazers

Watchers

Forkers

finalfusion-python's Issues

Recommend Projects

Recommend Topics

Recommend Org