Code Monkey home page Code Monkey logo

finalfusion-python's People

Contributors

danieldk avatar sebpuetz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

finalfusion-python's Issues

Add tests.

Put a very small model in the repo so we can run some tests from within python? Otherwise I'm not sure how to include tests for the CI to pick up

Batched embedding lookup

Individual lookups for quantized embeddings are extremely slow, batched lookups are a lot faster.

Add convenience methods etc.

Not sure if there's more, but that's what I remember:

  • Method to get the full embedding matrix, suggested by @Blubberli
  • numpy compatibility, suggested by @twuebi
  • make vocab accessible from python?
  • Check documentation

Restrict AppVeyor Builds to Pullrequests and Releases

I think AppVeyor is processing the builds sequentially, this takes a lot of time. We are currently building on both CI services when something is pushed to a branch and also when a PR is made.

Maybe we can restrict the AppVeyor builds to pull requests and releases? I'm not sure how release builds are triggered, so I don't know if this is actually doable.

Investigate generation of documentation

Investigate current possibilities for documentation generation in pyo3 and pyo3-pack. Ideally, we'd generate something similar to readthedocs. This is the last item from #5.

Add typing to all methods

E.g. __getitem__ on the storage types is missing that among other methods. That leaves a lot of code paths un-analyzed because they implicitly become Any.

Download pyo3-pack instead of cargo install

Downloading a precompiled version of pyo3-pack would shave off more than 10 minutes from build times. Any objections to this?

Although it seems like something is messed up for the 0.6.1 binary, at least it's not possible for me to run it locally or on CI. Bash just states, that the file or directory doesn't exist. The newest release works just fine locally.

Release 0.4 branch

I would like to branch a new a new release. Primarily to ensure that we are in sync with finalfusion, to get analogy queries with masking, and it is nice to have the norms functionality as well.

@sebpuetz : do you want to get #32 in before the release?

Return None instead of raising exceptions

Most of our methods raise exceptions if the model can't produce an embedding for the given input, e.g. embedding() and similarity() while we could also return None in those cases.

I'm not sure what's canonical Python here, but I'd prefer to do something like:

embedding = embeds.Embedding("something oov")
if embedding is None:
    embedding = generic_oov_embed()

over

try:
    embedding = embeds.Embedding("something oov")
except KeyError:
    embedding = generic_oov_embed()

What's your take on this?

Consider replacing finalfusion-python by ffp?

Just wanted to have your opinion on this @sebpuetz . I think my objection at the time is that we had to keep two implementations (Rust, Python) in sync. But:

  • finalfusion is mostly 'done' (API stable, data format stable), so having two implementations is less of an issue now.
  • In the meanwhile, we have seen cases of unsoundness in pyo3.

Numpy dependency should be specified

If you try to get an embedding without having Numpy installed in the Python environment, you get a panic. Is it possible to have Numpy be automatically installed as a dependency when you run pip install finalfusion?

Release 0.7

  • Update setup.py: include README, update URL, add other URLs #106 & #104
  • Documentation #104
  • mention scripts in docs #116
  • Add release workflow #112
  • Fix installing wheels in CI on Windows not going to install & test wheels for windows on release workflows.
  • Add MANIFEST.in for sdists #113
  • Mention analogy + similarity scripts in README #114

I think the other issues can be resolved in 0.7.1 (e.g. error handlers for bad utf8 or batched lookups).

Is there anything else that you think should be done before the release?

Handling malformed UTF8

As of now, everything assumes all words are proper utf8. Perhaps add a lossy arg analogous to finalfusion-rust in the read method(s).

Expose model parameters

The interface could use getters for the various model parameters specified here, particularly the commonly used ones such as dimension, vocab size, context size, etc.

Use of locale analogy integration test

I think I missed this during reviewing, but the analogy integration test has:

export LC_ALL=en_US.UTF-8

This fails in two cases:

  • Sandboxed builds, in which glibc locales are typically not available.
  • People who have systems without this locale installed (e.g. because they use some non-English locale).

@sebpuetz Do you encounter problems when e.g. using export LC_ALL=C? (Which should avoid doing any parsing of character data.)

Release 0.5

We have added quite some features in the last month, e.g. the read methods for other formats, exposure of subword indices and the embedding similarity method.

Once #51 is done we could cut a new release, unless there is something else in the pipeline for the Python bindings.

What do you think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.