Code Monkey home page Code Monkey logo

featureform / featureform Goto Github PK

View Code? Open in Web Editor NEW
1.7K 16.0 88.0 220.18 MB

The Virtual Feature Store. Turn your existing data infrastructure into a feature store.

Home Page: https://www.featureform.com

License: Mozilla Public License 2.0

Starlark 0.19% Python 11.78% C++ 0.54% Shell 0.11% Dockerfile 0.38% Makefile 0.21% JavaScript 3.48% CSS 0.01% Go 26.58% Smarty 0.10% HCL 0.25% Jupyter Notebook 55.85% Gherkin 0.53%
machine-learning data-science vector-database embeddings-similarity embeddings hacktoberfest feature-store mlops data-quality feature-engineering

featureform's Introduction

featureform

Embedding Store workflow Featureform Slack
Python supported PyPi Version Featureform Website Twitter

What is Featureform?

Featureform is a virtual feature store. It enables data scientists to define, manage, and serve their ML model's features. Featureform sits atop your existing infrastructure and orchestrates it to work like a traditional feature store. By using Featureform, a data science team can solve the following organizational problems:

  • Enhance Collaboration Featureform ensures that transformations, features, labels, and training sets are defined in a standardized form, so they can easily be shared, re-used, and understood across the team.
  • Organize Experimentation The days of untitled_128.ipynb are over. Transformations, features, and training sets can be pushed from notebooks to a centralized feature repository with metadata like name, variant, lineage, and owner.
  • Facilitate Deployment Once a feature is ready to be deployed, Featureform will orchestrate your data infrastructure to make it ready in production. Using the Featureform API, you won't have to worry about the idiosyncrasies of your heterogeneous infrastructure (beyond their transformation language).
  • Increase Reliability Featureform enforces that all features, labels, and training sets are immutable. This allows them to safely be re-used among data scientists without worrying about logic changing. Furthermore, Featureform's orchestrator will handle retry logic and attempt to resolve other common distributed system problems automatically.
  • Preserve Compliance With built-in role-based access control, audit logs, and dynamic serving rules, your compliance logic can be enforced directly by Featureform.

Further Reading



A virtual feature store's architecture



Why is Featureform unique?

Use your existing data infrastructure. Featureform does not replace your existing infrastructure. Rather, Featureform transforms your existing infrastructure into a feature store. In being infrastructure-agnostic, teams can pick the right data infrastructure to solve their processing problems, while Featureform provides a feature store abstraction above it. Featureform orchestrates and manages transformations rather than actually computing them. The computations are offloaded to the organization's existing data infrastructure. In this way, Featureform is more akin to a framework and workflow, than an additional piece of data infrastructure.

Designed for both single data scientists and large enterprise teams Whether you're a single data scientist or a part of a large enterprise organization, Featureform allows you to document and push your transformations, features, and training sets definitions to a centralized repository. It works everywhere from a laptop to a large heterogeneous cloud deployment.

  • A single data scientist working locally: The days of untitled_128.ipynb, df_final_final_7, and hundreds of undocumented versions of datasets. A data scientist working in a notebook can push transformation, feature, and training set definitions to a centralized, local repository.
  • A single data scientist with a production deployment: Register your PySpark transformations and let Featureform orchestrate your data infrastructure from Spark to Redis, and monitor both the infrastructure and the data.
  • A data science team: Share, re-use, and learn from each other's transformations, features, and training sets. Featureform standardizes how machine learning resources are defined and provides an interface for search and discovery. It also maintains a history of changes, allows for different variants of features, and enforces immutability to resolve the most common cases of failure when sharing resources.
  • A data science organization: An enterprise will have a variety of different rules around access control of their data and features. The rules may be based on the data scientist’s role, the model’s category, or dynamically based on a user’s input data (i.e. they are in Europe and subject to GDPR). All of these rules can be specified, and Featureform will enforce them. Data scientists can be sure to comply with the organization’s governance rules without modifying their workflow.

Native embeddings support Featureform was built from the ground up with embeddings in mind. It supports vector databases as both inference and training stores. Transformer models can be used as transformations, so that embedding tables can be versioned and reliably regenerated. We even created and open-sourced a popular vector database, Emeddinghub.

Open-source Featureform is free to use under the Mozilla Public License 2.0.


The Featureform Abstraction



The components of a feature



In reality, the feature’s definition is split across different pieces of infrastructure: the data source, the transformations, the inference store, the training store, and all their underlying data infrastructure. However, a data scientist will think of a feature in its logical form, something like: “a user’s average purchase price”. Featureform allows data scientists to define features in their logical form through transformations, providers, labels, and training set resources. Featureform will then orchestrate the actual underlying components to achieve the data scientists' desired state.

How to use Featureform

Featureform can be run locally on files or in Kubernetes with your existing infrastructure.

Kubernetes

Featureform on Kubernetes can be used to connect to your existing cloud infrastructure and can also be run locally on Minikube.

To check out how to run it in the cloud, follow our Kubernetes deployment.

To try Featureform in a single docker container, follow our docker quickstart guide



Contributing

  • To contribute to Featureform, please check out Contribution docs.
  • Welcome to our community, join us on Slack.

Report Issues

Please help us by reporting any issues you may have while using Featureform.


License

featureform's People

Contributors

ahmadnazeri avatar anthonylasso avatar antony-eng avatar aolfat avatar dependabot[bot] avatar ekorman avatar epps avatar ihkap11 avatar imanthorpe avatar jerempy avatar jmeisele avatar josephrocca avatar joshcolts18 avatar ksshiraja avatar mmbazel avatar riddhibagadiaa avatar rushabh31 avatar saadhvi27 avatar samuell avatar sdreyer avatar shabbyjoon avatar simba-git avatar steffitan23 avatar syedzubeen avatar zhilingc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

featureform's Issues

[Bug]: Containered Featureform refreshes to home page

Expected Behavior

When navigating to sub page /features, /users, etc. Refreshing the page will keep the user in the current address.

Actual Behavior

The application routes the user back to the home screen:

Screen.Recording.2023-04-18.at.1.31.19.PM.mov

This behavior is not present when running in (local mode)

Steps To Reproduce

-Run FF in the docker mode
-Navigate to any page, /providers, /users/ etc.
-Refresh the browser window
-The app will navigate to the home screen

What mode are you running Featureform in?

Local

What version of Python are you running?

3.7

Featureform Python Package Version

0.7.2

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

[Bug]: Dashboard shows the data source description for all features

Expected Behavior

I would expect to see descriptions like "The Foo feature" or "The Bar feature" in the red-marked places in this screenshot:

image

Actual Behavior

Instead I see only the description for the data source, "This is a dummy dataset in CSV format".

Steps To Reproduce

Add this code to a file definitions.py:

import featureform as ff

ff.register_user("myself").make_default_owner()

local = ff.register_local()

dummydata = local.register_file(
    name="dummydata",
    variant="default",
    description="This is a dummy dataset in CSV format",
    path="data.csv",
)

person = ff.register_entity("person")

dummydata.register_resources(
    entity="person",
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "foo",
            "variant": "default",
            "column": "foo",
            "type": "float32",
            "description": "The Foo feature",
        },
        {
            "name": "bar",
            "variant": "default",
            "column": "bar",
            "type": "float32",
            "description": "The Bar feature",
        },
    ],
    timestamp_column="time",
)

Then run:

# Apply it:
featureform apply --local definitions.py
# Start the dashboard
featureform dash

Then open up the dashboard, e.g. in http://127.0.0.1:3000/features

What mode are you running Featureform in?

Local

What version of Python are you running?

3.9

Featureform Python Package Version

1.1.13rc0

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

[Bug]: Get and List commands in localmode print debugging message

Expected Behavior

Running featureform list:
NAME VARIANT STATUS

Actual Behavior

Running featureform list:
Featureform Database exists. Connecting...
Featureform Database exists. Connecting...
NAME VARIANT STATUS
Featureform Database exists. Connecting...
Featureform Database exists. Connecting...

Steps To Reproduce

Installing the featureform package and running featureform list:

What mode are you running Featureform in?

Local

What version of Python are you running?

3.7

Featureform Python Package Version

1.1.10

Featureform Helm Chart Version

NA

Kubernetes Version

NA

Relevant log output

No response

[Bug]: 500 Server Error on some pages in the dashboard

Expected Behavior

Expect to see a normal page saying for example "No entities registered" or similar.

Actual Behavior

There are also "500 Server Error" thrown at some pages in some situations, such as:

Example screenshot from the first of the URLs above:

image

Steps To Reproduce

Add the following code to definitions.py:

import featureform as ff

ff.register_user("myself").make_default_owner()

local = ff.register_local()

dummydata = local.register_file(
    name="dummydata",
    variant="default",
    description="This is a dummy dataset in CSV format",
    path="data.csv",
)

person = ff.register_entity("person")

dummydata.register_resources(
    entity="person",
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "foo",
            "variant": "default",
            "column": "foo",
            "type": "float32",
            "description": "The Foo feature",
        },
        {
            "name": "bar",
            "variant": "default",
            "column": "bar",
            "type": "float32",
            "description": "The Bar feature",
        },
    ],
    timestamp_column="time",
)

Run:

featureform apply --local definitions.py
featureform dash

Open in a browser: http://127.0.0.1:3000/entities

Open in a browser: http://127.0.0.1:3000/users

What mode are you running Featureform in?

Local

What version of Python are you running?

3.9

Featureform Python Package Version

1.1.12

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

When accessing http://127.0.0.1:3000/entities 


127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /entities HTTP/1.1" 200 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/css/85a2addfd2efc882.css HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/chunks/webpack-b5a50f2710bf3333.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/chunks/framework-3412d1150754b2fb.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/chunks/main-2715d0c23f47c019.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/chunks/pages/_app-b217558d3c27b2cc.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/chunks/pages/%5Btype%5D-3c144c056366a40b.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/jgzCj9eheZJ2ZWZELvsjO/_buildManifest.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/jgzCj9eheZJ2ZWZELvsjO/_ssgManifest.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /_next/static/media/Matter-Regular.f1ae4ce5.ttf HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:39] "GET /static/FeatureForm_Logo_Full_Black.svg HTTP/1.1" 304 -
[2022-09-11 00:06:40,120] ERROR in app: Exception on /data/entities [GET]
Traceback (most recent call last):
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask_cors/decorator.py", line 128, in wrapped_function
    resp = make_response(f(*args, **kwargs))
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/featureform/dashboard_metadata.py", line 331, in GetMetadataList
    allData.append(entities(row))
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/featureform/dashboard_metadata.py", line 243, in entities
    label_list = sqlObject.query_resource( "label_variant", "entity", rowData['name'])
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/featureform/sqlite_metadata.py", line 197, in query_resource
    raise ValueError(f"{type} with {column}: {resource} not found")
ValueError: label_variant with entity: person not found
127.0.0.1 - - [11/Sep/2022 00:06:40] "GET /data/entities HTTP/1.1" 500 -
127.0.0.1 - - [11/Sep/2022 00:06:40] "GET /_next/static/chunks/pages/index-c2ee5b1681e97e4b.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:06:40] "GET /static/favicon.ico HTTP/1.1" 304 -

When accessing http://127.0.0.1:3000/users

127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /users HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/css/85a2addfd2efc882.css HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/chunks/webpack-b5a50f2710bf3333.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/chunks/pages/_app-b217558d3c27b2cc.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/chunks/framework-3412d1150754b2fb.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/chunks/main-2715d0c23f47c019.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/chunks/pages/%5Btype%5D-3c144c056366a40b.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/jgzCj9eheZJ2ZWZELvsjO/_buildManifest.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/jgzCj9eheZJ2ZWZELvsjO/_ssgManifest.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /static/FeatureForm_Logo_Full_Black.svg HTTP/1.1" 304 -
[2022-09-11 00:07:27,451] ERROR in app: Exception on /data/users [GET]
Traceback (most recent call last):
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 2525, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 1822, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/flask_cors/decorator.py", line 128, in wrapped_function
    resp = make_response(f(*args, **kwargs))
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/featureform/dashboard_metadata.py", line 335, in GetMetadataList
    allData.append(users(row))
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/featureform/dashboard_metadata.py", line 275, in users
    variant_organiser(label_variant(sqlObject.query_resource( "label_variant", "owner", rowData['name']))[2]),
  File "/home/sal/.cache/pypoetry/virtualenvs/03-ff-feature-descriptions-6H_O3yPY-py3.9/lib/python3.9/site-packages/featureform/sqlite_metadata.py", line 197, in query_resource
    raise ValueError(f"{type} with {column}: {resource} not found")
ValueError: label_variant with owner: default_user not found
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /data/users HTTP/1.1" 500 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /_next/static/chunks/pages/index-c2ee5b1681e97e4b.js HTTP/1.1" 304 -
127.0.0.1 - - [11/Sep/2022 00:07:27] "GET /static/favicon.ico HTTP/1.1" 304 -

HNSW distance metric

Hi!
Is it possible to choose distance metric for HNSW algorithm?

As I see, it's using L2 as default, but in hnswlib also Cosine similarity and Dot product is supported.

LocalConfig space fails on "get_space"

version: pip embeddinghub version 0.0.1.post12
env: Powershell, Windows 10

This code:
image
Leads to this error:
image
get_space seems to be implemented following the server config model even on the local version, causing this error.

[Bug]: Broken image in "Software" column in dashboard at Home > Providers

Expected Behavior

The image should render whatever represents the value for "Software" in local mode.

Actual Behavior

The image element is broken given there's no value for the src attribute.

It appears that line 151 in ResourceListView.js indexes into the providerLogos object with the key LOCALMODE, which doesn't exist (see line 115 in Resources.js. This results the value undefined for the src attribute.

Steps To Reproduce

  • Complete Steps 1-3 in Quickstart (Local} and run featureform dash
  • Visit http://localhost:3000/providers
  • Observe the broken image under the "Software" column.
    Screen Shot 2023-01-20 at 1 29 32 PM

What mode are you running Featureform in?

Local

What version of Python are you running?

3.8

Featureform Python Package Version

1.4.4

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

[Bug]: Registering a training set with a variant that doesn't exist results in a GRPC dump

Expected Behavior

    entity=user,
    entity_column="user_id",
    inference_store=redis,
    features=[
        {"name": "avg_transactions", "variant": "default", "column": "avg_transaction_amt", "type": "float32"},
    ],
)

# Register label from our base Transactions table
transactions.register_resources(
    entity=user,
    entity_column="customerid",
    labels=[
        {"name": "fraudulent", "variant": "default", "column": "isfraud", "type": "bool"},
    ],
)

ff.register_training_set(
    "fraud_training", "default",
    label=("fraudulent", "default"),
    features=[("avg_transactions", "quickstart")],
)

Registering avg_transactions.default but using avg_transactions.quickstart in the training set should have an output like:

Creating feature avg_transactions
Creating label fraudulent
Creating training-set fraud_training
Error: Invalid Feature avg_transactions.quickstart in Training Set fraud_training.default

Actual Behavior

Creating training-set fraud_training
Traceback (most recent call last):
  File "/Users/sdreyer/Projects/embeddinghub/venv/bin/featureform", line 8, in <module>
    sys.exit(cli())
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/cli.py", line 187, in apply
    rc.apply()
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/register.py", line 2603, in apply
    state().create_all(self._stub)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/resources.py", line 1492, in create_all
    raise e
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/resources.py", line 1486, in create_all
    resource._create(stub)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/resources.py", line 1289, in _create
    stub.CreateTrainingSetVariant(serialized)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "could not propogate: &{%!s(*proto.TrainingSetVariant=&{{{} [] [] 0xc00004c238} 121 [] fraud_training quickstart  default_user 0xc00068c540 spark <nil> [0xc0003283c0] 0xc000328410 <nil>  []})}: could not get dependencies for parent: &{%!s(*proto.TrainingSetVariant=&{{{} [] [] 0xc00004c238} 121 [] fraud_training quickstart  default_user 0xc00068c540 spark <nil> [0xc0003283c0] 0xc000328410 <nil>  []})}: could not create submap for IDs: [{default_user  USER} {spark  PROVIDER} {fraudulent quickstart LABEL_VARIANT} {fraud_training  TRAINING_SET} {avg_transactions quickstart FEATURE_VARIANT}]: submap deserialize: {avg_transactions quickstart FEATURE_VARIANT}: failed To Parse Resource: : unexpected end of JSON input"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:34.174.72.100:443 {created_time:"2023-03-20T13:04:50.56439-07:00", grpc_status:2, grpc_message:"could not propogate: &{%!s(*proto.TrainingSetVariant=&{{{} [] [] 0xc00004c238} 121 [] fraud_training quickstart  default_user 0xc00068c540 spark <nil> [0xc0003283c0] 0xc000328410 <nil>  []})}: could not get dependencies for parent: &{%!s(*proto.TrainingSetVariant=&{{{} [] [] 0xc00004c238} 121 [] fraud_training quickstart  default_user 0xc00068c540 spark <nil> [0xc0003283c0] 0xc000328410 <nil>  []})}: could not create submap for IDs: [{default_user  USER} {spark  PROVIDER} {fraudulent quickstart LABEL_VARIANT} {fraud_training  TRAINING_SET} {avg_transactions quickstart FEATURE_VARIANT}]: submap deserialize: {avg_transactions quickstart FEATURE_VARIANT}: failed To Parse Resource: : unexpected end of JSON input"}"

Steps To Reproduce

  1. Register a feature with a name and variant
  2. Register a training set with a name but variant that doesnt exist
  3. featureform apply

What mode are you running Featureform in?

Hosted

What version of Python are you running?

3.7

Featureform Python Package Version

1.6.1

Featureform Helm Chart Version

0.6.1

Kubernetes Version

1.23

Relevant log output

No response

[Bug]: Spark Dataframe Transformation Does Not Read Column Headers

Expected Behavior

Should be able to access columns by name in dataframe transformations

Actual Behavior

Cannot resolve column name "transactionamount" among (_c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7)

    for location in sources:
        if location.endswith(".csv"):
            func_parameters.append(spark.read.option("recursiveFileLookup", "true").csv(location))
        elif location.endswith(".parquet"):
            func_parameters.append(spark.read.option("recursiveFileLookup", "true").parquet(location))
        else:
            raise Exception(f"the file type for '{location}' file is not supported.")

is missing .option("header","true")

Steps To Reproduce

  1. Create a dataframe transformation with spark and try to access a column by name

What mode are you running Featureform in?

Hosted

What version of Python are you running?

3.7

Featureform Python Package Version

1.6.1

Featureform Helm Chart Version

0.6.1

Kubernetes Version

1.23

Relevant log output

No response

Connection to DynamoDB

Currently, the only way to connect to DynamoDB is to provide access key and secret key explicitly in the code. FeatureForm also does not support connection to local DynamoDB.

Some suggestions:

  1. Connect to DynamoDB using IAM Role.
  2. Supports connection to local DynamoDB.

Can't connect to local docker

After running the getting started guide,
getting the following error.
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1640296696.148721299","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3186,"referenced_errors":[{"created":"@1640296696.148720244","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":146,"grpc_status":14}]}"

Can't connect to the docker.

[Bug]: Chaining Transformations in Spark Throws Errors

Expected Behavior

Should be able to use one transformation as an input to another transformation.

Actual Behavior

The file extension check to not check to see if the location is a directory or a file

    for location in sources:
        if location.endswith(".csv"):
            func_parameters.append(spark.read.option("recursiveFileLookup", "true").csv(location))
        elif location.endswith(".parquet"):
            func_parameters.append(spark.read.option("recursiveFileLookup", "true").parquet(location))
        else:
            raise Exception(f"the file type for '{location}' file is not supported.")

Steps To Reproduce

Create a transformation that uses a different transformation as its input

@spark.sql_transformation(variant="default")
def average_user_transaction():
    """the average transaction amount for a user """
    return "SELECT CustomerID as user_id, avg(TransactionAmount) " \
           "as avg_transaction_amt from {{transactions.default}} GROUP BY user_id"


@spark.df_transformation(inputs=[average_user_transaction])
def outlier_transactions(avg_transactions):
    q = avg_transactions["avg_transaction_amt"].quantile(0.99)
    outliers = avg_transactions[avg_transactions["avg_transaction_amt"] > q]
    return outliers

What mode are you running Featureform in?

Hosted

What version of Python are you running?

3.7

Featureform Python Package Version

1.6.1

Featureform Helm Chart Version

0.6.1

Kubernetes Version

1.23

Relevant log output

Traceback (most recent call last): File "/app/provider/scripts/spark/offline_store_spark_runner.py", line 38, in main output_location = execute_df_job(args.output_uri, args.code, args.store_type, args.spark_config, args.credential, args.source) 
File "/app/provider/scripts/spark/offline_store_spark_runner.py", line 112, in execute_df_job raise Exception(f"the file type for '{location}' file is not supported.") 
Exception: the file type for 'gs://featureform-webinar/data/featureform/Transformation/average_user_transaction/default2023-03-20-20-39-06-629509' file is not supported. During handling of the above exception, another exception occurred: 
Traceback (most recent call last): File "/app/provider/scripts/spark/offline_store_spark_runner.py", line 337, in <module> main(parse_args()) File "/app/provider/scripts/spark/offline_store_spark_runner.py", line 44, in main raise Exception(error_message) Exception: the df job failed. Error: the file type for 'gs://featureform-webinar/data/featureform/Transformation/average_user_transaction/default2023-03-20-20-39-06-629509' file is not supported. 

[Bug]: Features in training set ordered by feature name, not order defined when registering

Expected Behavior

I realize this might be a question of what is the desired behavior, so I go on my own subjective expectation here.

By executing the example code outline further below, I would expect to see the following output:

$ python client.py 
[[1, 10, 100]]
[[2, 20, 200]]
[[3, 30, 300]]

Actual Behavior

$ python client.py 
[[10, 100, 1]]
[[20, 200, 2]]
[[30, 300, 3]]

Steps To Reproduce

Create the files:

data.csv

time,foo,bar,baz,person_id
2022-08-25 00:00:01,1,10,100,samuel
2022-08-25 00:00:02,2,20,200,samuel
2022-08-25 00:00:03,3,30,300,samuel

defs.py

import featureform as ff

ff.register_user("myself").make_default_owner()

local = ff.register_local()

dummydata = local.register_file(
    name="dummydata",
    variant="default",
    description="",
    path="data.csv",
)

person = ff.register_entity("person")

dummydata.register_resources(
    entity="person",
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "foo",
            "variant": "default",
            "column": "foo",
            "type": "float32",
        },
        {
            "name": "bar",
            "variant": "default",
            "column": "bar",
            "type": "float32",
        },
        {
            "name": "baz",
            "variant": "default",
            "column": "baz",
            "type": "float32",
        },
    ],
    labels=[
        {
            "name": "baz",
            "variant": "default",
            "column": "baz",
            "type": "float32",
        },
    ],
    timestamp_column="time",
)

ff.register_training_set(
    "all_features",
    "default",
    features=[
        ("foo", "default"),
        ("bar", "default"),
        ("baz", "default"),
    ],
    label=("baz", "default"),
)

client.py

import featureform as ff

cli = ff.ServingClient(local=True)

dataset = cli.training_set("all_features", "default")

for row in dataset:
    print(row.features())

Run:

featureform apply --local defs.py
python client.py

What mode are you running Featureform in?

Local

What version of Python are you running?

3.9

Featureform Python Package Version

1.1.13rc0

Other comments

Some experimentation indicates the sorting is done based on alphabetical sorting of feature names, rather than the order features are occurring when registering them in defs.py, which at least I would have expected.

Can't search by vector using nearest neighbors

Errror Description

When trying to search by vector an error is raised

neighbors = space.nearest_neighbors(vector=[0,0,1], num=2)
Traceback (most recent call last):
  File "/workspace/main.py", line 31, in <module>
    neighbors = space.nearest_neighbors(vector=[0,0,1], num=10)
  File "/usr/local/lib/python3.9/site-packages/embeddinghub/client.py", line 154, in nearest_neighbors
    return self._client.nearest_neighbor(self._name, num, key=key, embedding=vector)
  File "/usr/local/lib/python3.9/site-packages/embeddinghub/client.py", line 365, in nearest_neighbor
    req = embedding_store_pb2.NearestNeighborRequest(space=str(space),
TypeError: Parameter to MergeFrom() must be instance of same class: expected featureform.embedding.proto.Embedding got list.

How to replicate

Full code:

import embeddinghub as eh

hub = eh.connect(eh.Config(host="embeddinghub", port=7462))  # we use docker
space = hub.create_space("quickstart", dims=3)

embeddings = {
    "apple": [1, 0, 0],
    "orange": [1, 1, 0],
    "potato": [0, 1, 0],
    "chicken": [-1, -1, 0],
}

space.multiset(embeddings)

neighbors = space.nearest_neighbors(vector=[0,0,1], num=2)  # << error here
print(neighbors)

How can we convert a list or array into an Embedding?.
Thank you in advance 😄

[Bug]: Can not show information about transformation in the UI

Expected Behavior

I expect to see some metadata information about the transformation when clicking on it.

Actual Behavior

I only see the animated three dots, as if data is loaded:

image

Steps To Reproduce

Add this code to a file named definitions.py:

import featureform as ff

ff.register_user("myself").make_default_owner()

local = ff.register_local()

dummydata = local.register_file(
    name="dummydata",
    variant="default",
    description="",
    path="data.csv",
)

person = ff.register_entity("person")

dummydata.register_resources(
    entity="person",
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "foo",
            "variant": "default",
            "column": "foo",
            "type": "float32",
        },
        {
            "name": "bar",
            "variant": "default",
            "column": "bar",
            "type": "float32",
        },
    ],
    timestamp_column="time",
)


@local.df_transformation(variant="default", inputs=[("dummydata", "default")])
def compute_fooplusbar(df):
    df["fooplusbar"] = df["foo"] + df["bar"]
    return df


compute_fooplusbar.register_resources(
    entity=person,
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "fooplusbar",
            "variant": "default",
            "column": "fooplusbar",
            "type": "float32",
        },
    ],
    labels=[
        {
            "name": "fooplusbar",
            "variant": "default",
            "column": "fooplusbar",
            "type": "float32",
        },
    ],
    timestamp_column="time",
)

Run:

featureform apply --local definitions.py
featureform dash

Enter the URL http://127.0.0.1:3000/sources/compute_fooplusbar

... or alternatively, open http://127.0.0.1:3000 and click "features" and then "compute_foobar" on the "source" row in the table with metadata.

What mode are you running Featureform in?

Local

What version of Python are you running?

3.9

Featureform Python Package Version

1.1.12

[Bug]: Test bug report

Expected Behavior

Something I expected

Actual Behavior

Something that happened

Steps To Reproduce

Steps 1, 2, 3

What mode are you running Featureform in?

Local

What version of Python are you running?

3.8

Featureform Python Package Version

1.1.1.1.1

Featureform Helm Chart Version

2.3.4.5

Kubernetes Version

1.2.2.2

Relevant log output

Some logs

cli commands `list` and `get` don't work in secure non-local mode

Doing a command such as

featureform list sources --host <HOST_NAME> --cert tls.crt

results in the error ValueError: Cannot be local and have a host. This is because it looks like insecure False forces local mode:

https://github.com/featureform/featureform/blob/main/client/src/featureform/cli.py#L132

Similar logic is in the get command. The apply command however does work and has different logic for creating the ResourceClient:
https://github.com/featureform/featureform/blob/main/client/src/featureform/cli.py#L172

is there a reason that the get and list commands don't create the ResourceClient this way?

[Bug]: BreadCrumbs.capitalize(word) throws exception if the path is undefined

Expected Behavior

Breadcrumb failures should not stop the entire application from loading.

Actual Behavior

The function capitalize does not do null checking on the passed in string. Any possible react re-render may cause an empty string to get passed to the breadCrumbs component, and will throw the NPE

Screen.Recording.2023-04-19.at.1.41.40.PM.mov

Steps To Reproduce

-Run FF in the docker mode
-Navigate to any page, /providers, /users/ etc.
-The error will sporadically throw

What mode are you running Featureform in?

Local

What version of Python are you running?

3.7

Featureform Python Package Version

0.7.2

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

[Bug]: Inability to do time-based join, and potentially to name time column different than 'ts'

Expected Behavior

Showing some numerical values from features and labels in the console.

(Documented in detail in this repo, which contains a full bug reproduction setup: https://github.com/samuell/bugs/tree/main/20220831-ff-join-bug#readme)

Actual Behavior

Got this stack trace:

$ python train.py 
Traceback (most recent call last):
  File "/home/sal/proj/sav/2022/bug-reproductions/20220831-ff-join-bug/train.py", line 5, in <module>
    train_data = client.training_set("traindata", "default")
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 49, in training_set
    return self._local_training_set(name, version)
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 112, in _local_training_set
    trainingset_df = pd.merge_asof(trainingset_df, df.sort_values(['ts']), direction='backward',
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 6259, in sort_values
    k = self._get_label_or_level_values(by, axis=axis)
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/generic.py", line 1779, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'ts'

After trying to rename all the occurances of "ts" to "time" in data.csv and defs.py, and running

rm -rf .featureform && sleep 1 && featureform apply --local defs.py
python train.py

.. I instead got this:

$ python train.py 
Traceback (most recent call last):
  File "/home/sal/proj/sav/2022/bug-reproductions/20220831-ff-join-bug/train.py", line 5, in <module>
    train_data = client.training_set("traindata", "default")
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 49, in training_set
    return self._local_training_set(name, version)
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 112, in _local_training_set
    trainingset_df = pd.merge_asof(trainingset_df, df.sort_values(['ts']), direction='backward',
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 580, in merge_asof
    op = _AsOfMerge(
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1740, in __init__
    _OrderedMerge.__init__(
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1623, in __init__
    _MergeOperation.__init__(
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 681, in __init__
    self._validate_specification()
  File "/home/sal/.cache/pypoetry/virtualenvs/20220831-ff-join-bug-JAwT4QYU-py3.9/lib/python3.9/site-packages/pandas/core/reshape/merge.py", line 1809, in _validate_specification
    raise MergeError(
pandas.errors.MergeError: Incompatible merge dtype, dtype('O') and dtype('O'), both sides must have numeric dtype

(Documented in detail in this repo, which contains a full bug reproduction setup: https://github.com/samuell/bugs/tree/main/20220831-ff-join-bug#readme)

Steps To Reproduce

  1. Make sure to have poetry installed (or install featureform 1.2.0 in another way).
  2. Run the following commands:
    git clone https://github.com/samuell/bugs.git
    cd 20220831-ff-join-bug
    poetry install
    poetry shell
    featureform apply --local defs.py
    python train.py
    

(Documented in detail in this repo, which contains a full bug reproduction setup: https://github.com/samuell/bugs/tree/main/20220831-ff-join-bug#readme)

What mode are you running Featureform in?

Local

What version of Python are you running?

3.9

Featureform Python Package Version

1.2.0

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

[Bug]: Acessing localhost:7878 cannot open

Expected Behavior

Open web page and mantain communication between docker and interface

Actual Behavior

time="2023-04-06T18:45:08Z" level=info msg="[core] [Server #7] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"GET /favicon.ico HTTP/1.\""" system=system

Steps To Reproduce

Follow the tutorial from featureform using Docker

What mode are you running Featureform in?

Local

What version of Python are you running?

3.10

Featureform Python Package Version

1.7.2

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

[Bug]: Spark SQL Transformation requires a variant to be specified

Expected Behavior

@spark.sql_transformation()
def average_user_transaction():
    return "SELECT * FROM ......."

Should be valid

Actual Behavior

@spark.sql_transformation()
def average_user_transaction():
    return "SELECT * FROM ......."

Errors with

TypeError: sql_transformation() missing 1 required positional argument: 'variant'

Steps To Reproduce

  1. Create a spark sql transformation without a variant
  2. featureform apply

What mode are you running Featureform in?

Hosted

What version of Python are you running?

3.7

Featureform Python Package Version

1.6.1

Featureform Helm Chart Version

0.6.1

Kubernetes Version

1.23

Relevant log output

No response

Missing code in sdk/python when installing local build

Found this issue when I tried to import the Python package after installing a local build via pip install -e .

There seems to be some code missing under sdk/python:

>>> import embeddinghub
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/code/embeddinghub/sdk/python/embeddinghub.py", line 19, in <module>
    from sdk.python import embedding_store_pb2_grpc
ImportError: cannot import name 'embedding_store_pb2_grpc' from 'sdk.python' (/Users/user/code/embeddinghub/sdk/python/__init__.py)

The file structure currently looks like this:
Bildschirmfoto 2021-10-27 um 12 47 46

[Bug]: Querying for features based on transformations with entity name throws errors

Summary

I get errors when I try to query features that are registered upon a df_transformation . It seems the entity name does not work there. If I use the entity column name instead, I don't get errors anymore, and actually get some output, although it is not correct. See below, and also this folder in a separate repo for all the code needed to reproduce this.

Expected Behavior

Should get this output:

-------Foo-------
0.002
-------Bar-------
0.004
-------FooPlusBar-------
0.006

Actual Behavior

I get this output:

-------Foo-------
0.002
-------Bar-------
0.004
-------FooPlusBar-------
Traceback (most recent call last):
  File "/home/sal/proj/sav/2022/bug-reproductions/02-ff-entity-match/client.py", line 15, in <module>
    fooplusbar = client.features([("fooplusbar", "default")], {"person": "samuel"})
  File "/home/sal/.cache/pypoetry/virtualenvs/20220903-ff-entity-match-9koWNEX7-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 104, in features
    return self.impl.features(features, entities)
  File "/home/sal/.cache/pypoetry/virtualenvs/20220903-ff-entity-match-9koWNEX7-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 304, in features
    all_features_list = self.add_feature_dfs_to_list(feature_variant_list, entity_id)
  File "/home/sal/.cache/pypoetry/virtualenvs/20220903-ff-entity-match-9koWNEX7-py3.9/lib/python3.9/site-packages/featureform/serving.py", line 319, in add_feature_dfs_to_list
    raise ValueError(
ValueError: Could not set entity column. No column name person exists in compute_fooplusbar-default

Steps To Reproduce

The code to reproduce this is available in this repo, but adding the reproduction info here as well:

Save this data to file named data.csv:

time,foo,bar,person_id
2022-08-25 00:00:01,0.000,0.001,samuel
2022-08-25 00:00:02,0.001,0.002,samuel
2022-08-25 00:00:03,0.002,0.004,samuel

Put this in defs.py:

import featureform as ff

ff.register_user("myself").make_default_owner()

local = ff.register_local()

dummydata = local.register_file(
    name="dummydata",
    variant="default",
    description="",
    path="data.csv",
)

person = ff.register_entity("person")

dummydata.register_resources(
    entity="person",
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "foo",
            "variant": "default",
            "column": "foo",
            "type": "float32",
        },
        {
            "name": "bar",
            "variant": "default",
            "column": "bar",
            "type": "float32",
        },
    ],
    timestamp_column="time",
)

@local.df_transformation(variant="default", inputs=[("dummydata", "default")])
def compute_fooplusbar(df):
    df["fooplusbar"] = df["foo"] + df["bar"]
    return df

compute_fooplusbar.register_resources(
    entity=person,
    entity_column="person_id",
    inference_store=local,
    features=[
        {
            "name": "fooplusbar",
            "variant": "default",
            "column": "fooplusbar",
            "type": "float32",
        },
    ],
    labels=[
        {
            "name": "fooplusbar",
            "variant": "default",
            "column": "fooplusbar",
            "type": "float32",
        },
    ],
    timestamp_column="time",
)

Put this in client.py:

import featureform as ff

client = ff.ServingClient(local=True)

print("-"*7 + "Foo" + "-"*7)
foo = client.features([("foo", "default")], {"person": "samuel"})
print(f"{foo[0]:.3f}")

print("-"*7 + "Bar" + "-"*7)
bar = client.features([("bar", "default")], {"person": "samuel"})
print(f"{bar[0]:.3f}")

print("-"*7 + "FooPlusBar" + "-"*7)
fooplusbar = client.features([("fooplusbar", "default")], {"person": "samuel"})
print(f"{fooplusbar[0]:.3f}")

Run:

featureform apply --local defs.py
python client.py

What mode are you running Featureform in?

Local

What version of Python are you running?

3.9

Featureform Python Package Version

1.1.12

Other info

If using the entity column name (person_id) instead of the entity name
(person) for the feature based on a transformation ... that is, putting this code into client_works_but_gives_wrong_value.py

import featureform as ff

client = ff.ServingClient(local=True)

print("-"*7 + "Foo" + "-"*7)
foo = client.features([("foo", "default")], {"person": "samuel"})
print(f"{foo[0]:.3f}")

print("-"*7 + "Bar" + "-"*7)
bar = client.features([("bar", "default")], {"person": "samuel"})
print(f"{bar[0]:.3f}")

print("-"*7 + "FooPlusBar" + "-"*7)
# *** NOTE below that we use "person_id" (the entity COLUMN NAME) instead of "person" (the entity NAME): ***
fooplusbar = client.features([("fooplusbar", "default")], {"person_id": "samuel"})
print(f"{fooplusbar[0]:.3f}")

... and executes it with python ... then, we get output, but not
correct values:

Expected output:

-------Foo-------
0.002
-------Bar-------
0.004
-------FooPlusBar-------
0.006

Actual output

-------Foo-------
0.002
-------Bar-------
0.004
-------FooPlusBar-------
0.001

Persisting embeddings in local mode

I'm using EmbeddingHub in local mode like this:

import embeddinghub as eh
hub = eh.connect(eh.LocalConfig("/path/to/data"))
term_space = hub.create_space("terms", dims=768)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

terms = ["apple", "orange", "potato", "chicken"]
embeddings = {}
for t in terms:
  embeddings[t] = model.encode(t)

term_space.multiset(embeddings)

hub.save()

But hub.save() outputs "SAVE NOT IMPLEMENTED YET". On this page it says hub.save() is needed to persist data in local mode.

So given that hub.save() isn't yet implemented, I'm just wondering if there is currently any way to persist embeddings in local mode?

Thanks for your work on this project!

[Bug]: Can't use file object as input for spark DF transformation

Expected Behavior

transactions = spark.register_file(
    name="transactions",
    file_path=".......",
)

@spark.df_transformation(inputs=[transactions])
def outlier_transactions(df):
    return df

Should be valid

Actual Behavior

Throws an error

Traceback (most recent call last):
  File "/Users/sdreyer/Projects/embeddinghub/venv/bin/featureform", line 8, in <module>
    sys.exit(cli())
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/cli.py", line 179, in apply
    read_file(file)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/cli.py", line 221, in read_file
    exec_file(py, file)
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/cli.py", line 234, in exec_file
    exec(code)
  File "webinar.py", line 51, in <module>
    @spark.df_transformation(inputs=[transactions])
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/register.py", line 229, in df_transformation
    return self.__registrar.df_transformation(name=name,
  File "/Users/sdreyer/Projects/embeddinghub/venv/lib/python3.8/site-packages/featureform/register.py", line 2272, in df_transformation
    inputs[i] = nv.name_variant()
AttributeError: 'ColumnSourceRegistrar' object has no attribute 'name_variant'

Steps To Reproduce

  1. Register a file with spark
  2. Use the returned file object as an input into a dataframe transformation
  3. Run featureform apply

What mode are you running Featureform in?

Hosted

What version of Python are you running?

3.7

Featureform Python Package Version

1.6.1

Featureform Helm Chart Version

0.6.1

Kubernetes Version

1.23

Relevant log output

No response

[Bug]: ./pip_update.sh rebuilds the dashboard each time even without updates

Expected Behavior

If the dashboard hasn't been updated, then rebuilding the python package locally shouldn't rebuild the dashboard as it takes a significant amount of time.

The build steps for the python package should be moved to a Makefile and should only rebuild the dashboard if the dashboard has changed.

Actual Behavior

To build the python package, the dashboard is rebuilt every time

Steps To Reproduce

Run ./pip_update.sh

What mode are you running Featureform in?

Local

What version of Python are you running?

3.7

Featureform Python Package Version

Any

Featureform Helm Chart Version

No response

Kubernetes Version

No response

Relevant log output

No response

docs: small fix required

In https://github.com/featureform/embeddinghub/blob/main/docs/overview.md,

Embeddinghub is a database built for machine learning embeddings. It is built with four goals in mind.
...
Prior to Embeddinghub, many organizations would use three different tools to achieve these three goals. With Embeddinghub, you get a database that’s built from the ground up to achieve this functionality.

would use three different tools to achieve these three goals -> would use four different tools to achieve these four goals

RuntimeError: The number of elements exceeds the specified limit

Hi, I read through the tutorial and was trying to load a trained embedding from fasttext using the multiset method but got into this error. A further search on this error seems to suggest that this is thrown by Hnswlib. I believe this has something to do with the max_elements that has to be initialised at the start.

Script

import io
import embeddinghub as eh

hub = eh.connect(eh.LocalConfig("data/"))

def load_vectors(fname):
    fin = io.open(fname, "r", encoding="utf-8", newline="\n", errors="ignore")
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(" ")
        data[tokens[0]] = tokens[1:]
    return data

embeddings = load_vectors("wiki-news-300d-1M.vec")

space = hub.create_space("quickstart", dims=300)

space.multiset(embeddings)

neighbors = space.nearest_neighbors(key="apple", num=2)
print(neighbors)

Error

Traceback (most recent call last):
  File "run.py", line 29, in <module>
    space.multiset(embeddings)
  File "/home/derek/anaconda3/envs/py38/lib/python3.8/site-packages/embeddinghub/client.py", line 91, in multiset
    self._idx.multiset(keyed_embeddings)
  File "/home/derek/anaconda3/envs/py38/lib/python3.8/site-packages/embeddinghub/offlinehub.py", line 97, in multiset
    self._idx.add_items(embeddings, idxs)
RuntimeError: The number of elements exceeds the specified limit

RocksDB directory path not expanded correctly (and in conflict with the docs)

The docs say:

docker run -d -v /custom/mount:/root/.embeddinghub/data -p 7462:7462 featureformcom/embeddinghub

However, the directory that is created in embedding_store.cc is metadata instead of data. Furthermore (and I'm not familiar with how path expansion is handled in your path lib) the path is expanded incorrectly, so it creates a literal ~ directory in the container instead of expanding to /root/.embeddinghub/metadata.

At the time of this writing the correct command to run the container is:

docker run -d -v /custom/mount:/~/.embeddinghub/metadata -p 7462:7462 featureformcom/embeddinghub

Actually I prefer dropping metadata in the container path since I'd rather RocksDB create a metadata (or really data) directory in the host so the ideal command IMHO is:

docker run -d -v /custom/mount:/~/.embeddinghub -p 7462:7462 featureformcom/embeddinghub

Thanks for the great work on the project!

Using `list` in the CLI returns NO_STATUS

Issue:

When running featureform list features --cert tls.crt
I get:

NAME                           VARIANT                        STATUS                        
avg_transactions               default (default)              NO_STATUS 

And when running:
featureform get feature avg_transactions --cert tls.crt
I get:

NAME:                          avg_transactions         
STATUS:                        NO_STATUS                
-----------------------------------------------
VARIANTS:
default                        default                  
-----------------------------------------------

Expected:

I expect to see a status next to the variant

NAME                           VARIANT                        STATUS                        
avg_transactions               default (default)              READY 

I don't expect a status here since its not listing a specific variant

NAME:                          avg_transactions                   
-----------------------------------------------
VARIANTS:
default                        default                  
-----------------------------------------------

Minikube serving client requires tls_verify=False when cert_path used

Using the serving client with a self-signed certificate requires tls_verify to be set as false which is odd behavior.
serving = ServingClient(host="localhost:443", tls_verify=False, cert_path="tls.crt")

Expect it to work with tls_verify=True, which is default.
serving = ServingClient(host="localhost:443", cert_path="tls.crt")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.