Code Monkey home page Code Monkey logo

sematic's Introduction

Sematic Logo

The open-source Continuous Machine Learning Platform

Build ML pipelines with only Python, run on your laptop, or in the cloud.

PyPI CircleCI PyPI - License Python 3.8 Python 3.9 Python 3.10 Discord Made By Sematic PyPI - Downloads

Sematic Screenshot

Sematic is an open-source ML development platform. It lets ML Engineers and Data Scientists write arbitrarily complex end-to-end pipelines with simple Python and execute them on their local machine, in a cloud VM, or on a Kubernetes cluster to leverage cloud resources.

Sematic is based on learnings gathered at top self-driving car companies. It enables chaining data processing jobs (e.g. Apache Spark) with model training (e.g. PyTorch, Tensorflow), or any other arbitrary Python business logic into type-safe, traceable, reproducible end-to-end pipelines that can be monitored and visualized in a modern web dashboard.

Read our documentation and join our Discord channel.

Why Sematic

  • Easy onboarding – no deployment or infrastructure needed to get started, simply install Sematic locally and start exploring.
  • Local-to-cloud parity – run the same code on your local laptop and on your Kubernetes cluster.
  • End-to-end traceability – all pipeline artifacts are persisted, tracked, and visualizable in a web dashboard.
  • Access heterogeneous compute – customize required resources for each pipeline step to optimize your performance and cloud footprint (CPUs, memory, GPUs, Spark cluster, etc.)
  • Reproducibility – rerun your pipelines from the UI with guaranteed reproducibility of results

Getting Started

To get started locally, simply install Sematic in your Python environment:

$ pip install sematic

Start the local web dashboard:

$ sematic start

Run an example pipeline:

$ sematic run examples/mnist/pytorch

Create a new boilerplate project:

$ sematic new my_new_project

Or from an existing example:

$ sematic new my_new_project --from examples/mnist/pytorch

Then run it with:

$ python3 -m my_new_project

To deploy Sematic to Kubernetes and leverage cloud resources, see our documentation.

Features

  • Lightweight Python SDK – define arbitrarily complex end-to-end pipelines
  • Pipeline nesting – arbitrarily nest pipelines into larger pipelines
  • Dynamic graphs – Python-defined graphs allow for iterations, conditional branching, etc.
  • Lineage tracking – all inputs and outputs of all steps are persisted and tracked
  • Runtime type-checking – fail early with run-time type checking
  • Web dashboard – Monitor, track, and visualize pipelines in a modern web UI
  • Artifact visualization – visualize all inputs and outputs of all steps in the web dashboard
  • Local execution – run pipelines on your local machine without any deployment necessary
  • Cloud orchestration – run pipelines on Kubernetes to access GPUs and other cloud resources
  • Heterogeneous compute resources – run different steps on different machines (e.g. CPUs, memory, GPU, Spark, etc.)
  • Helm chart deployment – install Sematic on your Kubernetes cluster
  • Pipeline reruns – rerun pipelines from the UI from an arbitrary point in the graph
  • Step caching – cache expensive pipeline steps for faster iteration
  • Step retry – recover from transient failures with step retries
  • Metadata and collaboration – Tags, source code visualization, docstrings, notes, etc.
  • Numerous integrations – See below

Integrations

  • Apache Spark – on-demand in-cluster Spark cluster
  • Ray – on-demand Ray in-cluster Ray resources
  • Snowflake – easily query your data warehouse (other warehouses supported too)
  • Plotly, Matplotlib – visualize plot artifacts in the web dashboard
  • Pandas – visualize dataframe artifacts in the dashboard
  • Grafana – embed Grafana panels in the web dashboard
  • Bazel – integrate with your Bazel build system
  • Helm chart – deploy to Kubernetes with our Helm chart
  • Git – track git information in the web dashboard

Community and resources

Learn more about Sematic and get in touch with the following resources:

Contribute!

To contribute to Sematic, check out open issues tagged "good first issue", and get in touch with us on Discord. You can find instructions on how to get your development environment set up in our developer docs. If you'd like to add an example, you may also find this guide helpful.

scarf pixel

sematic's People

Contributors

anuragkanungo avatar augray avatar chance-an avatar chance-sematic avatar erikcek avatar idow09 avatar jaichopra avatar jmalicki avatar kamalesh0406 avatar katkag avatar kaushil24 avatar materight avatar neutralino1 avatar nvinayvarma189 avatar sidguptajhs avatar snoshy avatar tscurtu avatar twitchax avatar v-pwais avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sematic's Issues

Add public API for blocking on a run until it's complete

People may want to launch a run and then wait to see if it succeeds (ex: if using a Sematic pipeline in CI). This can be done as follows:

import time
from sematic import api_client
from sematic.abstract_future import FutureState

def block_until_done(run_id: str, poll_interval_seconds: int):
    keep_going = True
    while keep_going:
        run = api_client.get_run(run_id)
        state = FutureState[run.future_state]
        keep_going = not state.is_terminal()
        time.sleep(poll_interval_seconds)
    if state != FutureState.RESOLVED:
        raise RuntimeError(f"Run {run_id} finished in state {state}")

However, this uses non-public APIs. We should expose something like block_until_done as a public API.

Improve contributor guide

  • Mention that installing postgres is a pre-req (actually that's true for "real" users as well)
  • description of how to run bazel tests
  • instructions on installing standard version of black/flake8/mypy

Make resolver jobs re-entrant

Resolver pods can get evicted for various reasons, and then replaced. These pods stick around for long-ish periods of time, so it would be good to have them be resilient in this scenario--able to pick up and re-load their state from the API and continue as if nothing had happened.

Error when returning a DataFrame containing a datetime[ns] type

I have a function that returns a DataFrame containing a column of type datetime[ns]. I get the following error when running:

Traceback (most recent call last):
  File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/my_first_pipeline/__main__.py", line 25, in <module>
    main()
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/my_first_pipeline/__main__.py", line 15, in main
    pipeline(
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/future.py", line 35, in resolve
    self.value = resolver.resolve(self)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 28, in resolve
    self._schedule_future_if_input_ready(future_)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 220, in _schedule_future_if_input_ready
    self._schedule_future(future)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/silent_resolver.py", line 12, in _schedule_future
    self._run_inline(future)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/silent_resolver.py", line 21, in _run_inline
    self._handle_future_failure(future, exception)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 242, in _handle_future_failure
    raise exception
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/silent_resolver.py", line 19, in _run_inline
    self._update_future_with_value(future, value)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 278, in _update_future_with_value
    self._set_future_state(future, FutureState.RESOLVED)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 114, in _set_future_state
    CALLBACKS[state](future)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/local_resolver.py", line 135, in _future_did_resolve
    output_artifact = make_artifact(future.value, future.calculator.output_type)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/db/models/factories.py", line 45, in make_artifact
    json_summary = get_json_encodable_summary(value, type_)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/types/serialization.py", line 100, in get_json_encodable_summary
    return to_json_encodable_summary_func(value, type_)
  File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/types/types/pandas/dataframe.py", line 22, in _dataframe_json_encodable_summary
    if len(json.dumps(payload)) > _PAYLOAD_CUTOFF:
  File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Timestamp is not JSON serializable

Add retries to read queries in `queries.py`

Sometimes the connection to the DB can be closed unexpectedly but would be successful on a retry. For read queries, retrying is harmless. We should add the @retry decorator to reads for robustness.

Persist resource requirements to the DB

ResourceRequirements let users specify function-specific resource requirements. For example, a function can specify that it needs a particular type of Kubernetes node (e.g. GPU, high-mem).

At this time, these are used at runtime when launching K8s jobs, but they are not persisted in the DB. Persisting them is necessary to enable re-running pipelines from the UI and CLI, as well as to move kubernetes job launches behind the server

Option to suppress get_item in UI

Python 3.9
sematic 0.10.0

Within a sematic function, calling a sematic function that returns a tuple, then accessing components, causes a get_item to appear in the UI:

  results = plot_experiments(status, experiments, exp_results[0])
 @sematic.func
  def update_status(status : str = None) -> str:
    status = (status + 
              ' set_path ' +
              data_path + 
              date_prefix + 
              '_optimization_set.csv')
    return status
  status = update_status(results[1])

image

Add timeouts for Sematic funcs

It would be helpful if Sematic functions could have associated timeouts, such that if they take longer than a specified time the function automatically fails.

[Bug] inconsistent UI bug rendering

Description

When we run the liver_cirrhosis example provided by sematic, we get Unexpected token N in JSON at position {n} when sematic tries to render the output of the data preprocessing steps.

inconsistent_rendering_bug.webm

Is this a regression?

No. I'm able to reproduce this bug even in the previous version of this library (sematic==0.5.0)

To reproduce

this draft PR #55 has the code in which the issue is reproducible.

Diagnosis

Diagnosis by @neutralino1 is that:

In order to support future unpacking, Sematic injects _get_item futures to do the unpacking. So naturally those show up in the UI.
What we need to do is mark these runs with a label (e.g. "system") and not display them in the UI.

Expected behavior

The example runs without any issues in the UI.

Environment

  • Sematic version: commit version
  • Python version: 3.9.11
  • Node version: v16.14.2
  • Npm version: 8.12.2
  • OS: Ubuntu 20.04
  • Browser name: Brave
  • Browser version: 1.32.113

Show the server version on the UI

When people upgrade regularly, it can be helpful to know what version of the server is running. We should display this information discreetly on the UI

Stable links to pipeline executions

There are URLs that will take you to a pipeline, but not to an individual execution of one. When people are sharing executions with one another, they will likely want some way to jump straight to viewing that particular execution.

Exception on `sematic start` after installation

After installing sematic for the first time, I try to start and get the following exception. I tried to dig in a bit but didn't find an obvious solution...

python version: Python 3.9.7

11:07:21 root@6b93e1bb3779 algo ±|dev ✗|→ sematic start
Traceback (most recent call last):
  File "/opt/pyenv/versions/3.9.7/bin/sematic", line 5, in <module>
    from sematic.cli.main import cli
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/sematic/cli/main.py", line 5, in <module>
    import sematic.cli.start  # noqa: F401
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/sematic/cli/start.py", line 12, in <module>
    from sematic.api.server import run_wsgi
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/sematic/api/server.py", line 7, in <module>
    import eventlet
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/__init__.py", line 17, in <module>
    from eventlet import convenience
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/convenience.py", line 7, in <module>
    from eventlet.green import socket
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/green/socket.py", line 21, in <module>
    from eventlet.support import greendns
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/support/greendns.py", line 66, in <module>
    setattr(dns, pkg, import_patched('dns.' + pkg))
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/support/greendns.py", line 61, in import_patched
    return patcher.import_patched(module_name, **modules)
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/patcher.py", line 129, in import_patched
    return inject(
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/patcher.py", line 106, in inject
    module = __import__(module_name, {}, {}, module_name.split('.')[:-1])
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/dns/dnssec.py", line 483, in <module>
    from Crypto.PublicKey import RSA as CryptoRSA, DSA as CryptoDSA
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/Crypto/PublicKey/__init__.py", line 21, in <module>
    from Crypto.Util.asn1 import (DerSequence, DerInteger, DerBitString,
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/Crypto/Util/asn1.py", line 27, in <module>
    from Crypto.Util.number import long_to_bytes, bytes_to_long
  File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/Crypto/Util/number.py", line 399
    s = pack('>I', n & 0xffffffffL) + s
                                 ^
SyntaxError: invalid syntax

(What else is required for a diagnosis?)

Support __getattr__ for Future dataclasses

Ideally this would work:

@dataclass
class Foo:
    foo: int

@sematic.func
def make_foo(i: int) -> Foo:
    return Foo(foo=i)

@sematic.func
def get_int(foo: Foo) -> int:
    return make_foo(42).foo

But it doesn't because make_foo returns a future, and getitem isn't supported for futures wrapping dataclasses. But since dataclasses have type annotations, we could make this work.

There is a workaround, but it's kind of annoying for a simple field access:

@dataclass
class Foo:
    foo: int

@sematic.func
def make_foo(i: int) -> Foo:
    return Foo(foo=i)

@sematic.func
def get_int(foo: Foo) -> int:
    return get_foo_field(make_foo(42))

@sematic.func
def get_foo_field(foo: Foo) -> int:
    return foo.foo

Automatically convert Tuple-of-future to future-Tuple

We already support this for lists:

@sematic.func
def pipeline() -> typing.List[int]:
    return [foo(), bar()]

where foo and bar are Sematic funcs (thus returning futures). We should also support this:

@sematic.func
def pipeline() -> typing.Tuple[int]:
    return (foo(), bar())

Make "address" settings consistent

We have one case where an env var is expected to set the server address without the http (SEMATIC_SERVER_ADDRESS) and one where it is expected to have it (SEMATIC_API_ADDRESS). SEMATIC_WORKER_API_ADDRESS works both ways. We should have the other two work both ways as well.

Refactor server logging to use a SocketHandler rather than a file handler directly

From python logging cookbook:

When deploying Web applications using Gunicorn or uWSGI (or similar), multiple worker processes are created to handle client requests. In such environments, avoid creating file-based handlers directly in your web application. Instead, use a SocketHandler to log from the web application to a listener in a separate process. This can be set up using a process management tool such as Supervisor - see Running a logging socket listener in production for more details.

But we currently use a rotating file handler. We should refactor to log to a socket, which then uses a rotating file handler.

Always fail when doing __eq__ (and other comparisons) on a future

You can hit weird cases that behave in unexpected ways when you try to compare futures:

@sematic.func
def pipeline() -> str:
    if some_sematic_func() == 1:
        return "Yay!"
    else:
        return "Boo!"

No matter whether some_sematic_func() resolves to a 1 or not, this will always return "Boo!" because what's actually being compared is Future(...) == 1, which will always be False. We should save users from this and similar gotchas (other comparison operators) with good error messages.

Bind to 0.0.0.0 by default?

When trying out sematic, the app is not accessible from another machine by default.
How about binding to 0.0.0.0 by default, so that one can try it out without having to do a full-fletched remote deployment as described in https://docs.sematic.dev/diving-deeper/deploy

or maybe add an option flag to make it accessible externally?
I got stuck on this when onboarding.

Add python 3.10 support

It would be nice to have support for the latest stable python version. We currently have some libraries that aren't playing well on 3.10 (I think eventlet, perhaps?).

Can not sign up to newsletter or discord

I can not sign up to your newsletter - I just get this error message:

Oops! Something went wrong while submitting the form.

Also the discord sign up screen just does not work - after typing a username nothing happens.

Besides using ublock origin to surf the internet (of course) I am not doing anything special.

Clean up external jobs when runs get failed

When a resolution fails, the run objects are moved to the FAILED state, but the k8s jobs for them are not necessarily cleaned up. We should fix that and clean up the jobs.

Submit cloud jobs behind the Sematic API Server

Right now when submitting k8s jobs, it happens from the clients (either from the resolver in detached mode, or from a user machine in non-detached mode). We should put this behind the API server. This will help make re-runs easier to implement, and also make it so that end-users don't have to install/set up k8s to execute in the cloud.

Enable using a different API URL when executing locally vs when executing in the cloud

Currently whatever URL the end-user is using to talk to the server is passed as an env var to the cloud job, and the cloud job uses that same URL. However, some people have a setup where the end users use one URL, which passes through a reverse proxy (ex: Cloudflare), while the jobs executing on-cluster should use a different URL. We should allow this kind of execution mode.

Add "chunking" and logrotation to log streaming

Currently we upload the entirety of a sematic function's logs at once. We buffer them on disk. For long-running jobs, this may not be viable; the container may eat up too much of its allowed storage. We need to make it so we truncate logs after each upload so that the log file on disk doesn't get too large. But we'll need to be careful we don't lose any log lines when we do this.

Support dataclasses with frozen=True

Currently there is logic that attempts to cast dataclass values by progressively setting the field values. This casting logic runs even when none of the field values would actually be changed. For dataclasses with frozen=True set, this fails. We should fix this.

Unreturned futures should be executed

When I do this

@sematic.func
def foo() -> T:
    ...

@sematic.func
def bar() -> U:
    foo()
    ...

I expect foo to be resolved as part of graph. Currently it is not resolved because it is neither returned by bar, nor passed as an input to another Sematic Function.

We can solve this by having a global future registry that futures add themselves to upon instantiation.

Make the run id slugs on chats into links

The chat view on pipelines shows the first few characters of the run that they were made on. It would be great if you could click on those and be taken to the view for the corresponding run.

Enable testing with multiple interpreter versions

Currently bazel tests only with python3.9. We should try to enable testing with more than one python version, since users may be on a variety of versions. In the meantime, we should consider pinning bazel to the minimum supported version.

Give a friendlier error message when Sematic is used with an incompatible python version

Users can currently receive a message like this if they try to use sematic with invalid python versions:

    from typing import (  # type: ignore
ImportError: cannot import name 'GenericAlias' from 'typing' (/usr/lib/python3.7/typing.py)

For folks installing with pip, this shouldn't be possible, but other mechanisms of depending on sematic might make it possible to try it with older interpreters.

[Bug] Unable to render some dataframes have on the UI

Description

When we run the liver_cirrhosis example provided by sematic, we get Unexpected token N in JSON at position {n} when sematic tries to render the output of the data preprocessing steps.

liver_cirrhosis_glitching.mp4

Is this a regression?

No. I'm able to reproduce this bug even in the previous version of this library (sematic==0.0.4a0)

To reproduce

  1. Start the sematic server: bazel run //sematic/api:server -- --debug
  2. Start sematic UI: cd sematic/ui && npm start

Example 1

Create a pipeline like this:

from sklearn.datasets import fetch_openml

@sematic.func
def pipeline() -> pd.DataFrame:
   X, y = fetch_openml('titanic', version=1, as_frame=True, return_X_y=True)
   return X

This will show this error
image

Example 2

Create a pipeline like this:

@sematic.func
def pipeline() -> pd.DataFrame:
   dataframe = pd.DataFrame(
       {
           "pclass": ["1", None, "hello", 2, 3, 3],
           "age": [3, 4, 5, 5, 5, 6]
       }
   )
   return dataframe

This will show this error
image

Diagnosis

It is related to artifacts=[artifact.to_json_encodable() for artifact in artifacts], here

artifact.to_json_encodable() is not able to handle some values of a dataframe. For example: a None value.
That's why we see an error in the UI while the data pre-processing steps of liver_cirrhosis are running. The data frames being returned by these steps are not being converted correctly by the to_json_encode function.

As a result, in the UI, within the fetchJSON function , when we do

fetch(url)
    .then((res) => res.json())

res.json() is causing an error

Expected behavior

UI code to be updated to handle the different dtypes that can be present in a dataframe and be able to render the dataframes properly.

Environment

  • Sematic version: 0.1.0a0
  • Python version: 3.9.11
  • Node version: v16.14.2
  • Npm version: 8.12.2
  • OS: Ubuntu 20.04
  • Browser name: Brave
  • Browser version: 1.32.113

Make authenticated endpoints the default

Right now developers need to decorate API endpoints with @authenticate to make sure an endpoint cannot be accessed without a valid API key in the headers.

This is ok, but it would be safer to require authentication by default, and have an API to mark certain endpoints as not requiring authentication.

Detect and recover from run read-modify-write race conditions

There are a few places that do read-modify-write with runs:

  • resolver jobs
  • the server (for updating runs from external jobs)
  • the worker (for storing outputs/exceptions)
  • eventually the UI (adding tags and stuff)

Supposing you have two entities, E1 and E2, whose operations interleave as follows:
E1.read
E2.read
E1.modify
E2.modify
E1.write
E2.write

In this case, the modification from E1 will be overwritten by the modification from E2. There are ways to detect this situation so the writer can retry making its modification. We should do that to prevent weird bugs!

When using `Any` a TypeError is thrown: isinstance() arg 2 must be a type or tuple of types

Using HuggingFace datasets:

`from datasets import load_dataset

@sematic.func
def pipeline() -> Any:
dataset = load_yelp_dataset()
return dataset

@sematic.func
def load_yelp_dataset() -> Any:
dataset = load_dataset("yelp_review_full")
print(type(dataset))
return dataset`

When using DatasetDict instead of Any it works

Here is the stack trace:

Traceback (most recent call last): File "/Users/jai/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/jai/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/jai/projects/sematic/my_project/__main__.py", line 17, in <module> main() File "/Users/jai/projects/sematic/my_project/__main__.py", line 13, in main pipeline().resolve() File "/Users/jai/projects/sematic/sematic/future.py", line 21, in resolve self.value = resolver.resolve(self) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 28, in resolve self._schedule_future_if_input_ready(future_) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 220, in _schedule_future_if_input_ready self._schedule_future(future) File "/Users/jai/projects/sematic/sematic/resolvers/offline_resolver.py", line 84, in _schedule_future self._run_inline(future) File "/Users/jai/projects/sematic/sematic/resolvers/offline_resolver.py", line 93, in _run_inline self._handle_future_failure(future, exception) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 242, in _handle_future_failure raise exception File "/Users/jai/projects/sematic/sematic/resolvers/offline_resolver.py", line 91, in _run_inline self._update_future_with_value(future, value) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 271, in _update_future_with_value value = future.calculator.cast_output(value) File "/Users/jai/projects/sematic/sematic/calculator.py", line 131, in cast_output return self.cast_value( File "/Users/jai/projects/sematic/sematic/calculator.py", line 155, in cast_value cast_value, error = safe_cast(value, type_) File "/Users/jai/projects/sematic/sematic/types/casting.py", line 95, in safe_cast if isinstance(value, type_): TypeError: isinstance() arg 2 must be a type or tuple of types

document set_env

Using set_env is required for running locally but writing to a shared metadata server. We should document it.

Enforce that only allowed state transitions happen for runs

It could get tricky to verify that runs are always in a valid state, especially given how we dual represent some things in-memory and on the DB (both of which need to be kept somewhat in sync with what's happening with remote compute). One thing that would help ensure we're doing valid transitions would be to encode what future-state transitions are allowed, and assert that runs are always moved between these states in the expected way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.