sematic-ai / sematic Goto Github PK
View Code? Open in Web Editor NEWAn open-source ML pipeline development platform
License: Other
An open-source ML pipeline development platform
License: Other
I have a function that returns a DataFrame containing a column of type datetime[ns]
. I get the following error when running:
Traceback (most recent call last):
File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/my_first_pipeline/__main__.py", line 25, in <module>
main()
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/my_first_pipeline/__main__.py", line 15, in main
pipeline(
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/future.py", line 35, in resolve
self.value = resolver.resolve(self)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 28, in resolve
self._schedule_future_if_input_ready(future_)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 220, in _schedule_future_if_input_ready
self._schedule_future(future)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/silent_resolver.py", line 12, in _schedule_future
self._run_inline(future)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/silent_resolver.py", line 21, in _run_inline
self._handle_future_failure(future, exception)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 242, in _handle_future_failure
raise exception
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/silent_resolver.py", line 19, in _run_inline
self._update_future_with_value(future, value)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 278, in _update_future_with_value
self._set_future_state(future, FutureState.RESOLVED)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/state_machine_resolver.py", line 114, in _set_future_state
CALLBACKS[state](future)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/resolvers/local_resolver.py", line 135, in _future_did_resolve
output_artifact = make_artifact(future.value, future.calculator.output_type)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/db/models/factories.py", line 45, in make_artifact
json_summary = get_json_encodable_summary(value, type_)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/types/serialization.py", line 100, in get_json_encodable_summary
return to_json_encodable_summary_func(value, type_)
File "/Users/apope/nursefly-analytics/experiments/semantic-ai/venv/lib/python3.9/site-packages/sematic/types/types/pandas/dataframe.py", line 22, in _dataframe_json_encodable_summary
if len(json.dumps(payload)) > _PAYLOAD_CUTOFF:
File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/Users/apope/.pyenv/versions/3.9.13/lib/python3.9/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Timestamp is not JSON serializable
Feature request: Ability to choose number of rows etc. for more ability to inspect results.
Right now when submitting k8s jobs, it happens from the clients (either from the resolver in detached mode, or from a user machine in non-detached mode). We should put this behind the API server. This will help make re-runs easier to implement, and also make it so that end-users don't have to install/set up k8s to execute in the cloud.
It could get tricky to verify that runs are always in a valid state, especially given how we dual represent some things in-memory and on the DB (both of which need to be kept somewhat in sync with what's happening with remote compute). One thing that would help ensure we're doing valid transitions would be to encode what future-state transitions are allowed, and assert that runs are always moved between these states in the expected way.
Many users expect <!-- some comment -->
to hide text from being displayed in Markdown, and Sematic accepts markdown for docstrings. However, comments of that format get displayed. We should suppress them.
When we run the liver_cirrhosis
example provided by sematic, we get Unexpected token N in JSON at position {n}
when sematic tries to render the output of the data preprocessing steps.
No. I'm able to reproduce this bug even in the previous version of this library (sematic==0.5.0
)
this draft PR #55 has the code in which the issue is reproducible.
Diagnosis by @neutralino1 is that:
In order to support future unpacking, Sematic injects _get_item futures to do the unpacking. So naturally those show up in the UI.
What we need to do is mark these runs with a label (e.g. "system") and not display them in the UI.
The example runs without any issues in the UI.
We have one case where an env var is expected to set the server address without the http (SEMATIC_SERVER_ADDRESS
) and one where it is expected to have it (SEMATIC_API_ADDRESS
). SEMATIC_WORKER_API_ADDRESS
works both ways. We should have the other two work both ways as well.
It would be nice to have support for the latest stable python version. We currently have some libraries that aren't playing well on 3.10 (I think eventlet, perhaps?).
People may want to launch a run and then wait to see if it succeeds (ex: if using a Sematic pipeline in CI). This can be done as follows:
import time
from sematic import api_client
from sematic.abstract_future import FutureState
def block_until_done(run_id: str, poll_interval_seconds: int):
keep_going = True
while keep_going:
run = api_client.get_run(run_id)
state = FutureState[run.future_state]
keep_going = not state.is_terminal()
time.sleep(poll_interval_seconds)
if state != FutureState.RESOLVED:
raise RuntimeError(f"Run {run_id} finished in state {state}")
However, this uses non-public APIs. We should expose something like block_until_done
as a public API.
Currently whatever URL the end-user is using to talk to the server is passed as an env var to the cloud job, and the cloud job uses that same URL. However, some people have a setup where the end users use one URL, which passes through a reverse proxy (ex: Cloudflare), while the jobs executing on-cluster should use a different URL. We should allow this kind of execution mode.
When we run the liver_cirrhosis
example provided by sematic, we get Unexpected token N in JSON at position {n}
when sematic tries to render the output of the data preprocessing steps.
No. I'm able to reproduce this bug even in the previous version of this library (sematic==0.0.4a0
)
bazel run //sematic/api:server -- --debug
cd sematic/ui && npm start
Create a pipeline like this:
from sklearn.datasets import fetch_openml
@sematic.func
def pipeline() -> pd.DataFrame:
X, y = fetch_openml('titanic', version=1, as_frame=True, return_X_y=True)
return X
Create a pipeline like this:
@sematic.func
def pipeline() -> pd.DataFrame:
dataframe = pd.DataFrame(
{
"pclass": ["1", None, "hello", 2, 3, 3],
"age": [3, 4, 5, 5, 5, 6]
}
)
return dataframe
It is related to artifacts=[artifact.to_json_encodable() for artifact in artifacts]
, here
artifact.to_json_encodable()
is not able to handle some values of a dataframe. For example: a None
value.
That's why we see an error in the UI while the data pre-processing steps of liver_cirrhosis are running. The data frames being returned by these steps are not being converted correctly by the to_json_encode
function.
As a result, in the UI, within the fetchJSON function , when we do
fetch(url)
.then((res) => res.json())
res.json()
is causing an error
UI code to be updated to handle the different dtypes that can be present in a dataframe and be able to render the dataframes properly.
Resolver pods can get evicted for various reasons, and then replaced. These pods stick around for long-ish periods of time, so it would be good to have them be resilient in this scenario--able to pick up and re-load their state from the API and continue as if nothing had happened.
There are URLs that will take you to a pipeline, but not to an individual execution of one. When people are sharing executions with one another, they will likely want some way to jump straight to viewing that particular execution.
Ideally this would work:
@dataclass
class Foo:
foo: int
@sematic.func
def make_foo(i: int) -> Foo:
return Foo(foo=i)
@sematic.func
def get_int(foo: Foo) -> int:
return make_foo(42).foo
But it doesn't because make_foo returns a future, and getitem isn't supported for futures wrapping dataclasses. But since dataclasses have type annotations, we could make this work.
There is a workaround, but it's kind of annoying for a simple field access:
@dataclass
class Foo:
foo: int
@sematic.func
def make_foo(i: int) -> Foo:
return Foo(foo=i)
@sematic.func
def get_int(foo: Foo) -> int:
return get_foo_field(make_foo(42))
@sematic.func
def get_foo_field(foo: Foo) -> int:
return foo.foo
Python 3.9
sematic 0.10.0
Within a sematic function, calling a sematic function that returns a tuple, then accessing components, causes a get_item to appear in the UI:
results = plot_experiments(status, experiments, exp_results[0])
@sematic.func
def update_status(status : str = None) -> str:
status = (status +
' set_path ' +
data_path +
date_prefix +
'_optimization_set.csv')
return status
status = update_status(results[1])
We already support this for lists:
@sematic.func
def pipeline() -> typing.List[int]:
return [foo(), bar()]
where foo
and bar
are Sematic funcs (thus returning futures). We should also support this:
@sematic.func
def pipeline() -> typing.Tuple[int]:
return (foo(), bar())
We should provide a way via the webapp where users can rotate their API keys in case they lose track of theirs.
Using set_env is required for running locally but writing to a shared metadata server. We should document it.
Currently there is logic that attempts to cast dataclass values by progressively setting the field values. This casting logic runs even when none of the field values would actually be changed. For dataclasses with frozen=True
set, this fails. We should fix this.
There are multiple ways to deploy the Sematic server: one allows for cloud execution while the other does not. If a user tries to use the CloudResolver when the server can’t support it, we should give a message that indicates why it isn’t working and also points to the deployment docs.
Using HuggingFace datasets:
`from datasets import load_dataset
@sematic.func
def pipeline() -> Any:
dataset = load_yelp_dataset()
return dataset
@sematic.func
def load_yelp_dataset() -> Any:
dataset = load_dataset("yelp_review_full")
print(type(dataset))
return dataset`
When using DatasetDict
instead of Any
it works
Here is the stack trace:
Traceback (most recent call last): File "/Users/jai/opt/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/Users/jai/opt/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/jai/projects/sematic/my_project/__main__.py", line 17, in <module> main() File "/Users/jai/projects/sematic/my_project/__main__.py", line 13, in main pipeline().resolve() File "/Users/jai/projects/sematic/sematic/future.py", line 21, in resolve self.value = resolver.resolve(self) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 28, in resolve self._schedule_future_if_input_ready(future_) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 220, in _schedule_future_if_input_ready self._schedule_future(future) File "/Users/jai/projects/sematic/sematic/resolvers/offline_resolver.py", line 84, in _schedule_future self._run_inline(future) File "/Users/jai/projects/sematic/sematic/resolvers/offline_resolver.py", line 93, in _run_inline self._handle_future_failure(future, exception) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 242, in _handle_future_failure raise exception File "/Users/jai/projects/sematic/sematic/resolvers/offline_resolver.py", line 91, in _run_inline self._update_future_with_value(future, value) File "/Users/jai/projects/sematic/sematic/resolvers/state_machine_resolver.py", line 271, in _update_future_with_value value = future.calculator.cast_output(value) File "/Users/jai/projects/sematic/sematic/calculator.py", line 131, in cast_output return self.cast_value( File "/Users/jai/projects/sematic/sematic/calculator.py", line 155, in cast_value cast_value, error = safe_cast(value, type_) File "/Users/jai/projects/sematic/sematic/types/casting.py", line 95, in safe_cast if isinstance(value, type_): TypeError: isinstance() arg 2 must be a type or tuple of types
When trying out sematic, the app is not accessible from another machine by default.
How about binding to 0.0.0.0 by default, so that one can try it out without having to do a full-fletched remote deployment as described in https://docs.sematic.dev/diving-deeper/deploy
or maybe add an option flag to make it accessible externally?
I got stuck on this when onboarding.
Currently we upload the entirety of a sematic function's logs at once. We buffer them on disk. For long-running jobs, this may not be viable; the container may eat up too much of its allowed storage. We need to make it so we truncate logs after each upload so that the log file on disk doesn't get too large. But we'll need to be careful we don't lose any log lines when we do this.
It would be helpful if Sematic functions could have associated timeouts, such that if they take longer than a specified time the function automatically fails.
I would rather suggest using https://zulip.com/ as it is free for open source projects, has a much better UI and does play fair.
Currently bazel tests only with python3.9. We should try to enable testing with more than one python version, since users may be on a variety of versions. In the meantime, we should consider pinning bazel to the minimum supported version.
From python logging cookbook:
When deploying Web applications using Gunicorn or uWSGI (or similar), multiple worker processes are created to handle client requests. In such environments, avoid creating file-based handlers directly in your web application. Instead, use a SocketHandler to log from the web application to a listener in a separate process. This can be set up using a process management tool such as Supervisor - see Running a logging socket listener in production for more details.
But we currently use a rotating file handler. We should refactor to log to a socket, which then uses a rotating file handler.
Currently auth only supports using one email domain to authenticate, it would be nice if you could provide a list of such domains. Also a list of allowed emails from any domain.
The chat view on pipelines shows the first few characters of the run that they were made on. It would be great if you could click on those and be taken to the view for the corresponding run.
I can not sign up to your newsletter - I just get this error message:
Oops! Something went wrong while submitting the form.
Also the discord sign up screen just does not work - after typing a username nothing happens.
Besides using ublock origin to surf the internet (of course) I am not doing anything special.
There are a few places that do read-modify-write with runs:
Supposing you have two entities, E1 and E2, whose operations interleave as follows:
E1.read
E2.read
E1.modify
E2.modify
E1.write
E2.write
In this case, the modification from E1 will be overwritten by the modification from E2. There are ways to detect this situation so the writer can retry making its modification. We should do that to prevent weird bugs!
You can hit weird cases that behave in unexpected ways when you try to compare futures:
@sematic.func
def pipeline() -> str:
if some_sematic_func() == 1:
return "Yay!"
else:
return "Boo!"
No matter whether some_sematic_func()
resolves to a 1 or not, this will always return "Boo!"
because what's actually being compared is Future(...) == 1
, which will always be False. We should save users from this and similar gotchas (other comparison operators) with good error messages.
When people upgrade regularly, it can be helpful to know what version of the server is running. We should display this information discreetly on the UI
If a pod can't be scheduled within 30 min (or some other appropriate number), fail out with a useful error message. The error message should ideally specify the message for the FailedScheduling event if possible.
When a resolution fails, the run objects are moved to the FAILED state, but the k8s jobs for them are not necessarily cleaned up. We should fix that and clean up the jobs.
When I do this
@sematic.func
def foo() -> T:
...
@sematic.func
def bar() -> U:
foo()
...
I expect foo
to be resolved as part of graph. Currently it is not resolved because it is neither returned by bar
, nor passed as an input to another Sematic Function.
We can solve this by having a global future registry that futures add themselves to upon instantiation.
After installing sematic for the first time, I try to start
and get the following exception. I tried to dig in a bit but didn't find an obvious solution...
python version: Python 3.9.7
11:07:21 root@6b93e1bb3779 algo ±|dev ✗|→ sematic start
Traceback (most recent call last):
File "/opt/pyenv/versions/3.9.7/bin/sematic", line 5, in <module>
from sematic.cli.main import cli
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/sematic/cli/main.py", line 5, in <module>
import sematic.cli.start # noqa: F401
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/sematic/cli/start.py", line 12, in <module>
from sematic.api.server import run_wsgi
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/sematic/api/server.py", line 7, in <module>
import eventlet
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/__init__.py", line 17, in <module>
from eventlet import convenience
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/convenience.py", line 7, in <module>
from eventlet.green import socket
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/green/socket.py", line 21, in <module>
from eventlet.support import greendns
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/support/greendns.py", line 66, in <module>
setattr(dns, pkg, import_patched('dns.' + pkg))
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/support/greendns.py", line 61, in import_patched
return patcher.import_patched(module_name, **modules)
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/patcher.py", line 129, in import_patched
return inject(
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/eventlet/patcher.py", line 106, in inject
module = __import__(module_name, {}, {}, module_name.split('.')[:-1])
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/dns/dnssec.py", line 483, in <module>
from Crypto.PublicKey import RSA as CryptoRSA, DSA as CryptoDSA
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/Crypto/PublicKey/__init__.py", line 21, in <module>
from Crypto.Util.asn1 import (DerSequence, DerInteger, DerBitString,
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/Crypto/Util/asn1.py", line 27, in <module>
from Crypto.Util.number import long_to_bytes, bytes_to_long
File "/opt/pyenv/versions/3.9.7/lib/python3.9/site-packages/Crypto/Util/number.py", line 399
s = pack('>I', n & 0xffffffffL) + s
^
SyntaxError: invalid syntax
(What else is required for a diagnosis?)
Sometimes the connection to the DB can be closed unexpectedly but would be successful on a retry. For read queries, retrying is harmless. We should add the @retry
decorator to reads for robustness.
We should explain what's supported and what's not, as well as the functionality you get when using these types.
Right now developers need to decorate API endpoints with @authenticate
to make sure an endpoint cannot be accessed without a valid API key in the headers.
This is ok, but it would be safer to require authentication by default, and have an API to mark certain endpoints as not requiring authentication.
There are some use cases for this, like adding tracking for Sematic executions in external systems by the Sematic run id.
ResourceRequirements
let users specify function-specific resource requirements. For example, a function can specify that it needs a particular type of Kubernetes node (e.g. GPU, high-mem).
At this time, these are used at runtime when launching K8s jobs, but they are not persisted in the DB. Persisting them is necessary to enable re-running pipelines from the UI and CLI, as well as to move kubernetes job launches behind the server
class Foo:
@sematic.func
def bar(self):
pass
should ideally give an error at time of wrapping rather than runtime, and have a nice message.
Users can currently receive a message like this if they try to use sematic with invalid python versions:
from typing import ( # type: ignore
ImportError: cannot import name 'GenericAlias' from 'typing' (/usr/lib/python3.7/typing.py)
For folks installing with pip, this shouldn't be possible, but other mechanisms of depending on sematic might make it possible to try it with older interpreters.
People may not want to put their ECR repo URLs in the source code, but rather make this configurable at Build time. We should make this possible, ideally.
Right now there's no way to tell in the UI who created a particular execution. We should add some capability to do that.
We should support abstract base classes as type annotations for sematic functions, but we currently do not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.