raymon-ai / raymon Goto Github PK
View Code? Open in Web Editor NEWThe official http://raymon.ai data profiling and logging library.
License: MIT License
The official http://raymon.ai data profiling and logging library.
License: MIT License
If sending data to the backend failed, we need to show a warning (but not let prod code crash!)
The Raymon Logging library should call the backend with the Authorization header set to a Bearer JWT token. It should get the token from our Auth0 endpoint. To get the token for machine 2 machine logging, it should send a request with the following parameters to the Auth0 endpoint:
RAYMON_AUTH0_URL
: The endpoint to query the token onRAYMON_GRANT_TYPE
:RAYMON_AUDIENCE
: The API we want to log toRAYMON_CLIENT_ID
: client idRAYMON_CLIENT_SECRET
: client secretThese parameters should be loaded from ~/.raymon/secrets.json
, <working_dir>/.raymon.json
, from environment variables or from a specific file, in that order of ascending priority.
Important: this use case is for ingesting data. For users querying and inspection data, we need another workflow.
We currently setup a connection every time we log an artefact. This can be optimized by using a session. https://requests.readthedocs.io/en/master/user/advanced/
When logging nested data, we should gracefully check for types like np.int64
and such and convert them to something json dumpable. Packes will probably exist for this.
Doing an API call on every log / tag statement is fine for demo and MVP purposes, but not for real use cases.
For simple functions, we should provide a wrapper class that takes care of loading / serialising the function.
We should support the following metrics for all component types (input, output, actual, scores) and make it configurable which ones to check for.
We need to support the batched API data ingestion method and do this async and error-proof.
Now we just have a ModelProfile. We could add a class Profile
, that only has inputs.
trace.tag
should accept both Tag
objects or dicts. We should convert them internally if needed, and only convert them to json when writing to file or API.
We need a way to indicate how confident we are about the drift, which will be in function of the data we've analysed. We currently use a 2 sample KS test for NumericStat
s, but do not return a p-value. We could easily do this, but we need to decide what to do with the other types first:
IntComponent
)CategroicalComponent
s?For all component types / stats types, we need a test that returns a value between 0 and 1, and can return a pvalue, confidence interval, or something else.
To avoid passing the trace in al lfunctions, we should support a workflow like this:
import raymon
from raymon import Trace
trace = Trace(... ,global=True) # default
trace2 = raymon.current_trace()
assert trace == trace2
The pytorch dependency is huge, and only has limited value.
Let's try to replace it by ONNX, which is hopefully smaller. Alternatively, we should move the extractors that depend on pytorch to a separate package.
Allow users to log to text files, and ingest those text files offline in the API, much like prometheus / logstash.
When one profile has input_components=[a, b, c] and output_comonents=[d, e], and the other profile has nput_components=[a, b] and output_comonents=[d], a schema contrast should only take the common components.
This should format the error nicely, tag the trace with the error and log the stacktrace as a trace element.
All REST API endpoints should be queryable through the raymon library.
e.g. api.search_object(...)
Lib should be as lightweight as possible.
https://github.com/pandas-profiling/pandas-profiling
We should be able to parse the output into one of our profiles.
Browsing to page 2 on one component_type, and then switching pages results on showing page 2 on the new component_type page too. This may not have 2 pages.\
When switching component_types, set page to 0.
When building a schema from the database, its stats can be empty and not contrast to another schema can be made for those components. Currently these will return a drift of -1, which will be shown in green, yikes! We should alert users of "No Data" instead. This should be a new type of alert actually. (invalids, drift, no data)
We already have a paremeter domain
on the stats objects from before. We need to re-enable calling those.
profile.build()
should have a parameter domains
of the form:
{
'input_components': {name: domain, name2: domain},
}
We should enable them both again, like we did before. They can be plotted on the same axis.
I got errors in examples.
ingest_retinopathy_1 | Traceback (most recent call last):
ingest_retinopathy_1 | File "process.py", line 242, in <module>
ingest_retinopathy_1 | ray_ids = run()
ingest_retinopathy_1 | File "process.py", line 235, in run
ingest_retinopathy_1 | oracle.process(ray_id=ray_id, metadata=metadata)
ingest_retinopathy_1 | File "process.py", line 141, in process
ingest_retinopathy_1 | ray.info(f"Logging ground truth for {ray}")
ingest_retinopathy_1 | File "/usr/local/lib/python3.7/site-packages/raymon/ray.py", line 41, in info
ingest_retinopathy_1 | self.logger.info(ray_id=str(self), text=text)
ingest_retinopathy_1 | File "/usr/local/lib/python3.7/site-packages/raymon/loggers.py", line 84, in info
ingest_retinopathy_1 | self.data_logger.info(json.dumps(kafka_msg))
ingest_retinopathy_1 | AttributeError: 'RaymonFileLogger' object has no attribute 'data_logger'
From: https://setuptools.readthedocs.io/en/latest/pkg_resources.html
Use of pkg_resources is discouraged in favor of importlib.resources, importlib.metadata, and their backports (resources, metadata). Please consider using those libraries instead of pkg_resources.
https://docs.python.org/3/library/importlib.html#module-importlib.resources
When we contrast profiles, some component may be missing. We need to deal with this.
We are should simply save the image as a PIL image and save / load to lossless PNG format.
We need basic tests for:
Instead of the validation checks, we should simply try to parse the data as a float.
Data types should have the to_jcr
and from_jcr
functions implemented and should be unit tested.
We currently only support the client_credentials
flow. We also need to support a user to log in with the CLI using the device flow
grant.
When available, the system should use client credentials, if not, it should try to login the user.
Using the statement api.system_metrics()
we should be able to get and send following tags or global metrics to the backend:
These should be tags of type global-metric
(?) and should not be attached to the ray?
using ray.process_metrics()
I got an error when logging in for the first time.
FileNotFoundError Traceback (most recent call last)
~/raymon/examples/setup_project.py in
19 login_env = None
20 # api = RaymonAPI(url=f"https://api{ENV}.raymon.ai/v0", env=login_env)
---> 21 api = RaymonAPI(url=f"http://localhost:8000/v0", env=login_env)
22
23
~/opt/miniconda3/envs/retinopathy/lib/python3.8/site-packages/raymon/api.py in __init__(self, url, project_id, auth_path, env)
17 self.token = None
18
---> 19 self.login()
20
21 """
~/opt/miniconda3/envs/retinopathy/lib/python3.8/site-packages/raymon/api.py in login(self)
24
25 def login(self):
---> 26 self.token = login(fpath=self.auth_path, project_id=self.project_id, env=self.env)
27 self.headers["Authorization"] = f"Bearer {self.token}"
28
~/opt/miniconda3/envs/retinopathy/lib/python3.8/site-packages/raymon/auth/__init__.py in login(fpath, project_id, env)
71 # If we did not find m2m credentials, let the user login interactively.
72 try:
---> 73 token = login_user(credentials=credentials, out=fpath, env=env)
74 except (SecretException, NetworkException) as exc:
75 print(f"Could not login with user credentials.")
~/opt/miniconda3/envs/retinopathy/lib/python3.8/site-packages/raymon/auth/__init__.py in login_user(credentials, out, env)
40 if not token_ok(token):
41 token = login_device_flow(config)
---> 42 save_user_config(
43 existing=credentials,
44 auth_endpoint=config["auth_url"],
~/opt/miniconda3/envs/retinopathy/lib/python3.8/site-packages/raymon/auth/user.py in save_user_config(existing, auth_endpoint, audience, client_id, token, out, env)
31 user_config[env["auth_url"]] = env_config
32 known_configs["user"] = user_config
---> 33 with open(out, "w") as f:
34 json.dump(known_configs, fp=f, indent=4)
35
FileNotFoundError: [Errno 2] No such file or directory: '/Users/emreozan/.raymon/secrets.json'
5 cells were canceled due to an error in the previous cell.
A ray should not be able to log to the same peephole twice. Peepholes must be unique.
Use Cerberus or Schematics to validate the config values.
Can serve as inspiration for RDV.
The current implementation does not have authentication. Add this ASAP.
Simple: https://pypi.org/project/falcon-auth0/
Future proof: https://pypi.org/project/falcon-auth0/
Must work the same like #33.
When comparing 2 stats objects we currently do not use any confidence interval or p-value check, which we should. Since p-value checks can be overly sensitive on big data sets (which we happen to have a lot of in "big data") we want to use confidence intervals. They can also make nice plots.
We can build a confidence intervals with minimal changes required in this library and the backend as follows.
We can build these confidence intervals based solely on our stats (edf / frequencies and sampel sizes).
To contrast 2 stats objects, we can simply measure the max distance between the confidence intervals instead of the observed functions.
We should make it easy for users to measure elapsed time.
with ray.time("your-ref"):
pass
and ray.time_ref(peephole="your-ref")
+ easy calculation of time elapsed since previous ref.
This should be added as tag to the ray.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.