Code Monkey home page Code Monkey logo

phoenix's People

Contributors

a-s-gorski avatar amank94 avatar anticorrelator avatar arizedatngo avatar axiomofjoy avatar camyoung93 avatar cjunkin avatar davidgmonical avatar dependabot[bot] avatar fjcasti1 avatar github-actions[bot] avatar gregwchase avatar hakantekgul avatar harrisonchu avatar jgilhuly avatar jlopatec avatar kryskirkland avatar lou-k avatar matthewsh avatar michaelschiff avatar mikeldking avatar mkhludnev avatar nate-mar avatar parker-stafford avatar pbadhe avatar rogerhyang avatar sallyannarize avatar shashankvsg avatar tammy37 avatar trevor-laviale-arize avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phoenix's Issues

gitignore compiled js bundle

Make the js bundles gitignored and find a way to make it part of the assets when pip install is called. Right now it fails to be mounted in the python package if it is gitignored.

[metrics] Add support for getting timeseries data of any calculation

As a developer, I want to be able compose together a metric calculation and 1 or 2 datasets such that the metric is calculated for a specified interval over time at a granularity of my choosing(or auto).

Pseudo-Code

 metricAggregate = Calculate(df, metrics=[EuclideanDiscance]) // Retrieves the metric at an aggregate over the entire dataframe
 metricOverTime = TimeSeries(dv, metrics=[EuclideanDistance], granularity="hourly") // Retrieves the metric over time at the granularity interval specified

Calculate

psi = Calculate([primary_df, baseline_df], psi)

Returns

print(psi) # 2.73

Calculate Over Time

psiOverTime = CalculateOverTime([primary_df, baseline_df], psi, granularity="1hour")

Returns

print(psiOverTime) # [{ timestamp: "12-12-2022", v: 1.27 }, { timestamp: "12-12-2022", v: 1.27 }] 

[metrics] CSV parsing for embeddings

vector columns are not parsed in CSV appropriately and are not type safe.

Acceptance Criteria

  • Properly validate columns in dataframe matches the type expected
  • Remove the cast and make the column retrieval methods off of Dataset type safe.
  • Add tests for parsing

pip install fails due to HDF5Close

Not sure how to avoid

(notebook) ➜  phoenix git:(47-lasso-select) ✗ (⎈|dev:arize-dev)pip install .
Processing /Users/mikeldking/work/phoenix
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting pandas
  Downloading pandas-1.5.2-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.8/10.8 MB 18.7 MB/s eta 0:00:00
Collecting umap-learn
  Using cached umap_learn-0.5.3-py3-none-any.whl
Collecting numpy
  Using cached numpy-1.23.5-cp310-cp310-macosx_11_0_arm64.whl (13.4 MB)
Collecting hdbscan
  Using cached hdbscan-0.8.29-cp310-cp310-macosx_12_0_arm64.whl
Collecting tables
  Using cached tables-3.7.0.tar.gz (8.2 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [12 lines of output]
      /var/folders/1s/4vdv59n15b1ghg42frdd8f480000gn/T/H5close646cmm8d.c:2:5: error: implicit declaration of function 'H5close' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          H5close();
          ^
      1 error generated.
      cpuinfo failed, assuming no CPU features: No module named 'cpuinfo'
      * Using Python 3.10.3 (main, Apr 14 2022, 13:44:37) [Clang 13.1.6 (clang-1316.0.21.2.3)]
      * Found cython 0.29.32
      * USE_PKGCONFIG: True
      .. ERROR:: Could not find a local HDF5 installation.
         You may need to explicitly state where your local HDF5 headers and
         library can be found by setting the ``HDF5_DIR`` environment
         variable or by using the ``--hdf5`` command-line option.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

[metrics] Bottom out computation limits using pandas.

Metrics

  • Accuracy
  • Percent Empty (percent of each column that's NaN)
  • AUC
  • 60M predictions - drift calculation on 200 features using PSI
  • Embedding average 60M embeddings with 1K dimensions
  • Embedding average 60M embeddings with 10K dimensions
  • NDCG metric and precision metric on 60M predictions

Improve docstring coverage github action

The current (commented) github action that checks for docstring coverage needs to:

  • Ignore init.py files
  • Allow for verbose option so we can see in the action console what went wrong instead of having to run interrogate locally

Embeddings are read as full strings from csv files

Ideally, we would want to read a table and get in every cell of the embedding vector column an array of floats. This is not the case when reading data from a csv file. The whole embedding is read as a string '[1.31,-0.46,...,-.108]' instead of [1.31,-0.46,...,-.108]. Other file formats, like hdf5, conserve better the data structure and don't have this problem.

This happens essentially with any field that expects an iterable inside a cell, i.e., it also happens with the list of token arrays.

Add flexibility on the number of points per dataset in UMAP

Users should have some control over the number of points they want from each dataset.

In addition, separating the primary and reference with a fixed points per dataset like here will error

primary_dataset_points = construct_dataset_points(
        projections[:points_per_dataset], sampled_primary_dataset, embedding_feature
    )
    reference_dataset_points = construct_dataset_points(
        projections[points_per_dataset:], sampled_reference_dataset, embedding_feature
    )

Add a `Dataset` module

Add a arize_toolbox.dataset module that keeps track of inference records in a performant and normalized format. Must support features, actuals, embeddings.

  • serializable to a pandas dataframe
  • serializable to JSON

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.