vertaai / modeldb Goto Github PK

Open Source ML Model Versioning, Metadata, and Experiment Management

License: Apache License 2.0

Python 27.37% Scala 16.26% Shell 0.22% Java 34.21% JavaScript 1.59% CSS 1.34% HTML 0.01% Dockerfile 0.03% Makefile 0.01% Go 0.96% TypeScript 17.87% Mustache 0.14%

machine-learning model-management modeldb mit verta model-versioning

modeldb's People

Contributors

Stargazers

Watchers

Forkers

flikka chagge tspannhw benjamesbabala sniperxiaojun nimanajafi smarthi merajat sjanulonoks effectiveai shubhamchopra rogertangos yonromai eyadsibai jeffery4000 hejian991 shubaozhang arkll soellingeraj jingruhou a12590 codeaudit thanhtran81 zitterbewegung vidit-bhatia raghparihar blackc0re bigrlab forpankil asoriano-stratio onisimchukv ocallaco nagyistge francescoperera 17ai jrgauthier01 utkarshcontractor clustersdata ashwanthkumar dem-esgal praveenmunagapati kingmbc stvhanna rogervaas cpfrer xieydd zaxcie miyamotok0105 drod331 ubyjvovk saiprashanths miguelperalvo rockie-yang jamesbconner lulzzz swpflow zmoon111 kotenev vraja2 connordoyle adrozdov jmassapina pankeshgupta macmarcin-zz skrypnyuk q2w shubhamrungta29 vn09 lingya truocphamkhac-agilityio jeganthirumeni wei-he jacklee20151 yujioshima kcompher stenpiren forestlzj huafengw gladyshil satsumas jkinkead krisrs1128 zachaloma adolfoeliazat nordstrom shibuiwilliam eycab shobhit-agarwal chen0031 georgedittmar satyamsingh nordsvich rizplate lynnmhook lambdaml ananthc toquoccuong rashmisathyananda beautifulnow1992 brucechou1983

modeldb's Issues

Django Integration - Some Random Questions

Hello guys,

First of all some greetings, this work looks absolutely great. I have a few questions for you.

How mmany models do we you think it is worth to manage to start feeling benefits using such a system?
I use Django and custom version of this package: https://github.com/fridiculous/django-estimators, what pros and cons would you see to make the switch (or not doing it) ?
Do you actually believe it is more efficient to save each model in a DB rather than some cloud storage such as AWS S3 and save the link to the model in the app database ?
Do you have any benchmark on how long it takes to load a model from ModelDB and save a model to ModelDB ?
Do you have views on making on pre-compiled container available on Docker Hub for instance? It would be so wonderful to quickly spin-up the last release and test it live.

Thanks a lot for your answers, I think some of the answers to this questions should appear somewhere in the Readme.md. It is absolutely a must know for someone who is interested in going live with this project.

Once again, thank you very much for your contribution to the ML community.

Best regards from France,

Jonathan D.

[Bug] Frontend: update value of Model Filepath key

Model Filepath currently displays a timestamp. Fix this to read the actual filepath.

Error in ModelDbSyncer line number 268

Backend TkAgg is interactive backend. Turning interactive mode on. Traceback (most recent call last): File "C:\Program Files\JetBrains\PyCharm 2017.2.1\helpers\pydev\pydevd.py", line 1596, in <module> globals = debugger.run(setup['file'], None, None, is_module) File "C:\Program Files\JetBrains\PyCharm 2017.2.1\helpers\pydev\pydevd.py", line 1023, in run pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Program Files\JetBrains\PyCharm 2017.2.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:/Users/vibhatia/PycharmProjects/modeldbexample/demo/linearmodel.py", line 34, in <module> df, target, test_size=0.3) File "C:\Users\vibhatia\AppData\Local\Programs\Python\Python35\lib\site-packages\modeldb\sklearn_native\ModelDbSyncer.py", line 258, in train_test_split_fn result = split_dfs[:len(split_dfs) / 2] TypeError: slice indices must be integers or None or have an __index__ method

After changing the line to result = split_dfs[:int(len(split_dfs) / 2)] It works I am using python 3.5 . Lemme know if this works and I can make the change and do a pull request

[Feature] R client for ModelDB

Implement ModelDB client library in R. Includes:

Writing a thrift client in R
Implementing ModelDB light logging functionality
Implementing operator level logging in R

[Feature] Metadata should be versioned similar to models

Follow on from #249

Updates to metadata should be tracked, whether these happen from the API directly or the frontend.

cc @suhailshergill @arubisov

Running docker-compose on ubuntu gives error

I am using an Ubuntu VM on Azure to setup modelDB . I am using Docker to spin up the instance. When I try to run docker-compose up I get the following error

The following packages have unmet dependencies: mongodb-org-shell : Depends: libssl1.0.0 (>= 1.0.1) but it is not installable E: Unable to correct problems, you have held broken packages. ERROR: Service 'backend' failed to build: The command '/bin/sh -c apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6 && echo "deb [ arch=amd64 ] http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-3.4.list && apt-get update && apt-get install -y maven sqlite g++ make automake bison flex pkg-config libevent-dev libssl-dev libtool mongodb-org-shell && apt-get clean && update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java' returned a non-zero code: 100

Can someone provide me some pointer what I am doing wrong.

This bug is important as we at Adobe are trying to analyze the feature set of Model DB to see if we can use it for our use case

[Bug] gh-pages website missing Getting Started links

modeldb.csail.mit.edu links to spark.ml and scikit-learn Getting Started docs, but those links are empty hrefs.

Also, there are no files in the gh-pages branch with that kind of content. master has files which look appropriate: https://github.com/mitdbg/modeldb/blob/master/docs/getting_started/scikit_learn.md and https://github.com/mitdbg/modeldb/blob/master/docs/getting_started/spark_ml.md.

[Feature] Add support for model monitoring

Once a model has been trained/registered with ModelDB, expose functionality to add metadata to the model. This can be periodic metrics, perhaps failures etc.

Potential APIs:

appendMetadata(int modelId, KV keyvalue)

Fix Python Pipeline Tests

Two of the python tests have been failing for some time now. Could someone please take a look at these and either fix the test or what they're testing? I'm not really familiar enough with it to do it myself.

testPipelineEvent.test_overall_pipeline_fit_event and testPipelineEvent.test_pipeline_first_fit_stage

======================================================================
FAIL: test_overall_pipeline_fit_event (testPipelineEvent.TestPipelineEvent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testPipelineEvent.py", line 105, in test_overall_pipeline_fit_event
    utils.is_equal_transformer_spec(spec, expected_spec, self)
  File "/Users/arcarter/code/modeldb/client/python/modeldb/tests/utils.py", line 153, in is_equal_transformer_spec
    tester.assertEqual(len(spec1.hyperparameters), len(spec2.hyperparameters))
AssertionError: 14 != 10

======================================================================
FAIL: test_pipeline_first_fit_stage (testPipelineEvent.TestPipelineEvent)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "testPipelineEvent.py", line 149, in test_pipeline_first_fit_stage
    utils.is_equal_transformer_spec(spec, expected_spec, self)
  File "/Users/arcarter/code/modeldb/client/python/modeldb/tests/utils.py", line 153, in is_equal_transformer_spec
    tester.assertEqual(len(spec1.hyperparameters), len(spec2.hyperparameters))
AssertionError: 7 != 3

[Feature] Add more flexible constructors for ModelDBSyncer

Currently ModelDB syncer can be instantiated from the config or explicitly. Allow the user to override the options read in from config and to provide more flexible explicit constructor.

Fix IDE warnings for server.storage.metadata package

Class and Interfaces for edu.mit.csail.db.ml.server.storage.metadata has the right package construct but are available under a different directory. IntelliJ IDEA was complaining about missing imports.

[Improvement] Reduce size of docker image

Current image is several GBs. Can we reduce?

[BUG] Python imports and syntax are difficult to parse

The many import * statements make it hard to tell at a glance what is importing what, and from where. This is a ticket to pep8-ify the code and remove cruft when possible.

Remove `features` field from `TransformerSpec` in ModelDb.thrift

The TransformerSpec struct has a field called features. However, this is never used anywhere. We should remove it. FitEvent includes the features, and we do use those.

ServerLogicException

When I am trying to run the SimpleSampleWithModelDB.py, Im getting this error on both server and client side.

May i know what's the real problem that I am facing?

Between how do i clear the log or metadata file? I was able to launch the frontend with the title Simple Sample but not able to load the metadata with an error.

Display model name in frontend

Model Name should be listed in the ID column under Model ID, Experiment ID, and Experiment Run ID.

support PySpark as a ModelDBclient

Spark scala client is not released in a public repo

The scala client is not released in public repo. In this scenario every user who wants to write a client will have to build it locally in there computer. The scala client must be released in a public repo.

Add CI environment.

[Feature][IMP] Add logging examples for Tensorflow and torch

Model DB UI does not show models if Evaluate is not called

The UI does not show models on which the evaluate is not called . In our case the evaluate is not called in most of the cases and there is no way of visualizing those models if Evaluate is not called.

The culprit seems to be the following lines in the UI

   for (var i=0; i<models.length; i++) {
          var model_metrics = models[i].metrics;
          var metrics = [];
          models[i].show = false;
          for (key in model_metrics) {
            if (model_metrics.hasOwnProperty(key)) {
              var val = Object.keys(model_metrics[key]).map(function(k){return model_metrics[key][k]})[0];
              val = Math.round(parseFloat(val) * 1000) / 1000;
              metrics.push({
                "key": key,
                "val": val
              });
              models[i].show = true;
            }
          }

          models[i].metrics = metrics;
        }
        models = models.filter(function(model) {
          return model.show;
        });

If I comment the lines

        models = models.filter(function(model) {
          return model.show;
        });

the models are shown but it gives an extra model called pipeline model which must not be shown if the fit sync is called on a pipeline Any reason why this is happening.

Using only the frontend?

Does modeldb embrace a use case of utilizing only the frontend? The scikit-learn and Apache Spark version requirements for the client do not fit with my environment. I see the configurations can be yaml files, but I am wondering if this is a logical use case–to manually create the yaml files and use the front end to view model results.

[BUG] Basic client: 'Syncer' has no attribute 'instance'

i was able to setup the server and successfully ran all the code in https://github.com/mitdbg/modeldb/tree/master/client/python/samples/sklearn.
However when i tried the basic client, all the samples code in here https://github.com/mitdbg/modeldb/tree/master/client/python/samples/basic yields this error Syncer.instance.add_to_buffer(fit_event)
AttributeError: type object 'Syncer' has no attribute 'instance'

[FEATURE] Delete models once synced

Being able to delete models once they are synced.

[Installation] Make ModelDB compatible with thrift 0.10.0

Currently this is blocked on a version issue related to the thrift-maven-plugin. Will be resolved when that is unblocked.

Server installation - did not succeed on my Ubuntu 16.04

Missing a commons-lang dependency, and the path to thrift was not the same as on my system. Patch suggestion in #195

[Bug] Server incorrectly returning timestamp/filepath

Multiple Label columns present in thrift file, Single Label Column in SQL

In the .thrift file, in FitEvent, the type of "LabelColumns" is list.
In CreateDb.sql, you store "LabelColumn" (no s in SQL) as a single string.
Are there supposed to be multiple label columns, separated by commas, in the SQL table? If so, we should change this to "LabelColumns" and add the appropriate comment

[Feature] Add alphanumeric IDs to models

This will make it easier to refer to models from different parts of the system.

Fix project duplication when syncing from YAML

If two YAML files are sync'd containing both containing:

PROJECT:
  NAME: my_name
  DESCRIPTION: my_description

but containing different MODEL information, the frontend does not correctly aggregate this under one project, and instead creates two distinct project IDs with the same name & desc.

[Minor][Bug] Update scala client tests to reference the correct jar

The scala client jar name has changed to modeldb-scala-client.jar. Update tests accordingly.

[Feature] Add user accounts and authentication to ModelDB

ModelDB currently does not have a concept of users or accounts. All data is visible to all users. However, with this model, users cannot do things like bookmark models or annotate them (for themselves as opposed to across all users).

Adding user accounts and authentication will enable these functions along with access control.

Relax requirement for Config and Metric in YAML files

Sync'ing via YAML file requires that the CONFIG be non-empty and METRICS be iterable. These restrictions should be relaxed so that both fields can be empty.

[FEATURE] Demo site

There should be a live, public example of ModelDB for people to play with. Visitors should get to browse both a Jupyter notebook with example models to run and ModelDB's Node.js frontend to see reports about those models.

Architecture:

Single ModelDB server
- Mongo server
- ModelDB Java backend
- ModelDB Node.js frontend
Multiple ephemeral Jupyter notebooks with ModelDB python library
- configurable-http-proxy to vend out notebooks
- tmpnb to spawn new notebooks and cull idle and old ones

This will also require the creation of one or more .ipynb files that can be run by users to create reports in ModelDB.

[Feature] Continuous Integration Framework for ModelDB

Set up a continuous integration system for ModelDB including automated building and testing.

Requirements:

ModelDB spans multiple languages and frameworks: scala, python, java
CI must be able to interface with all of these languages
Must be reasonably easy to use

[Minor] slf4j dependency for pom.xml

On running ./start_server.sh

[INFO] ------------------------------------------------------------------------
[INFO] Building modeldb 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- exec-maven-plugin:1.5.0:java (default-cli) @ modeldb ---
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Starting the simple server on port 6543...

Requires a dependency on simple logging facade for Java (https://www.slf4j.org). Most recent version is 1.7.24.

[FEATURE] PyPi Package for Modeldb

There should be a pip installable package for modeldb's python client. This will allow people to more easily deploy modeldb.

A PyPi package may require moving syncer.json, or duplicating its hardcoded values into ConfigUtils.py

This is currently being worked on at the PyPI test site

Support Mysql for model DB . Make the system extensible to different database

Today the models are generated in SQLite specific package and all the references in the code are like that this makes changing the database very hard in fact the model must be created in a database agnostic package so that the changes in underlying database will only change the library.xml file

[BUG] Root Url error

http://localhost:3000/

First noticed at 152a63a
Bug not present at dea7604

TypeError: api.testConnection is not a function
    at /Users/arcarter/code/modeldb/frontend/routes/index.js:7:7
    at Layer.handle [as handle_request] (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/layer.js:95:5)
    at next (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/route.js:131:13)
    at Route.dispatch (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/route.js:112:3)
    at Layer.handle [as handle_request] (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/layer.js:95:5)
    at /Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:277:22
    at Function.process_params (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:330:12)
    at next (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:271:10)
    at Function.handle (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:176:3)
    at router (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:46:12)

Model DB Scala client gets timeout exception on fitsync for a long job

Steps to reproduce.

Create a spark job which runs for more than 10 minutes.
Cal fitSync on any estimator in Spark client.
17/10/11 08:13:33 ERROR ApplicationMaster: User class threw exception: com.twitter.finagle.ConnectionFailedException: Connection timed out at remote address: <DNS:port> from service: <DNS:port>. Remote Info: Upstream Address: Not Available, Upstream Client Id: Not Available, Downstream Address: <DNS:port>, Downstream Client Id: <DNS:port>, Trace Id: 0bb6d8aea19c2260.0bb6d8aea19c2260<:0bb6d8aea19c2260 com.twitter.finagle.ConnectionFailedException: Connection timed out at remote address: <DNS:port> from service:<DNS:port>. Remote Info: Upstream Address: Not Available, Upstream Client Id: Not Available, Downstream Address: <DNS:port>, Downstream Client Id: <DNS:port>, Trace Id: 0bb6d8aea19c2260.0bb6d8aea19c2260<:0bb6d8aea19c2260 Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at com.twitter.finagle.util.ProxyThreadFactory$$anonfun$newProxiedRunnable$1$$anon$1.run(ProxyThreadFactory.scala:19) at java.lang.Thread.run(Thread.java:748)

for a job less than 7 minutes the results are fine

host=localhost not possible with docker mac

Not sure if anyone else has experienced this same issue.

I am using the docker-compose modeldb installation on a mac, and unable to connect the docker-machine to localhost. Here's another forum discussing the issue: https://forums.docker.com/t/using-localhost-for-to-access-running-container/3148

As a workaround, I used this script for the [path to modeldb directory] that find/replaces all the instances of localhost in the routing files.

sed -i -e 's/localhost/<docker-machine address>/g' server/src/main/resources/reference.conf
sed -i -e 's/localhost/<docker-machine address>/g' server/src/main/resources/reference-docker.conf
sed -i -e 's/localhost/<docker-machine address>/g' server/src/main/resources/reference-test.conf
sed -i -e 's/localhost/<docker-machine address>/g' client/syncer.json
sed -i -e 's/localhost/<docker-machine address>/g' frontend/util/check_thrift.js
sed -i -e 's/localhost/<docker-machine address>/g' frontend/util/thrift.js
sed -i -e 's/localhost/<docker-machine address>/g' client/python/modeldb/basic/Structs.py

For , do: $ docker-machine ls

NAME      ACTIVE   DRIVER       STATE     URL                         SWARM   DOCKER        ERRORS
default   -        virtualbox   Running   tcp://192.168.99.100:2376           v17.05.0-ce

After that I was able to add new projects to the docker/modeldb.

I can branch and submit a pull request. Or, is there a better way? Please advise.

[FEATURE] Support python 3.x

Being able to install and use the python client with python 3.x

Creating annotations give 500 Error

When I try to create some annotations on a model it gives us 500 error

Transformer is not defined
ReferenceError: Transformer is not defined
    at Object.storeAnnotation (E:\machinelearning\git\modeldb\frontend\util\api.js:109:27)
    at E:\machinelearning\git\modeldb\frontend\routes\models.js:30:7
    at Layer.handle [as handle_request] (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\layer.js:95:5)
    at next (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\route.js:131:13)
    at Route.dispatch (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\route.js:112:3)
    at Layer.handle [as handle_request] (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\layer.js:95:5)
    at E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:277:22
    at param (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:349:14)
    at param (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:365:14)
    at Function.process_params (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:410:3)

#268

[Feature] Ability to use with custom data from shell

Imagine that I have a long custom pipeline of experiment managed with a bash script.
Is there a way to pass custom output/artifacts from such pipeline to ModelDB?

I believe some kind of RestAPI would be nice.
It will be ok to use other languages (generate artifacts somehow, run python script to put this to db)

[Feature] Drop SQLite in favor of MongoDB

suggestion: remove SQLite footprint from ModelDB and use MongoDB as sole database.

use case: to have ModelDB running in production, we have it running in a Docker container with the Mongo database mounted as a volume from the host OS outside container. this allows us to quickly respawn in case the container goes down, and enables data recovery. with regards the SQL db, it's not only redundant given the introduction of the Mongo db, but its path is also hardcoded, so mounting it separately is more of a hack.

[Bug] [Minor] Extraneous error in frontend

@weienlee could you please remove the extraneous error being logged here?

change the mongo metadata parameters to take in URL

Today the mongo metadata store takes in host and port. the issue with this approach is that we can not define authentication information as generally the URL contains the authentication information. The proposal is to change the host and port to URL which can entail the host and port approach if you set the URL to be : and also more complex use case where we need to provide more information to mongo client

Counting number of rows in a DataFrame is slow

Currently, I've commented out a line that counts the number of rows in a DataFrame.

Counting the number of rows in a DataFrame is a slow operation. However, we use the numRows field in our DataFrame thrift struct. If we insist on keeping the numRows field, then we need to accept the cost of counting the number of DataFrame rows (this requires a full sequential scan of the dataset).

I wonder if it's inappropriate for ModelDB, which is supposed to be a low-overhead tool, to perform a sequential scan of a DataFrame. If we do this, then the overhead of ModelDB will skyrocket and will scale linearly with the size of the user's dataset.

I'm thinking of making the "should we count the number of rows?" configurable. If users are willing to accept the cost of ModelDB counting the number of rows in the DataFrame, then they can indicate so. If they need the performance, they can indicate so. Also, to avoid recounting the number of rows in the DataFrame, we can do some caching of the counts.

Thoughts?

[BUG] Mongo does not shut down

mongo --eval "db.getSiblingDB('admin').shutdownServer()" doesn't actually shut down the server correctly. I've had a better experience just doing

ps -a | grep mongo
kill [PID]

This could obviously use some more elaboration, and I could look at it more. But it's worth noting for now that the instructions aren't doing what they're supposed to.

[Bug][Minor] Sizing error

The graph might need to be resized so the axis labels could fit.