vertaai / modeldb Goto Github PK
View Code? Open in Web Editor NEWOpen Source ML Model Versioning, Metadata, and Experiment Management
License: Apache License 2.0
Open Source ML Model Versioning, Metadata, and Experiment Management
License: Apache License 2.0
Hello guys,
First of all some greetings, this work looks absolutely great. I have a few questions for you.
Thanks a lot for your answers, I think some of the answers to this questions should appear somewhere in the Readme.md. It is absolutely a must know for someone who is interested in going live with this project.
Once again, thank you very much for your contribution to the ML community.
Best regards from France,
Jonathan D.
Model Filepath currently displays a timestamp. Fix this to read the actual filepath.
Backend TkAgg is interactive backend. Turning interactive mode on. Traceback (most recent call last): File "C:\Program Files\JetBrains\PyCharm 2017.2.1\helpers\pydev\pydevd.py", line 1596, in <module> globals = debugger.run(setup['file'], None, None, is_module) File "C:\Program Files\JetBrains\PyCharm 2017.2.1\helpers\pydev\pydevd.py", line 1023, in run pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Program Files\JetBrains\PyCharm 2017.2.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:/Users/vibhatia/PycharmProjects/modeldbexample/demo/linearmodel.py", line 34, in <module> df, target, test_size=0.3) File "C:\Users\vibhatia\AppData\Local\Programs\Python\Python35\lib\site-packages\modeldb\sklearn_native\ModelDbSyncer.py", line 258, in train_test_split_fn result = split_dfs[:len(split_dfs) / 2] TypeError: slice indices must be integers or None or have an __index__ method
After changing the line to result = split_dfs[:int(len(split_dfs) / 2)] It works I am using python 3.5 . Lemme know if this works and I can make the change and do a pull request
Implement ModelDB client library in R. Includes:
Writing a thrift client in R
Implementing ModelDB light logging functionality
Implementing operator level logging in R
Follow on from #249
Updates to metadata should be tracked, whether these happen from the API directly or the frontend.
I am using an Ubuntu VM on Azure to setup modelDB . I am using Docker to spin up the instance. When I try to run docker-compose up I get the following error
The following packages have unmet dependencies: mongodb-org-shell : Depends: libssl1.0.0 (>= 1.0.1) but it is not installable E: Unable to correct problems, you have held broken packages. ERROR: Service 'backend' failed to build: The command '/bin/sh -c apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6 && echo "deb [ arch=amd64 ] http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.4 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-3.4.list && apt-get update && apt-get install -y maven sqlite g++ make automake bison flex pkg-config libevent-dev libssl-dev libtool mongodb-org-shell && apt-get clean && update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java' returned a non-zero code: 100
Can someone provide me some pointer what I am doing wrong.
This bug is important as we at Adobe are trying to analyze the feature set of Model DB to see if we can use it for our use case
modeldb.csail.mit.edu links to spark.ml and scikit-learn Getting Started docs, but those links are empty hrefs.
Also, there are no files in the gh-pages branch with that kind of content. master has files which look appropriate: https://github.com/mitdbg/modeldb/blob/master/docs/getting_started/scikit_learn.md and https://github.com/mitdbg/modeldb/blob/master/docs/getting_started/spark_ml.md.
Once a model has been trained/registered with ModelDB, expose functionality to add metadata to the model. This can be periodic metrics, perhaps failures etc.
Potential APIs:
appendMetadata(int modelId, KV keyvalue)
Two of the python tests have been failing for some time now. Could someone please take a look at these and either fix the test or what they're testing? I'm not really familiar enough with it to do it myself.
testPipelineEvent.test_overall_pipeline_fit_event
and testPipelineEvent.test_pipeline_first_fit_stage
======================================================================
FAIL: test_overall_pipeline_fit_event (testPipelineEvent.TestPipelineEvent)
----------------------------------------------------------------------
Traceback (most recent call last):
File "testPipelineEvent.py", line 105, in test_overall_pipeline_fit_event
utils.is_equal_transformer_spec(spec, expected_spec, self)
File "/Users/arcarter/code/modeldb/client/python/modeldb/tests/utils.py", line 153, in is_equal_transformer_spec
tester.assertEqual(len(spec1.hyperparameters), len(spec2.hyperparameters))
AssertionError: 14 != 10
======================================================================
FAIL: test_pipeline_first_fit_stage (testPipelineEvent.TestPipelineEvent)
----------------------------------------------------------------------
Traceback (most recent call last):
File "testPipelineEvent.py", line 149, in test_pipeline_first_fit_stage
utils.is_equal_transformer_spec(spec, expected_spec, self)
File "/Users/arcarter/code/modeldb/client/python/modeldb/tests/utils.py", line 153, in is_equal_transformer_spec
tester.assertEqual(len(spec1.hyperparameters), len(spec2.hyperparameters))
AssertionError: 7 != 3
Currently ModelDB syncer can be instantiated from the config or explicitly. Allow the user to override the options read in from config and to provide more flexible explicit constructor.
Class and Interfaces for edu.mit.csail.db.ml.server.storage.metadata has the right package
construct but are available under a different directory. IntelliJ IDEA was complaining about missing imports.
Current image is several GBs. Can we reduce?
The many import *
statements make it hard to tell at a glance what is importing what, and from where. This is a ticket to pep8-ify the code and remove cruft when possible.
The TransformerSpec
struct has a field called features
. However, this is never used anywhere. We should remove it. FitEvent
includes the features, and we do use those.
When I am trying to run the SimpleSampleWithModelDB.py, Im getting this error on both server and client side.
May i know what's the real problem that I am facing?
Between how do i clear the log or metadata file? I was able to launch the frontend with the title Simple Sample but not able to load the metadata with an error.
Model Name should be listed in the ID column under Model ID, Experiment ID, and Experiment Run ID.
The scala client is not released in public repo. In this scenario every user who wants to write a client will have to build it locally in there computer. The scala client must be released in a public repo.
The UI does not show models on which the evaluate is not called . In our case the evaluate is not called in most of the cases and there is no way of visualizing those models if Evaluate is not called.
The culprit seems to be the following lines in the UI
for (var i=0; i<models.length; i++) {
var model_metrics = models[i].metrics;
var metrics = [];
models[i].show = false;
for (key in model_metrics) {
if (model_metrics.hasOwnProperty(key)) {
var val = Object.keys(model_metrics[key]).map(function(k){return model_metrics[key][k]})[0];
val = Math.round(parseFloat(val) * 1000) / 1000;
metrics.push({
"key": key,
"val": val
});
models[i].show = true;
}
}
models[i].metrics = metrics;
}
models = models.filter(function(model) {
return model.show;
});
If I comment the lines
models = models.filter(function(model) {
return model.show;
});
the models are shown but it gives an extra model called pipeline model which must not be shown if the fit sync is called on a pipeline Any reason why this is happening.
Does modeldb embrace a use case of utilizing only the frontend? The scikit-learn and Apache Spark version requirements for the client do not fit with my environment. I see the configurations can be yaml files, but I am wondering if this is a logical use case–to manually create the yaml files and use the front end to view model results.
i was able to setup the server and successfully ran all the code in https://github.com/mitdbg/modeldb/tree/master/client/python/samples/sklearn.
However when i tried the basic client, all the samples code in here https://github.com/mitdbg/modeldb/tree/master/client/python/samples/basic yields this error Syncer.instance.add_to_buffer(fit_event)
AttributeError: type object 'Syncer' has no attribute 'instance'
Being able to delete models once they are synced.
Currently this is blocked on a version issue related to the thrift-maven-plugin. Will be resolved when that is unblocked.
Missing a commons-lang dependency, and the path to thrift was not the same as on my system. Patch suggestion in #195
In the .thrift file, in FitEvent, the type of "LabelColumns" is list.
In CreateDb.sql, you store "LabelColumn" (no s in SQL) as a single string.
Are there supposed to be multiple label columns, separated by commas, in the SQL table? If so, we should change this to "LabelColumns" and add the appropriate comment
This will make it easier to refer to models from different parts of the system.
If two YAML files are sync'd containing both containing:
PROJECT:
NAME: my_name
DESCRIPTION: my_description
but containing different MODEL
information, the frontend does not correctly aggregate this under one project, and instead creates two distinct project IDs with the same name & desc.
The scala client jar name has changed to modeldb-scala-client.jar. Update tests accordingly.
ModelDB currently does not have a concept of users or accounts. All data is visible to all users. However, with this model, users cannot do things like bookmark models or annotate them (for themselves as opposed to across all users).
Adding user accounts and authentication will enable these functions along with access control.
Sync'ing via YAML file requires that the CONFIG be non-empty and METRICS be iterable. These restrictions should be relaxed so that both fields can be empty.
There should be a live, public example of ModelDB for people to play with. Visitors should get to browse both a Jupyter notebook with example models to run and ModelDB's Node.js frontend to see reports about those models.
Architecture:
This will also require the creation of one or more .ipynb files that can be run by users to create reports in ModelDB.
Set up a continuous integration system for ModelDB including automated building and testing.
Requirements:
On running ./start_server.sh
[INFO] ------------------------------------------------------------------------
[INFO] Building modeldb 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- exec-maven-plugin:1.5.0:java (default-cli) @ modeldb ---
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Starting the simple server on port 6543...
Requires a dependency on simple logging facade for Java (https://www.slf4j.org). Most recent version is 1.7.24.
There should be a pip installable package for modeldb's python client. This will allow people to more easily deploy modeldb.
A PyPi package may require moving syncer.json, or duplicating its hardcoded values into ConfigUtils.py
This is currently being worked on at the PyPI test site
Today the models are generated in SQLite specific package and all the references in the code are like that this makes changing the database very hard in fact the model must be created in a database agnostic package so that the changes in underlying database will only change the library.xml file
http://localhost:3000/
First noticed at 152a63a
Bug not present at dea7604
TypeError: api.testConnection is not a function
at /Users/arcarter/code/modeldb/frontend/routes/index.js:7:7
at Layer.handle [as handle_request] (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/layer.js:95:5)
at next (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/route.js:131:13)
at Route.dispatch (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/route.js:112:3)
at Layer.handle [as handle_request] (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/layer.js:95:5)
at /Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:277:22
at Function.process_params (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:330:12)
at next (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:271:10)
at Function.handle (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:176:3)
at router (/Users/arcarter/code/modeldb/frontend/node_modules/express/lib/router/index.js:46:12)
Steps to reproduce.
Create a spark job which runs for more than 10 minutes.
Cal fitSync on any estimator in Spark client.
17/10/11 08:13:33 ERROR ApplicationMaster: User class threw exception: com.twitter.finagle.ConnectionFailedException: Connection timed out at remote address: <DNS:port> from service: <DNS:port>. Remote Info: Upstream Address: Not Available, Upstream Client Id: Not Available, Downstream Address: <DNS:port>, Downstream Client Id: <DNS:port>, Trace Id: 0bb6d8aea19c2260.0bb6d8aea19c2260<:0bb6d8aea19c2260 com.twitter.finagle.ConnectionFailedException: Connection timed out at remote address: <DNS:port> from service:<DNS:port>. Remote Info: Upstream Address: Not Available, Upstream Client Id: Not Available, Downstream Address: <DNS:port>, Downstream Client Id: <DNS:port>, Trace Id: 0bb6d8aea19c2260.0bb6d8aea19c2260<:0bb6d8aea19c2260 Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at com.twitter.finagle.util.ProxyThreadFactory$$anonfun$newProxiedRunnable$1$$anon$1.run(ProxyThreadFactory.scala:19) at java.lang.Thread.run(Thread.java:748)
for a job less than 7 minutes the results are fine
Not sure if anyone else has experienced this same issue.
I am using the docker-compose modeldb installation on a mac, and unable to connect the docker-machine to localhost. Here's another forum discussing the issue: https://forums.docker.com/t/using-localhost-for-to-access-running-container/3148
As a workaround, I used this script for the [path to modeldb directory] that find/replaces all the instances of localhost in the routing files.
sed -i -e 's/localhost/<docker-machine address>/g' server/src/main/resources/reference.conf
sed -i -e 's/localhost/<docker-machine address>/g' server/src/main/resources/reference-docker.conf
sed -i -e 's/localhost/<docker-machine address>/g' server/src/main/resources/reference-test.conf
sed -i -e 's/localhost/<docker-machine address>/g' client/syncer.json
sed -i -e 's/localhost/<docker-machine address>/g' frontend/util/check_thrift.js
sed -i -e 's/localhost/<docker-machine address>/g' frontend/util/thrift.js
sed -i -e 's/localhost/<docker-machine address>/g' client/python/modeldb/basic/Structs.py
For , do: $ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
default - virtualbox Running tcp://192.168.99.100:2376 v17.05.0-ce
After that I was able to add new projects to the docker/modeldb.
I can branch and submit a pull request. Or, is there a better way? Please advise.
Being able to install and use the python client with python 3.x
When I try to create some annotations on a model it gives us 500 error
Transformer is not defined
ReferenceError: Transformer is not defined
at Object.storeAnnotation (E:\machinelearning\git\modeldb\frontend\util\api.js:109:27)
at E:\machinelearning\git\modeldb\frontend\routes\models.js:30:7
at Layer.handle [as handle_request] (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\layer.js:95:5)
at next (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\route.js:131:13)
at Route.dispatch (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\route.js:112:3)
at Layer.handle [as handle_request] (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\layer.js:95:5)
at E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:277:22
at param (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:349:14)
at param (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:365:14)
at Function.process_params (E:\machinelearning\git\modeldb\frontend\node_modules\express\lib\router\index.js:410:3)
Imagine that I have a long custom pipeline of experiment managed with a bash script.
Is there a way to pass custom output/artifacts from such pipeline to ModelDB?
I believe some kind of RestAPI would be nice.
It will be ok to use other languages (generate artifacts somehow, run python script to put this to db)
suggestion: remove SQLite footprint from ModelDB and use MongoDB as sole database.
use case: to have ModelDB running in production, we have it running in a Docker container with the Mongo database mounted as a volume from the host OS outside container. this allows us to quickly respawn in case the container goes down, and enables data recovery. with regards the SQL db, it's not only redundant given the introduction of the Mongo db, but its path is also hardcoded, so mounting it separately is more of a hack.
Today the mongo metadata store takes in host and port. the issue with this approach is that we can not define authentication information as generally the URL contains the authentication information. The proposal is to change the host and port to URL which can entail the host and port approach if you set the URL to be : and also more complex use case where we need to provide more information to mongo client
Currently, I've commented out a line that counts the number of rows in a DataFrame.
Counting the number of rows in a DataFrame is a slow operation. However, we use the numRows
field in our DataFrame
thrift struct. If we insist on keeping the numRows
field, then we need to accept the cost of counting the number of DataFrame rows (this requires a full sequential scan of the dataset).
I wonder if it's inappropriate for ModelDB, which is supposed to be a low-overhead tool, to perform a sequential scan of a DataFrame. If we do this, then the overhead of ModelDB will skyrocket and will scale linearly with the size of the user's dataset.
I'm thinking of making the "should we count the number of rows?" configurable. If users are willing to accept the cost of ModelDB counting the number of rows in the DataFrame, then they can indicate so. If they need the performance, they can indicate so. Also, to avoid recounting the number of rows in the DataFrame, we can do some caching of the counts.
Thoughts?
mongo --eval "db.getSiblingDB('admin').shutdownServer()"
doesn't actually shut down the server correctly. I've had a better experience just doing
ps -a | grep mongo
kill [PID]
This could obviously use some more elaboration, and I could look at it more. But it's worth noting for now that the instructions aren't doing what they're supposed to.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.