vc1492a / henosis Goto Github PK
View Code? Open in Web Editor NEWA Python framework for deploying recommendation models for form fields.
License: Other
A Python framework for deploying recommendation models for form fields.
License: Other
Useful when a model's predicted probabilities do not show a clear indication of the optimal class category. Adds the ability to not provide a recommendation in certain scenarios.
In the case where preload_pickles
is set to True in config.yaml
but there are no objects in the S3 bucket, change preload_pickles
to False.
Remove the library's dependency on imbalanced-learn and direct users to balance datasets appropriately prior to fitting the model in Henosis. This has the effect of reducing the size of the library and the number of dependencies.
Line 106 in 78acc32
As to generalize the terminology away from form population and more towards providing recommendations generally.
Currently the entire model data is retrieved from Elasticsearch prior to being parsed for the model IDs - replace this functionality with code that only retrieves the model IDs from Elasticsearch which should improve performance of the library.
Replace yaml.load
with yaml.safe_load
in all files and make any other necessary changes. Using safe_load
limits the ability to create simple Python objects to integers or lists, the format of the user-provided config.yaml
file. As users will be providing config.yaml
, using safe_load
guards against the possibility of creating a malicious object.
Restructure the SKModels class in models.py
to inherit from Modals as opposed to being structured as a class within a class.
When appropriate, migrate to using Python 3.7's data classes. This will clean up code and improve maintainability, but does require using 3.7 or higher.
Currently, only AND search queries are supported.
From 0.12.2
Currently, Henosis only supports the use of scikit-learn models with our without predicted probabilities distinctly. It is currently not possible to use models that do and do not provide predicted probabilities in tandem. For example, a LinearSVC
cannot be deployed to the same instance as a MultinomialNB
if predict_probabilities
is set to True
in config.yaml
, but may be deployed to the same instance of predict_probabilities
is set to False
in config.yaml
.
Note this in the documentation and introduce a check that ignores models that do not return predicted probabilities if predict_probabilities
is set to True
in config.yaml
.
Add test coverage and metrics to the readme via a badge for improved transparency
Scikit-learn now includes functionality to report precision, recall, and F-score metrics with various types of averages. Move to using this functionality which will simplify the code base and provide options from the Models class to provide training and test results with varying types of averages.
Use a while loop and continue until successful response for reloading pickles objects.
from gevent.wsgi
to
from gevent.pywsgi
as gevent.wsgi is deprecated.
Henosis currently supports classification models only, and classification-type recommendation (prediction) problems. As such, reference this fact more neatly and cleanly in the documentation while acknowledging future plans to introduce support for regression-type models and other approaches.
Currently, the train and test results stored in Henosis reflect the top-1 scores, despite the fact that the Henosis instance may be providing a top-n recommendation (e.g. n=3 or n=5). Add the ability in the framework to test against the top-n if specified when calling m.train(d)
or m.test(d)
, where m
is a Models()
object.
Line 437 in 78acc32
Provide the model ID that is used in the system to the user when deploying or retracting the model. Add this change to model delete as well.
As implemented, Henosis creates a unique model ID by creating a hex string. This approach does not allow for version controlling of models, e.g. a retrained model with identical specification to a previous model would is assigned a completely new unique ID, with no information stored in relation to a previously stored and fit model.
Implementing a salted, directed acyclic graph using hashes of the documents (models) themselves would provide a means of version controlling models and associating models trained with the same data with one another. Derivates of a model (such as the same model being retrained) that are downstream of an upstream change receive changes in their hash values, and these changes can be tracked in the graph. This functionality would be useful when examining the state of the system - developers operating the system would be able to keep track of various model specifications easily, allowing for A/B testing and other modes of operation.
Needs to be explored further, but could be implemented by using NetworkX on top of the Elasticsearch functionality.
As the software adds functionality for models outside of scikit-learn, it will become necessary to distinguish model implementations in the database as well as in the response from the API. For example, the current API response is as follows:
{
"_id": "bb9e7a91c256412ca40f57e27e8e90b6",
"_index": "models",
"_score": 1.0,
"_source": {
"callCount": 30,
"dependent": "variableTwo",
"deployed": true,
"encoderPath": "encoder_variableTwo_1.pickle",
"encoderType": "CountVectorizer",
"id": "bb9e7a91c256412ca40f57e27e8e90b6",
"independent": [
{
"generator_path": "clean_text.pickle",
"inputs": [
"title"
],
"name": "cleanText"
}
],
"lastTestedDate": "2018-01-18T08:29:09",
"lastTrainedDate": "2018-01-18T08:29:08",
"modelPath": "model_variableTwo_1.pickle",
"modelType": "OneVsRestClassifier(estimator=MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True),\n n_jobs=1)",
"recommendationThreshold": 0.2,
"testAccuracy": 0.9019607843137255,
"testF1": 0.9008452056839288,
"testPrecision": 0.9044056750340173,
"testRecall": 0.9019607843137255,
"testTime": 0.005343914031982422,
"trainAccuracy": 0.9849376731301939,
"trainDataBalance": "upsample",
"trainF1": 0.9849744199555293,
"trainPrecision": 0.9854426711336097,
"trainRecall": 0.9849376731301939,
"trainTime": 0.04866385459899902
},
"_type": "model"
}
The updated response would include a modelClass
field that indicates whether the model is one implemented in scikit-learn or another library, e.g. something along the lines of the following (see the modelClass field):
{
"_id": "bb9e7a91c256412ca40f57e27e8e90b6",
"_index": "models",
"_score": 1.0,
"_source": {
"callCount": 30,
"dependent": "variableTwo",
"deployed": true,
"encoderPath": "encoder_variableTwo_1.pickle",
"encoderType": "CountVectorizer",
"id": "bb9e7a91c256412ca40f57e27e8e90b6",
"independent": [
{
"generator_path": "clean_text.pickle",
"inputs": [
"title"
],
"name": "cleanText"
}
],
"lastTestedDate": "2018-01-18T08:29:09",
"lastTrainedDate": "2018-01-18T08:29:08",
"modelPath": "model_variableTwo_1.pickle",
"modelClass": "SKModel()",
"modelType": "OneVsRestClassifier(estimator=MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True),\n n_jobs=1)",
"recommendationThreshold": 0.2,
"testAccuracy": 0.9019607843137255,
"testF1": 0.9008452056839288,
"testPrecision": 0.9044056750340173,
"testRecall": 0.9019607843137255,
"testTime": 0.005343914031982422,
"trainAccuracy": 0.9849376731301939,
"trainDataBalance": "upsample",
"trainF1": 0.9849744199555293,
"trainPrecision": 0.9854426711336097,
"trainRecall": 0.9849376731301939,
"trainTime": 0.04866385459899902
},
"_type": "model"
}
See PEP-0427.
Line 568 in 78acc32
Add the ability to provide additional information in the request response, such as predicted probabilities or other useful information, through a verbose parameter when submitting the request.
Add functionality to the Models() class that allows users to return a list of generator IDs in the Henosis instance, with or without associated information such the models used by the generator, its inputs and outputs, etc.
This functionality is not yet present in the API (GET generators) and a new endpoint should be created before including the API's results as part of the class call. This functionality would be nice to have as part of the Models() class as to include additional options and improve users' modeling workflow.
Using ultrajson or similar library, improve JSON operation speeds to improve the performance of the server.
Add a series of tests that ensure proper functionality as part of the package.
Currently, calling Models().SKModel().load_model()
results in a KeyError
if the ID of a non-existent model is specified, see below:
WARNING:root:<Response [404]>
{'description': 'Error retrieving model from Elasticsearch.'}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-10-8aa4cd950881> in <module>()
1 for m in list_models():
----> 2 model = Models().SKModel().load_model('bbe4830f5eb24a73a299b720eef58ccd', s)
3 print(model)
4
5
~/Files/Projects/5X/PRS Form Recommender/api/src/henosis/Henosis/model.py in load_model(self, model_id, server_config)
344 print(model_info)
345
--> 346 self.id = model_info['models']['id']
347 self.deployed = model_info['models']['deployed']
348 self.call_count = model_info['models']['callCount']
KeyError: 'models'
The aim here is to improve the response to the error, e.g. something along the lines of 'Specified model ID not present in Henosis instance.' and by also electing to not attempt to load the model if an error occurs in the API (currently, an error presented in the API response does not stop Henosis from attempting to return a model object).
Line 314 in 78acc32
Something along the lines of:
def scroll(self, url, auth, headers, q=None):
if q is None:
r = requests.get(url, headers=headers, verify=False, auth=auth)
else:
r = requests.get(url, json=q, headers=headers, verify=False, auth=auth)
# check for success
if r.status_code in [200, 201, '200', '201']:
pass
else:
logging.warning(r.status_code)
logging.warning(r.text)
time.sleep(5)
data = r.json()
parsed = self.parse_results(data)
self.docs += [p for p in parsed if 'a' in p.keys() and p['a'] != 'Person']
if '_scroll_id' in data.keys() and len(self.docs) != self.doc_count:
# query strings like this only for version 1.4
# you can otherwise scroll in newer version of Elasticsearch by sending a JSON payload
scroll_url = "/".join(url.split('/')[0:4]) + '/_search/scroll?scroll=1d&scroll_id=' + data['_scroll_id']
# q = {
# "scroll": "1d",
# "scroll_id": data['_scroll_id']
# } # for newer versions of Elasticsearch
self.doc_count = len(self.docs)
logging.info(str(len(self.docs)) + ' documents retrieved from index.')
self.scroll(scroll_url, auth, headers)
Add functionality to the Models() class that allows users to return a list of model IDs in the Henosis instance, with or without associated information such as the performance, training date, etc.
This functionality is present in the API (GET models), but would be nice to have as part of the Models() class as to include additional options and improve users' modeling workflow.
Need to upsample only the training data. Failure to do so allows information from test and validation sets to "bleed" into training set.
Line 300 in 78acc32
"must" should be replaced with the variable operator
as the above always results in a boolean must condition even if the operator is set to another value such as should
.
Reduce the API payload in the response and decrease the API response speed by only returning the necessary ID's in the response.
See https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-source-filtering.html.
Right now, users pass the configuration file path into the Server() object. Instead, change the approach such that it mirrors the modeling process, with users creating a connection object and then passing that object into the Server() object for instantiation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.