The henosis from vc1492a

Add ability to weigh models uniquely (instead of equally) when ensembling.

Add support for model thresholds determining when to provide recommendations

Useful when a model's predicted probabilities do not show a clear indication of the optimal class category. Adds the ability to not provide a recommendation in certain scenarios.

Fall back to preload_pickles = False

In the case where preload_pickles is set to True in config.yaml but there are no objects in the S3 bucket, change preload_pickles to False.

Remove imbalanced-learn dependency.

Remove the library's dependency on imbalanced-learn and direct users to balance datasets appropriately prior to fitting the model in Henosis. This has the effect of reducing the size of the library and the number of dependencies.

Remove references to 'formData' and replace with 'data' or 'payload'

henosis/Henosis/server.py

Line 106 in 78acc32

formatted = json.loads(args['formData'].replace("'", '"'))

As to generalize the terminology away from form population and more towards providing recommendations generally.

Old models do not clear from memory on refresh.

When models are refreshed according to the schedule defined in config.yaml, the old thread (and thus old models) are not killed and removed from memory. See below example.

Use Elasticsearch's source filtering for obtaining model IDs

Currently the entire model data is retrieved from Elasticsearch prior to being parsed for the model IDs - replace this functionality with code that only retrieves the model IDs from Elasticsearch which should improve performance of the library.

Reference.

Restrict YAML object types

Replace yaml.load with yaml.safe_load in all files and make any other necessary changes. Using safe_load limits the ability to create simple Python objects to integers or lists, the format of the user-provided config.yaml file. As users will be providing config.yaml, using safe_load guards against the possibility of creating a malicious object.

Create load and delete functions for models.

Restructure SKModels class to inherit from Models

Restructure the SKModels class in models.py to inherit from Modals as opposed to being structured as a class within a class.

Support Python 3.7 Data Classes

When appropriate, migrate to using Python 3.7's data classes. This will clean up code and improve maintainability, but does require using 3.7 or higher.

AND and OR queries for models and request logs.

Currently, only AND search queries are supported.

Move to Flask 1.0

From 0.12.2

Gain access to attributes of scikit-learn objects.

Token based authentication.

Note behavior of predict() and predict_proba() in documentation and introduce check

Currently, Henosis only supports the use of scikit-learn models with our without predicted probabilities distinctly. It is currently not possible to use models that do and do not provide predicted probabilities in tandem. For example, a LinearSVC cannot be deployed to the same instance as a MultinomialNB if predict_probabilities is set to True in config.yaml, but may be deployed to the same instance of predict_probabilities is set to False in config.yaml.

Note this in the documentation and introduce a check that ignores models that do not return predicted probabilities if predict_probabilities is set to True in config.yaml.

Add test coverage metrics

Add test coverage and metrics to the readme via a badge for improved transparency

Use scikit-learn's precision_recall_fscore_support

Scikit-learn now includes functionality to report precision, recall, and F-score metrics with various types of averages. Move to using this functionality which will simplify the code base and provide options from the Models class to provide training and test results with varying types of averages.

Use pathlib (not OS) for managing directories

Add API documentation.

Add check to reinitiate download of pickled objects if error in response

Use a while loop and continue until successful response for reloading pickles objects.

Change gevent WSGI import

from gevent.wsgi

to

from gevent.pywsgi

as gevent.wsgi is deprecated.

Add language referencing classification in the readme and documentation

Henosis currently supports classification models only, and classification-type recommendation (prediction) problems. As such, reference this fact more neatly and cleanly in the documentation while acknowledging future plans to introduce support for regression-type models and other approaches.

Integrate top-n train and test results into framework

Currently, the train and test results stored in Henosis reflect the top-1 scores, despite the fact that the Henosis instance may be providing a top-n recommendation (e.g. n=3 or n=5). Add the ability in the framework to test against the top-n if specified when calling m.train(d) or m.test(d), where m is a Models() object.

Use the multiprocessing library to more quickly send and receive requests

Periodic refresh of models in memory after instance starts.

Provide model ID in deployment response.

henosis/Henosis/model.py

Line 437 in 78acc32

    
           logging.info('Model deployed successfully and will be available after the next server restart.')

Provide the model ID that is used in the system to the user when deploying or retracting the model. Add this change to model delete as well.

Directed acyclic graph with hashes/salts for model version control

As implemented, Henosis creates a unique model ID by creating a hex string. This approach does not allow for version controlling of models, e.g. a retrained model with identical specification to a previous model would is assigned a completely new unique ID, with no information stored in relation to a previously stored and fit model.

Implementing a salted, directed acyclic graph using hashes of the documents (models) themselves would provide a means of version controlling models and associating models trained with the same data with one another. Derivates of a model (such as the same model being retrained) that are downstream of an upstream change receive changes in their hash values, and these changes can be tracked in the graph. This functionality would be useful when examining the state of the system - developers operating the system would be able to keep track of various model specifications easily, allowing for A/B testing and other modes of operation.

Needs to be explored further, but could be implemented by using NetworkX on top of the Elasticsearch functionality.

Add modelClass identifier to Elasticsearch index and API response

As the software adds functionality for models outside of scikit-learn, it will become necessary to distinguish model implementations in the database as well as in the response from the API. For example, the current API response is as follows:

{
            "_id": "bb9e7a91c256412ca40f57e27e8e90b6",
            "_index": "models",
            "_score": 1.0,
            "_source": {
                "callCount": 30,
                "dependent": "variableTwo",
                "deployed": true,
                "encoderPath": "encoder_variableTwo_1.pickle",
                "encoderType": "CountVectorizer",
                "id": "bb9e7a91c256412ca40f57e27e8e90b6",
                "independent": [
                    {
                        "generator_path": "clean_text.pickle",
                        "inputs": [
                            "title"
                        ],
                        "name": "cleanText"
                    }
                ],
                "lastTestedDate": "2018-01-18T08:29:09",
                "lastTrainedDate": "2018-01-18T08:29:08",
                "modelPath": "model_variableTwo_1.pickle",
                "modelType": "OneVsRestClassifier(estimator=MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True),\n          n_jobs=1)",
                "recommendationThreshold": 0.2,
                "testAccuracy": 0.9019607843137255,
                "testF1": 0.9008452056839288,
                "testPrecision": 0.9044056750340173,
                "testRecall": 0.9019607843137255,
                "testTime": 0.005343914031982422,
                "trainAccuracy": 0.9849376731301939,
                "trainDataBalance": "upsample",
                "trainF1": 0.9849744199555293,
                "trainPrecision": 0.9854426711336097,
                "trainRecall": 0.9849376731301939,
                "trainTime": 0.04866385459899902
            },
            "_type": "model"
        }

The updated response would include a modelClass field that indicates whether the model is one implemented in scikit-learn or another library, e.g. something along the lines of the following (see the modelClass field):

{
            "_id": "bb9e7a91c256412ca40f57e27e8e90b6",
            "_index": "models",
            "_score": 1.0,
            "_source": {
                "callCount": 30,
                "dependent": "variableTwo",
                "deployed": true,
                "encoderPath": "encoder_variableTwo_1.pickle",
                "encoderType": "CountVectorizer",
                "id": "bb9e7a91c256412ca40f57e27e8e90b6",
                "independent": [
                    {
                        "generator_path": "clean_text.pickle",
                        "inputs": [
                            "title"
                        ],
                        "name": "cleanText"
                    }
                ],
                "lastTestedDate": "2018-01-18T08:29:09",
                "lastTrainedDate": "2018-01-18T08:29:08",
                "modelPath": "model_variableTwo_1.pickle",
                "modelClass": "SKModel()",
                "modelType": "OneVsRestClassifier(estimator=MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True),\n          n_jobs=1)",
                "recommendationThreshold": 0.2,
                "testAccuracy": 0.9019607843137255,
                "testF1": 0.9008452056839288,
                "testPrecision": 0.9044056750340173,
                "testRecall": 0.9019607843137255,
                "testTime": 0.005343914031982422,
                "trainAccuracy": 0.9849376731301939,
                "trainDataBalance": "upsample",
                "trainF1": 0.9849744199555293,
                "trainPrecision": 0.9854426711336097,
                "trainRecall": 0.9849376731301939,
                "trainTime": 0.04866385459899902
            },
            "_type": "model"
        }

Move from .egg installations to wheel

See PEP-0427.

Fix inheritance with simple auth.

Align parameters in Data( ).train_test_split( ) with those in scikit-learn

henosis/Henosis/model.py

Line 568 in 78acc32

test_size=(1. - share_train),

Add API route for request and session logs.

Verbose option for API endpoints

Add the ability to provide additional information in the request response, such as predicted probabilities or other useful information, through a verbose parameter when submitting the request.

Add functionality to list all generators in instance

Add functionality to the Models() class that allows users to return a list of generator IDs in the Henosis instance, with or without associated information such the models used by the generator, its inputs and outputs, etc.

This functionality is not yet present in the API (GET generators) and a new endpoint should be created before including the API's results as part of the class call. This functionality would be nice to have as part of the Models() class as to include additional options and improve users' modeling workflow.

Improve JSON read/write speeds

Using ultrajson or similar library, improve JSON operation speeds to improve the performance of the server.

Trigger model refresh after a model is deployed.

Add tests

Add a series of tests that ensure proper functionality as part of the package.

Add appropriate response to Models().SKModel().load_model('<model_id>', connection_object) when non-existent model is specified.

Currently, calling Models().SKModel().load_model() results in a KeyError if the ID of a non-existent model is specified, see below:

WARNING:root:<Response [404]>
{'description': 'Error retrieving model from Elasticsearch.'}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-8aa4cd950881> in <module>()
      1 for m in list_models():
----> 2     model = Models().SKModel().load_model('bbe4830f5eb24a73a299b720eef58ccd', s)
      3     print(model)
      4 
      5 

~/Files/Projects/5X/PRS Form Recommender/api/src/henosis/Henosis/model.py in load_model(self, model_id, server_config)
    344             print(model_info)
    345 
--> 346             self.id = model_info['models']['id']
    347             self.deployed = model_info['models']['deployed']
    348             self.call_count = model_info['models']['callCount']

KeyError: 'models'

The aim here is to improve the response to the error, e.g. something along the lines of 'Specified model ID not present in Henosis instance.' and by also electing to not attempt to load the model if an error occurs in the API (currently, an error presented in the API response does not stop Henosis from attempting to return a model object).

Use Elasticsearch's scroll functionality in place of max search size

henosis/Henosis/utils.py

Line 314 in 78acc32

self.search_size = 10000

Something along the lines of:

def scroll(self, url, auth, headers, q=None):
        if q is None:
            r = requests.get(url, headers=headers, verify=False, auth=auth)
        else:
            r = requests.get(url, json=q, headers=headers, verify=False, auth=auth)

        # check for success
        if r.status_code in [200, 201, '200', '201']:
            pass
        else:
            logging.warning(r.status_code)
            logging.warning(r.text)
            time.sleep(5)

        data = r.json()
        parsed = self.parse_results(data)
        self.docs += [p for p in parsed if 'a' in p.keys() and p['a'] != 'Person']
        if '_scroll_id' in data.keys() and len(self.docs) != self.doc_count:
            # query strings like this only for version 1.4
            # you can otherwise scroll in newer version of Elasticsearch by sending a JSON payload
            scroll_url = "/".join(url.split('/')[0:4]) + '/_search/scroll?scroll=1d&scroll_id=' + data['_scroll_id']
            # q = {
            #     "scroll": "1d",
            #     "scroll_id": data['_scroll_id']
            # } # for newer versions of Elasticsearch
            self.doc_count = len(self.docs)
            logging.info(str(len(self.docs)) + ' documents retrieved from index.')
            self.scroll(scroll_url, auth, headers)

Add functionality to list all models in instance

Add functionality to the Models() class that allows users to return a list of model IDs in the Henosis instance, with or without associated information such as the performance, training date, etc.

This functionality is present in the API (GET models), but would be nice to have as part of the Models() class as to include additional options and improve users' modeling workflow.

Resampling methods upsample all X, not X_train.

Need to upsample only the training data. Failure to do so allows information from test and validation sets to "bleed" into training set.

Standardize API responses.

Term conditional operator always the same

henosis/Henosis/utils.py

Line 300 in 78acc32

    
           q['query']['constant_score']['filter']['bool']['must'].append({"term": {d_k: k[d_k]}})

"must" should be replaced with the variable operator as the above always results in a boolean must condition even if the operator is set to another value such as should.

Change behavior of model ID retrieval

Reduce the API payload in the response and decrease the API response speed by only returning the necessary ID's in the response.

See https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-source-filtering.html.

Alter flow when using Server() to match Connect()

Right now, users pass the configuration file path into the Server() object. Instead, change the approach such that it mirrors the modeling process, with users creating a connection object and then passing that object into the Server() object for instantiation.

vc1492a / henosis Goto Github PK

henosis's People

Contributors

Stargazers

Watchers

Forkers

henosis's Issues

Recommend Projects

Recommend Topics

Recommend Org