Code Monkey home page Code Monkey logo

henosis's People

Contributors

vc1492a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

henosis's Issues

Remove imbalanced-learn dependency.

Remove the library's dependency on imbalanced-learn and direct users to balance datasets appropriately prior to fitting the model in Henosis. This has the effect of reducing the size of the library and the number of dependencies.

Restrict YAML object types

Replace yaml.load with yaml.safe_load in all files and make any other necessary changes. Using safe_load limits the ability to create simple Python objects to integers or lists, the format of the user-provided config.yaml file. As users will be providing config.yaml, using safe_load guards against the possibility of creating a malicious object.

Support Python 3.7 Data Classes

When appropriate, migrate to using Python 3.7's data classes. This will clean up code and improve maintainability, but does require using 3.7 or higher.

Note behavior of predict() and predict_proba() in documentation and introduce check

Currently, Henosis only supports the use of scikit-learn models with our without predicted probabilities distinctly. It is currently not possible to use models that do and do not provide predicted probabilities in tandem. For example, a LinearSVC cannot be deployed to the same instance as a MultinomialNB if predict_probabilities is set to True in config.yaml, but may be deployed to the same instance of predict_probabilities is set to False in config.yaml.

Note this in the documentation and introduce a check that ignores models that do not return predicted probabilities if predict_probabilities is set to True in config.yaml.

Add language referencing classification in the readme and documentation

Henosis currently supports classification models only, and classification-type recommendation (prediction) problems. As such, reference this fact more neatly and cleanly in the documentation while acknowledging future plans to introduce support for regression-type models and other approaches.

Integrate top-n train and test results into framework

Currently, the train and test results stored in Henosis reflect the top-1 scores, despite the fact that the Henosis instance may be providing a top-n recommendation (e.g. n=3 or n=5). Add the ability in the framework to test against the top-n if specified when calling m.train(d) or m.test(d), where m is a Models() object.

Directed acyclic graph with hashes/salts for model version control

As implemented, Henosis creates a unique model ID by creating a hex string. This approach does not allow for version controlling of models, e.g. a retrained model with identical specification to a previous model would is assigned a completely new unique ID, with no information stored in relation to a previously stored and fit model.

Implementing a salted, directed acyclic graph using hashes of the documents (models) themselves would provide a means of version controlling models and associating models trained with the same data with one another. Derivates of a model (such as the same model being retrained) that are downstream of an upstream change receive changes in their hash values, and these changes can be tracked in the graph. This functionality would be useful when examining the state of the system - developers operating the system would be able to keep track of various model specifications easily, allowing for A/B testing and other modes of operation.

Needs to be explored further, but could be implemented by using NetworkX on top of the Elasticsearch functionality.

Add modelClass identifier to Elasticsearch index and API response

As the software adds functionality for models outside of scikit-learn, it will become necessary to distinguish model implementations in the database as well as in the response from the API. For example, the current API response is as follows:

{
            "_id": "bb9e7a91c256412ca40f57e27e8e90b6",
            "_index": "models",
            "_score": 1.0,
            "_source": {
                "callCount": 30,
                "dependent": "variableTwo",
                "deployed": true,
                "encoderPath": "encoder_variableTwo_1.pickle",
                "encoderType": "CountVectorizer",
                "id": "bb9e7a91c256412ca40f57e27e8e90b6",
                "independent": [
                    {
                        "generator_path": "clean_text.pickle",
                        "inputs": [
                            "title"
                        ],
                        "name": "cleanText"
                    }
                ],
                "lastTestedDate": "2018-01-18T08:29:09",
                "lastTrainedDate": "2018-01-18T08:29:08",
                "modelPath": "model_variableTwo_1.pickle",
                "modelType": "OneVsRestClassifier(estimator=MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True),\n          n_jobs=1)",
                "recommendationThreshold": 0.2,
                "testAccuracy": 0.9019607843137255,
                "testF1": 0.9008452056839288,
                "testPrecision": 0.9044056750340173,
                "testRecall": 0.9019607843137255,
                "testTime": 0.005343914031982422,
                "trainAccuracy": 0.9849376731301939,
                "trainDataBalance": "upsample",
                "trainF1": 0.9849744199555293,
                "trainPrecision": 0.9854426711336097,
                "trainRecall": 0.9849376731301939,
                "trainTime": 0.04866385459899902
            },
            "_type": "model"
        }

The updated response would include a modelClass field that indicates whether the model is one implemented in scikit-learn or another library, e.g. something along the lines of the following (see the modelClass field):

{
            "_id": "bb9e7a91c256412ca40f57e27e8e90b6",
            "_index": "models",
            "_score": 1.0,
            "_source": {
                "callCount": 30,
                "dependent": "variableTwo",
                "deployed": true,
                "encoderPath": "encoder_variableTwo_1.pickle",
                "encoderType": "CountVectorizer",
                "id": "bb9e7a91c256412ca40f57e27e8e90b6",
                "independent": [
                    {
                        "generator_path": "clean_text.pickle",
                        "inputs": [
                            "title"
                        ],
                        "name": "cleanText"
                    }
                ],
                "lastTestedDate": "2018-01-18T08:29:09",
                "lastTrainedDate": "2018-01-18T08:29:08",
                "modelPath": "model_variableTwo_1.pickle",
                "modelClass": "SKModel()",
                "modelType": "OneVsRestClassifier(estimator=MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True),\n          n_jobs=1)",
                "recommendationThreshold": 0.2,
                "testAccuracy": 0.9019607843137255,
                "testF1": 0.9008452056839288,
                "testPrecision": 0.9044056750340173,
                "testRecall": 0.9019607843137255,
                "testTime": 0.005343914031982422,
                "trainAccuracy": 0.9849376731301939,
                "trainDataBalance": "upsample",
                "trainF1": 0.9849744199555293,
                "trainPrecision": 0.9854426711336097,
                "trainRecall": 0.9849376731301939,
                "trainTime": 0.04866385459899902
            },
            "_type": "model"
        }

Verbose option for API endpoints

Add the ability to provide additional information in the request response, such as predicted probabilities or other useful information, through a verbose parameter when submitting the request.

Add functionality to list all generators in instance

Add functionality to the Models() class that allows users to return a list of generator IDs in the Henosis instance, with or without associated information such the models used by the generator, its inputs and outputs, etc.

This functionality is not yet present in the API (GET generators) and a new endpoint should be created before including the API's results as part of the class call. This functionality would be nice to have as part of the Models() class as to include additional options and improve users' modeling workflow.

Add tests

Add a series of tests that ensure proper functionality as part of the package.

Add appropriate response to Models().SKModel().load_model('<model_id>', connection_object) when non-existent model is specified.

Currently, calling Models().SKModel().load_model() results in a KeyError if the ID of a non-existent model is specified, see below:

WARNING:root:<Response [404]>
{'description': 'Error retrieving model from Elasticsearch.'}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-8aa4cd950881> in <module>()
      1 for m in list_models():
----> 2     model = Models().SKModel().load_model('bbe4830f5eb24a73a299b720eef58ccd', s)
      3     print(model)
      4 
      5 

~/Files/Projects/5X/PRS Form Recommender/api/src/henosis/Henosis/model.py in load_model(self, model_id, server_config)
    344             print(model_info)
    345 
--> 346             self.id = model_info['models']['id']
    347             self.deployed = model_info['models']['deployed']
    348             self.call_count = model_info['models']['callCount']

KeyError: 'models'

The aim here is to improve the response to the error, e.g. something along the lines of 'Specified model ID not present in Henosis instance.' and by also electing to not attempt to load the model if an error occurs in the API (currently, an error presented in the API response does not stop Henosis from attempting to return a model object).

Use Elasticsearch's scroll functionality in place of max search size

self.search_size = 10000

Something along the lines of:

def scroll(self, url, auth, headers, q=None):
        if q is None:
            r = requests.get(url, headers=headers, verify=False, auth=auth)
        else:
            r = requests.get(url, json=q, headers=headers, verify=False, auth=auth)

        # check for success
        if r.status_code in [200, 201, '200', '201']:
            pass
        else:
            logging.warning(r.status_code)
            logging.warning(r.text)
            time.sleep(5)

        data = r.json()
        parsed = self.parse_results(data)
        self.docs += [p for p in parsed if 'a' in p.keys() and p['a'] != 'Person']
        if '_scroll_id' in data.keys() and len(self.docs) != self.doc_count:
            # query strings like this only for version 1.4
            # you can otherwise scroll in newer version of Elasticsearch by sending a JSON payload
            scroll_url = "/".join(url.split('/')[0:4]) + '/_search/scroll?scroll=1d&scroll_id=' + data['_scroll_id']
            # q = {
            #     "scroll": "1d",
            #     "scroll_id": data['_scroll_id']
            # } # for newer versions of Elasticsearch
            self.doc_count = len(self.docs)
            logging.info(str(len(self.docs)) + ' documents retrieved from index.')
            self.scroll(scroll_url, auth, headers)

Add functionality to list all models in instance

Add functionality to the Models() class that allows users to return a list of model IDs in the Henosis instance, with or without associated information such as the performance, training date, etc.

This functionality is present in the API (GET models), but would be nice to have as part of the Models() class as to include additional options and improve users' modeling workflow.

Alter flow when using Server() to match Connect()

Right now, users pass the configuration file path into the Server() object. Instead, change the approach such that it mirrors the modeling process, with users creating a connection object and then passing that object into the Server() object for instantiation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.