Code Monkey home page Code Monkey logo

fema-flavor-classifier's Introduction

Flavor chemical database and classifier

This project has 4 main parts:

  1. Scrape two flavor industry websites to create a database of flavor chemicals and their flavor descriptors
  2. Find the underlying flavor profiles in the database to create labels for a machine learning classifier
  3. Calculate chemical properties that could be used as features in a machine learning classifier
  4. Train a classifier to identify chemical class

Making the flavor chemical database

1_fema_extraction:

In this notebook I extract information from the The Flavor and Extract Manufacturers Association (FEMA) website.

Each chemical has its own page (for example, acetic acid) from which I extracted:

  • Flavor descriptors
  • FEMA and Chemical Abstracts Service (CAS) registry numbers

2_jecfa_extraction

In this notebook I extract information from the the Joint FAO/WHO Expert Committee on Food Additives (JECFA) website.

Each chemical has its own page (for example, acetic acid) from which I extracted:

  • Odor/flavor
  • Synonyms
  • Molecular weight
  • JECFA and FEMA numbers

3_fema_jecfa_merge

In this notebook I merge the information extracted from the FEMA and JECFA websites. I make sure that each entry is for the same chemical and that all chemicals included have usable flavor/aroma descriptors.

4_rdkit_chemical_matching

In this notebook I pair the chemicals found above with their RDkit representations.

The RDkit is a chemical informatics toolkit. It allows for the calculation of chemical descriptors which can then be used as features for machine learning tasks.

By this point I have 2170 chemicals that can be used to train a machine learning classifier.

Unsupervised clustering based on flavor descriptors

5_descriptor_clustering

In this notebook I use K-Means clustering to group the flavor chemicals based on their flavor and aroma descriptors.

I found two minimal groups:

  • One large fruity, floral group with 1880 chemicals
  • A smaller savory, roast group with 290 chemicals

They can be visualized with word clouds of all the descriptors in each group:

I can now use these labels to train a supervised machine learning classifier.

Calculating chemical properties to use as machine learning features

6_property_calculations

In this notebook I use the RDKit to calculate several quantitative chemical properties. I also generated three different "chemical fingerprints" based on either chemical fragments , circular topology, or path-based topology for each molecule. In all 4422 features were generated for each chemical.

Training and testing a classifier to identify chemical class

7_algorithm_comparison

In this notebook I compare unoptimized Naive Bayes, Support Vector Machines, Adaboost, Logistic Regression, and Multi-layer Perceptron classifiers to see if any stand out with this dataset.

A comparison of the the average and 95% confidence intervals in terms of precision, recall, Matthews correlation, and area under Receiver Operating Characteristic curve (roc_auc) for the unoptimized classifiers:

Based on these results I decided to proceed with parameter optimization for:

  • Adaboost
  • Logistic Regression
  • Multi-layer perceptron

Support Vector Machines performed badly, and Naive Bayes don't have many parameters to optimize, although its worth noting that the Bernoulli Naive Bayes performed as well (if not better in terms of recall) than the other top classifiers.

8_parameter_optimization

In this notebook I exhaustively searched hyper-parameter space for the AdaBoost, Logistic Regression, and Multiple-layer Perceptron classifiers.

A comparison of the average scores and 95% confidence interval of the three optimized estimators:

These results indicate that the optimized Logistic Regression classifier performed best with this dataset.

9_estimator_analysis

In this notebook I look at the best classifier and how it performs on the dataset.

The best classifier was a Logistic Regression algorithm with :

  • Regularization C parameter of 0.1
  • An roc_auc score of 0.76 on the held-out test data, which indicates that the estimator has a 0.76 probability of ranking a random savory chemical above a random non-savory chemical.

The validation curve of this parameter shows that its C value is at the sweet spot with the highest score, and lowest amount of variability:

  • Anything below 0.1 produces an underfit (high bias) model, with low training and test scores.

  • Anything above 0.1 produces an overfit (high variance) model, with high training scores that don't generalize to the test data.

The learning curve for this parameter argues that the current model is still relatively overfit (high variance) due to a persistent gap between training and test scores, regardless of training example size:

This suggests that the best way to further improve this estimator would be to add more training examples.

fema-flavor-classifier's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.