Flavor chemical database and classifier

This project has 4 main parts:

Scrape two flavor industry websites to create a database of flavor chemicals and their flavor descriptors
Find the underlying flavor profiles in the database to create labels for a machine learning classifier
Calculate chemical properties that could be used as features in a machine learning classifier
Train a classifier to identify chemical class

Making the flavor chemical database

1_fema_extraction:

In this notebook I extract information from the The Flavor and Extract Manufacturers Association (FEMA) website.

Each chemical has its own page (for example, acetic acid) from which I extracted:

Flavor descriptors
FEMA and Chemical Abstracts Service (CAS) registry numbers

2_jecfa_extraction

In this notebook I extract information from the the Joint FAO/WHO Expert Committee on Food Additives (JECFA) website.

Each chemical has its own page (for example, acetic acid) from which I extracted:

Odor/flavor
Synonyms
Molecular weight
JECFA and FEMA numbers

3_fema_jecfa_merge

In this notebook I merge the information extracted from the FEMA and JECFA websites. I make sure that each entry is for the same chemical and that all chemicals included have usable flavor/aroma descriptors.

4_rdkit_chemical_matching

In this notebook I pair the chemicals found above with their RDkit representations.

The RDkit is a chemical informatics toolkit. It allows for the calculation of chemical descriptors which can then be used as features for machine learning tasks.

By this point I have 2170 chemicals that can be used to train a machine learning classifier.

Unsupervised clustering based on flavor descriptors

5_descriptor_clustering

In this notebook I use K-Means clustering to group the flavor chemicals based on their flavor and aroma descriptors.

I found two minimal groups:

One large fruity, floral group with 1880 chemicals
A smaller savory, roast group with 290 chemicals

They can be visualized with word clouds of all the descriptors in each group:

I can now use these labels to train a supervised machine learning classifier.

Calculating chemical properties to use as machine learning features

6_property_calculations

In this notebook I use the RDKit to calculate several quantitative chemical properties. I also generated three different "chemical fingerprints" based on either chemical fragments , circular topology, or path-based topology for each molecule. In all 4422 features were generated for each chemical.

Training and testing a classifier to identify chemical class

7_algorithm_comparison

In this notebook I compare unoptimized Naive Bayes, Support Vector Machines, Adaboost, Logistic Regression, and Multi-layer Perceptron classifiers to see if any stand out with this dataset.

A comparison of the the average and 95% confidence intervals in terms of precision, recall, Matthews correlation, and area under Receiver Operating Characteristic curve (roc_auc) for the unoptimized classifiers:

Based on these results I decided to proceed with parameter optimization for:

Adaboost
Logistic Regression
Multi-layer perceptron

Support Vector Machines performed badly, and Naive Bayes don't have many parameters to optimize, although its worth noting that the Bernoulli Naive Bayes performed as well (if not better in terms of recall) than the other top classifiers.

8_parameter_optimization

In this notebook I exhaustively searched hyper-parameter space for the AdaBoost, Logistic Regression, and Multiple-layer Perceptron classifiers.

A comparison of the average scores and 95% confidence interval of the three optimized estimators:

These results indicate that the optimized Logistic Regression classifier performed best with this dataset.

9_estimator_analysis

In this notebook I look at the best classifier and how it performs on the dataset.

The best classifier was a Logistic Regression algorithm with :

Regularization C parameter of 0.1
An roc_auc score of 0.76 on the held-out test data, which indicates that the estimator has a 0.76 probability of ranking a random savory chemical above a random non-savory chemical.

The validation curve of this parameter shows that its C value is at the sweet spot with the highest score, and lowest amount of variability:

Anything below 0.1 produces an underfit (high bias) model, with low training and test scores.
Anything above 0.1 produces an overfit (high variance) model, with high training scores that don't generalize to the test data.

The learning curve for this parameter argues that the current model is still relatively overfit (high variance) due to a persistent gap between training and test scores, regardless of training example size:

This suggests that the best way to further improve this estimator would be to add more training examples.

vanthaiunghoa / fema-flavor-classifier Goto Github PK

fema-flavor-classifier's Introduction

Flavor chemical database and classifier

Making the flavor chemical database

Unsupervised clustering based on flavor descriptors

Calculating chemical properties to use as machine learning features

Training and testing a classifier to identify chemical class

fema-flavor-classifier's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent