Why do people like some cuisines and not so much some others? Is it because of the ingredient combinations used in the cuisine, which indicates flavor profile patters?
The inspiration for this project comes from my motivation to answer the following questions using data-driven and quantitative methods:
- How can various international cuisines be compared to each other based on the ingredients used, in other words, how similar or dissimilar are cuisines?
- Also, given a person’s interest in a cuisine, how can a valid recommendation be made for other similar cuisines?
Using a database of recipes obtained from online recipe repositories, I have investigated the similarity of various cuisines in terms of ingredient combinations. Finally, for any given cuisine, using the findings from this analysis, I recommend recipes of the most similar cuisines, not the same cuisine but of the similar cuisines.
- Data Collection and Storage
- EDA and Feature engineering
- Model development
- Visualization and web application
- Possible Next Steps
- Toolkit + Credits
Using Python, BeautifulSoup, Selenium and AWS EC2, I web scraped the recipe databanks from the data scources to collect raw text, lists, and other information from html. Details for each recipe include the following information:
- Recipe Source (Non-Null) - Epicurious, BBC Food, Chowhound, BBD Good Food, Saveur
- Recipe Cuisine - 45 different cuisines
- Recipe Link (Non-Null) - url to recipes
- Recipe Name (Non-Null) - title of recipe
- Recipe Author/Chef - author/chef who put up the recipe
- Recipe Rating/Recommendations - ratings and/or recommendations for recipe
- Recipe Description - short description/summary of recipe
- Recipe Ingredients List (Non-Null) - list of ingredients and amounts used in recipe
- Recipe Preparation Steps (Non-Null) - preperation steps involved in recipe
- Recipe Prep Time - prep time required for recipe
- Recipe Cooking Time - cooking time required for recipe
- Total Nutrition - nutrtional value of recipe
- Recipe Image Source - url to recipe image
Data from all sources is cleaned to replace all non-ascii content with their corresponding values and merged into one database of over 28K recipes. Scraped data and cleaned data is stored in MongoDB, pandas dataframe in pickled file and S3 for backup.
For balanced distribution of number of recipes per cuisine and better signal, the cuisines were grouped together, based on geographical closeness of cuisines, into 19 unique cuisines.
The next step is quite possibly the most time-consuming, challenging, and rewarding part of the project. Using NLTK tokennization, lemmatizing, stop-words, bi-gram model and a custom built n-gram model, a list of unique ingredient names was extracted from the list of ingredients for each recipe. From these ingredient names, I generated a bag of unique ingredients for all recipes in the database. The ingredients were converted into count vectors and TF-IDF vectors based on these unique bag of words. For modeling, any ingredient which had less than two occurences in all the recipes were not considered.
The following classifier models were used to build a classifier model based on the TF-IDF vectors: Logistic Regression, Random Forest Classifier, Ada Boost Classifier, Multinomial Naive Bayes. Below is a comparison of the performance of these four models, after GridSearch, for various preformance metrics.
I choose Logistic Regression based on its higher performance.
In the 28.5K recipe dataset, about 25% of my data was unlabled, and I used the results from this classification model to classify the unlabled data before analysing cuisine similarities.
For similarity analysis, I combined all the recipes for each cuisine into one vector and computed the pairwise distance metrics for all cuisines using the following pairwise distance metrics: scikit-learn: cityblock, cosine, euclidean, l1, l2, manhattan scipy.spatial.distance: braycurtis, canberra, chebyshev, correlation, jaccard, matching, yule
I finally went with 'braycurtis' metric because of the most sense it made for most cuisines. The following are some of the interesting findings of the similarity analysis:
Cuisine | Interesting Similar Cuisines |
---|---|
Cajun/Creole | Mexican, Eastern European/Russian |
Indian | Central/South American/Caribbean, Mexican |
African | Spanish/Portuguese, Italian |
Mexican | Turkish and Middle Eastern, Cajun/Creole |
Central/South American/Caribbean | Cajun/Creole, Turkish and Middle Eastern |
Using the results from the similarity analysis, the recommendations model recommends similar cuisine recipes for any selected cuisine and a set of ingredients.
When a cuisine is selected, the search box (on the next page) displays only the ingredients that have been seen for the cuisine in the database. When a set of ingredients are selcted, they are converted into a vector and compared to all the recipes in the top 5 similar cuisines group for that cuisine. The top 20 results are displayed in ascending order of the distance between the given search vector and each of the compared recipes.
- Equally distributed data - Since two of my data sources were from the BBC group, a disproportionate portion of my data has English/Scottish recipes. For future work, I would like to get more data to get an equally distributed dataset
- Other recipe details (cooking time, cooking methods, nutritional value) - I would also like to use other recipe details to improve the performance of my models, as well as my recommendations.
- Historical colonization data and spice routes - For this project I validated my results based on what made intuitive sense. For future work, I would like to use historical colonization data as well as spice routes data to validate the results of my similarity analysis model.
data sources:
langugages used:
- python
- bash
- html / javascript / css
python libraries used:
- pandas
- nltk
- requests
- beautifulsoup
- selenium
- pymongo - Chosen because my database operations involve more dumping recipe details in and pulling details out than creating complex queries.
- matplotlib
- seaborn
- sklearn
- scipy, numpy
- pickle
- flask
- jinja2
- math, string, re
- collections, itertools
other tools used: