These tasks are done under the internship program as a part of July 2021 batch by Marisha Bhatti. The projects mainly come under the domain of Data Science & Business Analytics.
- Task 1: Prediction using Supervised ML.
- Simple Linear Regression Task. Predict the percentage of an student based on the number of study hours.
- Dataset can be seen and used from http://bit.ly/w-data.
- Mean Absolute Error: 4.183859899002982
- R2 Score: 0.9454906892105354
- Task 2: Prediction using Unsupervised ML.
- K-Means Clustering Task. Predict the optimum number of clusters and represent it visually.
- Dataset used is Iris.csv (can also be imported from sklearn.datasets).
- This notebook has two parts. The first part uses a K-Nearest Neighbors model to perform a simple multi-classification task (Step 1 - 6). The second part tackles the unsupervised machine learning problem using K-Means Clustering model (Step 7 - 9).
- The KNN model has an accuracy of 0.9736842105263158
- From the K-Means model we find that the optimum number of clusters is 3.
- Task 3: Exploratory Data Analysis - Retail
- Finding out the weak areas where more profit can be made.
- Dataset used is SampleSuperstore.csv (to view the dataset in github select view raw).
- Category-wise:
- Highest profit: Furniture
- Lowest profit: Technology
- Maximum Sales in Category: Technology
- Sub-Category-wise:
- Highest Profit: Copiers
- Lowest Profite: Tables
- Top 3 High Discount Products: Binders, Machines, Tables
- State-wise:
- Average Number of Deals per state is 203.9591836734694
- Highest Profit: Vermont
- Lowest Profit: Ohio
- Highest amount of Sales: Wyoming
- City-wise:
- Highest Profit: Jamestown
- Lowest Profit: Bethlehem
- Task 4: Exploratory Data Analysis - Terrorism
- Finding out the hot zone of terrorism.
- The dataset can be downloaded from https://www.kaggle.com/START-UMD/gtd.
- Middle East & North Africa has the most terrorist attacks. South Asia has second most terrorist attacks.
- Iraq has the most terrorist attacks in middle east. Pakistan, Afghanistan and India are in the Top 3 in South Asia.
- Iraq, Pakistan and Afghanistan are the Top 3 countries with most terrorist attacks.
- In Eastern Europe, Middle East, South asia, Southeast Asia and subsaharan Africa there has been a huge increase in terrorist attacks whereas other regions have seen a decrease since 2001.
- Task 5: Exploratory Data Analysis - Sports
- Finding out the most successful teams, players and factors contributing win or loss of a team.
- Datasets used are matches.csv and deliveries.csv.
- Mumbai Indians, Chennai Super Kings, Kolkata Knight Riders are top three teams with most wins.
- Top 3 Players based on Player of the Match Awards: Chris Gayle, AB de Villiers, MS Dhoni.
- Top Batsmen: Virat Kohli, SK Raina, Rohit Sharma.
- Top Bowlers: TG Southee, AD Mathews, SK Raina.
- Task 6: Prediction using Decision Tree Algorithm.
- Decision Tree Classifier Task. The classifier would be able to predict the right class given any new data.
- Dataset used is Iris.csv (can also be imported from sklearn.datasets).
- Mean of Cross Validation Score: 0.9466666666666667
- Standard Deviation of Cross Validation Score: 0.04521553322083511
- Task 7: Stock Market Prediction using Numerical and Textual analysis.
- Create a hybrid model for stock price/performance prediction using numerical analysis of historical stock prices, and sentimental analysis of news headlines.
- Historical stock prices dataset can be downloaded from finance.yahoo.com or use SENSEX.csv file in the repository for numerical analysis. Textual (news) data can be downloaded from https://bit.ly/36fFPI6 or https://www.kaggle.com/therohk/india-headlines-news-dataset?select=india-news-headlines.csv.
- Mean Absolute Error: 0.5019762845849802
- Mean Squared Error: 0.5019762845849802
- Root Mean Squared Error: 0.7085028472666713