Puneet Tokhi: @puneettokhi
Sai Kapadekar: @Sai-kapadekar
Shivang Patel: @Shivang-Patel
Aaryaneil Nimbalkar: @aaryaneil
College Major Analysis based on economical factors
https://github.com/fivethirtyeight/data/blob/master/college-majors/all-ages.csv https://github.com/fivethirtyeight/data/blob/master/college-majors/grad-students.csv https://github.com/fivethirtyeight/data/blob/master/college-majors/majors-list.csv https://github.com/fivethirtyeight/data/blob/master/college-majors/recent-grads.csv https://github.com/fivethirtyeight/data/blob/master/college-majors/women-stem.csv
Many students after high school have either a vague idea about their college major or they enter college with an undeclared major. Most of the students don’t have any idea about the prospective high-paying careers and end up settling for jobs that don’t require a college degree. Investing in a college degree has to be both fruitful and viable for the students because the economical factor is as important as the desire to pursue their field of interest.
There are multiple factors to consider like the employment ratio in that field, the number of job opportunities, and the median pay. This will help students to better understand and have a clear vision of their future goals. It will also help students in taking an informed decision about their future.
By analyzing the data on college majors, employment, and gender diversity, our goal is to provide a data model that can help students and parents choose a college major and understand how big a financial difference it makes.
To solve this problem, we have considered applying the data mining techniques of clustering and decision trees as the primary techniques for the project. Clustering can be useful in forming clusters of majors that fall within the defined salary ranges. With the help of classifications and decision trees, a major could be determined as economic or not. Other possible techniques which can be considered after starting the initial implementation and if time permits are Classification, Linear Regression, Outlier detection, and Association. We also consider using the random forest classifier technique which is a classification algorithm consisting of many decision trees.
For this project, We have decided to implement clustering, regression and dimensionality reduction. Accuracy of these models can be measured by using confusion matrix, classification report or accuracy score.
You will require Jupyter Notebook or any Python IDE with Python 3.0 or later installed to run the code. Change the directory of the data while loading it.
Our project also requires the following libraries:
- pandas
- matplotlib
- seaborn
- numpy
These libraries should be imported
However, following libraries would require pip install
:
- scipy
- sklearn
- kneed
- pydot
- umap