RAKSHIT SINHA's Projects
Content-Based Recommender System recommends movies similar to the movie user likes and analyses the sentiments on the reviews given by the user for that movie.
Recommends Anime using Content based filtering (using TFIDF vectorization and sigmoid kernel) and collaborative filtering (using KNN)
Implementing various ML Regression model on bike sharing data shared by Capital Bikeshare (Washington D.C.)
Building an efficient Active Portfolio which yields a high Sharpe Ratio on 8 instruments using various trade strategies in order to get a high Sharpe Ratio.
Using CNN to detect and classify which chest x-ray images have pneumonia and which ones are normal. The data is taken from Kaggle platform. : https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
Comment Processing-Tool
Predicted the sentiment associated with tweets made on the topic of Covid-19 pandemic. Tweets were classified into "Positive", "Extremely Positive", "Neutral","Negative" and "Extremely Negative". TF-IDF Vectorization was used to vectorize the tokens present in the tweets and then to classify "CatBoost" algorithm was used. Ultimately achieving an accuracy of around 57%.
Predicting the Salary of data science jobs (for example Data Scientist, Data Engineer, Machine Learning Engineer, Data Analyst, BI Engineer etc.) in USD based on various factors like Work Year (the year in which you are looking for job), Pay grade, Average pay scale in the Country (where the job is located), experience level, Employment type etc.
Flower detection using CNN
The task is to correctly predict the number of days a patient would be staying in a hospital, out of 10 different categories. 16 different parameters were given. EDA, Feature Engineering, resampling has been performed to properly do data preprocessing. Ultimately CatBoost Classification model has been implemented to achieve more than 41% accuracy.
Performed an analysis on a dataset and predicting which patients are more likely to suffer from a heart attack. link: https://www.kaggle.com/raksh710/87-accuracy-85-f1-score-knn-14-lr-svc-rf-cbc The dataset is available on kaggle and so is my notebook on this
ice_breaker project forked from emarco177 to test Langchain's capabilities with various APIs
Did a comparison between CatBoostRegressor and Keras to find out which model performed best on king county house price regression dataset from kaggle. Link to the notebook: https://www.kaggle.com/raksh710/catboost-vs-keras-cb-wins
Given an input image, classify the image in the following category: 'buildings': 0, 'forest': 1, 'glacier': 2, 'mountain': 3, 'sea': 4, 'street': 5 <br> </br> Above are the keys along with their tag (or value) are mentioned. A CNN model has been used with 3 Conv2D, 3 MaxPool2d, 1 Flatten, one dropout and 2 Dense layers. <br> </br> After training the CNN model on 14034 images belonging to 6 classes, the CNN model was validated on a validation set with 3000 images belonging to 6 classes, on which an accuracy of 84.17% was achieved. Steps: 1) Specify train, validation and test directory (where images are stored) 2) Use Image Generator to create more samples out of the given number of training samples (in order to detect the class more accurately). Images went through various processes like: zoomed in/out, sheared, rorated etc. 3) Images from train and validation were subjected to the Image Generator created in step: 2. Note that in training the shuffle was True and that in validation it was False, because we want to keep the validation set in order to evalue the accuracy (which required the images to be in order) 4) Image samples from train directory were fed to the CNN model and evaluated on the validation directory. 5) Image samples from test directory were also predicted and evaluated manually.
A major chunk of bank revenue is generated by credit cards. Customers who fail to pay their credit card dues on time could potentially cost banks a lot of revenue. Issuing credit cards to customers who have a higher likelihood of not paying their dues on time involves a higher risk for the bank. Issuing these customers' cards with a higher interest rate would work in favor of the bank. Inorder to make a informed decision about which customer is high risk and which one is low risk, the firm would benefit from a predition model which would accurately predict if the customer would default or not. Prediction can be done based on factors like job, education, balance, loans, and house ownership. Finding out which are the most common factors that defaulters have will also help the bank to be cautious before issuing a credit card to customers who fall into one of those categories.
Classifying Malicious website from benign ones using CatBoost Classifier. Process involves Exploration of data, Data Cleaning, Resampling of data (to handle highly imbalanced data), Model implementation and Evaluation.
We are working on UMD's info challenge and our dataset is ISCXIDS2012 cybersecurity dataset.
Task was to forecast the medical cost associated with each patient given their medical parameters and health history. CatBoost algorithm was implemented on the data after scaling (Standardization) was done.
The input data contained image data ( grayscale(color_scale = 1) data of width=28, height=28) of digits from 0 to 9 which are to be identified by the model. I implemented CNN which consisted of convolutional layers as well as MaxPool layers. I achieved 99.6 % accuracy on the test set. Link to my notebook: https://www.kaggle.com/raksh710/mnist-using-cnn-99-6-test-accuracy
My Resume
Customized chatbot for a particular PDF file
Dynamic Dashboard created using plotly-dash for stock price historical values.
Predict My Sleep is a Kaggle Competition hosted by Rob Mulla (Youtuber and Twitch Streamer). Tryint to predict his sleep patterns since 2022 using historic data.
This repository contains my assignment solutions to the course "How to win a kaggle data challenge (Challenge Predict Future sales)" from coursera. Please feel free to review and drop in your feedback
Config files for my GitHub profile.
Uses Google's v3 API to get the top 100 relevant comments and do a sentiment analysis on each comment and then, finally return the 'Average' sentiment. The application is hosted using Salesforce Heroku which is a PaaS.