NOTE: The projects are ordered from most recently completed to oldest.
DISCLOSURE: My Data Science course puts heavy focus on groupwork, therefore most of these projects are collaborative where I have worked on them in varying degrees. Only the data visualization is fully mine.
For my bachelor project I worked with two other students to measure how polarized subreddits are. We used a Python library (PRAW) to scrape user data within subreddits, then we created networks of the users based on which users replied to other users posts. We used pre-trained word embeddings to characterize the users based on their posts, and a pre-trained sentiment classifier to measure the sentiments of comments made to posts. Using the network structure, these interactions, and some clever algorithms, we measured whether users with differing characteristics interacted more positively or negatively than users with similar characteristics. In this way, we found that some subreddits were more polarized than others, and that some were much more similar than we initially thought. The graph below shows our main results. The further right, and the further away from the y-axis 0, the more polarized the subreddits are.
My group of 5 students was given an outdated Twitter-like website, where we had to work collaboratively to update it to modern standards. To do this we used GitHub and its CI/CD, secrets, kanban board and other features. We also had quality gates through SonarCloud, Snyk, and MegaLinter. We used Django (Python) for the backend of the website, and Docker and Docker Hub to containerize it. We hosted it on DigitalOcean (An IaaS cloud service), but sadly due to the cost of hosting it we have since "destroyed" our "droplets".
Using Tableau I created an interactive visualisation showcasing ammunition statistics from the videogame "Escape from Tarkov".
Feel free try a (less functional) public version here!
For more details read the paper I wrote about it here.
For this project, we investigated a method of training a model on multiple domains to then predict on an unseen domain. For this we used review data from multiple different domains such as Amazon. We wanted to see how we could choose different amounts of each training domain to train on in order to get better performance on the test domain, and we ended up finding a couple different methods of preprocessing that effectively boosted the performance of our model by up to a 5% increase in accuracy.
We used machine learning techniques to analyse various properties of glass as an input, and make predictions of what type of glass the sample was. We wrote two classifiers from scratch, one a feed forward neural network (which I did), and the other a decision tree classifier. Furthermore we implemented an ensemble method, and then compared all 3 approaches to the problem.
For this project we had to choose a network dataset and analyse its various properties. We chose a dataset of the worlds airports and the connections between them. A fun part of our project was to analyse the unrealistic hypothetical scenario of "which airports would still be above sea level should all of the ice in the world melt". I used the python libraries "networkx" and "cartopy" to create maps that showed which airports would be underwater at different sea-levels:
Notice how many of the coastal airports become submerged (red) and how many of the connections are lost!
We used machine learning and natural language processing techniques to automate sentiment classification of tweets. We learned two models to predict whether a tweet contained hate-speech or not, and to predict whether a tweet was of the emotion of anger, joy, optimist or sadness.
We used machine learning and image processing techniques to learn a KNN algorithm to be able to analyse images of skin lesions and predict whether the lesions were Melanoma (a type of skin cancer) or benign (not harmful).
We investigated the effects of weather on the spread of Covid-19 in Germany by analysing related data using python. We discovered a negative correlation between the amount of ultra violet light (UV-index) and humans being infected with Covid-19, suggesting that sunnier days leads to less infections. Whether it is due to the ultra violet light killing Covid-19, or due to changes in human social interaction during sunnier days remained inconclusive.
We investigated the overall road safety in Leeds by analysing traffic data using python. We made visualisations and drew the conclusion that many bicycle-vehicle collisions occur at junctions, and so measures should be taken to improve safety particularly in those areas.