Code Monkey home page Code Monkey logo
  • πŸ‘‹ Hello, welcome to my repo and portfolio!
  • πŸ‘· I'm Anthony De Faria, an engineer and data scientist.
  • πŸ‘€ I’m passionate about photography, travel and motorbikes.
  • πŸ’ͺ I'll show you here the extent of my skills in data science through a use case project.

🏍️ Project#1: French second-hand motorcycle market vizualisation dashboard and price prediction.

πŸ’‘ Introduction

The main project that you can find here is the French second-hand motorcycle market vizualisation dashboard and price prediction. The objective is to predict the price of new motorbike ads and detect good deals πŸ€‘!

How does it works?

The project is separated into several microservices:

  • Data gathering: pre configured robots (also know as spiders) are scraping 11 websites twice a day to gather market data.
  • Data cleaning: raw data from scraping is cleaned and prepared for vizualisation and machine learning.
  • Data storing: data is stored in a self hosted postgres database.
  • Data vizualisation: data can be visualized through an interactive dashboard to monitor spiders and understand market trends.
  • Traine (in progress)r: data is used to train a machine learning model to predict the price.
  • Rest API (in progress): query the price prediction from anywhere you need it!

🌐 Micoservices detailed

πŸ”’ Gathering data with scraping (private repo)

bike-price is a private repo with all my spiders ready to scrap the internet. It's a top secret repo but I can at least detail the strategy (keep it for yourself πŸ•΅οΈ). I use the famous scrapy framework to configure and run my spiders πŸ•·οΈ. They are clever and will never scrap twice the same url thanks to a fine tuned middleware. They have a special digestion pipeline to format the data and send it to a postgresql database. They are hosted on a scalable virtual machine aka Digitalocean droplet πŸ’§. They are dedicated workers, they wake up early, they go to bed very late and will always listen to what they are asked to do by the cron. Sometimes the task can be tricky. It’s the case for javascript based websites. They are smart enough to run a scriptable web browser called splash to help them to scrap the data. Thanks to continuous development they are super easy to deploy even after a small modification. Pushing on the master branch will trigger github actions to pull modifications on the droplet via ssh and then bluid containers with docker compose πŸ‹.

🧹 Cleaning and transforming data

Raw data is like an unpolished diamond. It needs some work to be done to prepare the data before vizualisation and machine learning. bike-price-cleaner is a service that will automatically clean and prepare the data after each scraping session as detailed below.

  1. Remove data out of physical and acceptable range.
  2. Dropping duplicates (Vendors tend to post the same ad on mutiple websites).
  3. Standardization of brand names and categories with advanced technique of fuzzy matching. Generaly speaking, the brand and model is a free text input.

πŸ’» Storing data

The data is stored in a postgresql database.

πŸ“ˆ Vizualise spiders heart beats and market trends

bike-price-dashboard is an interactive dashboard with Dash plotly (be patient, it can takes a few secs to load). It allows to visualize spiders workload and scraping anomalies. You will also be able to dive into the second hand market stats and history, very handy if you look for a bike at the best price.

πŸ‘¨β€πŸ« Machine learning training (In progress)

Coming soon:

  • Implementation of Celery and Redis to improve the UX
  • Standardization of bike model names with fuzzy matching
  • Deployement of an ExtraTreesRegressor model to predict the price
  • Check for PCA integration

Anthony DE FARIA's Projects

docker-cron icon docker-cron

An example of running cron job in a docker container

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.