Code Monkey home page Code Monkey logo

data-analysis-of-bike-sharing's Introduction

Data analysis of Bike Sharing - Python, Pandas, Jupyter Notebook, Matplotlib, plotly

Introduction

Bike sharing has become more popular in recent days. It not only serves as a new way to transit, but also a hobby for many people who prefer a more convenient way to ride.

The dataset includes the data of stations, trips, and weather in 3 cities: Montreal, Toronto, and Washington DC.

Our goal is to improve one data analysis of bike sharing on kaggle and to explore more details by adding two hypotheses in this project to make it more comprehensive.

Installation

Use the package manager pip to install some modules.

pip install scipy
pip install seaborn
pip install numpy
pip install pandas
pip install matplotlib
pip install haversine

Preparation

Import scipy, seaborn, numpy, pandas, matplotlib, datetime, haversine.

from scipy import stats
from haversine import haversine

import seaborn as sns
import numpy as np
import pandas as pd
import datetime
import matplotlib.pyplot as plt

Data Sources

Due to the limitation of uploading files on Git, please download the files as below to run the code.

Data source for hypothesis #1 to #3

  1. Bike Sharing - Big Data Systems development & implementation
    https://www.kaggle.com/v1teka/bike-sharing-big-data-systems-dev-impl/notebook

Data source for hypothesis #4 and #5

  1. Capital Bikeshare trip history data (Please use the files of 2020 April, 2020 Maym and 2020 June)
    https://s3.amazonaws.com/capitalbikeshare-data/index.html

Data source for hypothesis #5

  1. COVID-19 in USA
    https://www.kaggle.com/sudalairajkumar/covid19-in-usa

Data - (For hypothesis #1, hypothesis #2, hypothesis #3)

Trip Dataframe

image

Station Dataframe

image

Weather Dataframe

image

Output

Weather conditions:

prectot: Precipitation (mm day-1)
qv2m : Specific Humidity at 2 Meters (g/kg)
rh2m: Relative Humidity at 2 Meters (%)
ps: Surface Pressure (kPa)
t2m_range: Temperature Range at 2 Meters (C)
ts: Earth Skin Temperature (C)
t2mdew: Dew/Frost Point at 2 Meters (C)
t2mwet: Wet Bulb Temperature at 2 Meters (C)
t2m_max: Maximum Temperature at 2 Meters (C)
t2m_min: Minimum Temperature at 2 Meters (C)
t2m: Temperature at 2 Meters (C)
ws50m_range: Wind Speed Range at 50 Meters (m/s)
ws10m_range: Wind Speed Range at 10 Meters (m/s)
ws50m_min: Minimum Wind Speed at 50 Meters (m/s)
ws10m_min: Minimum Wind Speed at 10 Meters (m/s)
ws50m_max: Maximum Wind Speed at 50 Meters (m/s)
ws10m_max: Maximum Wind Speed at 10 Meters (m/s)
ws50m: Wind Speed at 50 Meters (m/s)
ws10m: Wind Speed at 10 Meters (m/s)\

Hypothesis #1 - Total daily trips duration depends on weather conditions (Take Montreal for example)

Here are some scatter plots, they show us the correlation between weather conditions and total daily trips duration. I used the for loop to draw the scatter plot and set titles for each factor.

image

If we want to go through the correlation between each factor and total daily trips duration, we can use some statistical methods. We chose the Pearson correlation coefficient which is a measure of linear correlation between two sets of data. We got two numbers from the method. The first number is Pearson's r, which means a numerical summary of the strength of the linear association between the variables, and the second number is the probability that the true value of r is zero. We can find out that there is a linear relationship between the maximum temperature and total daily trips duration. By contrast, the maximum wind speed at 50 meters and total daily trips duration have the largest negative correlation.

image

Here is a correlation matrix and every correlation matrix is symmetrical.

image

This is a heat map. We can clearly spot linear relationships between variables through the heat map.

image

Hypothesis #2 - Club members ride longer in a trip

We aimed to verify if club members ride longer in a trip.

You can see that in Montreal, casual riders have a longer duration in a trip compared to the member riders. The difference is about an average of 400 to 600 seconds. Six minutes to 10 minutes.

image

For the Toronto area, this visualization couldn’t provide much information because the data of the member riders in the dataset is incomplete, so we couldn’t explore more for this area based on this hypothesis.

image

As for the Washington region, the casual riders also ride longer than the member riders in a trip. Moreover, compared to the result of Montreal, the duration difference between member riders and casual riders is bigger. It’s about an average of 8 minutes to 20 minutes.

image

Hypothesis #3 - Compared to casual riders, member riders have a higher possibility to take bikes

Calculate the percentage of members riding bikes.

In conclusion about Montreal, we found out that there are some missing data in the dataset. However, if there is sufficient data for the whole year, then you can find out compared to casual riders, member riders have a higher possibility to take bikes in 2014.

image

In Toronto, the dataset only records casual riders. So, in this dataset, we can only find out when the casual riders ride bikes. We cannot come to the conclusion of this hypothesis due to the fact that the dataset did not provide sufficient data for us to explore.

image

In Washington, you can see that the proportion of casual riders is quite low just like in Montreal. It fits the hypothesis compared to casual riders, member riders have a higher possibility to take bikes.

image

Hypothesis #4 - During April to June in 2020, members tend to ride a longer distance than casual riders on a trip in Washington.

In April, members tend to ride a longer distance than casual riders.

image

In May, members tend to ride a little longer distance than casual riders.

image

In June, casual riders tend to ride a longer distance than members.

image

Hypothesis #5 - During April to June in 2020, as the number of confirmed cases for coronavirus increases, the daily average distance of trips would decrease in Washington.

The relationship between daily distance and confired cases in April 2020.

Screen Shot 2021-12-12 at 1 52 01 PM

The relationship between daily distance and confired cases in May 2020.

Screen Shot 2021-12-12 at 1 55 09 PM

The relationship between daily distance and confired cases in June 2020.

Screen Shot 2021-12-12 at 1 57 09 PM

data-analysis-of-bike-sharing's People

Contributors

megwu1129 avatar

Stargazers

Unnati Shah avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.