**Import libraries **
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
Download weather data
!wget https://datacases.s3.us-east-2.amazonaws.com/datathon-2020/Ernst+and+Young/Dubai+Weather_20180101_20200316.txt
--2020-05-17 11:35:08-- https://datacases.s3.us-east-2.amazonaws.com/datathon-2020/Ernst+and+Young/Dubai+Weather_20180101_20200316.txt
Resolving datacases.s3.us-east-2.amazonaws.com (datacases.s3.us-east-2.amazonaws.com)... 52.219.105.74
Connecting to datacases.s3.us-east-2.amazonaws.com (datacases.s3.us-east-2.amazonaws.com)|52.219.105.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6915588 (6.6M) [text/plain]
Saving to: ‘Dubai+Weather_20180101_20200316.txt’
Dubai+Weather_20180 100%[===================>] 6.59M 4.30MB/s in 1.5s
2020-05-17 11:35:10 (4.30 MB/s) - ‘Dubai+Weather_20180101_20200316.txt’ saved [6915588/6915588]
Weather Data
From "Predicting weather disruption of public transport" case, weather parameters are described below:
city_name City name
lat Geographical coordinates of the location (latitude)
lon Geographical coordinates of the location (longitude)
main
-
main.temp Temperature
-
main.feels_like This temperature parameter accounts for the human perception of weather
-
main.pressure Atmospheric pressure (on the sea level), hPa
-
main.humidity Humidity, %
-
main.temp_min Minimum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded (use these parameter optionally).
-
main.temp_max Maximum temperature at the moment. This is deviation from temperature that is possible for large cities and megalopolises geographically expanded (use these parameter optionally).
wind
-
wind.speed Wind speed. Unit Default: meter/sec
-
wind.deg Wind direction, degrees (meteorological)
clouds
- clouds.all Cloudiness, %
rain
-
rain.1h Rain volume for the last hour, mm
-
rain.3h Rain volume for the last 3 hours, mm
weather (more info Weather condition codes)
-
weather.id Weather condition id
-
weather.main Group of weather parameters (Rain, Snow, Extreme etc.)
-
weather.description Weather condition within the group
-
weather.icon Weather icon id
dt Time of data calculation, unix, UTC
dt_isoDate and time in UTC format
df = pd.read_json("Dubai+Weather_20180101_20200316.txt")
df.dtypes
city_name object
lat float64
lon float64
main object
wind object
clouds object
weather object
dt int64
dt_iso object
timezone int64
rain object
dtype: object
Flat json data type
s = df.apply(lambda x: pd.Series(x['weather']),axis=1).stack().reset_index(level=1, drop=True) # Weather column is a list which only contains one object
s.name = 'weather'
df = df.drop('weather', axis=1).join(s)
json_struct = json.loads(df.to_json(orient="records"))
df_flat = pd.json_normalize(json_struct)
for col in df_flat.columns: # remove useless columns
unique = df_flat[col].unique()
print(f'{col} unique size: {unique.size}')
city_name unique size: 1
lat unique size: 1
lon unique size: 1
dt unique size: 19344
dt_iso unique size: 19344
timezone unique size: 1
rain unique size: 1
main.temp unique size: 3075
main.temp_min unique size: 2416
main.temp_max unique size: 2444
main.feels_like unique size: 3331
main.pressure unique size: 46
main.humidity unique size: 87
wind.speed unique size: 142
wind.deg unique size: 112
clouds.all unique size: 101
weather.id unique size: 17
weather.main unique size: 9
weather.description unique size: 17
weather.icon unique size: 16
rain.1h unique size: 23
rain.3h unique size: 18
Firstly, as observed from Dubai weather data, parameters including city name, latitude, longitude, time zone, and rain only contain one variable; hence, they can be dropped from the data table.
for col in df_flat.columns: # remove useless columns
unique = df_flat[col].unique()
if unique.size == 1:
df_flat.drop(col, axis=1, inplace=True)
print(f'drop {col}')
else:
print(f'{col} {unique.shape}: {unique[:5]}')
drop city_name
drop lat
drop lon
dt (19344,): [1514764800 1514768400 1514772000 1514775600 1514779200]
dt_iso (19344,): ['2018-01-01 00:00:00 +0000 UTC' '2018-01-01 01:00:00 +0000 UTC'
'2018-01-01 02:00:00 +0000 UTC' '2018-01-01 03:00:00 +0000 UTC'
'2018-01-01 04:00:00 +0000 UTC']
drop timezone
drop rain
main.temp (3075,): [14.99 14.63 14.03 13.78 14.28]
main.temp_min (2416,): [13. 12. 16. 19.31 20.61]
main.temp_max (2444,): [18. 17. 19. 21. 23.]
main.feels_like (3331,): [13.7 13.91 13.89 13.14 13.45]
main.pressure (46,): [1015 1016 1017 1014 1013]
main.humidity (87,): [87 93 68 64 56]
wind.speed (142,): [3.1 2.6 1.5 2.1 0.5]
wind.deg (112,): [150 180 160 0 340]
clouds.all (101,): [ 1 0 20 75 40]
weather.id (17,): [800 701 721 801 803]
weather.main (9,): ['Clear' 'Mist' 'Haze' 'Clouds' 'Rain']
weather.description (17,): ['sky is clear' 'mist' 'haze' 'few clouds' 'broken clouds']
weather.icon (16,): ['01n' '50n' '50d' '01d' '02n']
rain.1h (23,): [ nan 0.14 2.03 0.11 0.35]
rain.3h (18,): [ nan 2. 0.31 0.94 1. ]
Secondly, in the weather parameter, each id matches one description and each it is more descriptive than main and icon. Therefore, id can represent the weather parameters and others can be dropped.
df_flat.drop(['weather.main', 'weather.icon', 'weather.description'], axis=1, inplace=True)
Thirdly, noticed from rain for 1 hour and rain for 3 hours, most of data are null value since there is less frequent to rain in Dubai, they can be replaced as 0 instead.
df=df_flat
print(df.isnull().sum())
df.fillna(0, inplace=True)
dt 0
dt_iso 0
main.temp 0
main.temp_min 0
main.temp_max 0
main.feels_like 0
main.pressure 0
main.humidity 0
wind.speed 0
wind.deg 0
clouds.all 0
weather.id 0
rain.1h 19319
rain.3h 19262
dtype: int64
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
dt | main.temp | main.temp_min | main.temp_max | main.feels_like | main.pressure | main.humidity | wind.speed | wind.deg | clouds.all | weather.id | rain.1h | rain.3h | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.934700e+04 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 | 19347.000000 |
mean | 1.549584e+09 | 28.103518 | 26.662334 | 29.811445 | 27.685407 | 1009.415465 | 52.495891 | 3.879835 | 188.369153 | 13.758050 | 791.030806 | 0.000601 | 0.003641 |
std | 2.010261e+07 | 7.329281 | 7.579672 | 7.241003 | 8.309739 | 8.016809 | 21.659532 | 2.099726 | 106.259695 | 26.485413 | 41.399420 | 0.023247 | 0.068180 |
min | 1.514765e+09 | 10.890000 | 7.000000 | 12.000000 | 6.340000 | 972.000000 | 4.000000 | 0.300000 | 0.000000 | 0.000000 | 200.000000 | 0.000000 | 0.000000 |
25% | 1.532176e+09 | 22.030000 | 20.925000 | 23.845000 | 20.750000 | 1003.000000 | 35.000000 | 2.340000 | 100.000000 | 0.000000 | 800.000000 | 0.000000 | 0.000000 |
50% | 1.549588e+09 | 28.060000 | 26.670000 | 30.000000 | 27.320000 | 1011.000000 | 53.000000 | 3.600000 | 180.000000 | 1.000000 | 800.000000 | 0.000000 | 0.000000 |
75% | 1.566992e+09 | 33.880000 | 32.810000 | 35.130000 | 34.890000 | 1016.000000 | 69.000000 | 5.100000 | 290.000000 | 19.000000 | 800.000000 | 0.000000 | 0.000000 |
max | 1.584400e+09 | 45.940000 | 45.360000 | 48.000000 | 47.890000 | 1026.000000 | 100.000000 | 14.900000 | 360.000000 | 100.000000 | 804.000000 | 2.030000 | 3.810000 |
Convert time format to datetime
print(df.dt_iso, df.dtypes)
df.dt_iso = pd.to_datetime(df.dt_iso, format='%Y-%m-%d %H:%M:%S +0000 UTC')
print(df.dt_iso, df.dtypes)
0 2018-01-01 00:00:00 +0000 UTC
1 2018-01-01 01:00:00 +0000 UTC
2 2018-01-01 02:00:00 +0000 UTC
3 2018-01-01 03:00:00 +0000 UTC
4 2018-01-01 04:00:00 +0000 UTC
...
19342 2020-03-16 19:00:00 +0000 UTC
19343 2020-03-16 20:00:00 +0000 UTC
19344 2020-03-16 21:00:00 +0000 UTC
19345 2020-03-16 22:00:00 +0000 UTC
19346 2020-03-16 23:00:00 +0000 UTC
Name: dt_iso, Length: 19347, dtype: object dt int64
dt_iso object
main.temp float64
main.temp_min float64
main.temp_max float64
main.feels_like float64
main.pressure int64
main.humidity int64
wind.speed float64
wind.deg int64
clouds.all int64
weather.id int64
rain.1h float64
rain.3h float64
dtype: object
0 2018-01-01 00:00:00
1 2018-01-01 01:00:00
2 2018-01-01 02:00:00
3 2018-01-01 03:00:00
4 2018-01-01 04:00:00
...
19342 2020-03-16 19:00:00
19343 2020-03-16 20:00:00
19344 2020-03-16 21:00:00
19345 2020-03-16 22:00:00
19346 2020-03-16 23:00:00
Name: dt_iso, Length: 19347, dtype: datetime64[ns] dt int64
dt_iso datetime64[ns]
main.temp float64
main.temp_min float64
main.temp_max float64
main.feels_like float64
main.pressure int64
main.humidity int64
wind.speed float64
wind.deg int64
clouds.all int64
weather.id int64
rain.1h float64
rain.3h float64
dtype: object
As observed below, temperature and pressure are negatively correlated with each other. Meanwhile, min, max, feels like and average temperature are positively correlated with each other. Therefore, only the temperature should be kept and others can be dropped.
df.sort_values(by=['dt'], inplace=True)
dfwithouttime = df.drop(['dt','dt_iso'], axis=1)
#dfwithouttime=(dfwithouttime-dfwithouttime.min())/(dfwithouttime.max()-dfwithouttime.min()) #normalize
fig, axs = plt.subplots(3, 4, figsize=(28, 15))
fig.subplots_adjust(hspace=.5)
i = 0
j = 0
for col in dfwithouttime.columns:
dfwithouttime[col].plot(ax=axs[i][j], title=col)
j += 1
if j == 4:
j = 0
i += 1
df.drop(['main.temp_min', 'main.temp_max', 'main.feels_like', 'main.pressure'], axis=1, inplace=True)
Traffic data
acci_time Accident time
acci_name categorization of the accident
acci_x Latitude
acci_y Longitude
Download data from Dubai Pulse
!wget http://data.bayanat.ae/ar/dataset/ad38cee7-f70e-4764-9c9d-aab760ce1026/resource/025ea6b2-a806-49c2-8294-4f3a97c09090/download/traffic_incidents-1.csv
!wget https://www.dubaipulse.gov.ae/dataset/c9263194-5ee3-4340-b7c0-3269b26acb43/resource/c3ece154-3071-4116-8650-e769d8416d88/download/traffic_incidents.csv
--2020-05-17 11:35:28-- http://data.bayanat.ae/ar/dataset/ad38cee7-f70e-4764-9c9d-aab760ce1026/resource/025ea6b2-a806-49c2-8294-4f3a97c09090/download/traffic_incidents-1.csv
Resolving data.bayanat.ae (data.bayanat.ae)... 185.141.13.100
Connecting to data.bayanat.ae (data.bayanat.ae)|185.141.13.100|:80... connected.
HTTP request sent, awaiting response... 302 Found : Moved Temporarily
Location: https://data.bayanat.ae/ar/dataset/ad38cee7-f70e-4764-9c9d-aab760ce1026/resource/025ea6b2-a806-49c2-8294-4f3a97c09090/download/traffic_incidents-1.csv [following]
--2020-05-17 11:35:30-- https://data.bayanat.ae/ar/dataset/ad38cee7-f70e-4764-9c9d-aab760ce1026/resource/025ea6b2-a806-49c2-8294-4f3a97c09090/download/traffic_incidents-1.csv
Connecting to data.bayanat.ae (data.bayanat.ae)|185.141.13.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2716239 (2.6M) [text/csv]
Saving to: ‘traffic_incidents-1.csv’
traffic_incidents-1 100%[===================>] 2.59M 166KB/s in 16s
2020-05-17 11:35:47 (163 KB/s) - ‘traffic_incidents-1.csv’ saved [2716239/2716239]
--2020-05-17 11:35:48-- https://www.dubaipulse.gov.ae/dataset/c9263194-5ee3-4340-b7c0-3269b26acb43/resource/c3ece154-3071-4116-8650-e769d8416d88/download/traffic_incidents.csv
Resolving www.dubaipulse.gov.ae (www.dubaipulse.gov.ae)... 91.73.143.12
Connecting to www.dubaipulse.gov.ae (www.dubaipulse.gov.ae)|91.73.143.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6323862 (6.0M) [text/csv]
Saving to: ‘traffic_incidents.csv’
traffic_incidents.c 100%[===================>] 6.03M 170KB/s in 38s
2020-05-17 11:36:29 (163 KB/s) - ‘traffic_incidents.csv’ saved [6323862/6323862]
df1 = pd.read_csv("traffic_incidents.csv")
df2 = pd.read_csv("traffic_incidents-1.csv")
print(df1.dtypes, df2.dtypes)
print(df1.shape[0] + df2.shape[0])
df_union= pd.concat([df1, df2]).drop_duplicates()
print(df_union.shape)
acci_id int64
acci_time object
acci_name object
acci_x float64
acci_y float64
dtype: object acci_id int64
acci_time object
acci_name object
acci_x float64
acci_y float64
dtype: object
79599
(65667, 5)
First of all, only the accident time parameter can be used with weather data, other columns can be drop.
df_union = df_union[['acci_time']]
print(df_union.shape, df_union.acci_time.unique().shape)
df_union.tail
(65667, 1) (64600,)
<bound method NDFrame.tail of acci_time
0 17/05/2020 13:12:42
1 17/05/2020 13:24:43
2 17/05/2020 13:26:02
3 17/05/2020 13:37:07
4 17/05/2020 13:44:31
... ...
23816 27/06/2019 11:21:09
23817 27/06/2019 11:21:35
23818 27/06/2019 11:24:07
23819 27/06/2019 11:24:27
23820 27/06/2019 11:25:26
[65667 rows x 1 columns]>
df_union.acci_time = pd.to_datetime(df_union.acci_time, format='%d/%m/%Y %H:%M:%S')
df_union.sort_values(by=['acci_time'], inplace=True)
Additionally, the count of the number of traffic accidents occurred within each one hour can be added as one column.
dfh = df_union.groupby([pd.Grouper(key='acci_time',freq='H')]).size().reset_index(name='count')
dfh['count'].plot()
dfh
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
acci_time | count | |
---|---|---|
0 | 2019-06-27 10:00:00 | 4 |
1 | 2019-06-27 11:00:00 | 19 |
2 | 2019-06-27 12:00:00 | 21 |
3 | 2019-06-27 13:00:00 | 26 |
4 | 2019-06-27 14:00:00 | 0 |
... | ... | ... |
7800 | 2020-05-17 10:00:00 | 0 |
7801 | 2020-05-17 11:00:00 | 0 |
7802 | 2020-05-17 12:00:00 | 1 |
7803 | 2020-05-17 13:00:00 | 11 |
7804 | 2020-05-17 14:00:00 | 5 |
7805 rows × 2 columns
After inner joining weather data and traffic accidents' data, only time range between 2019-06-27 10:00 and 2020-03-16 23:00 with a total of 6327 hours data can be used for analysis.
result = pd.merge(df, dfh, how='inner', left_on=['dt_iso'], right_on=['acci_time'])
result.drop(['acci_time', 'dt'], axis=1, inplace=True)
result.shape
(6327, 10)
fig, axs = plt.subplots(3, 3, figsize=(21, 14))
fig.subplots_adjust(hspace=.5)
i = 0
j = 0
for col in set(result.columns) - set(['dt_iso']):
result.plot(x='dt_iso',y=col,ax=axs[i][j], title=col)
j += 1
if j == 3:
j = 0
i += 1
As seen from above, all rain for 1-hour data is 0, so rain 1h can be dropped.
result.drop('rain.1h', axis=1, inplace=True)
Interestingly, there are some rain 3-hours cases. After research, there was a flood during that time. Since there is barely raining in Dubai; as a result, drainage measures might not be advanced in Dubai, and there was a significantly increasing number of traffic cases during that time duration.
Important features to predict accident counts per hour
X = result.drop(['count', 'dt_iso'],axis=1)
y = result['count']
X = (X-X.min())/(X.max()-X.min()) # normalize the data
y = (y-y.min())/(y.max()-y.min()) # normalize the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=23)
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train)
rf_feature = pd.Series(random_forest.feature_importances_,index=X.columns)
rf_feature = rf_feature.sort_values()
print(rf_feature[::-1][:5])
fig, axs = plt.subplots(1, 2, figsize=(12, 8))
fig.subplots_adjust(wspace=.5)
rf_feature.plot(kind="barh", ax=axs[0], title='Random Forest')
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
xgb_feature = pd.Series(xgb.feature_importances_,index=X.columns)
xgb_feature = xgb_feature.sort_values()
xgb_feature.plot(kind="barh", ax=axs[1], title="XGBoost")
print(xgb_feature[::-1][:5])
main.temp 0.297278
wind.deg 0.255076
main.humidity 0.184519
wind.speed 0.141623
clouds.all 0.074437
dtype: float64
[11:36:35] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
wind.deg 0.430483
wind.speed 0.118081
main.humidity 0.111731
main.temp 0.109670
clouds.all 0.101215
dtype: float32
After using both Random forest and XGBoost which are the most popular machine learning algorithms to evaluate the weather and traffic accidents, the windy condition and temperature are important features concluded by both algorithms.
The assumption would be rain is more observable than wind, so people may not go out during the heavy rainy day, and the number of cars decreases. However, people may not be aware of the strong wind, and as a result, the wind has a stronger impact on traffic accidents.
Another assumption would be the temperature also plays an important role in traffic accidents. Perhaps, people should be aware of climate change with greenhouse. Protection of the environment is important to avoid exterme weather condition.
def test(model):
pred = model.predict(X_test)
def cross_val(model):
res = cross_val_score(model, X_train, y_train, cv=10, n_jobs=-1)
print(np.mean(res))
cross_val(xgb)
test(xgb)
cross_val(random_forest)
test(random_forest)
0.21253033333798896
0.21488870400506527
Both models average similar error rates after 10 times cross validation.
Conclusion
The government should make people aware of climate change. Specifically, be aware of wind speed and build cars heavier and more stable.