Technical Challenge 99 Data • New York City taxicabs • Isis Santos Costa, Nov 2020 - Jan 2021
- Data Extraction
- Data Transformation
- Data Transformation, part 2
- Data Transformation, part 2 (addendum)
- Data Transformation, part 3
- EDA
- Mᴏᴅᴇʟɪɴɢ: Time Series Analysis • Forecasting (wip)
- Mᴏᴅᴇʟɪɴɢ: Machine Learning (soon!)
- Rᴇᴘᴏʀᴛ ғᴏʀ ᴛʜᴇ 🎯 CEO 🎯
- Rᴇᴘᴏʀᴛ ғᴏʀ ᴛʜᴇ 💲 CFO 💲 (soon!)
- Rᴇᴘᴏʀᴛ ғᴏʀ ᴛʜᴇ 🎮 COO 🎮 (soon!)
- Pʀᴇᴘᴀʀɪɴɢ ғᴏʀ ᴛʜᴇ ғᴜᴛᴜʀᴇ: Aggregate Data
Notebook 1: DATA EXTRACTION
- Connecting to DB
- From DB into DF
- DF into CSV
- Nᴇxᴛ: Transformation
Notebook 2: DATA TRANSFORMATION
- Setting the DataFrames
- Taking a first look
- Naming the columns
- Basic inter-table consistency
- Checking & improving the data structure
- Trips x Orders
- Payment type
- Rate type
- Exporting results for retrieval
- Continues as Transformation, part 2
Notebook 3: DATA TRANSFORMATION, part 2
- Resuming: retrieving the DFs
- Checking & improving the data structure (cont'd)
- Rate type (cont'd.)
- Basic information
- Converting strings into datetime
- Feature engineering: datetime 📅
- Orders
- Trips
- Treating missing data
- Exporting results for retrieval
- Continues as Data Transformation, part 2 (addendum)
Notebook 4: DATA TRANSFORMATION, part 2 (addendum)
- Resuming: retrieving the DFs
- Feature engineering: datetime 📅 (cont'd) [ Adding further new features for better analysis ]
- 📈 Adding further new features for better analysis
- Orders
- Month
- Week in the month
- Trips
- Month
- Week in the month
- Exporting results for retrieval
- Continues as Transformation, part 3 (Final!) (lat, long) ➔ neighborhoods
Notebook 5: DATA TRANSFORMATION, part 3
- Resuming: retrieving the DFs
- Combining the tables: outer join 🔗
- Listing information added when orders get converted: trips_features
- Adding trips_features to the table of all orders
- 🗃️ Reordering the columns of the new DF df_orders_tF (𝘵ransformed, 𝘍inal)
- Feature engineering: 📍coordinates into 🗺️neighborhoods (reverse geocoding)
- Preparing for heavy processing:
- Exporting the dataframe as transformed so far 📤
- Autotime + tqdm (progress bar)
- Importing: Geopandas • Geopy (Nominatim+RateLimiter) • PyPlot • Plotly_express
- Constructing Geocoder
- Reverse geocoding
- Pilot test ✔️
- Full scale 📈, with Nominatim:
- 1st trial, by list ❌
- nth trial, by list ❌
- ( Resuming after Kernel shut down • Retrieving data: nice it'd been saved! 😅 )
- [ try / except ] ❌
- [ try / except ] querying item by item ✔️ (kinda... 90 days!)
- [ numba / njit / numpy vectorize ] ✔️ (kinda: not fast enough)
- Full scale 📈, taking Einstein's advice:
- « As simple as possible...
- ... but not simpler »
- Geometrical approach: using points and polynoms to define each borough
- Getting boroughs coordinates from Google Maps
- Defining a function for a « good enough » location of neighborhood
- Reverse Geocoding: the light way ✔️✔️✔️
- Getting coordinates of Manhattan regions from Google Maps
- Defining a « good enough » function to locate Manhattan regions
- (Further) Reverse Geocoding: Manhattan regions ✔️✔️✔️
- Preparing for heavy processing:
- Lessons learned
- Preparing for the future: Creating aggregate data tables
- Daily data
- Data by passenger
- Exporting results for retrieval
- Nᴇxᴛ: EDA
Notebook 6: EDA • EXPLORATORY DATA ANALYSIS
- Resuming: retrieving the DF, final transformed
- Data by order
- Daily data
- Data by passenger
- Exᴘʟᴏʀᴀᴛᴏʀʏ Dᴀᴛᴀ Aɴᴀʟʏsɪs • Organizing features into classes
- Features List • Data by order
- Features List • Daily data
- Features List • Data by passenger
- Features classes • Data by order
- Features classes • Daily data
- Features classes • Data by passenger
- Features classes • Data by order
- Exᴘʟᴏʀᴀᴛᴏʀʏ Dᴀᴛᴀ Aɴᴀʟʏsɪs • Importing matplotlib
- Exᴘʟᴏʀᴀᴛᴏʀʏ Dᴀᴛᴀ Aɴᴀʟʏsɪs • Raw data (by order) ‖ Correlations
- Predictor x Predictor: Trip duration vs. distance
- Predictor x Predictor: Dropoff latitude vs. Pickup latitude
- Predictor x Predictor: Dropoff longitude vs. Pickup longitude
- Predictor x Predictor: Dropoff datetime vs. Pickup datetime
- Predictor x Predictor: Speed vs. Pickup hour
- Target x Predictor: tip_amount (Hᴀᴘᴘɪɴᴇss) vs. Pickup datetime
- Target x Predictor: total_amount (Rᴇᴠᴇɴᴜᴇ) vs. Pickup datetime
- Target x Predictor: tip_amount (Hᴀᴘᴘɪɴᴇss) vs. Trip duration
- Target x Predictor: total_amount (Rᴇᴠᴇɴᴜᴇ) vs. Trip duration
- Target x Predictor: tip_amount (Hᴀᴘᴘɪɴᴇss) vs. Trip length
- Target x Predictor: total_amount (Rᴇᴠᴇɴᴜᴇ) vs. Trip length
- Exᴘʟᴏʀᴀᴛᴏʀʏ Dᴀᴛᴀ Aɴᴀʟʏsɪs • Aggregations: frequency distributions
- Tʀɪᴘs: Frequency distribution by month (%)
- Tʀɪᴘs: Frequency distribution by week of the month (%) [normalized]
- Tʀɪᴘs: Frequency distribution by day of week (%)
- Tʀɪᴘs: Frequency distribution by pickup hour (%)
- Tʀɪᴘs: Frequency distribution by pickup time of the day (%)
- Tʀɪᴘs: Frequency distribution by pickup time of the day (% hourly demand)
- Tʀɪᴘs: Frequency distribution by trajectory duration (%)
- Tʀɪᴘs: Frequency distribution by trajectory length (%)
- Tʀɪᴘs: Frequency distribution by pickup borough (%)
- Tʀɪᴘs: Frequency distribution by dropoff borough (%)
- Tʀɪᴘs: Frequency distribution by rate type (%)
- Tʀɪᴘs: Frequency distribution by payment type (%)
- Tʀɪᴘs: Frequency distribution by tip amount (%)
- Tʀɪᴘs: Frequency distribution by total amount (%)
- Tʀɪᴘs: Frequency distribution by average speed, mph (%)
- Exporting results for retrieval
- Nᴇxᴛ: Mᴏᴅᴇʟɪɴɢ: Time Series Analysis • Forecasting
Notebook 9: Rᴇᴘᴏʀᴛ ғᴏʀ ᴛʜᴇ CEO ‖ Mar-May 2014 Operations
- Cᴏɴᴄᴇᴘᴛᴜᴀʟ Fʀᴀᴍᴇᴡᴏʀᴋ • Vɪsɪᴏɴᴀʀʏ
- Dʀɪᴠɪɴɢ Qᴜᴇsᴛɪᴏɴs
- Hɪɢʜʟɪɢʜᴛs
- PROFIT 📈: REVENUE PROFILE
- PEOPLE 👥: PASSENGERS' ROUTINE
- Sᴜᴘᴘᴏʀᴛɪɴɢ Dᴀᴛᴀ: BUSINESS SUSTAINABILITY • PROFIT 📈
- CWGR: Compound Weekly Growth Rate
- WEEKLY PROFILE
- Sᴜᴘᴘᴏʀᴛɪɴɢ Dᴀᴛᴀ: BUSINESS SUSTAINABILITY • PEOPLE 👥
- % TRIPS BY DURATION ⏲️
- % TRIPS BY DISTANCE
- % TRIPS BY TIME OF THE DAY 🌃
- Sᴜᴘᴘᴏʀᴛɪɴɢ Dᴀᴛᴀ: BUSINESS SUSTAINABILITY • PLANET 🌎
- SOON!
- See also:
- Rᴇᴘᴏʀᴛ ғᴏʀ ᴛʜᴇ CFO
- Rᴇᴘᴏʀᴛ ғᴏʀ ᴛʜᴇ COO