Ignacio 27/4/2022
For this project we will analyze data for a wellness technology company in order to help them take informed decitions. We will follow the 6 steps of data analysis from the Google Analytics course: Ask, Prepare, Process, Analyse, Share and Act.
Bellabeat is a health smart product, marketed for women, to inspire and empower them with knowledge of their health and habits. The company wants us to analyze smart device usage to gain consumer use knowledge. We then need to apply that onto a selected Bellabeat product.
The main questions are:
- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing strategy?
In order to answer the main questions we are encouraged to use publicly available data from smart device users.
- The data selected for this project is the FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Möbius).
- In this dataset 30 FitBit users consented to the submission of their personal tracker data, including minute-level output for physical activity, heart rate, sleep monitoring and weight tracking.
- The data comprehends a month of use of the tracker, beween 12th of April and 12th of May, 2016.
- There is a variability in the output collected due to the different type of FitBit tracker and the individual tracking preference.
Using the ROCCC method to assess the credibility of the data:
Reliable: The data presents different formats (due to different trackers), it is not complete (different tracking preferences) and one of the files presents repeated values.
Original: The data comes from a third party provider.
Comprehensive: The parameters measured are quite comprehensive when it comes to a fitness device tracker. However, it only includes 30 participants.
Current: The data is from 2016. We would not expect a change in the overall health of the population measured by the devices since, but maybe an improvement in the devices or the parameters they can measure.
Cited: As a third party data we have no information regarding a credible source.
As the Google Analytics course focuses on R as the programming language, I will use it for the further analysis.
From a previous inspection of the data, there is some repetition on the content. The file with daily Activity contains merged data from daily Caolories, daily Intensities and daily Steps. There is also detailed data by the hour and by the minute of calories, steps and intensities, that are summed up in the daily activity file. Moreover, there is sleep, weight, and heart rate tracks. We will be using the daily activity, as well as the sleep and weight. ## Loading the data and libraries
library(tidyverse)
library(lubridate)
library(data.table)
library(ggpmisc)
library(patchwork)
daily_activity <- read.csv("dailyActivity_merged.csv")
weight_log <- read.csv("weightLogInfo_merged.csv")
sleep_day <- read.csv("sleepDay_merged.csv")
Converting Date and time value to a correct datetime format and added weekday name for more information
Sys.setlocale("LC_TIME", "C")
## [1] "C"
daily_activity <- mutate(daily_activity, ActivityDate = mdy(ActivityDate))
daily_activity <- daily_activity %>% mutate(weekday = weekdays(ActivityDate))
daily_activity$weekday <- ordered(daily_activity$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))
sleep_day <- mutate(sleep_day, Date = mdy_hms(SleepDay))
sleep_day <- sleep_day %>% mutate(weekday = weekdays(Date))
sleep_day$weekday <- ordered(sleep_day$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))
weight_log <- mutate(weight_log, Date = mdy_hms(Date))
weight_log <- weight_log %>% mutate(weekday = weekdays(Date))
weight_log$weekday <- ordered(weight_log$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))
weight_log <- mutate(weight_log, Time = format(Date, "%H:%M:%S"))
weight_log <- mutate(weight_log, Date = format(Date, "%Y-%m-%d"))
#Added a palet of colors for the graphs
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
merge_act_sleep_weight <-unique(merge(daily_activity, sleep_day, by.x = c("Id","ActivityDate"),by.y=c("Id","Date"), all=TRUE) %>% merge(weight_log, by.x = c("Id","ActivityDate"),by.y=c("Id","Date"),all=TRUE))
merge_act_sleep_weight %>%
select(Diastance = TotalDistance,Steps = TotalSteps, Calories, SedentaryTime = SedentaryMinutes, Sleep_minutes = TotalMinutesAsleep, WeightKg, BMI) %>%
summary()
## Diastance Steps Calories SedentaryTime
## Min. : 0.000 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 2.620 1st Qu.: 3790 1st Qu.:1828 1st Qu.: 729.8
## Median : 5.245 Median : 7406 Median :2134 Median :1057.5
## Mean : 5.490 Mean : 7638 Mean :2304 Mean : 991.2
## 3rd Qu.: 7.713 3rd Qu.:10727 3rd Qu.:2793 3rd Qu.:1229.5
## Max. :28.030 Max. :36019 Max. :4900 Max. :1440.0
##
## Sleep_minutes WeightKg BMI
## Min. : 58.0 Min. : 52.60 Min. :21.45
## 1st Qu.:361.0 1st Qu.: 61.40 1st Qu.:23.96
## Median :432.5 Median : 62.50 Median :24.39
## Mean :419.2 Mean : 72.04 Mean :25.19
## 3rd Qu.:490.0 3rd Qu.: 85.05 3rd Qu.:25.56
## Max. :796.0 Max. :133.50 Max. :47.54
## NA's :530 NA's :873 NA's :873
p1 <- merge_act_sleep_weight %>% ggplot() + geom_histogram(aes(x=TotalSteps),fill = cbPalette[1]) + labs(title = "Daily steps", x="Daily steps")
p2 <- merge_act_sleep_weight %>% ggplot() + geom_histogram(aes(x=TotalDistance),fill = cbPalette[2]) + labs(title = "Daily km", x="Daily km distance")
p3 <- merge_act_sleep_weight %>% ggplot() + geom_histogram(aes(x=Calories),fill = cbPalette[3]) + labs(title = "Daily calories", x="Daily calories burnt")
p4 <- merge_act_sleep_weight %>% ggplot() + geom_histogram(aes(x=TotalMinutesAsleep/60),fill = cbPalette[4]) + labs(title = "Hours of sleep", x="Number of daily records")
p5 <- merge_act_sleep_weight %>% ggplot() + geom_histogram(aes(x=SedentaryMinutes/60),fill = cbPalette[5]) + labs(title = "Daily sedentary hours", x="Number of daily records")
p1+p2+p3+p4 +p5
We can see in this summary the distribution from some of the data which can lead us to some preliminary conclusions.
- We can see a correlation between steps and distance in the histogram plots.
- The average steps recommended for adults is 10000 according to the CDC, being about 8km. According to this data the average among the participants is 7638 steps and 5.49 km, way below that recommendation.
- On the same note, the time the users have spent in inactivity is 991 minutes in average a day, more than 16.5 hours.
- The average sleep recorded is og 419.2 minuts, about 7 hours.
- We can see the presence of NA’s in the merge data for sleep and weight, meaning that many of the participants don’t present daily records of those activities.
#Performing a pivot table to check if there's step activity for each user each day
pivot_daily_steps <- daily_activity %>% select(Id,ActivityDate,TotalSteps) %>% pivot_wider(names_from = ActivityDate, values_from=TotalSteps, values_fill = NaN)
pivot_daily_steps$num_na <- rowSums(is.na(pivot_daily_steps))
pivot_daily_steps %>% select(Id,num_na) %>% arrange(desc(num_na))
## # A tibble: 33 x 2
## Id num_na
## <dbl> <dbl>
## 1 4057192912 27
## 2 2347167796 13
## 3 8253242879 12
## 4 3372868164 11
## 5 6775888955 5
## 6 7007744171 5
## 7 6117666160 3
## 8 6290855005 2
## 9 8792009665 2
## 10 1644430081 1
## # ... with 23 more rows
From the daily activity there is one user that only has 4 days with recorded data and 27 missing values.
We can see here how much of the data was recorded from the users in the 30 day period.
p1 <- unique(sleep_day) %>% group_by(Id) %>% summarise(n= n()) %>% ggplot(aes(x=n)) + geom_histogram(fill = cbPalette[2]) + labs(title = "Sleep records per user") + xlab("Days with activity") + ylab("Number of users")
p2 <- weight_log %>% group_by(Id) %>% summarise(n=n()) %>% ggplot(aes(x=n)) + geom_histogram(fill = cbPalette[3]) + labs(title = "Weight records per user") + xlab("Days with activity") + ylab("Number of users")
daily_min_pivot <- daily_activity %>% group_by(Id,ActivityDate) %>% summarise(min_record = sum(SedentaryMinutes+LightlyActiveMinutes+FairlyActiveMinutes+VeryActiveMinutes))
p3 <- daily_min_pivot %>% group_by(Id) %>% summarise(daily_use = mean(min_record), n=n()) %>% arrange(n) %>% ggplot(aes(x=n)) + geom_histogram(fill = cbPalette[4]) + labs(title = "Number of days with recorded activity per user") + xlab("Days with activity") + ylab("Number of users")
p1+p2+p3
We can see as most of the participants have recorded steps, calories and activity for most of the month time, the recorded weight and sleep is not kept as precise.
We can see from the histogram in Fig 2 that the weight is tracked by the least amount of users with only 8 participants.
weight_log %>% ggplot(aes(x=day(Date),y=WeightKg)) + geom_col() + facet_wrap(~Id)
- Only 8 out of 30 participants have a minimum of 1 weight input
- Only 1 of those 8 uses an automatic method for a periodic input of weight.
- Only 2 of those 8 have more than 5 weight inputs in 30 days.
p1 <- daily_activity %>%
ggplot(aes(x=TotalDistance,y=Calories)) + geom_point() + labs(title = "Correlation between calories and distance") + xlab("Average distance") + ylab("Average calories") +geom_smooth(method=lm, se=FALSE, col='red', size=1)
p2 <- daily_activity %>%
ggplot(aes(x=TotalDistance,y=TotalSteps)) + geom_point() + labs(title = "Correlation between steps and distance") + xlab("Average distance") + ylab("Average steps") +geom_smooth(method=lm, se=FALSE, col='red', size=1)
p3 <- daily_activity %>%
ggplot(aes(x=TotalSteps,y=Calories)) + geom_point() + labs(title = "Correlation between steps and calories") + xlab("Average steps") + ylab("Average calories") +geom_smooth(method=lm, se=FALSE, col='red', size=1)
p1+p2+p3
As speculated before, there is a strong correlation between steps and distance. This correlation is not as strong between steps and distance with the calories.
#Type of activity in average from all users
daily_pivot <- daily_activity %>% group_by(Id) %>% summarise(sedentary=mean(SedentaryMinutes),lightly=mean(LightlyActiveMinutes),fairly = mean(FairlyActiveMinutes),very=mean(VeryActiveMinutes))
daily_pivot <- as.data.frame(melt(as.data.table(daily_pivot),id.vars= "Id",variable.name = "category", value.name = "value"))
daily_pivot %>% group_by(category) %>% summarise(mean=mean(value)) %>% ggplot(aes(x=category,y=mean/60, fill=category)) + geom_bar(position="dodge", stat = "identity") + labs(title = "Average time spent for each activity intensity", x="Activity", y="Average time in hours") + scale_fill_manual(name="Type of activity", values = cbPalette)
The distribution of time used per activity intensity is very much skewed to the sedentary activity, with an average of 16 hours, with very little time used for fairly active or very active excersise.
From the participants we can group them according to their average steps in 4 different categories to further study this: Bellow 5000 steps: Sedentary Between 5000 and 8000 steps: Lightly active Between 8000 and 10000 steps: Fairly active Above 10000 steps: Very active
daily_activity %>% group_by(Id) %>%filter(Id!=4057192912) %>% summarise(mean_steps = mean(TotalSteps), mean_distance = mean(TotalDistance), distance_per_step = mean(TotalDistance)*1000/mean(TotalSteps),n=n()) %>% arrange(n)
## # A tibble: 32 x 5
## Id mean_steps mean_distance distance_per_step n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 2347167796 9520. 6.36 0.668 18
## 2 8253242879 6482. 4.67 0.720 19
## 3 3372868164 6862. 4.71 0.686 20
## 4 6775888955 2520. 1.81 0.720 26
## 5 7007744171 11323. 8.02 0.708 26
## 6 6117666160 7047. 5.34 0.758 28
## 7 6290855005 5650. 4.27 0.756 29
## 8 8792009665 1854. 1.19 0.640 29
## 9 1644430081 7283. 5.30 0.727 30
## 10 3977333714 10985. 7.52 0.684 30
## # ... with 22 more rows
daily_active <- unique(daily_activity) %>% group_by(Id) %>% summarise(mean_steps = mean(TotalSteps), mean_distance = mean(TotalDistance), mean_calories = mean(Calories),sd_step = sd(TotalSteps), sd_distance=sd(TotalDistance),sd_calories = sd(Calories),n=n()) %>% arrange(n) %>% mutate(user = case_when(
mean_steps >= 10000~"very active", mean_steps < 10000 & mean_steps >=8000~"fairly active", mean_steps <8000 & mean_steps>=5000~"lightly active", mean_steps <5000~"sedentary"
)) %>% merge(daily_activity,by="Id")
daily_active$user <- ordered(daily_active$user, levels=c("sedentary", "lightly active", "fairly active", "very active"))
daily_active %>% group_by(user) %>% summarise(n=n(), labels = paste(toString(round((n/940)*100)),"%",sep="")) %>% ggplot(aes(x="", y=n, fill=user)) + geom_bar(stat = "identity", width = 1) +coord_polar("y", start=0) + theme_minimal() + theme(axis.title.x= element_blank(), axis.title.y = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.ticks = element_blank(), axis.text.x = element_blank(), plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +scale_fill_manual(values =cbPalette) +geom_text(aes(label = labels), position = position_stack(vjust = 0.5))+ labs(title="User type distribution")
#### After clustering by average steps, we can see if we find any pattern for activity level.
daily_pivot2 <- daily_active %>% group_by(user) %>% arrange(mean_steps) %>% summarise(sedentary=mean(SedentaryMinutes)/60,lightly=mean(LightlyActiveMinutes)/60,fairly = mean(FairlyActiveMinutes)/60,very=mean(VeryActiveMinutes)/60) %>% arrange()
daily_pivot2 <- as.data.frame(melt(as.data.table(daily_pivot2),id.vars= "user",variable.name = "category", value.name = "value"))
daily_pivot2 %>% mutate(name= fct_relevel(user,"sedentary","lightly active","fairly active","very active")) %>% ggplot(aes(x=name,y=value, fill=name)) + geom_bar(position="dodge", stat = "identity") + facet_wrap(~category, scales = "free") +theme_minimal() + labs(title="Type of activity per user type", x= "User type", y= "Hours spent") + scale_fill_manual(name="Type of user", values = cbPalette) + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
After that we can hypothesize if there is any correlation in tracking data according to the day of the week.
p1 <- daily_active %>% group_by(weekday) %>% summarise(mean_steps = mean(TotalSteps)) %>% ggplot(aes(x=weekday,y=mean_steps)) + geom_col(fill=cbPalette[7]) +theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)) +labs(title="Steps per day", x= "Day", y="Steps")
p2 <- sleep_day %>% group_by(weekday) %>% summarise(mean_sleep = mean(TotalMinutesAsleep)/60) %>% ggplot(aes(x=weekday,y=mean_sleep)) + geom_col(fill=cbPalette[8]) +theme(axis.text.x = element_text(angle = 45,vjust = 0.5, hjust = 1)) +labs(title="Hours sleep per day", x= "Day", y="Hours slept")
p1+p2
Interestingly, but maybe not surprising, Sundays are the days with less steps and more hours of sleep.
Checking the sleep record we see only 15 participants (out of 33) have records with more 10 or more days of sleep recorded from the 30 possible. There is also 9 users with less than 9 records, and other 9 with no sleep record from their FitBit. Only considering the regular users, we see an average sleep time of 7 hours and 16 minutes, with a standard deviation of 53 minutes.
heartrate_seconds <- read.csv("heartrate_seconds_merged.csv")
heartrate_seconds <- mutate(heartrate_seconds, Time = mdy_hms(Time))
heartrate_seconds <- heartrate_seconds %>% mutate(weekday = weekdays(Time))
heartrate_seconds$weekday <- ordered(heartrate_seconds$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday", "Sunday"))
weekday_p <- heartrate_seconds %>% filter(weekday==c("Monday","Tuesday","Wednesday", "Thursday","Friday")) %>% group_by(hour=format(Time,"%H")) %>% summarise(n(), mean_val=mean(Value)) %>% ggplot(aes(x=hour,y=mean_val, fill=mean_val)) + geom_col() + labs(title = "Weekday average heartbeat",x="Hour of the day",y="Average pulse") + theme_minimal() +scale_fill_gradient(name="Average heartbeat",low = "blue", high = "red")
weekend_p <- heartrate_seconds %>% filter(weekday==c("Saturday","Sunday")) %>% group_by(hour=format(Time,"%H")) %>% summarise(n(), mean_val=mean(Value)) %>% ggplot(aes(x=hour,y=mean_val, fill=mean_val)) + geom_col() + labs(title = "Weekend average heartbeat", x="Hour of the day",y="Average pulse") + theme_minimal() +scale_fill_gradient(name="Average heartbeat",low = "blue", high = "red")
weekday_p / weekend_p
#sleep_day %>% ggplot(aes(x=TotalTimeInBed,y=TotalMinutesAsleep)) + geom_point() + facet_wrap(~Id)
unique(sleep_day) %>% group_by(Id) %>% summarise(sleep = mean(TotalMinutesAsleep), n=n()) %>% ggplot(aes(x=n,y=sleep/60, color=sleep/60)) + geom_point() + annotate("text",x=25, y=10 ,label = "Daily users") + annotate("rect",xmin=22, xmax=32,ymin=4.5, ymax=9, alpha=0.2, fill="Green") + annotate("text",x=15, y=10 ,label = "Frequent users") + annotate("rect",xmin=14, xmax=19,ymin=7, ymax=8.5, alpha=0.2, fill="Yellow") + annotate("text", x=5, y=10, label = "Rare users") + labs(title = "Average sleep vs use of the FitBit", color="Hours slept") + xlab("Days recording sleep") + ylab("Average sleep in hours")
p1 <- daily_active %>% group_by(Id) %>% summarise(n=n(), user ) %>% group_by(user) %>% summarise(mean_days=mean(n)) %>% ggplot(aes(x=user,y=mean_days)) + geom_col(fill=cbPalette[3]) + labs(title="Days with tracked activity by user's activity", x="Type of user", y="Days with activity") + theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))
merge_act_sleep_weight <-unique(merge(daily_active, sleep_day, by.x = c("Id","ActivityDate"),by.y=c("Id","Date"), all=TRUE) %>% merge(weight_log, by.x = c("Id","ActivityDate"),by.y=c("Id","Date"),all=TRUE))
p2 <- merge_act_sleep_weight[!is.na(merge_act_sleep_weight$TotalMinutesAsleep),] %>% group_by(Id) %>% summarise(n= n(),user) %>% group_by(user) %>% summarise(mean_sleep = mean(n)) %>% ggplot(aes(x=user,y=mean_sleep)) + geom_col(fill=cbPalette[2]) + labs(title="Days with tracked sleep by user's activity", x="Type of user", y="Days with activity") + theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))
p3 <- merge_act_sleep_weight[!is.na(merge_act_sleep_weight$WeightKg),] %>% group_by(Id) %>% summarise(n= n(),user) %>% group_by(user) %>% summarise(mean_weight = mean(n)) %>% ggplot(aes(x=user,y=mean_weight)) + geom_col(fill=cbPalette[2]) + labs(title="Days with tracked weight by user's activity", x="Type of user", y="Days with activity") + theme(axis.text.x = element_text(angle = 45,vjust = 1, hjust = 1))
p1+p2+p3
While we don’t see a major difference in the amount of tracked data for daily steps according to the user activity, we do see a difference in the sleep tracking. Users that have a more sedentary life have a tendency to track less their sleep.
As early mentioned, Bellabeat is a company that has since 2013 contributed into empowering women giving them knowledge about their activity, sleep, stress and reproductive health. After analyzing data from a 3rd party source from a smart tracker, we can present some conclusions regarding our analysis.
1 - From the participants in the study we can see an average of 16 hours of sedentary activity daily (Fig 5). Those hours could definitely be improved by reducing them to increase the activity. However, as we could see when sorting the data by step count, the user that managed higher number of steps did not do that by reducing so much the sedentary time, but rather by performing a higher activity during the other time of the day (Fig 7). For this, we could send a reminder before the highest peaks of activity, being 16:00 on weekdays, and 13:00 on weekends (Fig 9). A reward system could be implemented for encouraging this.
2 - There is very few tracks of sleep or weight throughout the users of this study. In the case of weight, it had to be tracked manually, or by a device compatible with FitBit (Fig 3). However, sleep could be tracked with the same smart tracker, and only around 40% had records for more than 20 days (Fig 10). Despite having no information on why the users decided to not track their sleep, Bellabeat could improve the information they give their users about their sleeping behavior to improve this. Size and comfort of the smart tracked could be important for its use during the night.
3 - While Bellabeat claims to inform their users about stress, there was no calculated parameter for stress in the data analyzed from FitBit users. Stress could be potentially calculated with the differential increase in heartbeat, together with the pedometer track. Implementing a stress alert could be an improvement on the device that could separate them from their competitors in the market.
4 - Last but not least, when it comes to women reproductive health we cannot forget the menstrual cycle. Despite not knowing whether the FitBit data corresponded to male or female users, there was no tracking of menstrual cycle. Adding this feature could be useful for Bellabeat users, therefore we suggest incorporating it in the device’s options.