The gettingandcleaningdata from sevenwatch

##Getting and Cleaning Data Course Project

Welcome! This is my attempt at the Course Project. I am a newbie to R and all. So keep that in mind if you find something done in a very crude way. I am still learning and advice on coding is always appreciated.

Anyway!

##Files related *This readme markdown document: Instructions on how to acquire the data, packages used and references and hints used.

*The R script named run_analysis.R: contains all the code with commentary.

*The codebook markdown document: describes all the variables and data transformation.

*The new dataset in txt file.

##Getting the data. You will have to manually download the data from this url: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

Save it on a file directory named "data" inside your working directory. (i.e "yourworkingdirectory/data/UCIHARDataset.zip")

Yes, THIS IS ACEPTABLE for the Course Project.

Alternatively you can run the following code in R to make sure everything is where it whould be.

if(!file.exists("data")){ dir.create("data") }

fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20>HAR%20Dataset.zip"

download.file(fileUrl, destfile= "./data/dataset.zip")

Unzip the file using your preferred software.

##Installing required packages.

To run the script you will have to install the following packages if you haven't

install.packages("plyr") install.packages("reshape2") install.packages("dplyr") install.pacakges("tidyr") Installing the plyr package can take some time. The loading into R is included in the script.

##Run the script.

Just run all the script "run_analysis.R" and wait. You will find the new data in "./data/newUCIHARdataset.txt"

##How does it work?

The script is divided in the five steps required and filled with commentary.

During STEP 1 it reads all the necessary data text files,loads them into data frames, and combines them using cbind and rbind functions.

During STEP 2 it renames the columns of the dataset using the names contained in features.txt and two extra labels "activity" and "subject" not contained in the file. This is assigned as the column names of our dataset using the function colnames. Later, it subsets all variables whose names contain the words "mean()" or "std()" into a new data frame and then removes those which include "Freq" in the variable names as they involve the weighted mean. Finally it cbinds the new data frame with the previous "activity" and "subject" columns in the dataset.

During STEP 3 it reads the "activity_labels.txt" provided, then it creates the object which contains the code for the activities and then it rewrite the data frame activity column using the names and match function.

During STEP 4 it uses the function gsup to replace all abbreviations in the variable names for better human readable ones and stores it in the object "varnames", then it assigns the object to the colnames of the data frame.

During STEP 5, it loads the packages of "plyr" and "reshape2". Then it reshapes the data frame using the function melt. Melt will take all the columns except the ones singled out as id variables and puts them in the same column. The resulting data frame now contains a single column for the variables and the value. After that it runs the function "ddply" from "plyr" to automatically apply a calculation (in this case the mean) to all the subsets we're interested in. And assign it to a new final dataframe. Then it loads "tidyr" and uses the function "spread" to transform the melted data into the wide form tidy data. And finally it will run the function "write.table" to write our new data frame as a text file with the name "newUCIHARdataset.txt" on the following location "./data/newUCIHARdataset.txt"

##Reading the new file If you want to read the new file, you can run the following code.

x <- "./data/newUCIHARdataset.txt"

newdataset <- read.table(x, header = TRUE)

##About the script.

This script was created using R version 3.0.3 (2014-03-06) -- "Warm Puppy"

I could have not finished this, without the help from various sources.

*Figuring out how to start the project: David Hood at Coursera

*Getting only the columns with mean and std : StackOverflow

*Getting rid of certain columns: StackOverflow StackOverflow

*Hints to rename the activity labels: StackOverflow

*Getting help renaming the variable names: rfunction.com

*Use of dplyr and reshape2 to getting the tidy data for the last step r-bloggers

And that is it. Many thanks to everyone at Coursera forums and StackOverflow.

I hope this makes sense to you reading this!

-SevenWatch

sevenwatch / gettingandcleaningdata Goto Github PK

gettingandcleaningdata's Introduction

gettingandcleaningdata's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent