Skip to content

Latest commit

 

History

History
20 lines (13 loc) · 3.03 KB

File metadata and controls

20 lines (13 loc) · 3.03 KB

Getting-and-Cleaning-Data--course-3

Description of the variables, the data, and any transformations or work that performed to clean up the data -----------------------------------------------------------------------------------------------------

The datasets used in this course project are downloaded from: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip A full description of the problem to be solved is available at the site where the data was obtained: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphone The code in R is available from "run_analysis.R", which performs the following:

STEP 1: The script merges train and test data:

firstly, it merges "X_train.txt" and "X_test.txt" into "merged_X" data set (10299 x 561 data set);
secondly it merges "Y_train.txt" and "Y_test.txt" into "merged_Y" data set (10299 x 1 data set), and
finally, it merges "subject_train.txt" and "subject_test.txt" into "merged_subject" (10299 x 1 data set).

STEP 2. The script extracts only those measurements showing the mean and standard deviation for each measurement. For that purpose, the script reads features.txt (561 measurements in sum), and from the second column of features.txt, it extracts only "-mean" and "-std", by using grep function. Extracted features are stored in variable merged_X which, is now 10299 x 66 data set. This explains that only 66 measurements out of 561 are about "-mean" and "-std". Here, we also set the descriptive variable names for merged_X by replace all "(|)" with " " and by using names() function.

STEP 3. The script adds the descriptive activity names to name the activities in the data set. Activity names are given in "activity_labels.txt" data set. Firstly, we're reading activity_labels.txt into a temporary variable called tmp_activities (6 x 2 data set), then replacing all "_" from all matches with " ", and setting the descriptive variable names by using the names() function.

STEP 4. The script labels the data set with descriptive variable names. It works with merged_subject temporary data set, and adds "subject" as the name.

STEP 5. From the data set in step 4, the script now creates a new tidy data set that contains only the average of each variable for each activity and each subject. Firstly, we created "tidy_data_course project_step 4.txt" (based on STEP 4) (10299 x 68 data set), which merges all three datasets: merged_subject, merged_Y, merged_X, and we wrote "tidy_data_course project_step 4.txt" by using write.table() function. Then, we calculated the average of each variable for each activity and each subject. This is done by using a FOR loop which is controled based on the length of activities, subjects and variables from tidy_ds (STEP 4). During the FOR loop, the "final" data set is getting built, plus it calculates means (averages) for each variable. Finally, we write table called "tidy_data_course project_step 5.txt", based on "final" dataset. We're using row.name=FALSE for writing the table, as suggested in the decsription of the problem.