-
Notifications
You must be signed in to change notification settings - Fork 0
/
5_data_cleanup.Rmd
76 lines (52 loc) · 1.93 KB
/
5_data_cleanup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
title: "R Notebook"
output: html_notebook
---
# Introduction
This notebook is designed to clean and merge CSV files containing player ranking data. It includes steps to ensure data integrity by fixing ranking inconsistencies and appending year information based on filenames.
# Load Libraries
Start by loading necessary libraries for data manipulation.
```{r}
library(dplyr)
library(readr)
library(tidyr)
```
# Define Functions for Data Cleaning
## Fix Ranking Data
Create a function to correct any issues with the 'Rank' column in the dataset, such as filling missing values and converting them to integers.
```{r}
fix_ranking <- function(df) {
df <- df %>%
mutate(Rank = as.character(Rank)) %>%
fill(Rank, .direction = "down") %>%
mutate(Rank = as.integer(Rank))
return(df)
}
```
## Read and Fix CSV Files
Define a function to read CSV files, apply the ranking fix, and extract the year from the filename to add as a new column.
```{r}
read_and_fix_csv_with_year <- function(file_path) {
df <- read_csv(file_path, show_col_types = FALSE)
fixed_df <- fix_ranking(df)
year <- as.integer(gsub(".*_(\\d{4})\\.csv$", "\\1", basename(file_path)))
fixed_df$Year <- year
return(fixed_df)
}
```
# Merge CSV Files
Combine multiple CSV files from a directory into a single dataset with corrected rankings and appended year information.
## Execute Merging Process
Run the function to merge all CSV files in the specified directory into a single cleaned and consolidated dataset.
```{r}
merge_csv_files_with_year <- function(dir_path, output_file) {
file_paths <- list.files(path = dir_path, pattern = "*.csv", full.names = TRUE)
merged_df <- bind_rows(lapply(file_paths, read_and_fix_csv_with_year))
write_csv(merged_df, output_file)
}
```
```{r}
merge_csv_files_with_year("top_100_each_year", "dataset/rank.csv")
```
Now data is clean.
Not all data is alphanumeric and not all rows are actual players (duplicate header rows).