session5-draft.Rmd

---
title: "Session 5 Summary"
author: "Batool Almarzouq"
date: "`r Sys.Date()`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading package

```{r}
## load the tidyverse
library(tidyverse)
library(here)

```

## Working directory

```{r}
getwd()
```

## Loading the data

```{r}
interviews <- read_csv(here("data", "SAFI_clean.csv"), na = "NULL")

```

## inspect the data

```{r}
head(interviews)
view(interviews)
glimpse(interviews)
```

## Select column

```{r}

interviews$no_membrs

```

## Selecting columns and filtering rows

```{r}

interviews %>% 
select(village, no_membrs, months_lack_food)

interviews %>% 
select(village:respondent_wall_type)

```
To choose rows based on specific criteria, we can use the `filter()` function.

```{r, purl = FALSE}
# filters observations where village name is "Chirodzo" 

interviews %>% 
filter(village == "Chirodzo")
```

We can also form "and" statements with the `&` operator instead of commas:

```{r}

interviews %>% 

filter(village == "Chirodzo" & 
                   rooms > 1 & 
                   no_meals > 2)
```

To form "or" statements we use the logical operator for "or," which is the vertical bar (|): 

```{r}
# filters observations with "|" logical operator
# output dataframe satisfies AT LEAST ONE of the specified conditions
interviews %>% 
filter(village == "Chirodzo" | village == "Ruaca")

```

The last option, *pipes*, are a recent addition to R. Pipes let you take the
output of one function and send it directly to the next

```{r}

interviews %>%
    filter(village == "Chirodzo") %>%
    select(village:respondent_wall_type)

```

If we want to create a new object with this smaller version of the data, we
can assign it a new name:

```{r, purl = FALSE}
interviews_ch <- interviews %>%
    filter(village == "Chirodzo") %>%
    select(village:respondent_wall_type)

interviews_ch

```

> ## Exercise (10 mins)
>
>  Using pipes, subset the `interviews` data to include interviews
> where respondents were members of an irrigation association
> (`memb_assoc`) and retain only the columns `affect_conflicts`,
> `liv_count`, and `no_meals`.
>
> > ## Solution
> >
> > ```{r}
> > interviews %>%
> >     filter(memb_assoc == "yes") %>%
> >     select(affect_conflicts, liv_count, no_meals)
> > ```
> {: .solution}
{: .challenge}

### Mutate

Frequently you'll want to create new columns based on the values in existing
columns.

We might be interested in the ratio of number of household members
to rooms used for sleeping (i.e. avg number of people per room):


```{r, purl = FALSE}
interviews %>%
    mutate(people_per_room = no_membrs / rooms)
```

We may be interested in investigating whether being a member of an irrigation association had any effect on the ratio of household members
to rooms. To look at this relationship, we will first remove data from our dataset where the respondent didn't answer the question of whether they were a member of an irrigation association.
These cases are recorded as "NULL" in the dataset.

To remove these cases, we could insert a `filter()` in the chain:

```{r, purl = FALSE}
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    mutate(people_per_room = no_membrs / rooms)
```

The `!` symbol negates this and says we only want values of `FALSE`, where `memb_assoc` **is
not** missing.

> ## Exercise (15 mins)
>
>  Create a new dataframe from the `interviews` data that meets the following
>  criteria: contains only the `village` column and a new column called
>  `total_meals` containing a value that is equal to the total number of meals
>  served in the household per day on average (`no_membrs` times `no_meals`).
>  Only the rows where `total_meals` is greater than 20 should be shown in the
>  final dataframe.
>
>  **Hint**: think about how the commands should be ordered to produce this data
>  frame!
>
> > ## Solution
> >
> > ``` {r}
> > interviews_total_meals <- interviews %>%
> >     mutate(total_meals = no_membrs * no_meals) %>%
> >     filter(total_meals > 20) %>%
> >     select(village, total_meals)
> > ```
> {: .solution}
{: .challenge}


#### Counting

For example, if we wanted to count the number of rows of data for
each village, we would do:

```{r, purl = FALSE}
interviews %>%
    count(village)
```

For convenience, `count()` provides the `sort` argument to get results in
decreasing order:

```{r, purl = FALSE}
interviews %>%
    count(village, sort = TRUE)
```

### Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the *split-apply-combine*
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. **`dplyr`** makes this very easy through the use of
the `group_by()` function.

So to compute the average household size by
village:

```{r, purl = FALSE}
interviews %>%
    group_by(village) %>%
    summarize(mean_no_membrs = mean(no_membrs))
```

You may also have noticed that the output from these calls doesn't run off the
screen anymore. It's one of the advantages of `tbl_df` over dataframe.

You can also group by multiple columns:

```{r, purl = FALSE}
interviews %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs))
```

Note that the output is a grouped tibble. To obtain an ungrouped tibble, use the
`ungroup` function:

```{r, purl = FALSE}
interviews %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs)) %>%
    ungroup()
```

When grouping both by `village` and `membr_assoc`, we see rows in our table for
respondents who did not specify whether they were a member of an irrigation
association. We can exclude those data from our table using a filter step.


```{r, purl = FALSE}
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs))
```

Once the data are grouped, you can also summarize multiple variables at the same
time (and not necessarily on the same variable). For instance, we could add a
column indicating the minimum household size for each village for each group
(members of an irrigation association vs not):

```{r, purl = FALSE}
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs),
              min_membrs = min(no_membrs))
```

It is sometimes useful to rearrange the result of a query to inspect the values.
For instance, we can sort on `min_membrs` to put the group with the smallest
household first:


```{r, purl = FALSE}
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs),
              min_membrs = min(no_membrs)) %>%
    arrange(min_membrs)
```

To sort in descending order, we need to add the `desc()` function. If we want to
sort the results by decreasing order of minimum household size:

```{r, purl = FALSE}
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs),
              min_membrs = min(no_membrs)) %>%
    arrange(desc(min_membrs))
```

> ## Exercise
>
> How many households in the survey have an average of
> two meals per day? Three meals per day? Are there any other numbers
> of meals represented?
>
> > ## Solution
> >
> > ```{r}
> > interviews %>%
> >    count(no_meals)
> > ```
> {: .solution}
>
> Use `group_by()` and `summarize()` to find the mean, min, and max
> number of household members for each village. Also add the number of
> observations (hint: see `?n`).
>
> > ## Solution
> >
> > ```{r}
> > interviews %>%
> >   group_by(village) %>%
> >   summarize(
> >       mean_no_membrs = mean(no_membrs),
> >       min_no_membrs = min(no_membrs),
> >       max_no_membrs = max(no_membrs),
> >       n = n()
> >   )
> > ```
> {: .solution}
>
> What was the largest household interviewed in each month?
>
> > ## Solution
> >
> > ```{r}
> > # if not already included, add month, year, and day columns
> > library(lubridate) # load lubridate if not already loaded
> > interviews %>%
> >     mutate(month = month(interview_date),
> >            day = day(interview_date),
> >            year = year(interview_date)) %>%
> >     group_by(year, month) %>%
> >     summarize(max_no_membrs = max(no_membrs))
> > ```
> {: .solution}
{: .challenge}