Skip to content

Commit

Permalink
Updates after review from intro to lesson 05
Browse files Browse the repository at this point in the history
  • Loading branch information
mattahrens committed Nov 9, 2023
1 parent 25d98f4 commit fe4016b
Show file tree
Hide file tree
Showing 7 changed files with 49 additions and 22 deletions.
6 changes: 3 additions & 3 deletions docs/01-What-is-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ To expand on the definition of data as facts, I suggest that we think of data as

Similarly, if I told you the information *6 feet, 6 inches*, that wouldn't necessarrily be data. But if I tell you that the average height of NBA basketball players is *6 feet, 6 inches*, that is a piece of data. And it becomes a piece of data that can be compared to another piece of data, such as the fact that the average height of an NFL football player is *6 feet, 2 inches*.

Another important aspect of the definition of data is that it has to be something that can be measured or compared. That is, the fact or information that is attributed to something has to mean thing. If I tell you that my sister is smart, that isn't necessarrily a measurable piece of information. Rather, it is a subjective quality. Now if I tell you that my sister's IQ is 140, that becomes a measurable piece of information that qualifies as data.
Another important aspect of the definition of data is that it has to be something that can be measured or compared. If I tell you that my sister is smart, that isn't a measurable piece of information. Rather, it is a subjective quality. Now if I tell you that my sister's IQ is 140, that becomes a measurable piece of information that qualifies as data.

Let's take another example about weather. If I tell you that yesterday was windy, that is a piece of information that is attributable to something (specifically yesterday), but it is not measurable unless you include how windy is was, say 30 miles per hour.

Expand All @@ -37,7 +37,7 @@ To summarize, data is:

### Why learn about data

Now that we know what data is, why does it matter to learn about data and how to use it effectively? Data has become central to so much of what happens in the world. With the amount of available data growing each day, we are finding more use cases to use data to make decisions and to even automate tasks. Here are examples of ways that data is in our everyda life:
Now that we know what data is, why does it matter to learn about data and how to use it effectively? Data has become central to so much of what happens in the world. With the amount of available data growing each day, we are finding more use cases to use data to make decisions and to even automate tasks. Here are examples of ways that data is in our everydya life:

- When you see a weather forecast, that is using data to try to predict what the weather will be like.
- When you use a video service like Netflix or Youtube, the recommendations you see are based on data from what you've previously watched and what others have watched.
Expand Down Expand Up @@ -68,4 +68,4 @@ For example, you can think about how tall you are or what your hair color is. W

## Summary

Data is fact or piece of information that is attributable to something (or someone) and that can be measured or compared. Data is a key part of how the world works and understanding how to work with data is a very critical skill in our world.
Data is a fact or piece of information that is attributable to something (or someone) and that can be measured or compared. Data is a key part of how the world works and understanding how to work with data is a very critical skill in our world.
14 changes: 8 additions & 6 deletions docs/02-Collecting-and-displaying-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ description: Lesson 2 - Collecting and displaying data
## Concept

### How to collect data
We now can build on our definition of data as a fact that is attributable to something and that can be compared by thinking about how to collect data. The process of collect data simply involves making a record of the facts that we want to use later on. Before we start collecting data, there are some important things to think about:
We now can build on our definition of data as a fact that is attributable to something and that can be compared by thinking about how to collect data. The process of collecting data simply involves making a record of the facts that we want to use later on. Before we start collecting data, there are some important things to think about:

- What is the main object that we are attributing the facts to -- a person, a country, etc.?
- What facts do we want to collect?
Expand All @@ -25,19 +25,19 @@ Let's look at each of these individually to help see the importance of thoughtfu

**What is the object we are interested in?**

It is best to start with the object of our data as that will then help us answer the other questions. In figuring out the object of our data collection,we want to think about what is the main object that we are looking to analyze. If it is a person, then we can think about what specific attributes related to a person that matter to us in what we're trying to accomplish. If it is a country, then we can think about attributes that are related to a country.
It is best to start with the object of our data as that will then help us answer the other questions. In figuring out the object of our data collection, we want to think about what is the main object that we are looking to analyze. If it is a person, then we can think about what specific attributes related to a person that matter to us in what we're trying to accomplish. If it is a country, then we can think about attributes that are related to a country.

It is possible to have multiple objects that are related that you want to attribute facts to. There could be different layers that you want to group together. A simple example would be countries and cities. You want to collect data about cities, but you also want to keep track of what country a city is associated with. While the country could be considered another piece of data about the city, it is better to think about it as another part of the object that you are collecting data about.

**What facts do we want to collect?**

Once you have the object identified, then you can move on to the facts that you want to collect. In some respects, this is the simplest question because you can brainstorm all the various items that are interesting for you. You can think of the various attributes for an object that are relevant and keep track of them as you will use them when collecting data.
Once you have the object identified, then you can move on to the facts that you want to collect. In some respects, this is the simplest question because you can brainstorm all the various items that are interesting for you. You can think of the various attributes for an object that are relevant and keep track of them as you will use them when collecting data. If you are collecting facts about human beings, you can collect information about their hair color, eye color, whether they were glasses or not, their height, and so on.

**How will the facts be compared?**

The final question helps to shape the facts that you want to collect. After you have the list of facts that you are interested in collecting, you want to make sure you know how you will assign the facts as you collect them. Here is where you start to define how you will measure and collect facts.

For example, in the weather example, are you going to collect the type of weather -- windy, sunny, rainy, cloudy -- for a given time period or will you collect a specific quantitative measurement such as temperature or wind speed? A way to think about this is what you have to note down for each type of fact what are the types of values you will collect for each fact. Or more specifically, you can establish a *range* of possible values for a given fact.
If you are collecting data regarding weather, are you going to collect the type of weather -- windy, sunny, rainy, cloudy -- for a given time period or will you collect a specific quantitative measurement such as temperature or wind speed? One way to think about this is that you have to note down what are the types of values you will collect for each fact. Or more specifically, you can establish a *range* of possible values for a given fact.

It's time to look at some examples to help us understand all of this better.

Expand Down Expand Up @@ -102,11 +102,13 @@ We can take our exploration a step further by displaying the data. To create a c

![image](images/02-gsheet_chart.png)

The chart shows us a breakdown of total attribute values by how each value. You will notice that the highest part of the histogram chart shows around the values of 9 and 10.

## Practice: Collect Your Data

For an activity to collect data, you will want to go to your room (or think about what is in your room if you were at school) and collect data about what you have. You can think about how you would want to organize the data about your room in different ways. You can keep of count of types of objects in your room -- clothes, toys, pictures, clothes, etc. Or you can also add attributes for objects in your room such as color or size.
For an activity to collect data, you will want to go to your room (or think about what is in your room if you are at school) and collect data about what you have. You can think about how you would want to organize the data about your room in different ways. You can keep of count of types of objects in your room -- clothes, toys, pictures, clothes, etc. Or you can also add attributes for objects in your room such as color or size.

The output should be a set of records with fields representing what is in your room.
The output should be a set of records with fields representing what is in your room. Each object would be its own record in your data.

## Summary

Expand Down
2 changes: 1 addition & 1 deletion docs/03-Querying-data-pivot-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ description: Lesson 3 - Querying data in a pivot table

## Concept

When we have a dataset, one of the main uses of it is to get answers from the dataset for questions that we are interested in. When we ask a question to a dataset to get an answer, we call that a **query**. A query is a structured way to communicate what specific information we want to extract from a dataset. When we structure a query, we are usually trying to get answers related to the records in the dataset or the fields in the dataset, and often times we want a combination of both. Remember that the records in the dataset are represented by individual rows of data that each represent a set of attributes about an object. The fields in the dataset are those specific attributes that we have collected for our objects.
When we have a dataset, one of the main uses of it is ask questions from the dataset to get answers that we are interested in. When we ask a question to a dataset to get an answer, we call that a **query**. A query is a structured way to communicate what specific information we want to extract from a dataset. When we structure a query, we are usually trying to get answers related to the records in the dataset or the fields in the dataset, and often times we want a combination of both. Remember that the records in the dataset are represented by individual rows of data that each represent a set of attributes about an object. The fields in the dataset are those specific attributes that we have collected for our objects.

If we query data specifically to get a set of records, then we normally will structure our query to select records based on certain values of the attributes. For example, in our countries example from the previous lesson, if we want to only see records for countries that have population greater than 100 million, then we would write the query such that it only includes those records. When we are making a selection of records with conditions to pick what we want, that is done using a **filter** in our query.

Expand Down
12 changes: 6 additions & 6 deletions docs/04-Loading-data-in-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,15 @@ We are shifting from using Google Sheets to now using the Python programming lan

The dataframe structure should look similar to the Google Sheets data that we have looked at previously. You will see rows and columns of data to represent the records and fields for the dataset. As we write Python code to work with dataframes, we will start by using **functions** -- modules of code that accomplish a specific task -- to run queries.

During the practice portion of the lesson, you will be using a Google Colab Notebook to write code and see results. To create a Google Colab notebook, go to [[https://colab.research.google.com]] and then navigate to `File -> New Notebook`.
During the practice portion of the lesson, you will be using a Google Colab Notebook to write code and see results. To create a Google Colab notebook, go to <https://colab.research.google.com> and then navigate to `File -> New Notebook`.

![image](images/04-new_colab_notebook.png)

Once you have your new notebook, you are ready to being the practice portions of this lesson.

## Practice: Creating a dataset in Python using Google Colab

If your new notebook, you can see add snippets of code into what is called a **cell** and then run the cell to execute the code. To start, let's put in a simple statement to print out text and then run the cell by clicking the play button. This is what you should see after the cell is played:
In your new notebook, you can see add snippets of code into what is called a **cell** and then run the cell to execute the code. To start, let's put in a simple statement to print out text and then run the cell by clicking the play button. This is what you should see after the cell is played:

![image](images/04-colab_hello_world.png)

Expand All @@ -37,7 +37,7 @@ Going forward, you can copy the code that is displayed here into a new cell in y
```
import pandas as pd
country_data = {'Country': ['US', 'Brazil', 'Spain', 'Thailand'], 'Population (millions)': [331.9, 214.3]}
country_data = {'Country': ['US', 'Brazil'], 'Population (millions)': [331.9, 214.3]}
df = pd.DataFrame(country_data)
print(df)
```
Expand All @@ -48,7 +48,7 @@ Let's go through this code line by line to understand what is happening. The fi

## Practice: Loading a dataset in Python using Google Colab

Now we're ready to move one from creating a dataset to loading a dataset. In this practice, we will be loading a dataset containing book reviews. Before loading the dataset into dataframes we first have to download it with this code in a new cell in your notebook:
Now we're ready to move on from creating a dataset to loading a dataset. In this practice, we will be loading a dataset containing book reviews. Before loading the dataset into dataframes we first have to download it with this code in a new cell in your notebook:

```
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Expand Down Expand Up @@ -78,11 +78,11 @@ The read_csv function takes in parameters on how to load data, including what se
print(ratings_df)
```

You can add similar cells for the `books_df` and `user_df` dataframes as well. To see full information about the fields in a dataframe, you can use the `info()` function.
You can add similar cells to print the `books_df` and `user_df` dataframes as well. To see full information about the fields in a dataframe, you can use the `info()` function.

```
ratings_df.info()
```

## Summary
In this lesson, we learned how to use Python to create our own dataset and to load a dataset from existing files that we downloaded. We were introduced to the concept of a dataframe with is how data is represented in Python. We also learned about functions which are modules of code that allow us to accomplish a specific task. Some of the functions we practiced were `print()` and `describe()`.
In this lesson, we learned how to use Python to create our own dataset and to load a dataset from existing files that we downloaded. We were introduced to the concept of a dataframe with is how data is represented in Python. We also learned about functions which are modules of code that allow us to accomplish a specific task. Some of the functions we practiced were `info()`, `print()`, and `read_csv()`.
31 changes: 28 additions & 3 deletions docs/05-Querying-data-in-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ This single line of code looks complicated, so let's break it apart to see the v
```
books_df
```
After that, we call the `groupby` function with the field `Year-Of-Publication` to say what we want to group the dataset by.
After that, we call the `groupby` function with the field `Year-Of-Publication` to say what we want to group the dataset by. Grouping organizes the data together around the field(s) you specify in the group by function.
```
book_df.groupby(['Year-Of-Publication'])
```
Expand All @@ -43,7 +43,7 @@ The `count()` function simply adds up how many records there are for each value
```
book_df.groupby(['Year-Of-Publication']).count().sort_values(by=['ISBN'], ascending=False)
```
We use the `sort_values()` function to sort the data and we give it two pieces of information. The first item is what field to sort by. In this case, it doesn't matter what field to sort by since we've counted up all the values which would be the same for all the fields. So we have given it the `ISBN` field but we could have given it any fields. The second information we pass to the `sort_values()` function is how we want to sort. Because we want the highest values at the top, we want it in descending order which is the opposite of ascending; so we say that ascending is False to tell it what order to do.
We use the `sort_values()` function to sort the data and we give it two pieces of information. The first item is what field to sort by. In this case, it doesn't matter what field to sort by since we've counted up all the values which would be the same for all the fields. So we have given it the `ISBN` field but we could have given it any field. The second information we pass to the `sort_values()` function is how we want to sort. Because we want the highest values at the top, we want it in descending order which is the opposite of ascending; so we say that ascending is False to tell it what order to do.

We've got the data grouped and counted and sorted, so now we're ready to see the top results.
```
Expand All @@ -53,7 +53,7 @@ The `head()` function will output the first 25 records of the data output from o

## Practice: Building your own queries in Python using Google Colab

OK, now you've seen functions in action to build a query for our datasets, so now it's your turn. We'll go through a couple examples together and then I'll give you a couple queries to build yourself.
OK, now you've seen functions in action to build a query for our datasets, so now it's your turn. We'll go through a couple examples together and then I'll give you a few queries to build yourself.

As we build a query, we first start with what question we want to ask of the dataset and then we build the query to represent that question.

Expand Down Expand Up @@ -89,3 +89,28 @@ If you have successfully built all of those queries to answer the questions, the
## Summary
In this lesson, we learned how to write queries in Python using functions. We explored our book ratings datasets to ask questions of the data. We used different functions to help us get the answers we wanteds. Some of the functions included: `count()`, `groupby()`, `sort_values()`, and `head()`.

## Answer key
1. What is the age of the users who did reviews grouped by each age? Hint: you will have to use the users dataset for this query.
```
users_df.groupby('Age').count().sort_values(by=['Age'])
```

2. What is the overall average age of users? Hint: you will have to use the `mean()` function.
```
users_df['Age'].mean()
```

3. What is the number of ratings at each ratings (0 - 10)? Hint: you will have to the use the ratings dataset.
```
ratings_df.groupby('Book-Rating').count().sort_values(by=['Book-Rating'])
```

4. What is the overall average book rating from all ratings? Hint: you will have to use the `mean()` function.
```
ratings_df['Book-Rating'].mean()
```

5. How many distinct authors are in the dataset? Hint: you will have to use the books dataset and the `nunique()` function.
```
books_df['Book-Author'].nunique()
```
Binary file modified docs/images/04-colab_country_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit fe4016b

Please sign in to comment.