diff --git a/lessons/lesson3/Lesson 3.ipynb b/lessons/lesson3/Lesson 3.ipynb index 4a5b860..bad33cd 100644 --- a/lessons/lesson3/Lesson 3.ipynb +++ b/lessons/lesson3/Lesson 3.ipynb @@ -1 +1,1744 @@ -{"cells":[{"attachments":{},"cell_type":"markdown","metadata":{"id":"gPZMglnlttO8"},"source":["\n"," \n","# Basic Elementary Exploratory Data Analysis using Pandas\n","\n","_Author: Christopher Chan_"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"cdSXNUkhttO9"},"source":["### Objective\n","\n","Upon completion of this lesson you should be able to understand the following:\n","\n","1. Pandas library\n","2. Dataframes\n","3. Data selection\n","4. Data manipulation\n","5. Handling of missing data\n","\n","This is arguably the most important part of analysis. This is also referred to as the \"cleaning the data\". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out."]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"1c9iQdrTttO-"},"source":["##### ==================================================================================================\n","## Data Selection and Inspection\n","\n","\n","### Pandas Library\n","\n","`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,\n","built on top of the Python programming language.\n","\n","`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:\n","\n","### DataFrames\n","A data frame is a structured representation of data.\n","##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0v8znxdlttO-"},"outputs":[],"source":["import pandas as pd"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Q-fUhePhttO-"},"outputs":[],"source":["data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],\n"," 'Age': [24, 23, 22, 19, 10], \n"," 'Salary': [60000,120000,1000000,75000,80000]}\n","\n","people_df = pd.DataFrame(data)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"_fi2q8cuttO-"},"source":["##### ==================================================================================================\n","We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Muo9Gs_xttO_"},"outputs":[],"source":["people_df.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"ZtcR6GJ2ttO_"},"source":["##### ==================================================================================================\n","We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.\n","\n","Example: Select the first 2 rows of the dataframe"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"k-lZOSuGttO_"},"outputs":[],"source":["people_df.head(2)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"C_UTdB6IWiG_"},"source":["Example: Select the last 2 rows of the dataframe"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tfNVLk_tWU52"},"outputs":[],"source":["people_df.tail(2)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Q8nzMIscttO_"},"source":["##### ==================================================================================================\n","Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:\n","\n","`dataframe = pd.read_csv(\".../file_location/file_name.csv\")`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"IbdNygPgttO_"},"outputs":[],"source":["# Saving the file location to our data into a variable\n","pixar_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Pixar_Movies.csv\"\n","# Passing our file location to the read_csv function to locate and read our data into a DataFrame\n","movies_df = pd.read_csv(pixar_url)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"4r7sM285ttPA"},"source":["##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"u-4v6IISttPA"},"outputs":[],"source":["movies_df.head(10)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"5JISZYZHttPA"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT * \n","FROM Movies\n","LIMIT 10\n","```\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"uB-nxMi8ttPA"},"source":["`.shape` shows the number of rows and columns"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"tuyP3rLKttPA"},"outputs":[],"source":["movies_df.shape"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"sCnSo2HBttPA"},"source":["This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns\n","\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"zMhYN4Z1ttPA"},"source":["`.dtypes` shows the data types"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"UOImUY3attPA"},"outputs":[],"source":["movies_df.dtypes"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"244Ux8N_XWmo"},"source":["`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-MPTc3c6YjMp"},"outputs":[],"source":["movies_df.describe()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"uNXiyTCWYwl8"},"source":["You may optionally include categorical data in the `describe` method like so:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ITxGRSqQY8oX"},"outputs":[],"source":["movies_df.describe(include='all')"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"s3f7AzxJv_L5"},"outputs":[],"source":["movies_df.info()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"QHrP16DdttPA"},"source":["##### ==================================================================================================\n","\n","### Row and Column Selection\n","\n","There are two common ways to select rows and columns in a dataframe using .loc and .iloc\n","\n","`.loc` selects rows and columns by label/name\n","\n","`.iloc` selects row and columns by index\n","\n","Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"_H7tY4X8ttPA"},"outputs":[],"source":["movies_df.loc[2:4, ['Title','Director','Year'] ]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"0qwe2OwyttPA"},"source":["##### ==================================================================================================\n","\n","Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3rPyj7J1ttPA"},"outputs":[],"source":["movies_df.iloc[ :, [1,2,3] ]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"OigUAB8ottPB"},"source":["#### The two python codes above are equivalent to SQL's\n","\n","```sql\n","SELECT Title, Director, Year\n","FROM Movies\n","```\n","\n","##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"VrxiA9oittPB"},"outputs":[],"source":["movies_df.iloc[0:3,[1,2,3]]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"W8Rpe-SBttPB"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT Title, Director, Year\n","FROM Movies\n","LIMIT 3\n","```\n","##### =================================================================================================="]},{"cell_type":"code","execution_count":null,"metadata":{"id":"JYZXjJ7zttPB"},"outputs":[],"source":["movies_df.iloc[2:5, [1,2,3]]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"xuF4sFtRttPB"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT Title, Director, Year\n","FROM movies\n","LIMIT 3\n","OFFSET 2\n","```\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"qoAct0MgZq2Y"},"source":["The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"mAaANUmittPB"},"outputs":[],"source":["movies_df.loc[:,'Director'].value_counts()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"UUKE7FJkttPB"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT Director, COUNT(*)\n","FROM Movies\n","GROUP BY Director\n","```\n"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"dqwCWeGUdDOO"},"source":["##### ==================================================================================================\n","\n","We can use the `mean()` method to help us find the average of a column or group of columns."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"85Yx7Q8MdXDp"},"outputs":[],"source":["movies_df.loc[:, 'Length_minutes'].mean()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"hxFxYRlVgy8D"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT AVG(Length_minutes)\n","FROM Movies\n","```"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"CDzWn6ZYhdjl"},"source":["Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.\n","\n","For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"1Pc8Bk75ePoi"},"outputs":[],"source":["movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"jbDuTSGwiCmq"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT Director, AVG(Length_minutes) AS Length_minutes\n","FROM Movies\n","GROUP BY Director\n","```"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"cKaf4n4ycypo"},"source":["##### ==================================================================================================\n","### Filtering Data\n","Using operator comparisons on columns returns information based on our desired conditions\n","\n","Example: Suppose we want to return movie information if it is only longer than 100 minutes long."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Zgl74_zjttPB"},"outputs":[],"source":["# Create the filter \n","movie_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n","# Use the filter in the `.loc` selector\n","movies_df.loc[movie_filter, :]\n","\n","# An example showing everything in a single step \n","movies_df.loc[movies_df.loc[:, \"Length_minutes\"] > 100, :]\n"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"1RAY_qWtttPB"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Movies\n","WHERE Length_minutes > 100\n","```\n","##### ==================================================================================================\n","\n","#### Multiple Conditional Filtering\n","\n","Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Dp1-vQ3mttPB"},"outputs":[],"source":["movie_len_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n","movie_year_filter = movies_df.loc[:, \"Year\"] < 2005\n","\n","movies_df.loc[(movie_len_filter) & (movie_year_filter), :]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"lQksNrTkttPB"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Movies\n","WHERE Length_minutes > 100\n","AND Year < 2005\n","```\n","##### =================================================================================================="]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"GVTfOPhottPB"},"source":["##### ==================================================================================================\n","### Sorting Data\n","The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`. \n","\n","The `.reset_index(drop=True)` will re-index the index after sorting."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"KFQjjjOxttPC"},"outputs":[],"source":["movies_df.loc[:,\"Title\"].sort_values().reset_index(drop=True)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"1fEi_PBfttPC"},"source":["#### The above python code is equivalent to SQL's\n","\n","```sql\n","SELECT Title\n","FROM Movies\n","ORDER BY Title\n","```\n","##### ==================================================================================================\n","\n","Sort the entire dataframe by a single column:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"stgM1BXxttPC"},"outputs":[],"source":["movies_df.sort_values(\"Title\").reset_index(drop=True)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"V5j8FDwuttPC"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Movies\n","ORDER BY Title\n","```\n","##### ==================================================================================================\n","\n","We can also sort using multiple columns.\n","Example: We can sort by Director first, then within each Director, sort the Title of the films."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"q6UqfJacttPC"},"outputs":[],"source":["movies_df.sort_values([\"Director\",\"Title\"], ascending=[True, False]).reset_index(drop=True)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"5wlURoWy2eYC"},"source":["```sql\n","SELECT Director, Title\n","FROM Movies\n","ORDER BY\n"," Director ASC,\n"," Title DESC\n","```"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"AWFTUYNVttPC"},"source":["##### ==================================================================================================\n","### Merging DataFrames\n","\n","In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.\n","\n","But first let us introduce a new dataset:"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3C3P14EvttPC"},"outputs":[],"source":["other_movies_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Other_Movies.csv\"\n","other_movies_df = pd.read_csv(other_movies_url)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"fjEx1V8vttPC"},"outputs":[],"source":["other_movies_df.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"DiyckWV1ttPC"},"source":["##### ==================================================================================================\n","Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"pjvZ8wGFttPC"},"outputs":[],"source":["all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DwGVaXWxttPC"},"outputs":[],"source":["all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"uwtTP015ttPD"},"source":["##### ==================================================================================================\n","Now lets introduce another dataframe, that being the movie scores received"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"wntotCmBttPD"},"outputs":[],"source":["movie_scores_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Movie_Scores.csv\"\n","scores_df = pd.read_csv(movie_scores_url)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":143,"status":"ok","timestamp":1680135575621,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"9xeFCBz5ttPD","outputId":"d749036a-deb4-4141-c95d-0428415de7a3"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
Score
08.3
17.2
27.9
38.1
48.2
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "],"text/plain":[" Score\n","0 8.3\n","1 7.2\n","2 7.9\n","3 8.1\n","4 8.2"]},"execution_count":35,"metadata":{},"output_type":"execute_result"}],"source":["scores_df.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Yarl7-KPttPD"},"source":["##### ==================================================================================================\n","Now we can combine the two dataframes side by side"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"W2zOhxPcttPD"},"outputs":[],"source":["movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = \"columns\").reset_index(drop=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"VBMdQiRettPD"},"outputs":[],"source":["movies_and_scores_df.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"xoQ3fB8SttPD"},"source":["##### ==================================================================================================\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ZITCa9qYttPD"},"outputs":[],"source":["managers = pd.DataFrame(\n"," {\n"," 'Id': [1,2,3],\n"," 'Manager':['Chris','Maritza','Jamin']\n"," }\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"MX9spfihttPD"},"outputs":[],"source":["managers.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"LtD1zQJuttPD"},"outputs":[],"source":["captains = pd.DataFrame(\n"," {\n"," 'Id': [2,2,3,1,1,3,2,3,1,1,3,3],\n"," 'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],\n"," 'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']\n"," }\n",")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0xOS-Bu4ttPD"},"outputs":[],"source":["captains.head(12)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"3c478mlSttPD"},"outputs":[],"source":["roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')\n","roster.head(-1)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"rJ9K1BPGXxzE"},"outputs":[],"source":["test_roster = pd.concat([captains, managers], axis=\"columns\").reset_index(drop=True)\n","test_roster.head()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"2hro1V6XttPD"},"source":["#### The above python code is equivalent to SQL's\n","```sql\n","SELECT *\n","FROM Captains\n","INNER JOIN Managers\n","ON Captains.Id = Managers.Id\n","```\n","##### ==================================================================================================\n","## Column Renaming\n","\n","We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"-nELWGyPttPD"},"outputs":[],"source":["roster = roster.rename(columns = {\"Id\":\"Cohort\",\"Title\":\"Captain Rank\"})\n","roster.head(-1)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":144,"status":"ok","timestamp":1680136209238,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"zrKc31ukYw3i","outputId":"40c853d2-aa1f-4f1d-ae91-3e803b90c20f"},"outputs":[{"data":{"text/plain":["Index(['Cohort', 'Captain', 'Captain Rank', 'Manager'], dtype='object')"]},"execution_count":45,"metadata":{},"output_type":"execute_result"}],"source":["roster.columns"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"N5H7HamottPE"},"source":["If we would like to replace all columns, we must use a list of equal length"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"eCTo6V3UttPE"},"outputs":[],"source":["roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']\n","roster.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Wp5nb6skttPE"},"source":["##### ==================================================================================================\n","### Drop Columns"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"doploOj9ttPE"},"outputs":[],"source":["#df.drop([\"column1\",\"column2\"], axis = \"columns\")\n","\n","roster = roster.drop(\"Cohort Num\", axis = \"columns\")\n","roster.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"u-SBCempttPE"},"source":["##### ==================================================================================================\n","### Missing Values / NaN Values\n","\n","There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.\n","\n","Missing data can be remedied by the following:\n","1. Removing the row with the missing/NaN values\n","2. Removing the column with the missing/NaN values\n","3. Filling in the missing data\n","\n","For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"yV1RhRDNttPE"},"outputs":[],"source":["cars_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Cars.csv\"\n","cars = pd.read_csv(cars_url)\n","cars.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"zT2P3Mq9ttPE"},"source":["##### ==================================================================================================\n","Now lets sort the companies in alphabetical order"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"W4xHJumrttPE"},"outputs":[],"source":["cars = cars.sort_values(\"Company\").reset_index(drop=True)\n","cars.head(-1)"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"tFLokzyvttPE"},"source":["##### ==================================================================================================\n","Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":163,"status":"ok","timestamp":1680136659347,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"q33En74DttPE","outputId":"5b5675c2-4396-49ee-ef44-cf50d5c03b14"},"outputs":[{"data":{"text/plain":["Company 0\n","Location 4\n","Year 5\n","dtype: int64"]},"execution_count":54,"metadata":{},"output_type":"execute_result"}],"source":["cars.isna().sum()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"cGKKoYTpttPE"},"source":["##### ==================================================================================================\n","Lets inspect all the rows with any missing Loctation entries"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"dRmT5-TvttPE"},"outputs":[],"source":["missing_car_info_filter = cars.loc[:, \"Location\"].isna()\n","cars.loc[missing_car_info_filter, :]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"tvT4mHb5ttPE"},"source":["##### ==================================================================================================\n","Lets inspect all the rows with any missing Year entries"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"64m7mIH0ttPF"},"outputs":[],"source":["cars.loc[cars.loc[:, \"Year\"].isna(), :]"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"Kl_wIVHCttPF"},"source":["##### ==================================================================================================\n","For simplicity we can fill all the missing Location entries with \"NA\""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"k0NDuUMhttPF"},"outputs":[],"source":["cars.loc[:, \"Location\"] = cars.loc[:, \"Location\"].fillna(value=\"NA\")"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"KXC45KtFttPF"},"outputs":[],"source":["cars.head(-1)\n","cars.isna().sum()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"nB__rivattPF"},"source":["##### ==================================================================================================\n","Now lets drop any rows with missing entries"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":161,"status":"ok","timestamp":1680136931394,"user":{"displayName":"Martin Arroyo","userId":"00023833307036255373"},"user_tz":240},"id":"Ft1XTWOGttPF","outputId":"2a06ac7b-978b-4720-9cc9-52925c26c448"},"outputs":[{"data":{"text/plain":["Company 0\n","Location 0\n","Year 0\n","dtype: int64"]},"execution_count":61,"metadata":{},"output_type":"execute_result"}],"source":["cars = cars.dropna().reset_index(drop=True)\n","cars.head(-1)\n","cars.isna().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"MoUYqyzSeK9n"},"outputs":[],"source":["cars.info()"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"lbaxA3zrttPF"},"source":["##### ==================================================================================================\n","## Summary\n","\n","- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.\n","- `.loc` selects rows and columns based on their index values.\n","- `.iloc` selects rows and columns based on their position values.\n","- Calling a DataFrame method with `axis=\"rows\"` or `axis=0` causes it to operate along the row axis.\n","- Calling a DataFrame method with `axis=\"columns\"` or `axis=1` causes it to operate along the columns axis.\n","- `sort_values` reorders rows based on condition\n","- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.\n","- `.isna()` detects missing values\n","- `.fillna()` replaces NULL values with a specified value\n","- `.dropna()` removes all rows that contain NULL values\n","- `.merge()` updates content from one DataFrame with content from another Dataframe"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"k8I532SRttPF"},"source":["##### ==================================================================================================\n","### Exercise 1:\n","Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"t3G0XkmittPF"},"outputs":[],"source":["#solution\n","roster = pd.DataFrame(\n","{\n"," \"Name\" : [\"James\",\"Greg\",\"Patrick\",\"Chris\",\"Cynthia\",\"Chandra\", \"John\",\"David\",\"Tiffany\",\"Peter\"],\n"," \"Id\": [\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\"],\n"," \n","})\n","\n","exam = pd.DataFrame({\n"," \"Exam 1\" : [89,78,81,90,93,76,66,87,42,55],\n"," \"Exam 2\" : [100,74,20,86,60,76,92,97,88,90],\n"," \"Exam 3\" : [85,60,90,90,88,76,55,None,64,79],\n"," \"Id\" : [\"4\",\"2\",\"1\",\"7\",\"5\",\"10\",\"6\",\"3\",\"9\",\"8\"]\n","})\n","\n","# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"rMRopV2FttPF"},"source":["##### ==================================================================================================\n","### Exercise 2:\n","Fill all missing grades with 0."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"DA8C74TLttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"6a_N8JEEttPG"},"source":["##### ==================================================================================================\n","### Exercise 3:\n","Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Mzka5Y3_ttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"DkuO3tIPttPG"},"source":["##### ==================================================================================================\n","### Exercise 4:\n","\n","Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"QWXVYTj0ttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"96hRtey9ttPG"},"source":["##### ==================================================================================================\n","### Exercise 5:\n","Incorporate the newly created `Average` column into the DataFrame `cohort`"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"wEysGqYyttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"QHk1lZiDttPG"},"source":["##### ==================================================================================================\n","### Exercise 6:\n","Sort the dataset by Average in **descending** order and reindex the DataFrame"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"9azLYMHPttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"yyWST6gUttPG"},"source":["##### ==================================================================================================\n","### Exercise 7:\n","Drop columns Exam 1, 2, and 3"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PgD_KqCkttPG"},"outputs":[],"source":["# YOUR CODE HERE"]},{"attachments":{},"cell_type":"markdown","metadata":{"id":"OHg6AiIYttPG"},"source":["##### ==================================================================================================\n","### Exercise 8:\n","Select only the top 3 **Name, Id and Average only*** based on highest Average grade"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"MmHW3ki9ttPG"},"outputs":[],"source":["# YOUR CODE HERE"]}],"metadata":{"colab":{"provenance":[{"file_id":"11txmjQA_zWvA1kWLMxvhr2VEhNoUk9uC","timestamp":1679880071146}]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.5"}},"nbformat":4,"nbformat_minor":0} +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "''" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "\"\"\"\n", + "For Captians Using OLDER colab notebooks: \n", + "Run this code if youre loading .csv as \"pd.read_csv(\"/content/Pixar_Movies.csv\")\"\n", + "\n", + "This will down load a folder with the csv's and wil move the files into your current working directory, so you can use the prewritten code.\n", + "Newer notebooks will directly link to the csv, so you dont need to take any additional steps.\n", + "\"\"\"\n", + "# !gdown --folder https://drive.google.com/drive/folders/1DyiVBtUKIMg311TQTUoGt0kV2vcFcseM\n", + "# !cd Content; mv *.csv ..; rmdir /content/Content\n", + ";" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gPZMglnlttO8" + }, + "source": [ + "\n", + " \n", + "# Basic Elementary Exploratory Data Analysis using Pandas\n", + "\n", + "_Author: Christopher Chan_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cdSXNUkhttO9" + }, + "source": [ + "### Objective\n", + "\n", + "Upon completion of this lesson you should be able to understand the following:\n", + "\n", + "1. Pandas library\n", + "2. Dataframes\n", + "3. Data selection\n", + "4. Data manipulation\n", + "5. Handling of missing data\n", + "\n", + "This is arguably the most important part of analysis. This is also referred to as the \"cleaning the data\". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1c9iQdrTttO-" + }, + "source": [ + "##### ==================================================================================================\n", + "## Data Selection and Inspection\n", + "\n", + "\n", + "### Pandas Library\n", + "\n", + "`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,\n", + "built on top of the Python programming language.\n", + "\n", + "`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:\n", + "\n", + "### DataFrames\n", + "A data frame is a structured representation of data.\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0v8znxdlttO-" + }, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q-fUhePhttO-" + }, + "outputs": [], + "source": [ + "data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],\n", + " 'Age': [24, 23, 22, 19, 10], \n", + " 'Salary': [60000,120000,1000000,75000,80000]}\n", + "\n", + "people_df = pd.DataFrame(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_fi2q8cuttO-" + }, + "source": [ + "##### ==================================================================================================\n", + "We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Muo9Gs_xttO_" + }, + "outputs": [], + "source": [ + "people_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZtcR6GJ2ttO_" + }, + "source": [ + "##### ==================================================================================================\n", + "We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.\n", + "\n", + "Example: Select the first 2 rows of the dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k-lZOSuGttO_" + }, + "outputs": [], + "source": [ + "people_df.head(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C_UTdB6IWiG_" + }, + "source": [ + "Example: Select the last 2 rows of the dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tfNVLk_tWU52" + }, + "outputs": [], + "source": [ + "people_df.tail(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q8nzMIscttO_" + }, + "source": [ + "##### ==================================================================================================\n", + "Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:\n", + "\n", + "`dataframe = pd.read_csv(\".../file_location/file_name.csv\")`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IbdNygPgttO_" + }, + "outputs": [], + "source": [ + "# Saving the file location to our data into a variable\n", + "pixar_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Pixar_Movies.csv\"\n", + "# Passing our file location to the read_csv function to locate and read our data into a DataFrame\n", + "movies_df = pd.read_csv(pixar_url)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4r7sM285ttPA" + }, + "source": [ + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u-4v6IISttPA" + }, + "outputs": [], + "source": [ + "movies_df.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5JISZYZHttPA" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT * \n", + "FROM Movies\n", + "LIMIT 10\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uB-nxMi8ttPA" + }, + "source": [ + "`.shape` shows the number of rows and columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tuyP3rLKttPA" + }, + "outputs": [], + "source": [ + "movies_df.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sCnSo2HBttPA" + }, + "source": [ + "This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns\n", + "\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zMhYN4Z1ttPA" + }, + "source": [ + "`.dtypes` shows the data types" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UOImUY3attPA" + }, + "outputs": [], + "source": [ + "movies_df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "244Ux8N_XWmo" + }, + "source": [ + "`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-MPTc3c6YjMp" + }, + "outputs": [], + "source": [ + "movies_df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uNXiyTCWYwl8" + }, + "source": [ + "You may optionally include categorical data in the `describe` method like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ITxGRSqQY8oX" + }, + "outputs": [], + "source": [ + "movies_df.describe(include='all')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "s3f7AzxJv_L5" + }, + "outputs": [], + "source": [ + "movies_df.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QHrP16DdttPA" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "### Row and Column Selection\n", + "\n", + "There are two common ways to select rows and columns in a dataframe using .loc and .iloc\n", + "\n", + "`.loc` selects rows and columns by label/name\n", + "\n", + "`.iloc` selects row and columns by index\n", + "\n", + "Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_H7tY4X8ttPA" + }, + "outputs": [], + "source": [ + "movies_df.loc[2:4, ['Title','Director','Year'] ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0qwe2OwyttPA" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3rPyj7J1ttPA" + }, + "outputs": [], + "source": [ + "movies_df.iloc[ :, [1,2,3] ]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OigUAB8ottPB" + }, + "source": [ + "#### The two python codes above are equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM Movies\n", + "```\n", + "\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VrxiA9oittPB" + }, + "outputs": [], + "source": [ + "movies_df.iloc[0:3,[1,2,3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W8Rpe-SBttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM Movies\n", + "LIMIT 3\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JYZXjJ7zttPB" + }, + "outputs": [], + "source": [ + "movies_df.iloc[2:5, [1,2,3]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xuF4sFtRttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title, Director, Year\n", + "FROM movies\n", + "LIMIT 3\n", + "OFFSET 2\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qoAct0MgZq2Y" + }, + "source": [ + "The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mAaANUmittPB" + }, + "outputs": [], + "source": [ + "movies_df.loc[:,'Director'].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UUKE7FJkttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT Director, COUNT(*)\n", + "FROM Movies\n", + "GROUP BY Director\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dqwCWeGUdDOO" + }, + "source": [ + "##### ==================================================================================================\n", + "\n", + "We can use the `mean()` method to help us find the average of a column or group of columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "85Yx7Q8MdXDp" + }, + "outputs": [], + "source": [ + "movies_df.loc[:, 'Length_minutes'].mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hxFxYRlVgy8D" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT AVG(Length_minutes)\n", + "FROM Movies\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CDzWn6ZYhdjl" + }, + "source": [ + "Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.\n", + "\n", + "For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1Pc8Bk75ePoi" + }, + "outputs": [], + "source": [ + "movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jbDuTSGwiCmq" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT Director, AVG(Length_minutes) AS Length_minutes\n", + "FROM Movies\n", + "GROUP BY Director\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cKaf4n4ycypo" + }, + "source": [ + "##### ==================================================================================================\n", + "### Filtering Data\n", + "Using operator comparisons on columns returns information based on our desired conditions\n", + "\n", + "Example: Suppose we want to return movie information if it is only longer than 100 minutes long." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Zgl74_zjttPB" + }, + "outputs": [], + "source": [ + "# Create the filter \n", + "movie_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", + "# Use the filter in the `.loc` selector\n", + "movies_df.loc[movie_filter, :]\n", + "\n", + "# An example showing everything in a single step \n", + "movies_df.loc[movies_df.loc[:, \"Length_minutes\"] > 100, :]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1RAY_qWtttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "WHERE Length_minutes > 100\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "#### Multiple Conditional Filtering\n", + "\n", + "Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Dp1-vQ3mttPB" + }, + "outputs": [], + "source": [ + "movie_len_filter = movies_df.loc[:, \"Length_minutes\"] > 100\n", + "movie_year_filter = movies_df.loc[:, \"Year\"] < 2005\n", + "\n", + "movies_df.loc[(movie_len_filter) & (movie_year_filter), :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lQksNrTkttPB" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "WHERE Length_minutes > 100\n", + "AND Year < 2005\n", + "```\n", + "##### ==================================================================================================" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GVTfOPhottPB" + }, + "source": [ + "##### ==================================================================================================\n", + "### Sorting Data\n", + "The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`. \n", + "\n", + "The `.reset_index(drop=True)` will re-index the index after sorting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KFQjjjOxttPC" + }, + "outputs": [], + "source": [ + "movies_df.loc[:,\"Title\"].sort_values().reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1fEi_PBfttPC" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "SELECT Title\n", + "FROM Movies\n", + "ORDER BY Title\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "Sort the entire dataframe by a single column:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "stgM1BXxttPC" + }, + "outputs": [], + "source": [ + "movies_df.sort_values(\"Title\").reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V5j8FDwuttPC" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Movies\n", + "ORDER BY Title\n", + "```\n", + "##### ==================================================================================================\n", + "\n", + "We can also sort using multiple columns.\n", + "Example: We can sort by Director first, then within each Director, sort the Title of the films." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q6UqfJacttPC" + }, + "outputs": [], + "source": [ + "movies_df.sort_values([\"Director\",\"Title\"], ascending=[True, False]).reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5wlURoWy2eYC" + }, + "source": [ + "```sql\n", + "SELECT Director, Title\n", + "FROM Movies\n", + "ORDER BY\n", + " Director ASC,\n", + " Title DESC\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AWFTUYNVttPC" + }, + "source": [ + "##### ==================================================================================================\n", + "### Merging DataFrames\n", + "\n", + "In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.\n", + "\n", + "But first let us introduce a new dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3C3P14EvttPC" + }, + "outputs": [], + "source": [ + "other_movies_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Other_Movies.csv\"\n", + "other_movies_df = pd.read_csv(other_movies_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fjEx1V8vttPC" + }, + "outputs": [], + "source": [ + "other_movies_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DiyckWV1ttPC" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pjvZ8wGFttPC" + }, + "outputs": [], + "source": [ + "all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DwGVaXWxttPC" + }, + "outputs": [], + "source": [ + "all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "\n", + "```sql\n", + "WITH all_movies_df AS (\n", + " SELECT *\n", + " FROM movies_df\n", + " UNION ALL\n", + " SELECT *\n", + " FROM other_movies_df\n", + ")\n", + "SELECT *\n", + "FROM all_movies_df;\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uwtTP015ttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets introduce another dataframe, that being the movie scores received" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wntotCmBttPD" + }, + "outputs": [], + "source": [ + "movie_scores_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Movie_Scores.csv\"\n", + "scores_df = pd.read_csv(movie_scores_url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "executionInfo": { + "elapsed": 143, + "status": "ok", + "timestamp": 1680135575621, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "9xeFCBz5ttPD", + "outputId": "d749036a-deb4-4141-c95d-0428415de7a3" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Score
08.3
17.2
27.9
38.1
48.2
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " Score\n", + "0 8.3\n", + "1 7.2\n", + "2 7.9\n", + "3 8.1\n", + "4 8.2" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scores_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yarl7-KPttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "Now we can combine the two dataframes side by side" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W2zOhxPcttPD" + }, + "outputs": [], + "source": [ + "movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = \"columns\").reset_index(drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VBMdQiRettPD" + }, + "outputs": [], + "source": [ + "movies_and_scores_df.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xoQ3fB8SttPD" + }, + "source": [ + "##### ==================================================================================================\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZITCa9qYttPD" + }, + "outputs": [], + "source": [ + "managers = pd.DataFrame(\n", + " {\n", + " 'Id': [1,2,3],\n", + " 'Manager':['Chris','Maritza','Jamin']\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MX9spfihttPD" + }, + "outputs": [], + "source": [ + "managers.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LtD1zQJuttPD" + }, + "outputs": [], + "source": [ + "captains = pd.DataFrame(\n", + " {\n", + " 'Id': [2,2,3,1,1,3,2,3,1,1,3,3],\n", + " 'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],\n", + " 'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0xOS-Bu4ttPD" + }, + "outputs": [], + "source": [ + "captains.head(12)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3c478mlSttPD" + }, + "outputs": [], + "source": [ + "roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rJ9K1BPGXxzE" + }, + "outputs": [], + "source": [ + "test_roster = pd.concat([captains, managers], axis=\"columns\").reset_index(drop=True)\n", + "test_roster.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2hro1V6XttPD" + }, + "source": [ + "#### The above python code is equivalent to SQL's\n", + "```sql\n", + "SELECT *\n", + "FROM Captains\n", + "INNER JOIN Managers\n", + "ON Captains.Id = Managers.Id\n", + "```\n", + "##### ==================================================================================================\n", + "## Column Renaming\n", + "\n", + "We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-nELWGyPttPD" + }, + "outputs": [], + "source": [ + "roster = roster.rename(columns = {\"Id\":\"Cohort\",\"Title\":\"Captain Rank\"})\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 144, + "status": "ok", + "timestamp": 1680136209238, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "zrKc31ukYw3i", + "outputId": "40c853d2-aa1f-4f1d-ae91-3e803b90c20f" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Cohort', 'Captain', 'Captain Rank', 'Manager'], dtype='object')" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "roster.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N5H7HamottPE" + }, + "source": [ + "If we would like to replace all columns, we must use a list of equal length" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eCTo6V3UttPE" + }, + "outputs": [], + "source": [ + "roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wp5nb6skttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "### Drop Columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "doploOj9ttPE" + }, + "outputs": [], + "source": [ + "#df.drop([\"column1\",\"column2\"], axis = \"columns\")\n", + "\n", + "roster = roster.drop(\"Cohort Num\", axis = \"columns\")\n", + "roster.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u-SBCempttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "### Missing Values / NaN Values\n", + "\n", + "There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.\n", + "\n", + "Missing data can be remedied by the following:\n", + "1. Removing the row with the missing/NaN values\n", + "2. Removing the column with the missing/NaN values\n", + "3. Filling in the missing data\n", + "\n", + "For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yV1RhRDNttPE" + }, + "outputs": [], + "source": [ + "cars_url = \"https://raw.githubusercontent.com/freestackinitiative/COOP-PythonLessons/main/lessons/lesson3/data/Cars.csv\"\n", + "cars = pd.read_csv(cars_url)\n", + "cars.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zT2P3Mq9ttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets sort the companies in alphabetical order" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W4xHJumrttPE" + }, + "outputs": [], + "source": [ + "cars = cars.sort_values(\"Company\").reset_index(drop=True)\n", + "cars.head(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tFLokzyvttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 163, + "status": "ok", + "timestamp": 1680136659347, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "q33En74DttPE", + "outputId": "5b5675c2-4396-49ee-ef44-cf50d5c03b14" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Company 0\n", + "Location 4\n", + "Year 5\n", + "dtype: int64" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cars.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cGKKoYTpttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Lets inspect all the rows with any missing Loctation entries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dRmT5-TvttPE" + }, + "outputs": [], + "source": [ + "missing_car_info_filter = cars.loc[:, \"Location\"].isna()\n", + "cars.loc[missing_car_info_filter, :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvT4mHb5ttPE" + }, + "source": [ + "##### ==================================================================================================\n", + "Lets inspect all the rows with any missing Year entries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "64m7mIH0ttPF" + }, + "outputs": [], + "source": [ + "cars.loc[cars.loc[:, \"Year\"].isna(), :]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kl_wIVHCttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "For simplicity we can fill all the missing Location entries with \"NA\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k0NDuUMhttPF" + }, + "outputs": [], + "source": [ + "cars.loc[:, \"Location\"] = cars.loc[:, \"Location\"].fillna(value=\"NA\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KXC45KtFttPF" + }, + "outputs": [], + "source": [ + "cars.head(-1)\n", + "cars.isna().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nB__rivattPF" + }, + "source": [ + "##### ==================================================================================================\n", + "Now lets drop any rows with missing entries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 161, + "status": "ok", + "timestamp": 1680136931394, + "user": { + "displayName": "Martin Arroyo", + "userId": "00023833307036255373" + }, + "user_tz": 240 + }, + "id": "Ft1XTWOGttPF", + "outputId": "2a06ac7b-978b-4720-9cc9-52925c26c448" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "Company 0\n", + "Location 0\n", + "Year 0\n", + "dtype: int64" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cars = cars.dropna().reset_index(drop=True)\n", + "cars.head(-1)\n", + "cars.isna().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MoUYqyzSeK9n" + }, + "outputs": [], + "source": [ + "cars.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lbaxA3zrttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "## Summary\n", + "\n", + "- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.\n", + "- `.loc` selects rows and columns based on their index values.\n", + "- `.iloc` selects rows and columns based on their position values.\n", + "- Calling a DataFrame method with `axis=\"rows\"` or `axis=0` causes it to operate along the row axis.\n", + "- Calling a DataFrame method with `axis=\"columns\"` or `axis=1` causes it to operate along the columns axis.\n", + "- `sort_values` reorders rows based on condition\n", + "- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.\n", + "- `.isna()` detects missing values\n", + "- `.fillna()` replaces NULL values with a specified value\n", + "- `.dropna()` removes all rows that contain NULL values\n", + "- `.merge()` updates content from one DataFrame with content from another Dataframe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k8I532SRttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 1:\n", + "Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "t3G0XkmittPF" + }, + "outputs": [], + "source": [ + "#solution\n", + "roster = pd.DataFrame(\n", + "{\n", + " \"Name\" : [\"James\",\"Greg\",\"Patrick\",\"Chris\",\"Cynthia\",\"Chandra\", \"John\",\"David\",\"Tiffany\",\"Peter\"],\n", + " \"Id\": [\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\"],\n", + " \n", + "})\n", + "\n", + "exam = pd.DataFrame({\n", + " \"Exam 1\" : [89,78,81,90,93,76,66,87,42,55],\n", + " \"Exam 2\" : [100,74,20,86,60,76,92,97,88,90],\n", + " \"Exam 3\" : [85,60,90,90,88,76,55,None,64,79],\n", + " \"Id\" : [\"4\",\"2\",\"1\",\"7\",\"5\",\"10\",\"6\",\"3\",\"9\",\"8\"]\n", + "})\n", + "\n", + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rMRopV2FttPF" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 2:\n", + "Fill all missing grades with 0." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DA8C74TLttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6a_N8JEEttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 3:\n", + "Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Mzka5Y3_ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DkuO3tIPttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 4:\n", + "\n", + "Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QWXVYTj0ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "96hRtey9ttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 5:\n", + "Incorporate the newly created `Average` column into the DataFrame `cohort`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wEysGqYyttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QHk1lZiDttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 6:\n", + "Sort the dataset by Average in **descending** order and reindex the DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9azLYMHPttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yyWST6gUttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 7:\n", + "Drop columns Exam 1, 2, and 3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PgD_KqCkttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OHg6AiIYttPG" + }, + "source": [ + "##### ==================================================================================================\n", + "### Exercise 8:\n", + "Select only the top 3 **Name, Id and Average only*** based on highest Average grade" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MmHW3ki9ttPG" + }, + "outputs": [], + "source": [ + "# YOUR CODE HERE" + ] + } + ], + "metadata": { + "colab": { + "provenance": [ + { + "file_id": "11txmjQA_zWvA1kWLMxvhr2VEhNoUk9uC", + "timestamp": 1679880071146 + } + ] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}