From 792294f8d44f57f68ce6409f7d09f1d4a6014385 Mon Sep 17 00:00:00 2001 From: Diwakar Gupta <39624018+Diwakar-Gupta@users.noreply.github.com> Date: Tue, 10 May 2022 10:33:53 +0530 Subject: [PATCH] Assignment Solution --- 22-05-07-End-To-End/Assignment_Solution.ipynb | 3190 +++++++++++++++++ 1 file changed, 3190 insertions(+) create mode 100644 22-05-07-End-To-End/Assignment_Solution.ipynb diff --git a/22-05-07-End-To-End/Assignment_Solution.ipynb b/22-05-07-End-To-End/Assignment_Solution.ipynb new file mode 100644 index 0000000..d0916a7 --- /dev/null +++ b/22-05-07-End-To-End/Assignment_Solution.ipynb @@ -0,0 +1,3190 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "End_To_End_Project_Assignment.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true, + "include_colab_link": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# **Agenda**\n", + "\n", + "1. Look at the big picture.\n", + "2. Get the data.\n", + "3. Discover and visualize the data to gain insights.\n", + "4. Prepare the data for Machine Learning algorithms.\n", + "5. Select a model and train it.\n", + "6. Fine-tune your model.\n", + "7. Present your solution.\n", + "8. Launch, monitor, and maintain your system." + ], + "metadata": { + "id": "BA7AsnN771kU" + } + }, + { + "cell_type": "markdown", + "source": [ + "## **How to Approach**\n", + "\n", + "Gather knowledge about problem it's current solution and how it will be used by company and downstreams\n", + "\n", + "First Task is to Frame the Problem by asking question's\n", + "\n", + "**Question:** What exactly is the business objective?\n", + "\n", + "This is important for performance measure to evaluate your model and time spend tweaking it.\n", + "\n", + "Next Question is What's the current solution?\n", + "\n", + "It will give a reference performance, as well as insights on how to solve the problem.\n", + "Select a Performance Measure\n", + "\n", + "Metric system gives an idea of how much error the system typically makes in its predictions.\n", + "1. **RMSE:** Root Mean Squared Error \n", + "2. **MAE:** Mean Absolute Error\n", + "3. **Accuracy**: used for classification problems\n", + "\n", + "![image](https://miro.medium.com/max/710/1*5OQunI-NR-S0gAZFIit1Rw.png)\n", + "\n", + "**Check the Assumptions**\n", + "\n", + "My output is used by other machine.\n", + "Ask how downstream will use your output.\n", + "example exact price or label's(“cheap,”, “medium,” or “expensive”)." + ], + "metadata": { + "id": "0WoXP9KK8YRT" + } + }, + { + "cell_type": "markdown", + "source": [ + "# **Import all the required packages**\n", + "\n", + "1. Numpy is required for all mathematical computations\n", + "2. Pandas is required for manipulating data\n", + "3. Matplotlib and Seaborn will be required for drawing graphs and drawing conclusions from the given data. Hence, EDA (Exploratory Data Analysis)" + ], + "metadata": { + "id": "YyVbMvC8-MHA" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Bt5sE1_7wqNK" + }, + "outputs": [], + "source": [ + "import numpy as np \n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "markdown", + "source": [ + "1. Read the Data from the following link:\n", + "https://raw.githubusercontent.com/Diwakar-Gupta/assets_resources/main/datasets/titanic.csv\n", + "\n", + "2. View the actual format of the data and the various features and label" + ], + "metadata": { + "id": "GWHXjbOV_HpJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Get The Data" + ], + "metadata": { + "id": "SCrRPYuFY0aH" + } + }, + { + "cell_type": "code", + "source": [ + "df_train = pd.read_csv('https://raw.githubusercontent.com/Diwakar-Gupta/assets_resources/main/datasets/titanic.csv')\n", + "df_train.sample(5)" + ], + "metadata": { + "id": "3n0DBmSjM_2e", + "outputId": "674539e4-3ff3-4c1e-f27a-6a8f050ebd72", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 268 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
14614713Andersson, Mr. August Edvard (\"Wennerstrom\")male27.0003500437.7958NaNS
20520603Strom, Miss. Telma Matildafemale2.00134705410.4625G6S
22222303Green, Mr. George Henrymale51.000214408.0500NaNS
34434502Fox, Mr. Stanley Hubertmale36.00022923613.0000NaNS
55155202Sharp, Mr. Percival James Rmale27.00024435826.0000NaNS
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PassengerId Survived Pclass ... Fare Cabin Embarked\n", + "146 147 1 3 ... 7.7958 NaN S\n", + "205 206 0 3 ... 10.4625 G6 S\n", + "222 223 0 3 ... 8.0500 NaN S\n", + "344 345 0 2 ... 13.0000 NaN S\n", + "551 552 0 2 ... 26.0000 NaN S\n", + "\n", + "[5 rows x 12 columns]" + ] + }, + "metadata": {}, + "execution_count": 33 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "C8F75E_nwycp" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "![image](https://raw.githubusercontent.com/Pepcoders/Pepcoding-Data-Science/main/dataset/images/titanicColumns.png)" + ], + "metadata": { + "id": "5XHHOzq7Eyep" + } + }, + { + "cell_type": "markdown", + "source": [ + "Look at the Shape of the DataFrame, to get an idea about the total entries in the dataset" + ], + "metadata": { + "id": "itcxxZD3_XfS" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.shape" + ], + "metadata": { + "id": "AjxTYCMBNMpx", + "outputId": "a6c1af5f-11f7-4319-d876-b99680edc169", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(891, 12)" + ] + }, + "metadata": {}, + "execution_count": 34 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "rwAtb4jUw0Mb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we have a fair idea about the data, we need to look for the following things:\n", + "1. Try finding out the null values present in the data\n", + "2. See the data types of various features \n" + ], + "metadata": { + "id": "EtbPptD4_kh9" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.info()" + ], + "metadata": { + "id": "jWzraccnNXGo", + "outputId": "2bb49461-d345-47c1-b15c-b332dd160390", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 12 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 PassengerId 891 non-null int64 \n", + " 1 Survived 891 non-null int64 \n", + " 2 Pclass 891 non-null int64 \n", + " 3 Name 891 non-null object \n", + " 4 Sex 891 non-null object \n", + " 5 Age 714 non-null float64\n", + " 6 SibSp 891 non-null int64 \n", + " 7 Parch 891 non-null int64 \n", + " 8 Ticket 891 non-null object \n", + " 9 Fare 891 non-null float64\n", + " 10 Cabin 204 non-null object \n", + " 11 Embarked 889 non-null object \n", + "dtypes: float64(2), int64(5), object(5)\n", + "memory usage: 83.7+ KB\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "GihqtEzLw11j" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Data Cleaning + EDA" + ], + "metadata": { + "id": "Svbv6DHBY-Ox" + } + }, + { + "cell_type": "markdown", + "source": [ + "Now, if you can see the null values, try finding the total sum of null values present for each feature. " + ], + "metadata": { + "id": "2ooFeq8cAa9R" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.isna().sum()" + ], + "metadata": { + "id": "8AKR3Fg-Nb2n", + "outputId": "a1ed8165-b68c-4fe7-883c-e24d4c7468e5", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "PassengerId 0\n", + "Survived 0\n", + "Pclass 0\n", + "Name 0\n", + "Sex 0\n", + "Age 177\n", + "SibSp 0\n", + "Parch 0\n", + "Ticket 0\n", + "Fare 0\n", + "Cabin 687\n", + "Embarked 2\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 36 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "lM57wU13w3S-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Number of Null Values gives a good idea about what to do with the null values, but a graphical representation would give the impact of null values with respect to total data present. Try drawing a map of null values present" + ], + "metadata": { + "id": "-fQDV6wpAnhL" + } + }, + { + "cell_type": "code", + "source": [ + "sns.heatmap(df_train.isnull(), cmap = 'rainbow')" + ], + "metadata": { + "id": "zrt1KIarNoW6", + "outputId": "0dffd173-3838-4e49-c407-0c843ab1c8b9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 338 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 37 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "ePSAYwlQw40U" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we can see the null values, we need to first deal with numerical null values, there are a few options at our disposal, these are listed below:\n", + "1. Delete the entire row where null value is present.\n", + "2. Fill the null value with previous entry\n", + "3. Fill the null value with next entry\n", + "4. Fill the null value with mean, median or mode\n", + "\n", + "The type of cleaning done on null values depends on our choice. This does not mean only these 4 values will be done when encountered with a null value. Though these are the most frequent ones, with respect to a numerical column containing a null value. \n", + "This time we have to deal with 'Age', which means most appropriate thing would be the fill it with the mean value. \n", + "\n", + "This would also be clear if we make a graph, see the type of distribution, for various ages present in the data.\n", + "\n", + "If you see a normal distribution, fill it with null values without giving much thought, as normal distribution, has max values near its mean" + ], + "metadata": { + "id": "hXu3xyoFA8yb" + } + }, + { + "cell_type": "code", + "source": [ + "sns.distplot(df_train['Age'], bins=10)" + ], + "metadata": { + "id": "iSEhBpj4N9ml", + "outputId": "95f1f4b9-0054-489d-d9ec-936f7bba0a2d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 351 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n", + " warnings.warn(msg, FutureWarning)\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 38 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "qGQKDLiww6ak" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now, that we are filling null values in Age feature with the mean Age, we need to find the mean age, this can be done, if we find the mean.\n", + "One way is the actually find the arithmetic mean, but that is tedious.\n", + "Try using the Describe Function, this would also give min, max, quatile values. " + ], + "metadata": { + "id": "vBBhwUmpC9My" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.describe()" + ], + "metadata": { + "id": "D99rZN4DOIUv", + "outputId": "8faca78a-b67d-4371-e1e1-d4114d281ab5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PassengerId Survived Pclass ... SibSp Parch Fare\n", + "count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000\n", + "mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208\n", + "std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429\n", + "min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000\n", + "25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400\n", + "50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200\n", + "75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000\n", + "max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200\n", + "\n", + "[8 rows x 7 columns]" + ] + }, + "metadata": {}, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "gCln1hnQw8Kb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "**Here you can see the age feature varies from 20 to 38 from 1st quartile to 3rd Quartile, This means most of the passengers in the ship are between 20 to 38 years old. And a really new born kid of few months is in the ship.**\n", + "\n" + ], + "metadata": { + "id": "BO2o2CKNDo_c" + } + }, + { + "cell_type": "markdown", + "source": [ + "Now that, we know the mean, store it in a variable where we will fill it later, \n", + "also check if there are duplicate rows in the data.\n", + "\n", + "All this is required so that we clean the data, which will make sure there are no absurd values in our model and prediction will be easy." + ], + "metadata": { + "id": "dd4LRsq6Doi4" + } + }, + { + "cell_type": "code", + "source": [ + "Age_fill = 30" + ], + "metadata": { + "id": "mc0YXGPzw9uR" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We can also check for correlations in the data, sometimes, there is high degree of correlation between the labels and the feature and hence, we can drop the feature which have little correlation as that will not impact our prediction significantly or we can use those features to do some feature engineering and create new features." + ], + "metadata": { + "id": "cfPWQPZmEDmN" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.corr()['Survived'].sort_values(ascending=False)" + ], + "metadata": { + "id": "win3CVI3Og4T", + "outputId": "0bb1c289-2201-480d-a5fa-979e8278c054", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "Survived 1.000000\n", + "Fare 0.257307\n", + "Parch 0.081629\n", + "PassengerId -0.005007\n", + "SibSp -0.035322\n", + "Age -0.077221\n", + "Pclass -0.338481\n", + "Name: Survived, dtype: float64" + ] + }, + "metadata": {}, + "execution_count": 41 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "AsDJ5uUwxdaW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now, we are performing EDA,i.e., Exploratory Data Analysis. \n", + "So we need to know what each feature is signifying and also how much importance it will carry when we prepare our model." + ], + "metadata": { + "id": "0JmY-b7MEbdm" + } + }, + { + "cell_type": "markdown", + "source": [ + "Firstly, we know that our label is which person survived. \n", + "Now in our training data, we need to know exactly, how many people survived, so that we know about the greater percentage in our data.\n", + "This means we must know whether more people survived in our training data or whether more people did not survive in our training data.\n", + "\n", + "We can use function for counting all of them, but generally graphs are preferred in EDA, simply due to visual appeal, people process images quickly when compared to raw numbers, hence try using graphs, as much as possible." + ], + "metadata": { + "id": "sr4Rxhw-Eu9S" + } + }, + { + "cell_type": "code", + "source": [ + "df_train[\"Survived\"].value_counts()" + ], + "metadata": { + "id": "4hMQyu2XO2vT", + "outputId": "766fd8e1-79bf-46b8-ef78-5ee60ea78360", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 549\n", + "1 342\n", + "Name: Survived, dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 42 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "4Gg9J1ZfFMPx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "sns.set_style('whitegrid')\n", + "sns.countplot(x = 'Survived', data =df_train)" + ], + "metadata": { + "id": "rpchNdPYO9JE", + "outputId": "cc54dc06-ea4e-43d3-99f2-6ca750969e26", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 43 + }, + { + "output_type": "display_data", + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAUfUlEQVR4nO3dfUyV9/3/8dcpp1CCgELknGr9LXHaSNQNs3V40kbnYUcsyEQqc3NjSrt1W6wO3eh0TS2rlfXGKemWbSEmju6bbvuWcNMNHShW6LY6E5V5k9NmtSG1jeccw4140x3geL5/mH1+tRV67OHiIDwff8F1znWdN+bCJ+cD5zq2cDgcFgAAku6I9QAAgLGDKAAADKIAADCIAgDAIAoAAMMe6wGi0dHRoYSEhFiPAQC3lWAwqKysrJvedltHISEhQZmZmbEeAwBuK16vd8jbWD4CABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYEz4KwYFQrEfAGMR5gYnqtr7MxUhIuDNOXyh/KdZjYIw59sK3Yz0CEBMT/pkCAOD/IwoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDAsPQqqW63W0lJSbrjjjsUFxenuro69fb2atOmTXr//fc1ffp0VVVVKTU1VeFwWDt27FBbW5vuuusuPfvss5o7d66V4wEAPsLyZwo1NTVqbGxUXV2dJKm6uloul0stLS1yuVyqrq6WJLW3t6uzs1MtLS3avn27KioqrB4NAPARo7581NraqsLCQklSYWGhDh48eMN2m82mrKws9fX1KRAIjPZ4ADChWf4mO4888ohsNptWr16t1atXq6urSxkZGZKkqVOnqqurS5Lk9/vldDrNfk6nU36/39z3ZoLBoLxeb1TzZWZmRrU/xq9ozy3gdmRpFP7whz/I4XCoq6tLpaWlmjlz5g2322w22Wy2T338hIQE/lOHZTi3MF4N9wOPpctHDodDkpSeni6Px6OTJ08qPT3dLAsFAgGlpaWZ+/p8PrOvz+cz+wMARodlUbh69aouX75sPv773/+u2bNny+12q6GhQZLU0NCgnJwcSTLbw+GwOjo6lJycPOzSEQBg5Fm2fNTV1aX169dLkkKhkJYvX65FixZp/vz5KisrU21traZNm6aqqipJ0uLFi9XW1iaPx6PExERVVlZaNRoAYAi2cDgcjvUQn5bX6x2Rdd8vlL80AtNgPDn2wrdjPQJgmeH+7+QVzQAAgygAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDAIAoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDAIAoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwCAKAADD8iiEQiEVFhbqe9/7niTp3LlzKi4ulsfjUVlZmfr7+yVJ/f39Kisrk8fjUXFxsd577z2rRwMAfITlUXjppZf02c9+1ny+c+dOrVu3TgcOHFBKSopqa2slSa+88opSUlJ04MABrVu3Tjt37rR6NADAR1gaBZ/Pp8OHD2vVqlWSpHA4rCNHjig3N1eStHLlSrW2tkqSDh06pJUrV0qScnNz9cYbbygcDls5HgDgI+xWHryyslLl5eW6cuWKJKmnp0cpKSmy268/rNPplN/vlyT5/X7dfffd14ey25WcnKyenh6lpaUNefxgMCiv1xvVjJmZmVHtj/Er2nMLuB1ZFoXXXntNaWlpmjdvnv75z39a8hgJCQn8pw7LcG5hvBruBx7LonD8+HEdOnRI7e3tCgaDunz5snbs2KG+vj4NDg7KbrfL5/PJ4XBIkhwOh86fPy+n06nBwUFdunRJU6ZMsWo8AMBNWPY7hR/96Edqb2/XoUOHtGvXLi1cuFC/+MUvlJ2drebmZklSfX293G63JMntdqu+vl6S1NzcrIULF8pms1k1HgDgJkb9dQrl5eXau3evPB6Pent7VVxcLElatWqVent75fF4tHfvXv34xz8e7dEAYMKzhW/jP/Hxer0jsu77hfKXRmAajCfHXvh2rEcALDPc/528ohkAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEARijwoPBWI+AMcjq88Ju6dEBfGo2e4LefXp+rMfAGPP/tp2y9Pg8UwAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIARURTWrl0b0TYAwO1t2BevBYNBffDBB+rp6dHFixcVDoclSZcvX5bf7x+VAQEAo2fYKPzxj39UTU2NAoGAioqKTBQmTZqkb33rW6MyIABg9AwbhbVr12rt2rX6/e9/r5KSkls6cDAY1De/+U319/crFAopNzdXGzdu1Llz57R582b19vZq7ty5ev755xUfH6/+/n49/vjjOnPmjCZPnqzdu3frnnvuieqLAwDcmoiufVRSUqLjx4/r/fffVygUMtsLCwuH3Cc+Pl41NTVKSkrSwMCA1qxZo0WLFmnv3r1at26d8vPztW3bNtXW1mrNmjV65ZVXlJKSogMHDqipqUk7d+5UVVVV9F8hACBiEf2iuby8XM8//7yOHTumU6dO6dSpUzp9+vSw+9hsNiUlJUmSBgcHNTg4KJvNpiNHjig3N1eStHLlSrW2tkqSDh06pJUrV0qScnNz9cYbb5jlKgDA6IjomcLp06e1b98+2Wy2Wzp4KBRSUVGR3n33Xa1Zs0YzZsxQSkqK7PbrD+t0Os0vrP1+v+6+++7rQ9ntSk5OVk9Pj9LS0m7pMQEAn15EUZg9e7YuXLigjIyMWzp4XFycGhsb1dfXp/Xr1+udd975VEMOJRgMyuv1RnWMzMzMEZoG402051a0ODcxFCvPzYii0NPTo/z8fH3uc5/TnXfeabb/9re/jehBUlJSlJ2drY6ODvX19WlwcFB2u10+n08Oh0OS5HA4dP78eTmdTg0ODurSpUuaMmXKsMdNSEjgGweW4dzCWBXtuTlcVCKKwoYNG275Qbu7u2W325WSkqL//Oc/+sc//qHvfve7ys7OVnNzs/Lz81VfXy+32y1Jcrvdqq+v14IFC9Tc3KyFCxfe8nIVACA6EUXhS1/60i0fOBAIaMuWLQqFQgqHw1q2bJmWLFmiWbNmadOmTaqqqlJmZqaKi4slSatWrVJ5ebk8Ho9SU1O1e/fuW35MAEB0IorCggULzE/tAwMDGhwcVGJioo4fPz7kPnPmzFFDQ8PHts+YMUO1tbUf256QkKAXX3wx0rkBABaIKAonTpwwH4fDYbW2tqqjo8OyoQAAsXHLV0m12Wz6yle+or/97W9WzAMAiKGInim0tLSYj69du6bTp08rISHBsqEAALERURRee+0183FcXJymT5+uX//615YNBQCIjYii8POf/9zqOQAAY0BEv1Pw+Xxav369XC6XXC6XNmzYIJ/PZ/VsAIBRFlEUtm7dKrfbrddff12vv/66lixZoq1bt1o9GwBglEUUhe7ubj300EOy2+2y2+0qKipSd3e31bMBAEZZRFGYPHmyGhsbFQqFFAqF1NjYqMmTJ1s9GwBglEUUhcrKSu3fv1/333+/HnjgATU3N+vZZ5+1ejYAwCiL6K+PXnzxRT333HNKTU2VJPX29uq5557jr5IAYJyJ6JnCW2+9ZYIgXV9OivW15gEAIy+iKFy7dk0XL140n/f29t7wXs0AgPEhouWjhx9+WKtXr9ayZcskSX/961/1/e9/39LBAACjL6IoFBYWat68eTpy5Igk6Ve/+pVmzZpl6WAAgNEXURQkadasWYQAAMa5W750NgBg/CIKAACDKAAADKIAADCIAgDAIAoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwLAsCufPn1dJSYny8vKUn5+vmpoaSdffoKe0tFRLly5VaWmpefOecDisZ555Rh6PRwUFBTpz5oxVowEAhmBZFOLi4rRlyxbt27dPf/rTn/Tyyy/r7bffVnV1tVwul1paWuRyuVRdXS1Jam9vV2dnp1paWrR9+3ZVVFRYNRoAYAiWRSEjI0Nz586VJE2aNEkzZ86U3+9Xa2urCgsLJV1/856DBw9Kktlus9mUlZWlvr4+BQIBq8YDANxExG+yE4333ntPXq9Xn//859XV1aWMjAxJ0tSpU9XV1SVJ8vv9cjqdZh+n0ym/32/uezPBYFBerzeq2TIzM6PaH+NXtOdWtDg3MRQrz03Lo3DlyhVt3LhRP/3pTzVp0qQbbrPZbLLZbJ/62AkJCXzjwDKcWxiroj03h4uKpX99NDAwoI0bN6qgoEBLly6VJKWnp5tloUAgoLS0NEmSw+GQz+cz+/p8PjkcDivHAwB8hGVRCIfDeuKJJzRz5kyVlpaa7W63Ww0NDZKkhoYG5eTk3LA9HA6ro6NDycnJwy4dAQBGnmXLR8eOHVNjY6PuvfderVixQpK0efNmPfrooyorK1Ntba2mTZumqqoqSdLixYvV1tYmj8ejxMREVVZWWjUaAGAIlkXhi1/8ot56662b3vbf1yx8mM1m01NPPWXVOACACPCKZgCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBhWRS2bt0ql8ul5cuXm229vb0qLS3V0qVLVVpaqosXL0qSwuGwnnnmGXk8HhUUFOjMmTNWjQUAGIZlUSgqKtKePXtu2FZdXS2Xy6WWlha5XC5VV1dLktrb29XZ2amWlhZt375dFRUVVo0FABiGZVG47777lJqaesO21tZWFRYWSpIKCwt18ODBG7bbbDZlZWWpr69PgUDAqtEAAEOwj+aDdXV1KSMjQ5I0depUdXV1SZL8fr+cTqe5n9PplN/vN/cdSjAYlNfrjWqmzMzMqPbH+BXtuRUtzk0Mxcpzc1Sj8GE2m002my2qYyQkJPCNA8twbmGsivbcHC4qo/rXR+np6WZZKBAIKC0tTZLkcDjk8/nM/Xw+nxwOx2iOBgDQKEfB7XaroaFBktTQ0KCcnJwbtofDYXV0dCg5OfkTl44AACPPsuWjzZs36+jRo+rp6dGiRYu0YcMGPfrooyorK1Ntba2mTZumqqoqSdLixYvV1tYmj8ejxMREVVZWWjUWAGAYlkVh165dN91eU1PzsW02m01PPfWUVaMAACLEK5oBAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAAxpiKQnt7u3Jzc+XxeFRdXR3rcQBgwhkzUQiFQnr66ae1Z88eNTU16S9/+YvefvvtWI8FABPKmInCyZMn9ZnPfEYzZsxQfHy88vPz1draGuuxAGBCscd6gP/y+/1yOp3mc4fDoZMnTw67TzAYlNfrjfqx/+fh+6I+BsaXkTivRkTx/8Z6AowxI3FuBoPBIW8bM1H4NLKysmI9AgCMK2Nm+cjhcMjn85nP/X6/HA5HDCcCgIlnzERh/vz56uzs1Llz59Tf36+mpia53e5YjwUAE8qYWT6y2+3atm2bvvOd7ygUCumhhx7S7NmzYz0WAEwotnA4HI71EACAsWHMLB8BAGKPKAAADKIALi+CMWvr1q1yuVxavnx5rEeZMIjCBMflRTCWFRUVac+ePbEeY0IhChMclxfBWHbfffcpNTU11mNMKERhgrvZ5UX8fn8MJwIQS0QBAGAQhQmOy4sA+DCiMMFxeREAH8YrmqG2tjZVVlaay4v84Ac/iPVIgCRp8+bNOnr0qHp6epSenq4NGzaouLg41mONa0QBAGCwfAQAMIgCAMAgCgAAgygAAAyiAAAwiAIg6Te/+Y3y8/NVUFCgFStW6F//+lfUx2xtbR2xq84uWLBgRI4DfJIx83acQKycOHFChw8fVn19veLj49Xd3a2BgYGI9h0cHJTdfvNvo5ycHOXk5IzkqIDleKaACe/ChQuaMmWK4uPjJUlpaWlyOBxyu93q7u6WJJ06dUolJSWSpF/+8pcqLy/X17/+dT3++OP62te+pn//+9/meCUlJTp16pTq6ur09NNP69KlS1qyZImuXbsmSbp69aoWL16sgYEBvfvuu3rkkUdUVFSkNWvW6OzZs5Kkc+fOafXq1SooKNDu3btH858DExxRwIR3//336/z588rNzVVFRYWOHj36ifucPXtWv/vd77Rr1y7l5eVp//79kqRAIKBAIKD58+eb+yYnJ2vOnDnmuIcPH9YDDzygO++8U08++aSefPJJ1dXV6Sc/+Yl+9rOfSZJ27Nihb3zjG/rzn/+sjIwMC75q4OaIAia8pKQk81N9WlqaNm3apLq6umH3cbvduuuuuyRJDz74oJqbmyVJ+/fv17Jlyz52/7y8PO3bt0+S1NTUpLy8PF25ckUnTpzQD3/4Q61YsULbtm3ThQsXJF1f0srPz5ckrVixYsS+VuCT8DsFQFJcXJyys7OVnZ2te++9Vw0NDYqLi9N/rwITDAZvuH9iYqL52OFwaPLkyXrzzTe1f/9+VVRUfOz4brdbu3fvVm9vr86cOaOFCxfqgw8+UEpKihobG286k81mG7kvEIgQzxQw4b3zzjvq7Ow0n3u9Xk2bNk3Tp0/X6dOnJUktLS3DHiMvL0979uzRpUuXNGfOnI/dnpSUpHnz5mnHjh368pe/rLi4OE2aNEn33HOPWXoKh8N68803JV3/a6OmpiZJ0quvvjoSXyYQEaKACe/q1avasmWL8vLyVFBQoLNnz+qxxx7TY489psrKShUVFSkuLm7YY+Tm5mrfvn168MEHh7xPXl6eXn31VeXl5ZltL7zwgmpra/XVr35V+fn5OnjwoCTpiSee0Msvv6yCggLeCQ+jiqukAgAMnikAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDA+D9EoeNtAM2agwAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "tqQOeyQVxA8X" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we know the number of Survived people, we also need to know their variations when sex is accounted, this means we want to know how many men were their among the survived ones, because during a calamity such as the sinking of Titanic, women, children and elderly are the ones which will be saved first, but the mean age on the ship being 30, sex should be the first parameter which we look for and check it. " + ], + "metadata": { + "id": "dgT8sT2XFwpI" + } + }, + { + "cell_type": "code", + "source": [ + "sns.set_style('whitegrid')\n", + "sns.countplot(x= 'Survived', hue = 'Sex', data = df_train, palette = 'rainbow')" + ], + "metadata": { + "id": "YayC1lbVPWyn", + "outputId": "d82673f0-b073-4a8f-e7a3-8b605b1b0383", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 44 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "aFYnX523xDhZ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Appart from Sex, Rich people will get to survive, so we need to find rich or poor, among the survived ones. One thing to note here is that we do not have an income feature, so we try to find something that relates it, Passenger class could be a good idea, as Rich People will buy better Cabins. \n", + "Cabin could have been used, but firstly it is a categorical data which does not represent anything where we could deduce, moreover it has a high amount of null Values\n", + "\n", + "\n", + "Fare is also a good parameter to judge economy but we have no idea about the price range for each class, so it would lead to a random selection of arbitrary fare, therefore we find the Passenger class `Pclass` bought by survived people. " + ], + "metadata": { + "id": "UtZpoxUpGXhX" + } + }, + { + "cell_type": "code", + "source": [ + "sns.set_style('whitegrid')\n", + "sns.countplot(x='Survived', hue = 'Pclass', data = df_train, palette = 'rainbow')" + ], + "metadata": { + "id": "JhMv7cQ6Pvl3", + "outputId": "1d68370e-79e2-4b5f-fcd4-b2857b514917", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 45 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "ipEzLdKfxK2E" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We should also look for number of people travelling with parents, siblings and spouse as, one would try to save their family first, and a rough idea would help us select the type of model which needs to be prepared. " + ], + "metadata": { + "id": "M4T0-YfUHZVQ" + } + }, + { + "cell_type": "code", + "source": [ + "sns.countplot(x = 'SibSp', data = df_train)" + ], + "metadata": { + "id": "Bv5UpD8IQhFA", + "outputId": "3e01262d-0f61-43de-bf95-588a7a8263ac", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 296 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 46 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "KeCq4pEexSit" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that we looked at each paremeters, lets look at the number of people that bought a certain type of ticket, like say there is a ticket priced 20$, getting to know the number of people which bought that ticket will help me know, the percentage economy of people on board " + ], + "metadata": { + "id": "gBdXTgDpHt0t" + } + }, + { + "cell_type": "code", + "source": [ + "sns.distplot(df_train['Fare'], kde = False, color = 'Darkred', bins = 40)" + ], + "metadata": { + "id": "ML6DxJ93QyGQ", + "outputId": "6bc716dc-7541-4ffc-fcf5-8b8c52db3f91", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 353 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n", + " warnings.warn(msg, FutureWarning)\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 47 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "FgwJcqP0xXAj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "One, thing which we can see is, we needed to access each feature and see its relation with my label, i.e, Survived. This might not be needed if we see high correlation among the features and label. \n", + "\n", + "Now that, we know about the data present, lets clean the null values here and prepare our model" + ], + "metadata": { + "id": "mu1MtePxIMTd" + } + }, + { + "cell_type": "markdown", + "source": [ + "First fill the mean age to clean that data and see the info to check whether the null value has been accounted for. " + ], + "metadata": { + "id": "5FrDKVumJD4Y" + } + }, + { + "cell_type": "code", + "source": [ + "df_train['Age'] = df_train['Age'].fillna(Age_fill)" + ], + "metadata": { + "id": "jBbJdCcSxp-d" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "df_train.isna().sum()" + ], + "metadata": { + "id": "cMkibyzdRYUm", + "outputId": "966adab0-2241-4d10-da84-65c3108bd6cb", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "PassengerId 0\n", + "Survived 0\n", + "Pclass 0\n", + "Name 0\n", + "Sex 0\n", + "Age 0\n", + "SibSp 0\n", + "Parch 0\n", + "Ticket 0\n", + "Fare 0\n", + "Cabin 687\n", + "Embarked 2\n", + "dtype: int64" + ] + }, + "metadata": {}, + "execution_count": 49 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "3EfsZWD3x5MA" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "One can also check a heatmap, to see the same, as we know, maps and graphs are preferred due to visual conclusions." + ], + "metadata": { + "id": "ev0lZjhDJMXi" + } + }, + { + "cell_type": "code", + "source": [ + "sns.heatmap(df_train.isnull(), cmap= 'viridis')" + ], + "metadata": { + "id": "9Cmnq_MKRhVV", + "outputId": "b7bd160e-6508-47e8-9e1e-cd6cc9b73047", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 338 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": {}, + "execution_count": 50 + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "qH1m_NCKyAvW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "pd.crosstab(df_train['Pclass'], df_train['Cabin'].apply(lambda x: x[0] if type(x) == str else np.nan))" + ], + "metadata": { + "id": "BdHsQshiRpK2", + "outputId": "011e207a-1c59-4268-985a-a915ab830ee3", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 175 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
CabinABCDEFGT
Pclass
11547592925001
200044800
300003540
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + "Cabin A B C D E F G T\n", + "Pclass \n", + "1 15 47 59 29 25 0 0 1\n", + "2 0 0 0 4 4 8 0 0\n", + "3 0 0 0 0 3 5 4 0" + ] + }, + "metadata": {}, + "execution_count": 51 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "SsLUwP40L3B7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We can notice, there are huge amount of null values present in the cabin features but this does not mean this feature is useless, `Pclass` and `Cabin` both feature combinely have some information to give.\n", + "\n", + "**This is left for you to find.**\n", + "\n", + "For now we have just removed that column from dataset." + ], + "metadata": { + "id": "ocNWydUVJWAj" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.drop(['Cabin'], axis = 1, inplace = True)" + ], + "metadata": { + "id": "TYGlqIKxyFXJ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Another thing of importance, is there are 2 null values in the feature 'Embarked', this feature represents where did the person boarded the ship, here firstly it is categorical value, secondly this parameter could be filled with the most frequently occuring value\n", + "\n", + "First try to make a plot to see the most frequently occuring value, and then fill the data with this value." + ], + "metadata": { + "id": "-VjMUVQOJscz" + } + }, + { + "cell_type": "code", + "source": [ + "df_train['Embarked'].value_counts().plot(kind='pie', autopct='%.2f')\n", + "plt.show()" + ], + "metadata": { + "id": "gYzjodHrSmwI", + "outputId": "ac953477-2107-4d77-fbf8-5155260d99f4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 248 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "sw1NOZgbxcMy" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "df_train['Embarked'] = df_train['Embarked'].fillna('S')" + ], + "metadata": { + "id": "ua8W8R4vyObf" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "One can try and check the information, to see the null values, if remaining." + ], + "metadata": { + "id": "Zmk-wTOJKV7m" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.info()" + ], + "metadata": { + "id": "iovne8xLTEL-", + "outputId": "41b6cfdb-c331-4443-bb0f-0feb055ee122", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "RangeIndex: 891 entries, 0 to 890\n", + "Data columns (total 11 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 PassengerId 891 non-null int64 \n", + " 1 Survived 891 non-null int64 \n", + " 2 Pclass 891 non-null int64 \n", + " 3 Name 891 non-null object \n", + " 4 Sex 891 non-null object \n", + " 5 Age 891 non-null float64\n", + " 6 SibSp 891 non-null int64 \n", + " 7 Parch 891 non-null int64 \n", + " 8 Ticket 891 non-null object \n", + " 9 Fare 891 non-null float64\n", + " 10 Embarked 891 non-null object \n", + "dtypes: float64(2), int64(5), object(4)\n", + "memory usage: 76.7+ KB\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "id": "ZRvefRdbzMJ4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Feature Engineering" + ], + "metadata": { + "id": "hzN5qpiNZHjY" + } + }, + { + "cell_type": "markdown", + "source": [ + "SibSp: count of siblings or spouce\n", + "\n", + "Parch: count of parents and childrens\n", + "\n", + "combining this two features make sense as a count of family members." + ], + "metadata": { + "id": "zIKxKxifRAuV" + } + }, + { + "cell_type": "code", + "source": [ + "df_train['family_count'] = df_train['SibSp'] + df_train['Parch']" + ], + "metadata": { + "id": "Be4XHjbtRah1" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now we dont need this two fields any more, we can drop them both" + ], + "metadata": { + "id": "9esOCmy3RlB4" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.drop(['SibSp', 'Parch'], axis = 1, inplace = True)" + ], + "metadata": { + "id": "Ktl-eX3bRjdP" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Preprocessing\n", + "\n", + "Prepare data for machine learning algorithm" + ], + "metadata": { + "id": "GXnrPIgTZP4M" + } + }, + { + "cell_type": "markdown", + "source": [ + "We can see that there are certain *text values* remaining in our dataset, we also know that our models do not work well with text values, hence we need to convert them to numerical values, of some sort.\n", + "\n", + "One Hot Encoding, Ordinal Encoding is a good method to do that, but here only 2 or 3 parameters are present hence, get_dummies of pandas would not only prove simpler but also quicker.\n", + "\n", + "Here **embarked and sex both are of Nominal type**, get_dummies or OneHot will do our job." + ], + "metadata": { + "id": "TJNGTxp2K8N7" + } + }, + { + "cell_type": "code", + "source": [ + "embk = pd.get_dummies(df_train['Embarked'], drop_first= True)\n", + "sex = pd.get_dummies(df_train['Sex'], drop_first= True)" + ], + "metadata": { + "id": "Qn_kjNb9zNdO" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now, Name, PassengerId and Ticket will not affect whether the person survives or not, hence we can drop them.\n", + "Sex and Embarked can be encoded and hence should be dropped as encoded parameters, will be added to the dataframe. " + ], + "metadata": { + "id": "FPl90lkkMYei" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.drop(['Name', 'Ticket', 'Sex', 'Embarked', 'PassengerId'], axis = 1, inplace = True)" + ], + "metadata": { + "id": "88Aqzd5izUxc" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now, add the encoded values to the main dataframe\n", + "\n", + "This now makes our data clean and ready for training it appropriately. " + ], + "metadata": { + "id": "_sJslI__MoZD" + } + }, + { + "cell_type": "code", + "source": [ + "df_train = pd.concat([df_train, sex, embk], axis = 1)" + ], + "metadata": { + "id": "MIPK5uTQzgSz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You can also see the dataframe into the current format to check whether its converted appropriately. " + ], + "metadata": { + "id": "jUIld6qOM2BE" + } + }, + { + "cell_type": "code", + "source": [ + "df_train.head()" + ], + "metadata": { + "id": "nEi-fehVUoYh", + "outputId": "09a369d0-2aa1-4082-8a8e-3e66945ac32b", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SurvivedPclassAgeFarefamily_countmaleQS
00322.07.25001101
11138.071.28331000
21326.07.92500001
31135.053.10001001
40335.08.05000101
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " Survived Pclass Age Fare family_count male Q S\n", + "0 0 3 22.0 7.2500 1 1 0 1\n", + "1 1 1 38.0 71.2833 1 0 0 0\n", + "2 1 3 26.0 7.9250 0 0 0 1\n", + "3 1 1 35.0 53.1000 1 0 0 1\n", + "4 0 3 35.0 8.0500 0 1 0 1" + ] + }, + "metadata": {}, + "execution_count": 61 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + }, + "id": "oOpaIjZZznNE", + "outputId": "5aeadb17-91d9-4f26-987c-931a2ee9c054" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SurvivedPclassAgeFarefamily_countmaleQS
00322.07.25001101
11138.071.28331000
21326.07.92500001
31135.053.10001001
40335.08.05000101
\n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " Survived Pclass Age Fare family_count male Q S\n", + "0 0 3 22.0 7.2500 1 1 0 1\n", + "1 1 1 38.0 71.2833 1 0 0 0\n", + "2 1 3 26.0 7.9250 0 0 0 1\n", + "3 1 1 35.0 53.1000 1 0 0 1\n", + "4 0 3 35.0 8.0500 0 1 0 1" + ] + }, + "metadata": {}, + "execution_count": 31 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Separate the features and labels, into x and y so that we can train our features, in order to predict the labels accordingly. " + ], + "metadata": { + "id": "0lkvZlf5NIAv" + } + }, + { + "cell_type": "code", + "source": [ + "y = df_train['Survived']" + ], + "metadata": { + "id": "wBpHCXr8zpca" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "x = df_train.drop('Survived', axis= 1)" + ], + "metadata": { + "id": "F9IONoVXzttZ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that my features are recorded separately, I need to scale the features, so that all my data, in within close quarters, as we can see that fare and age will not be a close value, so we need to perform preprocessing, and scale the value within close quarters, this will be done using, Standard Scaler. " + ], + "metadata": { + "id": "LhTW6AEJNxSx" + } + }, + { + "cell_type": "code", + "source": [ + "import numpy as np\n", + "from sklearn import preprocessing" + ], + "metadata": { + "id": "OYTzzTpP6aq5" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# import StandardScaler\n", + "scaler = preprocessing.StandardScaler()\n", + "scaled_X = scaler.fit_transform(x)" + ], + "metadata": { + "id": "1YaYJ8kH6aly" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Select and Train Model" + ], + "metadata": { + "id": "b9UyMFoIZjzU" + } + }, + { + "cell_type": "markdown", + "source": [ + "Now the entire data is ready to be split into testing and training, splitting the data is necessary, to check the accuracies of our model. \n", + "While splitting the features, use the scaled values\n" + ], + "metadata": { + "id": "zoiWQMp-O_E4" + } + }, + { + "cell_type": "code", + "source": [ + "# import trian test split\n", + "from sklearn.model_selection import train_test_split" + ], + "metadata": { + "id": "n7cTtK30z0oS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# split using train test split method usign test_size 03\n", + "X_train, X_test, Y_train, Y_test = train_test_split(scaled_X, y, test_size = 0.3, random_state= 42)" + ], + "metadata": { + "id": "9NlHv-Igz3dW" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Initially, we can use various models, to train and predict the data. \n", + "Initially try, Logistic Regression, fit and predict the data, find score and see the accuracy of that algorithm.\n", + "\n", + "**Logistic Regression** is a classification algorithm and here we want to classify each person to weather he/she survived or not." + ], + "metadata": { + "id": "INVbGqMbPOaX" + } + }, + { + "cell_type": "code", + "source": [ + "# import logisticRegression and Train the model\n", + "from sklearn.linear_model import LogisticRegression\n", + "\n", + "log_reg = LogisticRegression()\n", + "log_reg.fit(X_train, Y_train)" + ], + "metadata": { + "id": "to11XeZ_hXwQ", + "outputId": "840d3b33-3d66-4a68-a8fd-88681add0b34", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LogisticRegression()" + ] + }, + "metadata": {}, + "execution_count": 68 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d5z1rezT16Mm", + "outputId": "79ebec11-40d3-4673-aa01-be0c202692a5" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "LogisticRegression()" + ] + }, + "metadata": {}, + "execution_count": 39 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# predict the score on test and store into y_pred\n", + "y_pred = log_reg.predict(X_test)" + ], + "metadata": { + "id": "SUYw3Ozf2gmZ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Try, finding accuracy_score, to calculate for accuracy of the model. " + ], + "metadata": { + "id": "NIskLYQlPjEn" + } + }, + { + "cell_type": "code", + "source": [ + "# import accuracy score and print accuracy\n", + "from sklearn.metrics import accuracy_score\n", + "\n", + "titanic_predictions = log_reg.predict(X_test)\n", + "og_score = accuracy_score(Y_test, titanic_predictions)\n", + "og_score" + ], + "metadata": { + "id": "c9NiZtyckeim", + "outputId": "107432c5-9019-4734-ebfd-5d9f3ca9b7bb", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8059701492537313" + ] + }, + "metadata": {}, + "execution_count": 70 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "N7hpE0OP2yK0", + "outputId": "83c32006-26f1-4ef2-b792-921324a76754" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.8059701492537313" + ] + }, + "metadata": {}, + "execution_count": 41 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Upon getting the score, we can see the logistic regression algorithm has done decent work, but nothing bad with trying other classification algorithms like DecisionTree.\n" + ], + "metadata": { + "id": "O5lDF2Q7Pte2" + } + }, + { + "cell_type": "code", + "source": [ + "# import decision Tree and train\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "\n", + "tree_clf = DecisionTreeClassifier()\n", + "tree_clf.fit(X_train, Y_train)" + ], + "metadata": { + "id": "Mq3-cu65ky4q", + "outputId": "e80f20a2-05bf-41cb-e589-d8e893439e73", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier()" + ] + }, + "metadata": {}, + "execution_count": 71 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "irzOgxIG3M3J", + "outputId": "815f05a2-bc90-4ee4-8001-0adf9f2b9a51" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "DecisionTreeClassifier()" + ] + }, + "metadata": {}, + "execution_count": 42 + } + ] + }, + { + "cell_type": "code", + "source": [ + "# using accuaracy_score print accuracy\n", + "titanic_predictions = tree_clf.predict(X_test)\n", + "tree_score = accuracy_score(Y_test, titanic_predictions)\n", + "tree_score" + ], + "metadata": { + "id": "GPeuQ2W2m3Aj", + "outputId": "96e2689d-cbe1-4f8f-ea6e-b7e182a81fd7", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.7798507462686567" + ] + }, + "metadata": {}, + "execution_count": 72 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "D_2eNhfS3hdf", + "outputId": "624bd39c-e650-451d-9f2f-017f1397ea5f" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.7798507462686567" + ] + }, + "metadata": {}, + "execution_count": 43 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "sccuracy has decreased a bit, we can try using cross validation and see the score for both Decision Tree and Linear Regression " + ], + "metadata": { + "id": "2fCRU-qDP8yV" + } + }, + { + "cell_type": "code", + "source": [ + "from sklearn.model_selection import cross_val_score\n", + "\n", + "scores = cross_val_score(tree_clf, X_train, Y_train,scoring=\"accuracy\", cv=10)" + ], + "metadata": { + "id": "fud7Tp2C3yQm" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "def display_scores(scores):\n", + " print(\"Scores:\", scores)\n", + " print(\"Mean:\", scores.mean())\n", + " print(\"Standard deviation:\", scores.std())\n", + "\n", + "display_scores(scores)" + ], + "metadata": { + "id": "pUsWjJxrndHi", + "outputId": "1d939448-9e68-49a9-9e18-9b430b631ca8", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Scores: [0.73015873 0.79365079 0.74603175 0.80645161 0.77419355 0.69354839\n", + " 0.75806452 0.82258065 0.85483871 0.80645161]\n", + "Mean: 0.7785970302099334\n", + "Standard deviation: 0.0453946434587528\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Zn3xL51337Ig", + "outputId": "259f021a-ffc8-4dbc-fb1c-e8ac1d60cd33" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Scores: [0.71428571 0.77777778 0.71428571 0.80645161 0.77419355 0.70967742\n", + " 0.77419355 0.80645161 0.85483871 0.80645161]\n", + "Mean: 0.7738607270865334\n", + "Standard deviation: 0.04580104507258628\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "log_scores = cross_val_score(log_reg, X_train, Y_train,scoring=\"accuracy\", cv=10)\n", + "\n", + "display_scores(log_scores)" + ], + "metadata": { + "id": "elbJHv1nnmtP", + "outputId": "714e9839-4102-4ff3-8fe0-6ccdab0cbe5d", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Scores: [0.76190476 0.76190476 0.88888889 0.85483871 0.80645161 0.74193548\n", + " 0.72580645 0.77419355 0.72580645 0.93548387]\n", + "Mean: 0.7977214541730671\n", + "Standard deviation: 0.0687047480828934\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MUYyBM793-ru", + "outputId": "6745577f-8300-44e6-8c2a-22ef9cfdbe03" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Scores: [0.76190476 0.76190476 0.88888889 0.85483871 0.80645161 0.74193548\n", + " 0.72580645 0.77419355 0.72580645 0.93548387]\n", + "Mean: 0.7977214541730671\n", + "Standard deviation: 0.0687047480828934\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Both of the above algorithms are not the best for this dataset, we may try various algorithms and try increasing the accuracy, but Random Forrest is generally known to be the best, so maybe try that." + ], + "metadata": { + "id": "qxxaUjXyQJyW" + } + }, + { + "cell_type": "code", + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "forest_clf = RandomForestClassifier()\n", + "forest_clf.fit(X_train, Y_train)\n", + "\n", + "titanic_predictions = forest_clf.predict(X_test)\n", + "forest_score = accuracy_score(Y_test, titanic_predictions)\n", + "forest_score" + ], + "metadata": { + "id": "qDAFM410nvV8", + "outputId": "55111a0e-7f57-4191-bbde-f2159218a0d9", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.7947761194029851" + ] + }, + "metadata": {}, + "execution_count": 78 + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Spay_mrC4HXu", + "outputId": "2621c8f8-cc16-4220-8485-ea9973e11651" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.7835820895522388" + ] + }, + "metadata": {}, + "execution_count": 47 + } + ] + }, + { + "cell_type": "code", + "source": [ + "display_scores(forest_score)" + ], + "metadata": { + "id": "MGmhRKdan41I", + "outputId": "110abdf1-d45a-4cfd-fae2-2fd47e52a4ba", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Scores: 0.7947761194029851\n", + "Mean: 0.7947761194029851\n", + "Standard deviation: 0.0\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CkR-N9bw4aAl", + "outputId": "7ebd5c46-8565-49b5-8943-487be83084ba" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Scores: 0.7835820895522388\n", + "Mean: 0.7835820895522388\n", + "Standard deviation: 0.0\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Fine Tune Model\n", + "\n", + "HyperParameter Tuning" + ], + "metadata": { + "id": "iO_BYHibZpDA" + } + }, + { + "cell_type": "markdown", + "source": [ + "After looking at these scores we might perform GridSearchCV, in order to check the best suited params from the data. \n" + ], + "metadata": { + "id": "Yr5Ck4baQnAN" + } + }, + { + "cell_type": "code", + "source": [ + "from sklearn.model_selection import GridSearchCV\n", + "\n", + "param_grid = [\n", + " {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6]},\n", + " {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},\n", + "]\n", + "\n", + "forest_clf = RandomForestClassifier()\n", + "\n", + "grid_search = GridSearchCV(forest_clf, param_grid, cv=5,\n", + " scoring='accuracy',\n", + " return_train_score=True)\n", + "\n", + "grid_search.fit(X_train, Y_train)\n", + "print(grid_search.best_params_)\n", + "print(grid_search.best_score_)" + ], + "metadata": { + "id": "OAh4eHctoAHR", + "outputId": "c1820e5e-e60f-4733-a57e-25bfd54c3b03", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'max_features': 4, 'n_estimators': 30}\n", + "0.8074838709677421\n" + ] + } + ] + }, + { + "cell_type": "code", + "source": [ + "" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JV_BgZFv4dGC", + "outputId": "5bc689e2-667a-4f6f-d179-7e5c5ca67c0f" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'max_features': 4, 'n_estimators': 30}\n", + "0.8010193548387097\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Here the best model has highest n_estimators value so we can try to further increase value of `n_estimators`. Higher values of n_estimators takes a longer time to fit model you can try this in your own. Reason for this longer time will be **clearly explaned** in your future classes of RandomForest." + ], + "metadata": { + "id": "QGdcekx-WJph" + } + }, + { + "cell_type": "markdown", + "source": [ + "After finding the best params, we can find all mean params and check that best params gives the highest score." + ], + "metadata": { + "id": "3r-YVWkyQ_Eh" + } + }, + { + "cell_type": "code", + "source": [ + "cvres = grid_search.cv_results_\n", + "for score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n", + " print(np.sqrt(score), params)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "VMcC4Jql4nuY", + "outputId": "0a43893f-95a2-4676-e87d-2f591206a1ee" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "0.8649072297819185 {'max_features': 2, 'n_estimators': 3}\n", + "0.8950112632198115 {'max_features': 2, 'n_estimators': 10}\n", + "0.888717815078954 {'max_features': 2, 'n_estimators': 30}\n", + "0.8777610814618249 {'max_features': 4, 'n_estimators': 3}\n", + "0.8823867925195296 {'max_features': 4, 'n_estimators': 10}\n", + "0.8986010633021431 {'max_features': 4, 'n_estimators': 30}\n", + "0.881457736896945 {'max_features': 6, 'n_estimators': 3}\n", + "0.8949896377125846 {'max_features': 6, 'n_estimators': 10}\n", + "0.8913925314704462 {'max_features': 6, 'n_estimators': 30}\n", + "0.8769154150551183 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}\n", + "0.8796480234510223 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}\n", + "0.8860131798335225 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}\n", + "0.8859986164701934 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}\n", + "0.8859986164701934 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}\n", + "0.8823648576045421 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# **Hurray! you have completed this excercise🤘🤘🤘**" + ], + "metadata": { + "id": "RaNppu5UXCEy" + } + }, + { + "cell_type": "markdown", + "source": [ + "Here are some tasks left for you if u wish to improve you skills then solve it.\n", + "\n", + "\n", + "\n", + "* **Find the relation between 'SibSp', 'Parch'.**\n", + "* **Do some Hyperparameter Tuning in Logistic Regression to enhance your model.** To see the hyperparameters in logistic regression you can use sklearn official weibite to see parameters.\n", + "\n" + ], + "metadata": { + "id": "_IxX1MxfXXv0" + } + } + ] +} \ No newline at end of file