From 792294f8d44f57f68ce6409f7d09f1d4a6014385 Mon Sep 17 00:00:00 2001
From: Diwakar Gupta <39624018+Diwakar-Gupta@users.noreply.github.com>
Date: Tue, 10 May 2022 10:33:53 +0530
Subject: [PATCH] Assignment Solution
---
22-05-07-End-To-End/Assignment_Solution.ipynb | 3190 +++++++++++++++++
1 file changed, 3190 insertions(+)
create mode 100644 22-05-07-End-To-End/Assignment_Solution.ipynb
diff --git a/22-05-07-End-To-End/Assignment_Solution.ipynb b/22-05-07-End-To-End/Assignment_Solution.ipynb
new file mode 100644
index 0000000..d0916a7
--- /dev/null
+++ b/22-05-07-End-To-End/Assignment_Solution.ipynb
@@ -0,0 +1,3190 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "End_To_End_Project_Assignment.ipynb",
+ "provenance": [],
+ "collapsed_sections": [],
+ "toc_visible": true,
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# **Agenda**\n",
+ "\n",
+ "1. Look at the big picture.\n",
+ "2. Get the data.\n",
+ "3. Discover and visualize the data to gain insights.\n",
+ "4. Prepare the data for Machine Learning algorithms.\n",
+ "5. Select a model and train it.\n",
+ "6. Fine-tune your model.\n",
+ "7. Present your solution.\n",
+ "8. Launch, monitor, and maintain your system."
+ ],
+ "metadata": {
+ "id": "BA7AsnN771kU"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## **How to Approach**\n",
+ "\n",
+ "Gather knowledge about problem it's current solution and how it will be used by company and downstreams\n",
+ "\n",
+ "First Task is to Frame the Problem by asking question's\n",
+ "\n",
+ "**Question:** What exactly is the business objective?\n",
+ "\n",
+ "This is important for performance measure to evaluate your model and time spend tweaking it.\n",
+ "\n",
+ "Next Question is What's the current solution?\n",
+ "\n",
+ "It will give a reference performance, as well as insights on how to solve the problem.\n",
+ "Select a Performance Measure\n",
+ "\n",
+ "Metric system gives an idea of how much error the system typically makes in its predictions.\n",
+ "1. **RMSE:** Root Mean Squared Error \n",
+ "2. **MAE:** Mean Absolute Error\n",
+ "3. **Accuracy**: used for classification problems\n",
+ "\n",
+ "![image](https://miro.medium.com/max/710/1*5OQunI-NR-S0gAZFIit1Rw.png)\n",
+ "\n",
+ "**Check the Assumptions**\n",
+ "\n",
+ "My output is used by other machine.\n",
+ "Ask how downstream will use your output.\n",
+ "example exact price or label's(“cheap,”, “medium,” or “expensive”)."
+ ],
+ "metadata": {
+ "id": "0WoXP9KK8YRT"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# **Import all the required packages**\n",
+ "\n",
+ "1. Numpy is required for all mathematical computations\n",
+ "2. Pandas is required for manipulating data\n",
+ "3. Matplotlib and Seaborn will be required for drawing graphs and drawing conclusions from the given data. Hence, EDA (Exploratory Data Analysis)"
+ ],
+ "metadata": {
+ "id": "YyVbMvC8-MHA"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Bt5sE1_7wqNK"
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np \n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "1. Read the Data from the following link:\n",
+ "https://raw.githubusercontent.com/Diwakar-Gupta/assets_resources/main/datasets/titanic.csv\n",
+ "\n",
+ "2. View the actual format of the data and the various features and label"
+ ],
+ "metadata": {
+ "id": "GWHXjbOV_HpJ"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Get The Data"
+ ],
+ "metadata": {
+ "id": "SCrRPYuFY0aH"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train = pd.read_csv('https://raw.githubusercontent.com/Diwakar-Gupta/assets_resources/main/datasets/titanic.csv')\n",
+ "df_train.sample(5)"
+ ],
+ "metadata": {
+ "id": "3n0DBmSjM_2e",
+ "outputId": "674539e4-3ff3-4c1e-f27a-6a8f050ebd72",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 268
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "ePSAYwlQw40U"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that we can see the null values, we need to first deal with numerical null values, there are a few options at our disposal, these are listed below:\n",
+ "1. Delete the entire row where null value is present.\n",
+ "2. Fill the null value with previous entry\n",
+ "3. Fill the null value with next entry\n",
+ "4. Fill the null value with mean, median or mode\n",
+ "\n",
+ "The type of cleaning done on null values depends on our choice. This does not mean only these 4 values will be done when encountered with a null value. Though these are the most frequent ones, with respect to a numerical column containing a null value. \n",
+ "This time we have to deal with 'Age', which means most appropriate thing would be the fill it with the mean value. \n",
+ "\n",
+ "This would also be clear if we make a graph, see the type of distribution, for various ages present in the data.\n",
+ "\n",
+ "If you see a normal distribution, fill it with null values without giving much thought, as normal distribution, has max values near its mean"
+ ],
+ "metadata": {
+ "id": "hXu3xyoFA8yb"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.distplot(df_train['Age'], bins=10)"
+ ],
+ "metadata": {
+ "id": "iSEhBpj4N9ml",
+ "outputId": "95f1f4b9-0054-489d-d9ec-936f7bba0a2d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 351
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n",
+ " warnings.warn(msg, FutureWarning)\n"
+ ]
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 38
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "qGQKDLiww6ak"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now, that we are filling null values in Age feature with the mean Age, we need to find the mean age, this can be done, if we find the mean.\n",
+ "One way is the actually find the arithmetic mean, but that is tedious.\n",
+ "Try using the Describe Function, this would also give min, max, quatile values. "
+ ],
+ "metadata": {
+ "id": "vBBhwUmpC9My"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.describe()"
+ ],
+ "metadata": {
+ "id": "D99rZN4DOIUv",
+ "outputId": "8faca78a-b67d-4371-e1e1-d4114d281ab5",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 300
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ " PassengerId Survived Pclass ... SibSp Parch Fare\n",
+ "count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000\n",
+ "mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208\n",
+ "std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429\n",
+ "min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000\n",
+ "25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400\n",
+ "50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200\n",
+ "75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000\n",
+ "max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200\n",
+ "\n",
+ "[8 rows x 7 columns]"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 39
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "gCln1hnQw8Kb"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "**Here you can see the age feature varies from 20 to 38 from 1st quartile to 3rd Quartile, This means most of the passengers in the ship are between 20 to 38 years old. And a really new born kid of few months is in the ship.**\n",
+ "\n"
+ ],
+ "metadata": {
+ "id": "BO2o2CKNDo_c"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that, we know the mean, store it in a variable where we will fill it later, \n",
+ "also check if there are duplicate rows in the data.\n",
+ "\n",
+ "All this is required so that we clean the data, which will make sure there are no absurd values in our model and prediction will be easy."
+ ],
+ "metadata": {
+ "id": "dd4LRsq6Doi4"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "Age_fill = 30"
+ ],
+ "metadata": {
+ "id": "mc0YXGPzw9uR"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can also check for correlations in the data, sometimes, there is high degree of correlation between the labels and the feature and hence, we can drop the feature which have little correlation as that will not impact our prediction significantly or we can use those features to do some feature engineering and create new features."
+ ],
+ "metadata": {
+ "id": "cfPWQPZmEDmN"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.corr()['Survived'].sort_values(ascending=False)"
+ ],
+ "metadata": {
+ "id": "win3CVI3Og4T",
+ "outputId": "0bb1c289-2201-480d-a5fa-979e8278c054",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "Survived 1.000000\n",
+ "Fare 0.257307\n",
+ "Parch 0.081629\n",
+ "PassengerId -0.005007\n",
+ "SibSp -0.035322\n",
+ "Age -0.077221\n",
+ "Pclass -0.338481\n",
+ "Name: Survived, dtype: float64"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 41
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "AsDJ5uUwxdaW"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now, we are performing EDA,i.e., Exploratory Data Analysis. \n",
+ "So we need to know what each feature is signifying and also how much importance it will carry when we prepare our model."
+ ],
+ "metadata": {
+ "id": "0JmY-b7MEbdm"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Firstly, we know that our label is which person survived. \n",
+ "Now in our training data, we need to know exactly, how many people survived, so that we know about the greater percentage in our data.\n",
+ "This means we must know whether more people survived in our training data or whether more people did not survive in our training data.\n",
+ "\n",
+ "We can use function for counting all of them, but generally graphs are preferred in EDA, simply due to visual appeal, people process images quickly when compared to raw numbers, hence try using graphs, as much as possible."
+ ],
+ "metadata": {
+ "id": "sr4Rxhw-Eu9S"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train[\"Survived\"].value_counts()"
+ ],
+ "metadata": {
+ "id": "4hMQyu2XO2vT",
+ "outputId": "766fd8e1-79bf-46b8-ef78-5ee60ea78360",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "0 549\n",
+ "1 342\n",
+ "Name: Survived, dtype: int64"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 42
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "4Gg9J1ZfFMPx"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.set_style('whitegrid')\n",
+ "sns.countplot(x = 'Survived', data =df_train)"
+ ],
+ "metadata": {
+ "id": "rpchNdPYO9JE",
+ "outputId": "cc54dc06-ea4e-43d3-99f2-6ca750969e26",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 296
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 43
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAUfUlEQVR4nO3dfUyV9/3/8dcpp1CCgELknGr9LXHaSNQNs3V40kbnYUcsyEQqc3NjSrt1W6wO3eh0TS2rlfXGKemWbSEmju6bbvuWcNMNHShW6LY6E5V5k9NmtSG1jeccw4140x3geL5/mH1+tRV67OHiIDwff8F1znWdN+bCJ+cD5zq2cDgcFgAAku6I9QAAgLGDKAAADKIAADCIAgDAIAoAAMMe6wGi0dHRoYSEhFiPAQC3lWAwqKysrJvedltHISEhQZmZmbEeAwBuK16vd8jbWD4CABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYEz4KwYFQrEfAGMR5gYnqtr7MxUhIuDNOXyh/KdZjYIw59sK3Yz0CEBMT/pkCAOD/IwoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDAsPQqqW63W0lJSbrjjjsUFxenuro69fb2atOmTXr//fc1ffp0VVVVKTU1VeFwWDt27FBbW5vuuusuPfvss5o7d66V4wEAPsLyZwo1NTVqbGxUXV2dJKm6uloul0stLS1yuVyqrq6WJLW3t6uzs1MtLS3avn27KioqrB4NAPARo7581NraqsLCQklSYWGhDh48eMN2m82mrKws9fX1KRAIjPZ4ADChWf4mO4888ohsNptWr16t1atXq6urSxkZGZKkqVOnqqurS5Lk9/vldDrNfk6nU36/39z3ZoLBoLxeb1TzZWZmRrU/xq9ozy3gdmRpFP7whz/I4XCoq6tLpaWlmjlz5g2322w22Wy2T338hIQE/lOHZTi3MF4N9wOPpctHDodDkpSeni6Px6OTJ08qPT3dLAsFAgGlpaWZ+/p8PrOvz+cz+wMARodlUbh69aouX75sPv773/+u2bNny+12q6GhQZLU0NCgnJwcSTLbw+GwOjo6lJycPOzSEQBg5Fm2fNTV1aX169dLkkKhkJYvX65FixZp/vz5KisrU21traZNm6aqqipJ0uLFi9XW1iaPx6PExERVVlZaNRoAYAi2cDgcjvUQn5bX6x2Rdd8vlL80AtNgPDn2wrdjPQJgmeH+7+QVzQAAgygAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDAIAoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDAIAoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwCAKAADD8iiEQiEVFhbqe9/7niTp3LlzKi4ulsfjUVlZmfr7+yVJ/f39Kisrk8fjUXFxsd577z2rRwMAfITlUXjppZf02c9+1ny+c+dOrVu3TgcOHFBKSopqa2slSa+88opSUlJ04MABrVu3Tjt37rR6NADAR1gaBZ/Pp8OHD2vVqlWSpHA4rCNHjig3N1eStHLlSrW2tkqSDh06pJUrV0qScnNz9cYbbygcDls5HgDgI+xWHryyslLl5eW6cuWKJKmnp0cpKSmy268/rNPplN/vlyT5/X7dfffd14ey25WcnKyenh6lpaUNefxgMCiv1xvVjJmZmVHtj/Er2nMLuB1ZFoXXXntNaWlpmjdvnv75z39a8hgJCQn8pw7LcG5hvBruBx7LonD8+HEdOnRI7e3tCgaDunz5snbs2KG+vj4NDg7KbrfL5/PJ4XBIkhwOh86fPy+n06nBwUFdunRJU6ZMsWo8AMBNWPY7hR/96Edqb2/XoUOHtGvXLi1cuFC/+MUvlJ2drebmZklSfX293G63JMntdqu+vl6S1NzcrIULF8pms1k1HgDgJkb9dQrl5eXau3evPB6Pent7VVxcLElatWqVent75fF4tHfvXv34xz8e7dEAYMKzhW/jP/Hxer0jsu77hfKXRmAajCfHXvh2rEcALDPc/528ohkAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEARijwoPBWI+AMcjq88Ju6dEBfGo2e4LefXp+rMfAGPP/tp2y9Pg8UwAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIARURTWrl0b0TYAwO1t2BevBYNBffDBB+rp6dHFixcVDoclSZcvX5bf7x+VAQEAo2fYKPzxj39UTU2NAoGAioqKTBQmTZqkb33rW6MyIABg9AwbhbVr12rt2rX6/e9/r5KSkls6cDAY1De/+U319/crFAopNzdXGzdu1Llz57R582b19vZq7ty5ev755xUfH6/+/n49/vjjOnPmjCZPnqzdu3frnnvuieqLAwDcmoiufVRSUqLjx4/r/fffVygUMtsLCwuH3Cc+Pl41NTVKSkrSwMCA1qxZo0WLFmnv3r1at26d8vPztW3bNtXW1mrNmjV65ZVXlJKSogMHDqipqUk7d+5UVVVV9F8hACBiEf2iuby8XM8//7yOHTumU6dO6dSpUzp9+vSw+9hsNiUlJUmSBgcHNTg4KJvNpiNHjig3N1eStHLlSrW2tkqSDh06pJUrV0qScnNz9cYbb5jlKgDA6IjomcLp06e1b98+2Wy2Wzp4KBRSUVGR3n33Xa1Zs0YzZsxQSkqK7PbrD+t0Os0vrP1+v+6+++7rQ9ntSk5OVk9Pj9LS0m7pMQEAn15EUZg9e7YuXLigjIyMWzp4XFycGhsb1dfXp/Xr1+udd975VEMOJRgMyuv1RnWMzMzMEZoG402051a0ODcxFCvPzYii0NPTo/z8fH3uc5/TnXfeabb/9re/jehBUlJSlJ2drY6ODvX19WlwcFB2u10+n08Oh0OS5HA4dP78eTmdTg0ODurSpUuaMmXKsMdNSEjgGweW4dzCWBXtuTlcVCKKwoYNG275Qbu7u2W325WSkqL//Oc/+sc//qHvfve7ys7OVnNzs/Lz81VfXy+32y1Jcrvdqq+v14IFC9Tc3KyFCxfe8nIVACA6EUXhS1/60i0fOBAIaMuWLQqFQgqHw1q2bJmWLFmiWbNmadOmTaqqqlJmZqaKi4slSatWrVJ5ebk8Ho9SU1O1e/fuW35MAEB0IorCggULzE/tAwMDGhwcVGJioo4fPz7kPnPmzFFDQ8PHts+YMUO1tbUf256QkKAXX3wx0rkBABaIKAonTpwwH4fDYbW2tqqjo8OyoQAAsXHLV0m12Wz6yle+or/97W9WzAMAiKGInim0tLSYj69du6bTp08rISHBsqEAALERURRee+0183FcXJymT5+uX//615YNBQCIjYii8POf/9zqOQAAY0BEv1Pw+Xxav369XC6XXC6XNmzYIJ/PZ/VsAIBRFlEUtm7dKrfbrddff12vv/66lixZoq1bt1o9GwBglEUUhe7ubj300EOy2+2y2+0qKipSd3e31bMBAEZZRFGYPHmyGhsbFQqFFAqF1NjYqMmTJ1s9GwBglEUUhcrKSu3fv1/333+/HnjgATU3N+vZZ5+1ejYAwCiL6K+PXnzxRT333HNKTU2VJPX29uq5557jr5IAYJyJ6JnCW2+9ZYIgXV9OivW15gEAIy+iKFy7dk0XL140n/f29t7wXs0AgPEhouWjhx9+WKtXr9ayZcskSX/961/1/e9/39LBAACjL6IoFBYWat68eTpy5Igk6Ve/+pVmzZpl6WAAgNEXURQkadasWYQAAMa5W750NgBg/CIKAACDKAAADKIAADCIAgDAIAoAAIMoAAAMogAAMIgCAMAgCgAAgygAAAyiAAAwiAIAwLAsCufPn1dJSYny8vKUn5+vmpoaSdffoKe0tFRLly5VaWmpefOecDisZ555Rh6PRwUFBTpz5oxVowEAhmBZFOLi4rRlyxbt27dPf/rTn/Tyyy/r7bffVnV1tVwul1paWuRyuVRdXS1Jam9vV2dnp1paWrR9+3ZVVFRYNRoAYAiWRSEjI0Nz586VJE2aNEkzZ86U3+9Xa2urCgsLJV1/856DBw9Kktlus9mUlZWlvr4+BQIBq8YDANxExG+yE4333ntPXq9Xn//859XV1aWMjAxJ0tSpU9XV1SVJ8vv9cjqdZh+n0ym/32/uezPBYFBerzeq2TIzM6PaH+NXtOdWtDg3MRQrz03Lo3DlyhVt3LhRP/3pTzVp0qQbbrPZbLLZbJ/62AkJCXzjwDKcWxiroj03h4uKpX99NDAwoI0bN6qgoEBLly6VJKWnp5tloUAgoLS0NEmSw+GQz+cz+/p8PjkcDivHAwB8hGVRCIfDeuKJJzRz5kyVlpaa7W63Ww0NDZKkhoYG5eTk3LA9HA6ro6NDycnJwy4dAQBGnmXLR8eOHVNjY6PuvfderVixQpK0efNmPfrooyorK1Ntba2mTZumqqoqSdLixYvV1tYmj8ejxMREVVZWWjUaAGAIlkXhi1/8ot56662b3vbf1yx8mM1m01NPPWXVOACACPCKZgCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBhWRS2bt0ql8ul5cuXm229vb0qLS3V0qVLVVpaqosXL0qSwuGwnnnmGXk8HhUUFOjMmTNWjQUAGIZlUSgqKtKePXtu2FZdXS2Xy6WWlha5XC5VV1dLktrb29XZ2amWlhZt375dFRUVVo0FABiGZVG47777lJqaesO21tZWFRYWSpIKCwt18ODBG7bbbDZlZWWpr69PgUDAqtEAAEOwj+aDdXV1KSMjQ5I0depUdXV1SZL8fr+cTqe5n9PplN/vN/cdSjAYlNfrjWqmzMzMqPbH+BXtuRUtzk0Mxcpzc1Sj8GE2m002my2qYyQkJPCNA8twbmGsivbcHC4qo/rXR+np6WZZKBAIKC0tTZLkcDjk8/nM/Xw+nxwOx2iOBgDQKEfB7XaroaFBktTQ0KCcnJwbtofDYXV0dCg5OfkTl44AACPPsuWjzZs36+jRo+rp6dGiRYu0YcMGPfrooyorK1Ntba2mTZumqqoqSdLixYvV1tYmj8ejxMREVVZWWjUWAGAYlkVh165dN91eU1PzsW02m01PPfWUVaMAACLEK5oBAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAABlEAABhEAQBgEAUAgEEUAAAGUQAAGEQBAGAQBQCAQRQAAAZRAAAYRAEAYBAFAIBBFAAAxpiKQnt7u3Jzc+XxeFRdXR3rcQBgwhkzUQiFQnr66ae1Z88eNTU16S9/+YvefvvtWI8FABPKmInCyZMn9ZnPfEYzZsxQfHy88vPz1draGuuxAGBCscd6gP/y+/1yOp3mc4fDoZMnTw67TzAYlNfrjfqx/+fh+6I+BsaXkTivRkTx/8Z6AowxI3FuBoPBIW8bM1H4NLKysmI9AgCMK2Nm+cjhcMjn85nP/X6/HA5HDCcCgIlnzERh/vz56uzs1Llz59Tf36+mpia53e5YjwUAE8qYWT6y2+3atm2bvvOd7ygUCumhhx7S7NmzYz0WAEwotnA4HI71EACAsWHMLB8BAGKPKAAADKIALi+CMWvr1q1yuVxavnx5rEeZMIjCBMflRTCWFRUVac+ePbEeY0IhChMclxfBWHbfffcpNTU11mNMKERhgrvZ5UX8fn8MJwIQS0QBAGAQhQmOy4sA+DCiMMFxeREAH8YrmqG2tjZVVlaay4v84Ac/iPVIgCRp8+bNOnr0qHp6epSenq4NGzaouLg41mONa0QBAGCwfAQAMIgCAMAgCgAAgygAAAyiAAAwiAIg6Te/+Y3y8/NVUFCgFStW6F//+lfUx2xtbR2xq84uWLBgRI4DfJIx83acQKycOHFChw8fVn19veLj49Xd3a2BgYGI9h0cHJTdfvNvo5ycHOXk5IzkqIDleKaACe/ChQuaMmWK4uPjJUlpaWlyOBxyu93q7u6WJJ06dUolJSWSpF/+8pcqLy/X17/+dT3++OP62te+pn//+9/meCUlJTp16pTq6ur09NNP69KlS1qyZImuXbsmSbp69aoWL16sgYEBvfvuu3rkkUdUVFSkNWvW6OzZs5Kkc+fOafXq1SooKNDu3btH858DExxRwIR3//336/z588rNzVVFRYWOHj36ifucPXtWv/vd77Rr1y7l5eVp//79kqRAIKBAIKD58+eb+yYnJ2vOnDnmuIcPH9YDDzygO++8U08++aSefPJJ1dXV6Sc/+Yl+9rOfSZJ27Nihb3zjG/rzn/+sjIwMC75q4OaIAia8pKQk81N9WlqaNm3apLq6umH3cbvduuuuuyRJDz74oJqbmyVJ+/fv17Jlyz52/7y8PO3bt0+S1NTUpLy8PF25ckUnTpzQD3/4Q61YsULbtm3ThQsXJF1f0srPz5ckrVixYsS+VuCT8DsFQFJcXJyys7OVnZ2te++9Vw0NDYqLi9N/rwITDAZvuH9iYqL52OFwaPLkyXrzzTe1f/9+VVRUfOz4brdbu3fvVm9vr86cOaOFCxfqgw8+UEpKihobG286k81mG7kvEIgQzxQw4b3zzjvq7Ow0n3u9Xk2bNk3Tp0/X6dOnJUktLS3DHiMvL0979uzRpUuXNGfOnI/dnpSUpHnz5mnHjh368pe/rLi4OE2aNEn33HOPWXoKh8N68803JV3/a6OmpiZJ0quvvjoSXyYQEaKACe/q1avasmWL8vLyVFBQoLNnz+qxxx7TY489psrKShUVFSkuLm7YY+Tm5mrfvn168MEHh7xPXl6eXn31VeXl5ZltL7zwgmpra/XVr35V+fn5OnjwoCTpiSee0Msvv6yCggLeCQ+jiqukAgAMnikAAAyiAAAwiAIAwCAKAACDKAAADKIAADCIAgDA+D9EoeNtAM2agwAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "tqQOeyQVxA8X"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that we know the number of Survived people, we also need to know their variations when sex is accounted, this means we want to know how many men were their among the survived ones, because during a calamity such as the sinking of Titanic, women, children and elderly are the ones which will be saved first, but the mean age on the ship being 30, sex should be the first parameter which we look for and check it. "
+ ],
+ "metadata": {
+ "id": "dgT8sT2XFwpI"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.set_style('whitegrid')\n",
+ "sns.countplot(x= 'Survived', hue = 'Sex', data = df_train, palette = 'rainbow')"
+ ],
+ "metadata": {
+ "id": "YayC1lbVPWyn",
+ "outputId": "d82673f0-b073-4a8f-e7a3-8b605b1b0383",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 296
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 44
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "aFYnX523xDhZ"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Appart from Sex, Rich people will get to survive, so we need to find rich or poor, among the survived ones. One thing to note here is that we do not have an income feature, so we try to find something that relates it, Passenger class could be a good idea, as Rich People will buy better Cabins. \n",
+ "Cabin could have been used, but firstly it is a categorical data which does not represent anything where we could deduce, moreover it has a high amount of null Values\n",
+ "\n",
+ "\n",
+ "Fare is also a good parameter to judge economy but we have no idea about the price range for each class, so it would lead to a random selection of arbitrary fare, therefore we find the Passenger class `Pclass` bought by survived people. "
+ ],
+ "metadata": {
+ "id": "UtZpoxUpGXhX"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.set_style('whitegrid')\n",
+ "sns.countplot(x='Survived', hue = 'Pclass', data = df_train, palette = 'rainbow')"
+ ],
+ "metadata": {
+ "id": "JhMv7cQ6Pvl3",
+ "outputId": "1d68370e-79e2-4b5f-fcd4-b2857b514917",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 296
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 45
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "ipEzLdKfxK2E"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We should also look for number of people travelling with parents, siblings and spouse as, one would try to save their family first, and a rough idea would help us select the type of model which needs to be prepared. "
+ ],
+ "metadata": {
+ "id": "M4T0-YfUHZVQ"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.countplot(x = 'SibSp', data = df_train)"
+ ],
+ "metadata": {
+ "id": "Bv5UpD8IQhFA",
+ "outputId": "3e01262d-0f61-43de-bf95-588a7a8263ac",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 296
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 46
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "KeCq4pEexSit"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now that we looked at each paremeters, lets look at the number of people that bought a certain type of ticket, like say there is a ticket priced 20$, getting to know the number of people which bought that ticket will help me know, the percentage economy of people on board "
+ ],
+ "metadata": {
+ "id": "gBdXTgDpHt0t"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.distplot(df_train['Fare'], kde = False, color = 'Darkred', bins = 40)"
+ ],
+ "metadata": {
+ "id": "ML6DxJ93QyGQ",
+ "outputId": "6bc716dc-7541-4ffc-fcf5-8b8c52db3f91",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 353
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n",
+ " warnings.warn(msg, FutureWarning)\n"
+ ]
+ },
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 47
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "FgwJcqP0xXAj"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "One, thing which we can see is, we needed to access each feature and see its relation with my label, i.e, Survived. This might not be needed if we see high correlation among the features and label. \n",
+ "\n",
+ "Now that, we know about the data present, lets clean the null values here and prepare our model"
+ ],
+ "metadata": {
+ "id": "mu1MtePxIMTd"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "First fill the mean age to clean that data and see the info to check whether the null value has been accounted for. "
+ ],
+ "metadata": {
+ "id": "5FrDKVumJD4Y"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train['Age'] = df_train['Age'].fillna(Age_fill)"
+ ],
+ "metadata": {
+ "id": "jBbJdCcSxp-d"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.isna().sum()"
+ ],
+ "metadata": {
+ "id": "cMkibyzdRYUm",
+ "outputId": "966adab0-2241-4d10-da84-65c3108bd6cb",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ "PassengerId 0\n",
+ "Survived 0\n",
+ "Pclass 0\n",
+ "Name 0\n",
+ "Sex 0\n",
+ "Age 0\n",
+ "SibSp 0\n",
+ "Parch 0\n",
+ "Ticket 0\n",
+ "Fare 0\n",
+ "Cabin 687\n",
+ "Embarked 2\n",
+ "dtype: int64"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 49
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "3EfsZWD3x5MA"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "One can also check a heatmap, to see the same, as we know, maps and graphs are preferred due to visual conclusions."
+ ],
+ "metadata": {
+ "id": "ev0lZjhDJMXi"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "sns.heatmap(df_train.isnull(), cmap= 'viridis')"
+ ],
+ "metadata": {
+ "id": "9Cmnq_MKRhVV",
+ "outputId": "b7bd160e-6508-47e8-9e1e-cd6cc9b73047",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 338
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "execution_count": 50
+ },
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ "Cabin A B C D E F G T\n",
+ "Pclass \n",
+ "1 15 47 59 29 25 0 0 1\n",
+ "2 0 0 0 4 4 8 0 0\n",
+ "3 0 0 0 0 3 5 4 0"
+ ]
+ },
+ "metadata": {},
+ "execution_count": 51
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "SsLUwP40L3B7"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can notice, there are huge amount of null values present in the cabin features but this does not mean this feature is useless, `Pclass` and `Cabin` both feature combinely have some information to give.\n",
+ "\n",
+ "**This is left for you to find.**\n",
+ "\n",
+ "For now we have just removed that column from dataset."
+ ],
+ "metadata": {
+ "id": "ocNWydUVJWAj"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.drop(['Cabin'], axis = 1, inplace = True)"
+ ],
+ "metadata": {
+ "id": "TYGlqIKxyFXJ"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Another thing of importance, is there are 2 null values in the feature 'Embarked', this feature represents where did the person boarded the ship, here firstly it is categorical value, secondly this parameter could be filled with the most frequently occuring value\n",
+ "\n",
+ "First try to make a plot to see the most frequently occuring value, and then fill the data with this value."
+ ],
+ "metadata": {
+ "id": "-VjMUVQOJscz"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train['Embarked'].value_counts().plot(kind='pie', autopct='%.2f')\n",
+ "plt.show()"
+ ],
+ "metadata": {
+ "id": "gYzjodHrSmwI",
+ "outputId": "ac953477-2107-4d77-fbf8-5155260d99f4",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 248
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {}
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "sw1NOZgbxcMy"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train['Embarked'] = df_train['Embarked'].fillna('S')"
+ ],
+ "metadata": {
+ "id": "ua8W8R4vyObf"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "One can try and check the information, to see the null values, if remaining."
+ ],
+ "metadata": {
+ "id": "Zmk-wTOJKV7m"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.info()"
+ ],
+ "metadata": {
+ "id": "iovne8xLTEL-",
+ "outputId": "41b6cfdb-c331-4443-bb0f-0feb055ee122",
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "\n",
+ "RangeIndex: 891 entries, 0 to 890\n",
+ "Data columns (total 11 columns):\n",
+ " # Column Non-Null Count Dtype \n",
+ "--- ------ -------------- ----- \n",
+ " 0 PassengerId 891 non-null int64 \n",
+ " 1 Survived 891 non-null int64 \n",
+ " 2 Pclass 891 non-null int64 \n",
+ " 3 Name 891 non-null object \n",
+ " 4 Sex 891 non-null object \n",
+ " 5 Age 891 non-null float64\n",
+ " 6 SibSp 891 non-null int64 \n",
+ " 7 Parch 891 non-null int64 \n",
+ " 8 Ticket 891 non-null object \n",
+ " 9 Fare 891 non-null float64\n",
+ " 10 Embarked 891 non-null object \n",
+ "dtypes: float64(2), int64(5), object(4)\n",
+ "memory usage: 76.7+ KB\n"
+ ]
+ }
+ ]
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "metadata": {
+ "id": "ZRvefRdbzMJ4"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Feature Engineering"
+ ],
+ "metadata": {
+ "id": "hzN5qpiNZHjY"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "SibSp: count of siblings or spouce\n",
+ "\n",
+ "Parch: count of parents and childrens\n",
+ "\n",
+ "combining this two features make sense as a count of family members."
+ ],
+ "metadata": {
+ "id": "zIKxKxifRAuV"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train['family_count'] = df_train['SibSp'] + df_train['Parch']"
+ ],
+ "metadata": {
+ "id": "Be4XHjbtRah1"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now we dont need this two fields any more, we can drop them both"
+ ],
+ "metadata": {
+ "id": "9esOCmy3RlB4"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.drop(['SibSp', 'Parch'], axis = 1, inplace = True)"
+ ],
+ "metadata": {
+ "id": "Ktl-eX3bRjdP"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Preprocessing\n",
+ "\n",
+ "Prepare data for machine learning algorithm"
+ ],
+ "metadata": {
+ "id": "GXnrPIgTZP4M"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "We can see that there are certain *text values* remaining in our dataset, we also know that our models do not work well with text values, hence we need to convert them to numerical values, of some sort.\n",
+ "\n",
+ "One Hot Encoding, Ordinal Encoding is a good method to do that, but here only 2 or 3 parameters are present hence, get_dummies of pandas would not only prove simpler but also quicker.\n",
+ "\n",
+ "Here **embarked and sex both are of Nominal type**, get_dummies or OneHot will do our job."
+ ],
+ "metadata": {
+ "id": "TJNGTxp2K8N7"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "embk = pd.get_dummies(df_train['Embarked'], drop_first= True)\n",
+ "sex = pd.get_dummies(df_train['Sex'], drop_first= True)"
+ ],
+ "metadata": {
+ "id": "Qn_kjNb9zNdO"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now, Name, PassengerId and Ticket will not affect whether the person survives or not, hence we can drop them.\n",
+ "Sex and Embarked can be encoded and hence should be dropped as encoded parameters, will be added to the dataframe. "
+ ],
+ "metadata": {
+ "id": "FPl90lkkMYei"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.drop(['Name', 'Ticket', 'Sex', 'Embarked', 'PassengerId'], axis = 1, inplace = True)"
+ ],
+ "metadata": {
+ "id": "88Aqzd5izUxc"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Now, add the encoded values to the main dataframe\n",
+ "\n",
+ "This now makes our data clean and ready for training it appropriately. "
+ ],
+ "metadata": {
+ "id": "_sJslI__MoZD"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train = pd.concat([df_train, sex, embk], axis = 1)"
+ ],
+ "metadata": {
+ "id": "MIPK5uTQzgSz"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "You can also see the dataframe into the current format to check whether its converted appropriately. "
+ ],
+ "metadata": {
+ "id": "jUIld6qOM2BE"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "df_train.head()"
+ ],
+ "metadata": {
+ "id": "nEi-fehVUoYh",
+ "outputId": "09a369d0-2aa1-4082-8a8e-3e66945ac32b",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 206
+ }
+ },
+ "execution_count": null,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "