Skip to content

ta-data-bcn/lab-data-cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Ironhack logo

Lab | Data Cleaning

Introduction

We keep seeing a common phrase that 80% of the work of a data scientist is data cleaning. We have no idea whether this number is accurate but a data scientist indeed spends lots of time and effort in collecting, cleaning and preparing the data for analysis. This is because datasets are usually messy and complex in nature. It is a very important ability for a data scientist to refine and restructure datasets into a usable state in order to proceed to the data analysis stage.

In this exercise, you will both practice the data cleaning techniques we discussed in the lesson and learn new techniques by looking up documentations and references. You will work on your own but remember the teaching staff is at your service whenever you encounter problems.

Getting Started

Now you should already be familar with the workflow of solving and submitting the labs. But in case not, review the guidelines in the README.md in the repo root and previous lab.

In this lab you will be working on main.ipynb. To launch it, first navigate to the directory that contains main.ipynb in Terminal, then execute jupyter notebook. In the webpage that is automatically opened, click the main.ipynb link to launch it.

When you are on main.ipynb, read the instructions for each cell and provide your answers. Make sure to test your answers in each cell and save. Jupyter Notebook should automatically save your work progress. But it's a good idea to periodically save your work manually just in case.

Goals

Do you remember your MySQL project? In this lab, you will examine some MySQL tables from here. This database contains an anonymized dump of all user-contributed content on the Stats Stack Exchange network.

You will need to import the pymysql library and the create_engine function from the sqlalchemy library.

import pymysql
from sqlalchemy import create_engine

Once your connection is established with the database you will use some basic SELECT queries to retrieve the data in order to answer the questions described next.

💡 If you receive import errors for pymysql or sqlalchemy, it means you need to install them with pip.

Challenge Questions

  1. Connect to the server and collect all the data from users and posts tables.

  2. Create a merged dataframe with users and post tables. Take into account that you will need to do some stuff before merging.

  3. Identify missing values in the merged dataframe and apply some of the methods.

  4. Change the data types of your merged dataset accordingly.

  5. Bonus Question: Create a dataframe with the outliers you have identified in the dataframe and export it to a csv file in your-code folder.

❗ If you feel you are already good at Python/Pandas and don't need the instructions in main.ipynb to walk you through, please feel free to skip main.ipynb and create your own solution file.

Deliverables

  • main.ipynb with your responses to each of the questions above.

Submission

Upon completion, add your deliverables to git. Then commit git, push to your forked repo, and create the pull request as in the previous labs. **REMEMBER

  • Upon completion, commit your code and submit to github. REMEMBER YOU HAVE ALREADY FORKED THE REPO BEFORE!!

    git add .
    git commit -m "<lab or project name>"
    git push origin master
    
  • Navigate to your repo and create a Pull Request.

  • Create a pull request with title following this format: "[<your_campus>][<bootcamp_code>] [<lab/project_name>]<your_name>"

    • For instance, if you are doing data bootcamp in Madrid, your name is Marc Pomar and the lab you are working on is lab-numpy, your pull request should be named like this: "[MAD][datamad10108] [lab-numpy] Marc Pomar"
  • If you have successfully created the pull request you are done! CONGRATS :)

Resources

Data Cleaning with Numpy and Pandas

Data Cleaning Video

Data Preparation

Google Search

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •