Skip to content

Latest commit

 

History

History
945 lines (364 loc) · 47.9 KB

README.md

File metadata and controls

945 lines (364 loc) · 47.9 KB

How to use pandas_cub

The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.

Installation

pip install pandas-cub

What is pandas_cub?

pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.

pandas_cub functionality

pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.

  • Subset selection with the brackets
  • Arithmetic and comparison operators (+, -, <, !=, etc...)
  • Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
  • Grouping via pivot tables
  • String-only methods for columns containing strings
  • Reading in simple comma-separated value files
  • Several other methods

pandas_cub DataFrame

pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.

Missing value representation

Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.

Code Examples

pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.

Reading data with read_csv

pandas_cub consists of a single function, read_csv, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head method returns the first 5 rows by default.

import pandas_cub as pdc
df = pdc.read_csv('data/employee.csv')
df.head()
dept race gender salary
0Houston Police Department-HPDWhite Male 45279
1Houston Fire Department (HFD)White Male 63166
2Houston Police Department-HPDBlack Male 66614
3Public Works & Engineering-PWEAsian Male 71680
4Houston Airport System (HAS)White Male 42390

DataFrame properties

The shape property returns a tuple of the number of rows and columns

df.shape
(1535, 4)

The len function returns just the number of rows.

len(df)
1535

The dtypes property returns a DataFrame of the column names and their respective data type.

df.dtypes
Column NameData Type
0dept string
1race string
2gender string
3salary int

The columns property returns a list of the columns.

df.columns
['dept', 'race', 'gender', 'salary']

Set new columns by assigning the columns property to a list.

df.columns = ['department', 'race', 'gender', 'salary']
df.head()
departmentrace gender salary
0Houston Police Department-HPDWhite Male 45279
1Houston Fire Department (HFD)White Male 63166
2Houston Police Department-HPDBlack Male 66614
3Public Works & Engineering-PWEAsian Male 71680
4Houston Airport System (HAS)White Male 42390

The values property returns a single numpy array of all the data.

df.values
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
       ['Houston Fire Department (HFD)', 'White', 'Male', 63166],
       ['Houston Police Department-HPD', 'Black', 'Male', 66614],
       ...,
       ['Houston Police Department-HPD', 'White', 'Male', 43443],
       ['Houston Police Department-HPD', 'Asian', 'Male', 55461],
       ['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
      dtype=object)

Subset selection

Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.

df['race'].head()
race
0White
1White
2Black
3Asian
4White

Select multiple columns with a list of strings.

df[['race', 'salary']].head()
race salary
0White 45279
1White 63166
2Black 66614
3Asian 71680
4White 42390

Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.

rows = [10, 50, 100]
cols = ['salary', 'race']
df[rows, cols]
salary race
0 77076Black
1 81239White
2 81239White

You can use integers for the columns as well.

rows = [10, 50, 100]
cols = [2, 0]
df[rows, cols]
gender department
0Male Houston Police Department-HPD
1Male Houston Police Department-HPD
2Male Houston Police Department-HPD

You can use a single integer and not just a list.

df[99, 3]
salary
0 66614

Or a single string for the columns

df[99, 'salary']
salary
0 66614

You can use a slice for the rows

df[20:100:10, ['race', 'gender']]
race gender
0White Male
1White Male
2Hispanic Male
3White Male
4White Male
5Hispanic Male
6Hispanic Male
7Black Female

You can also slice the columns with either integers or strings

df[20:100:10, :2]
departmentrace
0Houston Police Department-HPDWhite
1Houston Fire Department (HFD)White
2Houston Police Department-HPDHispanic
3Houston Police Department-HPDWhite
4Houston Fire Department (HFD)White
5Houston Police Department-HPDHispanic
6Houston Fire Department (HFD)Hispanic
7Houston Police Department-HPDBlack
df[20:100:10, 'department':'gender']
departmentrace gender
0Houston Police Department-HPDWhite Male
1Houston Fire Department (HFD)White Male
2Houston Police Department-HPDHispanic Male
3Houston Police Department-HPDWhite Male
4Houston Fire Department (HFD)White Male
5Houston Police Department-HPDHispanic Male
6Houston Fire Department (HFD)Hispanic Male
7Houston Police Department-HPDBlack Female

You can do boolean selection if you pass the brackets a one-column boolean DataFrame.

filt = df['salary'] > 100000
filt.head()
salary
0False
1False
2False
3False
4False
df[filt].head()
departmentrace gender salary
0Public Works & Engineering-PWEWhite Male 107962
1Health & Human ServicesBlack Male 180416
2Houston Fire Department (HFD)Hispanic Male 165216
3Health & Human ServicesWhite Female 100791
4Houston Airport System (HAS)White Male 120916
df[filt, ['race', 'salary']].head()
race salary
0White 107962
1Black 180416
2Hispanic 165216
3White 100791
4White 120916

Assigning Columns

You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:

df['bonus'] = 1000
df.head()
departmentrace gender salary bonus
0Houston Police Department-HPDWhite Male 45279 1000
1Houston Fire Department (HFD)White Male 63166 1000
2Houston Police Department-HPDBlack Male 66614 1000
3Public Works & Engineering-PWEAsian Male 71680 1000
4Houston Airport System (HAS)White Male 42390 1000

You can assign with a numpy array the same length as a column.

import numpy as np
df['bonus'] = np.random.randint(100, 5000, len(df))
df.head()
departmentrace gender salary bonus
0Houston Police Department-HPDWhite Male 45279 3536
1Houston Fire Department (HFD)White Male 63166 1296
2Houston Police Department-HPDBlack Male 66614 511
3Public Works & Engineering-PWEAsian Male 71680 4267
4Houston Airport System (HAS)White Male 42390 3766

You can assign a new column with a one column DataFrame.

df['salary'] + df['bonus']
salary
0 48815
1 64462
2 67125
3 75947
4 46156
5 110001
6 53738
7 185348
8 32575
9 57918
......
1525 32936
1526 49294
1527 34218
1528 82795
1529 104900
1530 46408
1531 67050
1532 47368
1533 60013
1534 52624
df['total salary'] = df['salary'] + df['bonus']
df.head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 3536 48815
1Houston Fire Department (HFD)White Male 63166 1296 64462
2Houston Police Department-HPDBlack Male 66614 511 67125
3Public Works & Engineering-PWEAsian Male 71680 4267 75947
4Houston Airport System (HAS)White Male 42390 3766 46156

Arithmetic and comparison operators

df1 = df[['salary', 'bonus']] * 5
df1.head()
salary bonus
0 226395 17680
1 315830 6480
2 333070 2555
3 358400 21335
4 211950 18830
df1 = df[['salary', 'bonus']] > 100000
df1.head()
salary bonus
0FalseFalse
1FalseFalse
2FalseFalse
3FalseFalse
4FalseFalse
df1 = df['race'] == 'White'
df1.head()
race
0True
1True
2False
3False
4True

Aggregation

Most of the common aggregation methods are available. They only work down the columns and not across the rows.

df.min()
departmentrace gender salary bonus total salary
0Health & Human ServicesAsian Female 24960 101 25913

Columns that the aggregation does not work are dropped.

df.mean()
salary bonus total salary
0 56278.746 2594.283 58873.029
df.argmax()
departmentrace gender salary bonus total salary
0 3 0 0 145 1516 145
df['salary'].argmin()
salary
0 347

Check if all salaries are greater than 20000

df1 = df['salary'] > 20000
df1.all()
salary
0True

Count the number of non-missing values

df.count()
departmentrace gender salary bonus total salary
0 1535 1535 1535 1535 1535 1535

Get number of unique values.

df.nunique()
departmentrace gender salary bonus total salary
0 6 5 2 548 1318 1524

Non-Aggregating Methods

These are methods that do not return a single value.

Get the unique values of each column. The unique method returns a list of DataFrames containing the unique values for each column.

dfs = df.unique()
dfs[0]
department
0Health & Human Services
1Houston Airport System (HAS)
2Houston Fire Department (HFD)
3Houston Police Department-HPD
4Parks & Recreation
5Public Works & Engineering-PWE
dfs[1]
race
0Asian
1Black
2Hispanic
3Native American
4White
dfs[2]
gender
0Female
1Male

Rename columns with a dictionary.

df.rename({'department':'dept', 'bonus':'BONUS'}).head()
dept race gender salary BONUS total salary
0Houston Police Department-HPDWhite Male 45279 3536 48815
1Houston Fire Department (HFD)White Male 63166 1296 64462
2Houston Police Department-HPDBlack Male 66614 511 67125
3Public Works & Engineering-PWEAsian Male 71680 4267 75947
4Houston Airport System (HAS)White Male 42390 3766 46156

Drop columns with a string or list of strings.

df.drop('race').head()
departmentgender salary bonus total salary
0Houston Police Department-HPDMale 45279 3536 48815
1Houston Fire Department (HFD)Male 63166 1296 64462
2Houston Police Department-HPDMale 66614 511 67125
3Public Works & Engineering-PWEMale 71680 4267 75947
4Houston Airport System (HAS)Male 42390 3766 46156
df.drop(['race', 'gender']).head()
departmentsalary bonus total salary
0Houston Police Department-HPD 45279 3536 48815
1Houston Fire Department (HFD) 63166 1296 64462
2Houston Police Department-HPD 66614 511 67125
3Public Works & Engineering-PWE 71680 4267 75947
4Houston Airport System (HAS) 42390 3766 46156

Non-aggregating methods that keep all columns

The next several methods are non-aggregating methods that return a DataFrame with the same exact shape as the original. They only work on boolean, integer and float columns and ignore string columns.

Absolute value

df.abs().head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 3536 48815
1Houston Fire Department (HFD)White Male 63166 1296 64462
2Houston Police Department-HPDBlack Male 66614 511 67125
3Public Works & Engineering-PWEAsian Male 71680 4267 75947
4Houston Airport System (HAS)White Male 42390 3766 46156

Cumulative min, max, and sum

df.cummax().head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 3536 48815
1Houston Fire Department (HFD)White Male 63166 3536 64462
2Houston Police Department-HPDBlack Male 66614 3536 67125
3Public Works & Engineering-PWEAsian Male 71680 4267 75947
4Houston Airport System (HAS)White Male 71680 4267 75947

Clip values to be within a range.

df.clip(40000, 60000).head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 40000 48815
1Houston Fire Department (HFD)White Male 60000 40000 60000
2Houston Police Department-HPDBlack Male 60000 40000 60000
3Public Works & Engineering-PWEAsian Male 60000 40000 60000
4Houston Airport System (HAS)White Male 42390 40000 46156

Round numeric columns

df.round(-3).head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45000 4000 49000
1Houston Fire Department (HFD)White Male 63000 1000 64000
2Houston Police Department-HPDBlack Male 67000 1000 67000
3Public Works & Engineering-PWEAsian Male 72000 4000 76000
4Houston Airport System (HAS)White Male 42000 4000 46000

Copy the DataFrame

df.copy().head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male 45279 3536 48815
1Houston Fire Department (HFD)White Male 63166 1296 64462
2Houston Police Department-HPDBlack Male 66614 511 67125
3Public Works & Engineering-PWEAsian Male 71680 4267 75947
4Houston Airport System (HAS)White Male 42390 3766 46156

Take the nth difference.

df.diff(2).head(10)
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male nan nan nan
1Houston Fire Department (HFD)White Male nan nan nan
2Houston Police Department-HPDBlack Male 21335.000 -3025.000 18310.000
3Public Works & Engineering-PWEAsian Male 8514.000 2971.000 11485.000
4Houston Airport System (HAS)White Male -24224.000 3255.000-20969.000
5Public Works & Engineering-PWEWhite Male 36282.000 -2228.000 34054.000
6Houston Fire Department (HFD)Hispanic Male 10254.000 -2672.000 7582.000
7Health & Human ServicesBlack Male 72454.000 2893.000 75347.000
8Public Works & Engineering-PWEBlack Male -22297.000 1134.000-21163.000
9Health & Human ServicesBlack Male -125147.000 -2283.000-127430.000

Find the nth percentage change.

df.pct_change(2).head(10)
departmentrace gender salary bonus total salary
0Houston Police Department-HPDWhite Male nan nan nan
1Houston Fire Department (HFD)White Male nan nan nan
2Houston Police Department-HPDBlack Male 0.471 -0.855 0.375
3Public Works & Engineering-PWEAsian Male 0.135 2.292 0.178
4Houston Airport System (HAS)White Male -0.364 6.370 -0.312
5Public Works & Engineering-PWEWhite Male 0.506 -0.522 0.448
6Houston Fire Department (HFD)Hispanic Male 0.242 -0.710 0.164
7Health & Human ServicesBlack Male 0.671 1.419 0.685
8Public Works & Engineering-PWEBlack Male -0.424 1.037 -0.394
9Health & Human ServicesBlack Male -0.694 -0.463 -0.688

Sort the DataFrame by one or more columns

df.sort_values('salary').head()
departmentrace gender salary bonus total salary
0Houston Police Department-HPDBlack Female 24960 953 25913
1Public Works & Engineering-PWEHispanic Male 26104 4258 30362
2Public Works & Engineering-PWEBlack Female 26125 3247 29372
3Houston Airport System (HAS)Hispanic Female 26125 832 26957
4Houston Airport System (HAS)Black Female 26125 2461 28586

Sort descending

df.sort_values('salary', asc=False).head()
departmentrace gender salary bonus total salary
0Houston Fire Department (HFD)White Male 210588 3724 214312
1Houston Police Department-HPDWhite Male 199596 848 200444
2Houston Airport System (HAS)Black Male 186192 1778 187970
3Health & Human ServicesBlack Male 180416 4932 185348
4Public Works & Engineering-PWEWhite Female 178331 2124 180455

Sort by multiple columns

df.sort_values(['race', 'salary']).head()
departmentrace gender salary bonus total salary
0Houston Airport System (HAS)Asian Female 26125 4446 30571
1Houston Police Department-HPDAsian Male 27914 2855 30769
2Houston Police Department-HPDAsian Male 28169 2572 30741
3Public Works & Engineering-PWEAsian Male 28995 2874 31869
4Public Works & Engineering-PWEAsian Male 30347 4938 35285

Randomly sample the DataFrame

df.sample(n=3)
departmentrace gender salary bonus total salary
0Houston Fire Department (HFD)White Male 62540 2995 65535
1Public Works & Engineering-PWEWhite Male 63336 1547 64883
2Houston Police Department-HPDWhite Male 52514 1150 53664

Randomly sample a fraction

df.sample(frac=.005)
departmentrace gender salary bonus total salary
0Houston Police Department-HPDHispanic Female 60347 1200 61547
1Public Works & Engineering-PWEBlack Male 49109 3598 52707
2Health & Human ServicesBlack Female 48984 4602 53586
3Houston Police Department-HPDWhite Male 55461 2813 58274
4Houston Airport System (HAS)Black Female 29286 1877 31163
5Houston Police Department-HPDAsian Male 66614 4480 71094
6Houston Fire Department (HFD)White Male 28024 4475 32499

Sample with replacement

df.sample(n=10000, replace=True).head()
departmentrace gender salary bonus total salary
0Parks & RecreationBlack Female 31075 1665 32740
1Public Works & Engineering-PWEHispanic Male 67038 644 67682
2Houston Police Department-HPDBlack Male 37024 1532 38556
3Health & Human ServicesBlack Female 57433 3106 60539
4Public Works & Engineering-PWEBlack Male 53373 924 54297

String-only methods

Use the str accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.

df.str.count('department', 'P').head()
department
0 2
1 0
2 2
3 2
4 0
df.str.lower('department').head()
department
0houston police department-hpd
1houston fire department (hfd)
2houston police department-hpd
3public works & engineering-pwe
4houston airport system (has)
df.str.find('department', 'Houston').head()
department
0 0
1 0
2 0
3 -1
4 0

Grouping

pandas_cub provides the value_counts method for simple frequency counting of unique values and pivot_table for grouping and aggregating.

The value_counts method returns a list of DataFrames, one for each column.

dfs = df[['department', 'race', 'gender']].value_counts()
dfs[0]
departmentcount
0Houston Police Department-HPD 570
1Houston Fire Department (HFD) 365
2Public Works & Engineering-PWE 341
3Health & Human Services 103
4Houston Airport System (HAS) 103
5Parks & Recreation 53
dfs[1]
race count
0White 542
1Black 518
2Hispanic 381
3Asian 87
4Native American 7
dfs[2]
gender count
0Male 1135
1Female 400

If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize parameter to True.

df['race'].value_counts(normalize=True)
race count
0White 0.353
1Black 0.337
2Hispanic 0.248
3Asian 0.057
4Native American 0.005

The pivot_table method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.

df.pivot_table(rows='race', columns='gender', values='salary', aggfunc='mean')
race Female Male
0Asian 58304.222 60622.957
1Black 48133.382 51853.000
2Hispanic 44216.960 55493.064
3Native American 58844.333 68850.500
4White 66415.528 63439.196

If you don't provide values or aggfunc then by default it will return frequency (a contingency table).

df.pivot_table(rows='race', columns='gender')
race Female Male
0Asian 18 69
1Black 207 311
2Hispanic 100 281
3Native American 3 4
4White 72 470

You can group by just a single column.

df.pivot_table(rows='department', values='salary', aggfunc='mean')
departmentmean
0Health & Human Services 51324.981
1Houston Airport System (HAS) 53990.369
2Houston Fire Department (HFD) 59960.441
3Houston Police Department-HPD 60428.746
4Parks & Recreation 39426.151
5Public Works & Engineering-PWE 50207.806
df.pivot_table(columns='department', values='salary', aggfunc='mean')
Health & Human ServicesHouston Airport System (HAS)Houston Fire Department (HFD)Houston Police Department-HPDParks & RecreationPublic Works & Engineering-PWE
0 51324.981 53990.369 59960.441 60428.746 39426.151 50207.806