You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The README.ipynb notebook will serve as the documentation and usage guide to pandas_cub.
Installation
pip install pandas-cub
What is pandas_cub?
pandas_cub is a simple data analysis library that emulates the functionality of the pandas library. The library is not meant for serious work. It was built as an assignment for one of Ted Petrou's Python classes. If you would like to complete the assignment on your own, visit this repository. There are about 40 steps and 100 tests that you must pass in order to rebuild the library. It is a good challenge and teaches you the fundamentals of how to build your own data analysis library.
pandas_cub functionality
pandas_cub has limited functionality but is still capable of a wide variety of data analysis tasks.
Subset selection with the brackets
Arithmetic and comparison operators (+, -, <, !=, etc...)
Aggregation of columns with most of the common functions (min, max, mean, median, etc...)
Grouping via pivot tables
String-only methods for columns containing strings
Reading in simple comma-separated value files
Several other methods
pandas_cub DataFrame
pandas_cub has a single main object, the DataFrame, to hold all of the data. The DataFrame is capable of holding 4 data types - booleans, integers, floats, and strings. All data is stored in NumPy arrays. panda_cub DataFrames have no index (as in pandas). The columns must be strings.
Missing value representation
Boolean and integer columns will have no missing value representation. The NumPy NaN is used for float columns and the Python None is used for string columns.
Code Examples
pandas_cub syntax is very similar to pandas, but implements much fewer methods. The below examples will cover just about all of the API.
Reading data with read_csv
pandas_cub consists of a single function, read_csv, that has a single parameter, the location of the file you would like to read in as a DataFrame. This function can only handle simple CSV's and the delimiter must be a comma. A sample employee dataset is provided in the data directory. Notice that the visual output of the DataFrame is nearly identical to that of a pandas DataFrame. The head method returns the first 5 rows by default.
importpandas_cubaspdc
df=pdc.read_csv('data/employee.csv')
df.head()
dept
race
gender
salary
0
Houston Police Department-HPD
White
Male
45279
1
Houston Fire Department (HFD)
White
Male
63166
2
Houston Police Department-HPD
Black
Male
66614
3
Public Works & Engineering-PWE
Asian
Male
71680
4
Houston Airport System (HAS)
White
Male
42390
DataFrame properties
The shape property returns a tuple of the number of rows and columns
df.shape
(1535, 4)
The len function returns just the number of rows.
len(df)
1535
The dtypes property returns a DataFrame of the column names and their respective data type.
df.dtypes
Column Name
Data Type
0
dept
string
1
race
string
2
gender
string
3
salary
int
The columns property returns a list of the columns.
df.columns
['dept', 'race', 'gender', 'salary']
Set new columns by assigning the columns property to a list.
The values property returns a single numpy array of all the data.
df.values
array([['Houston Police Department-HPD', 'White', 'Male', 45279],
['Houston Fire Department (HFD)', 'White', 'Male', 63166],
['Houston Police Department-HPD', 'Black', 'Male', 66614],
...,
['Houston Police Department-HPD', 'White', 'Male', 43443],
['Houston Police Department-HPD', 'Asian', 'Male', 55461],
['Houston Fire Department (HFD)', 'Hispanic', 'Male', 51194]],
dtype=object)
Subset selection
Subset selection is handled with the brackets. To select a single column, place that column name in the brackets.
df['race'].head()
race
0
White
1
White
2
Black
3
Asian
4
White
Select multiple columns with a list of strings.
df[['race', 'salary']].head()
race
salary
0
White
45279
1
White
63166
2
Black
66614
3
Asian
71680
4
White
42390
Simultaneously select rows and columns by passing the brackets the row selection followed by the column selection separated by a comma. Here we use integers for rows and strings for columns.
You can also slice the columns with either integers or strings
df[20:100:10, :2]
department
race
0
Houston Police Department-HPD
White
1
Houston Fire Department (HFD)
White
2
Houston Police Department-HPD
Hispanic
3
Houston Police Department-HPD
White
4
Houston Fire Department (HFD)
White
5
Houston Police Department-HPD
Hispanic
6
Houston Fire Department (HFD)
Hispanic
7
Houston Police Department-HPD
Black
df[20:100:10, 'department':'gender']
department
race
gender
0
Houston Police Department-HPD
White
Male
1
Houston Fire Department (HFD)
White
Male
2
Houston Police Department-HPD
Hispanic
Male
3
Houston Police Department-HPD
White
Male
4
Houston Fire Department (HFD)
White
Male
5
Houston Police Department-HPD
Hispanic
Male
6
Houston Fire Department (HFD)
Hispanic
Male
7
Houston Police Department-HPD
Black
Female
You can do boolean selection if you pass the brackets a one-column boolean DataFrame.
filt=df['salary'] >100000filt.head()
salary
0
False
1
False
2
False
3
False
4
False
df[filt].head()
department
race
gender
salary
0
Public Works & Engineering-PWE
White
Male
107962
1
Health & Human Services
Black
Male
180416
2
Houston Fire Department (HFD)
Hispanic
Male
165216
3
Health & Human Services
White
Female
100791
4
Houston Airport System (HAS)
White
Male
120916
df[filt, ['race', 'salary']].head()
race
salary
0
White
107962
1
Black
180416
2
Hispanic
165216
3
White
100791
4
White
120916
Assigning Columns
You can only assign an entire new column or overwrite an old one. You cannot assign a subset of the data. You can assign a new column with a single value like this:
df['bonus'] =1000df.head()
department
race
gender
salary
bonus
0
Houston Police Department-HPD
White
Male
45279
1000
1
Houston Fire Department (HFD)
White
Male
63166
1000
2
Houston Police Department-HPD
Black
Male
66614
1000
3
Public Works & Engineering-PWE
Asian
Male
71680
1000
4
Houston Airport System (HAS)
White
Male
42390
1000
You can assign with a numpy array the same length as a column.
The next several methods are non-aggregating methods that return a DataFrame with the same exact shape as the original. They only work on boolean, integer and float columns and ignore string columns.
Absolute value
df.abs().head()
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
45279
3536
48815
1
Houston Fire Department (HFD)
White
Male
63166
1296
64462
2
Houston Police Department-HPD
Black
Male
66614
511
67125
3
Public Works & Engineering-PWE
Asian
Male
71680
4267
75947
4
Houston Airport System (HAS)
White
Male
42390
3766
46156
Cumulative min, max, and sum
df.cummax().head()
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
45279
3536
48815
1
Houston Fire Department (HFD)
White
Male
63166
3536
64462
2
Houston Police Department-HPD
Black
Male
66614
3536
67125
3
Public Works & Engineering-PWE
Asian
Male
71680
4267
75947
4
Houston Airport System (HAS)
White
Male
71680
4267
75947
Clip values to be within a range.
df.clip(40000, 60000).head()
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
45279
40000
48815
1
Houston Fire Department (HFD)
White
Male
60000
40000
60000
2
Houston Police Department-HPD
Black
Male
60000
40000
60000
3
Public Works & Engineering-PWE
Asian
Male
60000
40000
60000
4
Houston Airport System (HAS)
White
Male
42390
40000
46156
Round numeric columns
df.round(-3).head()
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
45000
4000
49000
1
Houston Fire Department (HFD)
White
Male
63000
1000
64000
2
Houston Police Department-HPD
Black
Male
67000
1000
67000
3
Public Works & Engineering-PWE
Asian
Male
72000
4000
76000
4
Houston Airport System (HAS)
White
Male
42000
4000
46000
Copy the DataFrame
df.copy().head()
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
45279
3536
48815
1
Houston Fire Department (HFD)
White
Male
63166
1296
64462
2
Houston Police Department-HPD
Black
Male
66614
511
67125
3
Public Works & Engineering-PWE
Asian
Male
71680
4267
75947
4
Houston Airport System (HAS)
White
Male
42390
3766
46156
Take the nth difference.
df.diff(2).head(10)
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
nan
nan
nan
1
Houston Fire Department (HFD)
White
Male
nan
nan
nan
2
Houston Police Department-HPD
Black
Male
21335.000
-3025.000
18310.000
3
Public Works & Engineering-PWE
Asian
Male
8514.000
2971.000
11485.000
4
Houston Airport System (HAS)
White
Male
-24224.000
3255.000
-20969.000
5
Public Works & Engineering-PWE
White
Male
36282.000
-2228.000
34054.000
6
Houston Fire Department (HFD)
Hispanic
Male
10254.000
-2672.000
7582.000
7
Health & Human Services
Black
Male
72454.000
2893.000
75347.000
8
Public Works & Engineering-PWE
Black
Male
-22297.000
1134.000
-21163.000
9
Health & Human Services
Black
Male
-125147.000
-2283.000
-127430.000
Find the nth percentage change.
df.pct_change(2).head(10)
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
White
Male
nan
nan
nan
1
Houston Fire Department (HFD)
White
Male
nan
nan
nan
2
Houston Police Department-HPD
Black
Male
0.471
-0.855
0.375
3
Public Works & Engineering-PWE
Asian
Male
0.135
2.292
0.178
4
Houston Airport System (HAS)
White
Male
-0.364
6.370
-0.312
5
Public Works & Engineering-PWE
White
Male
0.506
-0.522
0.448
6
Houston Fire Department (HFD)
Hispanic
Male
0.242
-0.710
0.164
7
Health & Human Services
Black
Male
0.671
1.419
0.685
8
Public Works & Engineering-PWE
Black
Male
-0.424
1.037
-0.394
9
Health & Human Services
Black
Male
-0.694
-0.463
-0.688
Sort the DataFrame by one or more columns
df.sort_values('salary').head()
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
Black
Female
24960
953
25913
1
Public Works & Engineering-PWE
Hispanic
Male
26104
4258
30362
2
Public Works & Engineering-PWE
Black
Female
26125
3247
29372
3
Houston Airport System (HAS)
Hispanic
Female
26125
832
26957
4
Houston Airport System (HAS)
Black
Female
26125
2461
28586
Sort descending
df.sort_values('salary', asc=False).head()
department
race
gender
salary
bonus
total salary
0
Houston Fire Department (HFD)
White
Male
210588
3724
214312
1
Houston Police Department-HPD
White
Male
199596
848
200444
2
Houston Airport System (HAS)
Black
Male
186192
1778
187970
3
Health & Human Services
Black
Male
180416
4932
185348
4
Public Works & Engineering-PWE
White
Female
178331
2124
180455
Sort by multiple columns
df.sort_values(['race', 'salary']).head()
department
race
gender
salary
bonus
total salary
0
Houston Airport System (HAS)
Asian
Female
26125
4446
30571
1
Houston Police Department-HPD
Asian
Male
27914
2855
30769
2
Houston Police Department-HPD
Asian
Male
28169
2572
30741
3
Public Works & Engineering-PWE
Asian
Male
28995
2874
31869
4
Public Works & Engineering-PWE
Asian
Male
30347
4938
35285
Randomly sample the DataFrame
df.sample(n=3)
department
race
gender
salary
bonus
total salary
0
Houston Fire Department (HFD)
White
Male
62540
2995
65535
1
Public Works & Engineering-PWE
White
Male
63336
1547
64883
2
Houston Police Department-HPD
White
Male
52514
1150
53664
Randomly sample a fraction
df.sample(frac=.005)
department
race
gender
salary
bonus
total salary
0
Houston Police Department-HPD
Hispanic
Female
60347
1200
61547
1
Public Works & Engineering-PWE
Black
Male
49109
3598
52707
2
Health & Human Services
Black
Female
48984
4602
53586
3
Houston Police Department-HPD
White
Male
55461
2813
58274
4
Houston Airport System (HAS)
Black
Female
29286
1877
31163
5
Houston Police Department-HPD
Asian
Male
66614
4480
71094
6
Houston Fire Department (HFD)
White
Male
28024
4475
32499
Sample with replacement
df.sample(n=10000, replace=True).head()
department
race
gender
salary
bonus
total salary
0
Parks & Recreation
Black
Female
31075
1665
32740
1
Public Works & Engineering-PWE
Hispanic
Male
67038
644
67682
2
Houston Police Department-HPD
Black
Male
37024
1532
38556
3
Health & Human Services
Black
Female
57433
3106
60539
4
Public Works & Engineering-PWE
Black
Male
53373
924
54297
String-only methods
Use the str accessor to call methods available just to string columns. Pass the name of the string column as the first parameter for all these methods.
df.str.count('department', 'P').head()
department
0
2
1
0
2
2
3
2
4
0
df.str.lower('department').head()
department
0
houston police department-hpd
1
houston fire department (hfd)
2
houston police department-hpd
3
public works & engineering-pwe
4
houston airport system (has)
df.str.find('department', 'Houston').head()
department
0
0
1
0
2
0
3
-1
4
0
Grouping
pandas_cub provides the value_counts method for simple frequency counting of unique values and pivot_table for grouping and aggregating.
The value_counts method returns a list of DataFrames, one for each column.
If your DataFrame has one column, a DataFrame and not a list is returned. You can also return the relative frequency by setting the normalize parameter to True.
df['race'].value_counts(normalize=True)
race
count
0
White
0.353
1
Black
0.337
2
Hispanic
0.248
3
Asian
0.057
4
Native American
0.005
The pivot_table method allows to group by one or two columns and aggregate values from another column. Let's find the average salary for each race and gender. All parameters must be strings.