Skip to content

Commit

Permalink
Fixing problem with Columns natively inferred to be Boolean
Browse files Browse the repository at this point in the history
  • Loading branch information
John Hawkins authored and John Hawkins committed Aug 6, 2020
1 parent 56a7745 commit 9677ff4
Show file tree
Hide file tree
Showing 6 changed files with 35 additions and 14 deletions.
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ based on the implementation by [Javia Jinkal](https://github.com/javiajinkal/Fla
This [review article by Phillip Gibbons](https://www.cs.cmu.edu/~gibbons/Phillip%20B.%20Gibbons_files/Distinct-Values-Estimation-over-Data-Streams-PBGibbons.pdf) gives a great overview of the alternatives.


## Testing
## Usage

You can use this application multiple ways

Expand All @@ -55,12 +55,19 @@ Or simply install the package and use the command line application directly

# Installation

Installation from the source tree (or via pip from PyPI)::
Installation from the source tree:

```
python setup.py install
```

(or via pip from PyPI):

```
pip install dfsummarizer
```


Now, the ``dfsummarizer`` command is available::

```
Expand Down
14 changes: 7 additions & 7 deletions data/test.csv
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
id,opening,first,state,balance,duration,years,flag,comments
S001,2019-01-01,YES,NSW,230.40,24,2,,Simple transactions
S002,2019-03-13,NO,QLD,4230.90,12,3,1,Temporary savings account
S003,2019-06-09,YES,,900.00,24,4,,Combined savings account
S004,2019-05-21,NO,VIC,500.00,24,4,,Holdings
S005,2019-07-12,NO,NSW,200.00,,2,1,Customer called to make a complaint
S006,2019-03-25,,VIC,500.00,,3,,Unknown origin
id,opening,first,last,state,balance,duration,years,flag,comments
S001,2019-01-01,YES,FALSE,NSW,230.40,24,2,,Simple transactions
S002,2019-03-13,NO,,QLD,4230.90,12,3,1,Temporary savings account
S003,2019-06-09,YES,FALSE,,900.00,24,4,,Combined savings account
S004,2019-05-21,NO,,VIC,500.00,24,4,,Holdings
S005,2019-07-12,NO,TRUE,NSW,200.00,,2,1,Customer called to make a complaint
S006,2019-03-25,,,VIC,500.00,,3,,Unknown origin
2 changes: 1 addition & 1 deletion dfsummarizer/dfsummarizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

"""dfsummarizer.dfsummarizer: provides entry point main()."""

__version__ = "0.1.1"
__version__ = "0.1.2"

import numpy as np
import pandas as pd
Expand Down
16 changes: 12 additions & 4 deletions dfsummarizer/funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,10 +243,15 @@ def infer_type(thetype, unicount, uniques):
valtype = "Date"
if thetype == "<class 'pandas._libs.tslibs.timestamps.Timestamp'>" :
valtype = "Date"
# Infer Booleans by 2 unique values and additional criteria
#print("Type: ", thetype)
if unicount == 2:
if (valtype == "Char") :
if thetype == "<class 'numpy.bool_'>":
valtype = "Bool"
if thetype == "<class 'bool'>":
valtype = "Bool"

# Additional Inference of Booleans by strings with 2 unique values
# and common names as additional criteria
if (valtype == "Char") :
if unicount == 2:
temp = [x.lower() for x in uniques if x is not None]
temp.sort()
if (temp == ['no', 'yes']):
Expand Down Expand Up @@ -289,8 +294,11 @@ def booleanize(x):
return x
elif x is None :
return x
elif str(type(x)) == "<class 'bool'>":
return x
else :
x = x.lower()

if x == "yes" or x == "y" or x == "true" or x == "t" or x == 1:
return 1
else :
Expand Down
1 change: 1 addition & 0 deletions markdown_test.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
| id | Char | 100.0% | 0.0% | 4 | 4.0 | 4 |
| opening | Date | 100.0% | 0.0% | 2019-01-01 | 2019-04-18 | 2019-07-12 |
| first | Bool | 33.3% | 16.7% | 0.0 | 0.4 | 1.0 |
| last | Bool | 33.3% | 50.0% | 0 | 0.333 | 1 |
| state | Char | 50.0% | 16.7% | 3.0 | 3.0 | 3.0 |
| balance | Float | 83.3% | 0.0% | 200.0 | 1093.55 | 4230.9 |
| duration | Float | 50.0% | 33.3% | 12.0 | 21.0 | 24.0 |
Expand Down
5 changes: 5 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@
with open("README.md", "rb") as f:
long_descr = f.read().decode("utf-8")

with open("markdown_test.md", "rb") as f:
example = f.read().decode("utf-8")

long_descr = long_descr + "\n" + example

setup(
name = "dfsummarizer",
packages = ["dfsummarizer"],
Expand Down

0 comments on commit 9677ff4

Please sign in to comment.