fix: specify dtype for read_csv method #88

nielsbril · 2019-09-10T13:49:46Z

Fixing #87

nielsbril · 2019-12-27T09:05:20Z

@JosseVanDelm @jbelien Can you have a look at this? We use my own fork for now, but it would be nice to use your official repo in the future.

jbelien · 2019-12-28T09:46:07Z

filter/filter.py

@@ -62,7 +62,7 @@ def filter_file(args):
 """
 logger.info('Started reading input file')
 try:
- file = pd.read_csv(args.input_file)
+ file = pd.read_csv(args.input_file, dtype='unicode')


I'm not a Python expert so I may be wrong but, since the script is supposed to be run with Python 3, shouldn't it be dtype='str' ?

@jbelien i just read this stackoverflow post:
https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options
According to this post the proposed code should silence the error, but it does not resolve the main problem.
pandas is trying to guess the datatype for every csv column, but to do this, it has to load in all the data in memory.

If i get it correctly, the proper way to do it is to explicitly state the numpy datatypes for each column in order to make the code more efficient. I am not sure what implications this has on the rest on the code, so for now we can definitely accept this push request, but we should further investigate on this issue in the future.

I'm not a Python developer either, this code change seemed to fix the issue. It tells panda to treat each column as unicode, which resolves the issue for strings, numbers, ... The fix has been working for several months now on my own fork, but could (temporarily) be applied here too. But I agree this should be further investigated by someone with more knowledge on Python and the panda module.

fix: specify dtype for read_csv method

33e5d35

jbelien reviewed Dec 28, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: specify dtype for read_csv method #88

fix: specify dtype for read_csv method #88

nielsbril commented Sep 10, 2019 •

edited

Loading

nielsbril commented Dec 27, 2019

jbelien Dec 28, 2019

JosseVanDelm Dec 30, 2019

nielsbril Dec 30, 2019

fix: specify dtype for read_csv method #88

Are you sure you want to change the base?

fix: specify dtype for read_csv method #88

Conversation

nielsbril commented Sep 10, 2019 • edited Loading

nielsbril commented Dec 27, 2019

jbelien Dec 28, 2019

Choose a reason for hiding this comment

JosseVanDelm Dec 30, 2019

Choose a reason for hiding this comment

nielsbril Dec 30, 2019

Choose a reason for hiding this comment

nielsbril commented Sep 10, 2019 •

edited

Loading