Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH : improvement csv processing #517

Closed
wants to merge 10 commits into from

Conversation

kalounis
Copy link
Contributor

Pull request type

  • Code changes (bugfix, features)

Checklist

  • Tests for the changes have been added (if needed)
  • Docs have been reviewed and added / updated
  • Lint (black rocketpy/ tests/) has passed locally
  • All tests (pytest --runslow) have passed locally
  • CHANGELOG.md has been updated (if relevant)

Current behavior

The current code isn't able to deal with CSV files that contain a header.

New behavior

Now, when we create an instance of the Function Class, we can give for argument a CSV file with a header : the code will process this CSV file (drop NaN values) and will be able to deal with the header.

Breaking change

  • No

Additional information

Enter text here...

@kalounis kalounis changed the title Enh/improvement csv processing ENH : improvement csv processing Dec 19, 2023
Copy link
Member

@Gui-FernandesBR Gui-FernandesBR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I liked the way you separated the file handler function into different functions within the tools.py module

This is my first review, I will later check documentation and tests, but you can already work on the suggestions.

Btw I think this function can be used in other parts of the code as well.

cleaned_data.csv Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you commit this file by mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I forgot to delete it, sorry!

rocketpy/mathutils/function.py Outdated Show resolved Hide resolved
rocketpy/mathutils/function.py Show resolved Hide resolved
rocketpy/tools.py Outdated Show resolved Hide resolved
rocketpy/tools.py Outdated Show resolved Hide resolved
@phmbressan
Copy link
Collaborator

phmbressan commented Dec 19, 2023

Could you elaborate a bit more on why is this implementation an advantage over the current one, forgive me if I am mistaken, but my interpretation is that:

  • The current way the CSVs are read follows the numpy.loadtxt in a try-except clause: numpy will try to convert the CSV to an array of floats, if there is an error, we skip the first header line. If there is still an error, the input is likely bad formatted.
  • This new implementation creates aditional functions to process the data: the headers will be checked by trying a float() cast.

This seems to the same thing to me, but in the first case we can delegate to numpy the trouble of reading and not maintain these new functions.

We had recently a PR (#485 and #495) to handle CSV headers and there is even a test for it: test_func_from_csv_with_header.

@kalounis
Copy link
Contributor Author

kalounis commented Dec 19, 2023

Hello @phmbressan, this code is not only able to skip the first row but also to drop NaN values which it is not the case in the current code. Moreover, the current code is not very efficient dealing with these cases (from my perspective). Try with the following CSV file that @Gui-FernandesBR gave me. The current code throws an error whereas mine deals perfectly with it. Let me know if you want a complete report on this code!
test_dataset_with_header.csv

Copy link
Collaborator

@phmbressan phmbressan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to your clarifications @kalounis , now I understand the reasoning behind this PR. In fact, processing NaN values can be quite useful.

As I commented, we just need to make sure to inform if there were any problems on reading the source. After that, we are likely good to go.

Comment on lines +476 to +479
# Save the processed data to a new CSV file
with open(output_path, "w", encoding="utf-8") as output_file:
writer = csv.writer(output_file, delimiter=",")
writer.writerows(data_no_headers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think writing a new file with the processed data is really needed here.

More important, I think, is that we must make it clear to the user that not everything of their file was used (because there were NaN). So, could you trigger a warning that informs if any lines were skipped on source processing.

The implementation of this warning is up to you, but I believe that a simple boolean that is set to True in the for loop above if there were any skipped lines is enough. Then, if this boolean is True raise the warning.

There are other places in rocketpy that we raise warnings if you want to base the implementation on that. Of course, should you have any doubts don't hesitate in commenting.

@Gui-FernandesBR
Copy link
Member

tests not passing kinda worries me a bit but I couldn't find enough time to investigate it.

@Gui-FernandesBR
Copy link
Member

If the problem we are trying to solve is to read .csv files that have NaN values, I think we should consider Changning the np.loadtxt to the np.genfromtxt (see https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html).

This function is a bit more complex and can

Relying in numpy may be more beneficial than using the full pythonic approach, both in terms of maintenance and speed. But this is something that can be discussed.

https://stackoverflow.com/questions/20245593/difference-between-numpy-genfromtxt-and-numpy-loadtxt-and-unpack

@Gui-FernandesBR Gui-FernandesBR marked this pull request as draft January 25, 2024 04:07
@Gui-FernandesBR
Copy link
Member

Closed as not finished.
The branch is still available in the repo tho.

@Gui-FernandesBR Gui-FernandesBR deleted the enh/improvement-csv-processing branch August 3, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

4 participants