Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT: Cleaning the data folder #281

Closed
5 tasks done
Gui-FernandesBR opened this issue Nov 10, 2022 · 11 comments · Fixed by #721
Closed
5 tasks done

MNT: Cleaning the data folder #281

Gui-FernandesBR opened this issue Nov 10, 2022 · 11 comments · Fixed by #721
Assignees
Labels
Git housekeeping Clean and organize our github Good first issue Good for newcomers

Comments

@Gui-FernandesBR
Copy link
Member

Gui-FernandesBR commented Nov 10, 2022

What I propose is: (soft suggestions)

  • Create a "rockets" folder and save "caldene", "calisto", "euporia", "valetudo" folders inside that
  • Move "jiboia", "keron", "mandioca" to "motors" folder
  • Clean the "weather" folder by removing the majority of files and letting only the crucial ones.
  • Do not duplicate files in tests/fixtures. The best place to save data is the data folder!
  • Organize motors in different subfolders if needed
@Gui-FernandesBR
Copy link
Member Author

This issues requires at least a little discussion before starting applying new changes. @giovaniceotto could we help with some insights here?

@Gui-FernandesBR Gui-FernandesBR added Help wanted Extra attention is needed Git housekeeping Clean and organize our github labels Nov 10, 2022
@giovaniceotto
Copy link
Member

Great suggestions @Gui-FernandesBR.

I agree that the data folder may no longer be needed. We can move some of the files used for tests to the tests folders as fixtures, and create an sepecific folder for complete example cases, including comparisons with real flight data.

The only point I disagree with is creating a separate repository for examples. While this would make this repository lighter, it would be a nightmare to manage. Imagine having to sync two separate repositories so that the examples can always run with the latest RocketPy version. I do not believe this is worth the effort.

@Gui-FernandesBR
Copy link
Member Author

I agree that the data folder may no longer be needed.

Nooo, did I say that? sorry hahahaha
Data folder is indeed important. Lot of notebooks are using powerOnDrag='data/calisto/....csv' file, for example. The problem is that the folder is currently a mess.

What I think we could do is:

  1. Delete the not so useful files in the data folder (MAINT: Cleaning up some repo files #316)
  2. Keep only necessary files in the tests/fixtures, not duplicating them when the file already exists in the data folder.
  3. Comparisons, examples cases may be used mainly as documentation. Thus we should add them as notebooks and refer in our docs. The currently tests\fixtures\acceptance\EPFL_Bella_Lui\bella_lui_flight_sim.py file, for instance, could be converted into a jupyter notebook, where we can convert to .rst and add to our docs.

@Gui-FernandesBR Gui-FernandesBR added this to the Release v1.1.0 milestone May 26, 2023
@Gui-FernandesBR Gui-FernandesBR moved this from Backlog to Mid-Term in LibDev Roadmap Nov 20, 2023
@Gui-FernandesBR Gui-FernandesBR added Good first issue Good for newcomers and removed Help wanted Extra attention is needed labels Jul 10, 2024
@Gui-FernandesBR Gui-FernandesBR changed the title Cleaning the data folder MNT: Cleaning the data folder Oct 11, 2024
@Gui-FernandesBR Gui-FernandesBR linked a pull request Nov 2, 2024 that will close this issue
7 tasks
@aureliobarbosa
Copy link

aureliobarbosa commented Nov 7, 2024

To give a context: I discussed a little bit with your colleagues on PythonBR 2024 and they welcome contributions. I also mentioned that I would be mostly interested in CI, lib infrastructure and code maintenance in general...

Regarding the data folder, why I think it is a serious problem?

I started to clone the project on wifi and it took forever on a slow broadband (depending on the context it will make possible contributors to run away). In my home setup it was necessary to plug in a cable to download the project!

Adding to this discussion: I used cloc to check the number of linecodes on the data folder and found that it is one order of magnitude higher than in all other files. The size of this folder is only smaller than .git, but I presume .git has many copies of versions of this folders (possibly due to reorganization of files). Did you thought about using git lfs (git large file storage) or at least to check whether it can help you reduce the size your repository?

My guess is that it could potentially reduce the size of the whole repository by 90%.

I can devote some time investigating (and eventually implementing or helping the team to implement git large file storage on this repository) if the team think this could be valuable. Regards

edit: just corrected the name of the plugin.

RocketPy on  master is 📦 v1.6.1 via 🐍 v3.12.3 (rocketpy-devenv) 
❯ cloc .
     496 text files.
     470 unique files.                                          
      75 files ignored.

github.com/AlDanial/cloc v 1.98  T=0.86 s (548.6 files/s, 546445.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
CSV                             43              0              0         218429
Python                         167           7218          21945          26633
Text                            22              3              0          12145
SVG                              9             42             43          10185
reStructuredText               183           3093           4180           3488
Jupyter Notebook                17              0         154818           3350
Markdown                         7            292             26            579
JSON                             2              0              0            348
YAML                            11             41             15            289
MATLAB                           1             25            137            115
XML                              1              0              0             86
TOML                             1             12              2             77
CSS                              1             15              3             62
make                             2             16              8             40
DOS Batch                        1              8              1             26
HTML                             1             25            256             14
Dockerfile                       1              7             12             11
-------------------------------------------------------------------------------
SUM:                           470          10797         181446         275877
-------------------------------------------------------------------------------

RocketPy on  master is 📦 v1.6.1 via 🐍 v3.12.3 (rocketpy-devenv) 
❯ du -h --max-depth=1 --total .
1,9M	./rocketpy
829M	./.git
1,8M	./tests
60K	./.github
16K	./.vscode
162M	./data
40M	./docs
1,1G	.
1,1G	total

RocketPy on  master is 📦 v1.6.1 via 🐍 v3.12.3 (rocketpy-devenv) 
❯ cloc tests/ data/
     130 text files.
     115 unique files.                                          
      30 files ignored.

github.com/AlDanial/cloc v 1.98  T=0.42 s (271.8 files/s, 546528.9 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
CSV                             43              0              0         218429
Python                          66           1857           2708           6851
Text                             6              3              0           1367
-------------------------------------------------------------------------------
SUM:                           115           1860           2708         226647
-------------------------------------------------------------------------------

@Gui-FernandesBR
Copy link
Member Author

Yes, this is a known issue for a while now. We always thought that not a lot of people actually clones the repo and those who does may be okay with long download times.

We did a big mistake saving .nc and .csv in previous versions. Even after deletion of most of files, I believe git still holds versions and versions of the same file. 3k commits later and the repo already consumes more than 1GB.

Implementing git LFS would be a good idea, but how would non-experienced developers react to it? What would be the possible development overhead with such addition?

Alternatively, I wonder if we could simply delete some files from git history.
We could even delete the whole data folder from the git tree and add the files again.


Regarding contributions, we are really excited with new contributions coming from you.
Are you available on discord? We could use that channel to establish a seamless communication to better assist you on this journey.
There's a lot of CI features we could add.

@aureliobarbosa
Copy link

I am not used with discord and couldn't find the discord link on the project documentation. Can you please provide it?

@Gui-FernandesBR
Copy link
Member Author

I am not used with discord and couldn't find the discord link on the project documentation. Can you please provide it?

Just go through our readme please, "join our community" section.

@Gui-FernandesBR
Copy link
Member Author

@Gui-FernandesBR
Copy link
Member Author

Gui-FernandesBR commented Nov 7, 2024

I just realized that my RocketPy local repo is more than 3.5 GB large.

Gonna try cleaning this up

@Gui-FernandesBR
Copy link
Member Author

Gui-FernandesBR commented Nov 7, 2024

I tried running git gc --aggressive, it helped but not significantly.

  • Before: 3.64 GB
  • After: 3.55 GB
  • Saved space: 900 MB

Then I remembered that .venv folders usually consumes a lot of space!
Deleted my .venv folders (I had 2),

  • Before: 3.55 GB
  • After: 1.14 GB
  • Saved space: 2.41 GB

From the 1.14 GB that I currently have, the .git folder is weighting 768 MB (67%), so it is more than clear to me that the problem is related to our git tree.

@Gui-FernandesBR
Copy link
Member Author

@aureliobarbosa I've raised issue #727 to deal specifically with the repo size reduction task.

As of this current issue, the goal was to refactor both data and tests folder, which were accomplished by merging #726 . Therefore, I'm marking this one as closed.

@aureliobarbosa , let's collaborate and work together on the new #727 issue!

@github-project-automation github-project-automation bot moved this from Mid-Term to Closed in LibDev Roadmap Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Git housekeeping Clean and organize our github Good first issue Good for newcomers
Projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

3 participants