Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenWPM StorageWatchdog complete #1039

Merged
merged 10 commits into from
Oct 11, 2023
Merged

OpenWPM StorageWatchdog complete #1039

merged 10 commits into from
Oct 11, 2023

Conversation

gridl0ck
Copy link
Contributor

Introducing the StorageWatchdog module for OpenWPM, a powerful tool designed to enhance your web scraping experience. This module offers a range of features that enable you to efficiently manage temporary files, monitor their size, and effectively handle browser profiles, all without compromising performance.

With the StorageWatchdog module, you gain the ability to easily redirect temporary files to a directory of your choice. This functionality ensures that any temporary files generated during your web scraping activities are conveniently stored in a location that suits your needs. Whether you prefer a specific directory for organization or have limited storage space concerns, this feature allows you to maintain control over the storage location of these files effortlessly.

In addition to redirecting the temporary files, the StorageWatchdog module offers a monitoring system for their size. This capability enables you to keep track of the growth and consumption of disk space by the temporary files and browser profiles. The watchdog performs checks at 5 minute intervals and checks the size of the current browser profile to determine whether or not that specific browser needs to be reset to clear space.

This addition was created as a result of my capstone team running into issues with longer crawls running out of storage on our smaller infrastructure. Because we only needed the data from the generated database (more than most people need in the first place), these other random artifacts were simply taking up space on our system.

The usage is fairly simple and a demo_watchdog.py has been supplied to demonstrate its functionality and usage. Simply enable the watchdog as you would any other watchdog, configure the size in bytes for the check to monitor (this is located in the periodic_check function in storage_watchdog.py) and optionally specify a tmp file location (this defaults to /tmp as it has before) using the new Browser Paramater, tmp_profile_dir.

I hope you accept these changes and agree that this simple addition could be helpful to others!

Copy link
Contributor

@vringar vringar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,
thank you for this PR. I see that this is a useful feature especially when running natively on machines that are not under your control.
And while this looks good on first pass, I'd need to spend more time on a thorough review which I don't have atm.
I'll try to get back to you as soon as my schedule allows

openwpm/config.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 26, 2023

Codecov Report

Attention: 64 lines in your changes are missing coverage. Please review.

Comparison is base (761e46d) 46.20% compared to head (528602f) 40.77%.

❗ Current head 528602f differs from pull request most recent head 213a4c9. Consider uploading reports for the commit 213a4c9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1039      +/-   ##
==========================================
- Coverage   46.20%   40.77%   -5.43%     
==========================================
  Files          34       35       +1     
  Lines        3398     3480      +82     
==========================================
- Hits         1570     1419     -151     
- Misses       1828     2061     +233     
Files Coverage Δ
openwpm/config.py 94.69% <100.00%> (+0.16%) ⬆️
openwpm/deploy_browsers/deploy_firefox.py 24.70% <0.00%> (ø)
openwpm/browser_manager.py 49.12% <50.00%> (-0.63%) ⬇️
openwpm/task_manager.py 71.12% <28.57%> (-1.33%) ⬇️
openwpm/utilities/storage_watchdog.py 17.64% <17.64%> (ø)

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gridl0ck
Copy link
Contributor Author

gridl0ck commented Sep 2, 2023

If there are any other changes requested, feel free to comment! I really enjoyed working on this project and would love to help continue to make it better!

Copy link
Contributor

@vringar vringar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up on this and sorry for letting it linger.

environment.yaml Show resolved Hide resolved
openwpm/config.py Outdated Show resolved Hide resolved
openwpm/utilities/storage_watchdog.py Show resolved Hide resolved
openwpm/config.py Outdated Show resolved Hide resolved
openwpm/utilities/storage_watchdog.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@gridl0ck gridl0ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See current implementation but marking this as resolved.

@vringar vringar merged commit 213a4c9 into openwpm:master Oct 11, 2023
3 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants