-
Notifications
You must be signed in to change notification settings - Fork 97
GSoC 2023 ‐ Code Refactoring and Parallelization
- Name - Harsh (@peb-peb)
- Organisation - GCP Scanner
- Project - Code Refactoring & Parallelization
- Proposal - [link to proposal]
GCP Scanner didn't support parallel enumeration of GCP resources and parallel scanning of GCP targets.
To address this issue, project based and resource based parallelization was done using multithreading. This feature can be used by adding the -wc
or --worker-count
followed by an integer stating the number of workers to appoint while crawling in the existing GCP Scanner.
Example command:
gcp-scanner -o <<output_dir>> -g - --worker-count 8
Note: Some of the issues faced while developing the above solution and their discussion can be found here.
GCP Scanner had one giant scanning loop from where it launched GCP resource crawlers. We needed to split each crawler into individual modules with proper error handling that would improve code readability and quality.
This issue was addressed and solved by implementing the factory design for the crawlers. I leveraged Python classes for the state of execution control, config parsing, and enabling/disabling certain functionality in the scanner.
-
Python parallization and multiprocessing libraries: I learned about parallelization and multiprocessing in Python, including the different libraries available, such as
multiprocessing
,threading
, andconcurrent.futures
. I also learned about the pros and cons of each library. - Multiprocessing vs Multithreading: I learned the difference between multiprocessing and multithreading, and when to use each one. I learned that multiprocessing is used fro CPU bound task and multithreading for IO bound tasks.
- Refactoring: I learned the art of refactoring, which is the process of improving the structure of the code without changing its functionality. Refactoring can help to make the code more readable, maintainable, and efficient.
- Communication: The one thing that I learned in this program that would help me throughout my career would be: communication with my mentors. I am grateful to my mentors for their support and helping me learn this skill.
- Time management: During the entire period of the program I never felt stressed. My mentors were awesome. Also, I learnt the importance of time management and task prioritization.
-
Parallelization
GCP Scanner now supports parallel enumeration of projects and resources. Thus, improving the performance from ~120s to ~12s [90% faster] (tested on the test/gcp-scanner-2 project). This was achieved with the following PRs:
PRs:
-
https://github.com/google/gcp_scanner/pull/265
- project based parallelization
-
https://github.com/google/gcp_scanner/pull/269
- resource based parallelization
Related Issues:
-
https://github.com/google/gcp_scanner/pull/265
-
Code Refactoring
Following the Client-Factory Architecture, I refactored the existing Crawlers to this architecture. They are as follows:
PRs:
-
https://github.com/google/gcp_scanner/pull/194
- crawler/app services
-
https://github.com/google/gcp_scanner/pull/200
- crawler/bigtable instances
-
https://github.com/google/gcp_scanner/pull/202
- crawler/filestore instances
-
https://github.com/google/gcp_scanner/pull/215
- crawler/spanner instances
-
https://github.com/google/gcp_scanner/pull/216
- crawler/pubsub subs
-
https://github.com/google/gcp_scanner/pull/217
- crawler/cloud functions
-
https://github.com/google/gcp_scanner/pull/218
- crawler/sql instances
-
https://github.com/google/gcp_scanner/pull/221
- crawler/kms keys
-
https://github.com/google/gcp_scanner/pull/222
- crawler/bigquery
-
https://github.com/google/gcp_scanner/pull/223
- crawler/cloud resource manager
Related Issues:
-
https://github.com/google/gcp_scanner/issues/153
-
https://github.com/google/gcp_scanner/issues/192
- crawler factory for app resources
-
https://github.com/google/gcp_scanner/issues/199
- crawler factory for bigtable resources
-
https://github.com/google/gcp_scanner/issues/201
- crawler factory for filestore resources
-
https://github.com/google/gcp_scanner/issues/192
-
https://github.com/google/gcp_scanner/pull/194
-
Other Bugs solved
PR:
The tool has improved and changed a lot since I first started contributing. I plan to keep working on the project and contribute as much as I can. Some of the features I'd like to work in the future are:
- Add local testing support for developers of the tool.
- Expand support for more GCP resources.
- Improve logging and CLI appearance of the tool.
I would like to thank Google and GCP Scanner for providing me with this wonderful opportunity and my mentors Maksim Shudrak and Calle Svensson who guided me and taught me all sorts of things during this summer.
I would also like to thank my fellow GSoCer Sudipto Baral and GCP Scanner Community for helping me during the program.