[WIP][vpj] Add a way to run DataWriter jobs in an isolated environment #1265
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a way to run DataWriter jobs in an isolated environment
Currently, VPJ is written in a way where it launches the compute tasks from it's local environment. If an environment only allows running Spark jobs via
spark-submit
, the entire VPJ logic needs to run on the Spark driver, making the driver non-idempotent.With this change, we can separate the VPJ driver logic from Spark compute environment, and launch only the compute tasks on Spark. The driver on Spark is a very thin wrapper that only parses CLI args and launches the compute jobs. This makes the Spark compute job idempotent as well, and can improve the resiliency.
Another benefit of this change is that this improves the debugging experience as the logs can now be viewed in the environment where the user's job was triggered, and they don't have to check the logs on the Spark driver.
This change is implemented as an implementation of the
DataWriterComputeJob
interface, and all interactions with external systems are contained within this interface. This change serializes the job properties and job configs as CLI args and passes them to the isolated environment. The main class in the isolated environment parses these CLI args and configures the actual compute job. At the end of the compute job, the driver program on the isolated environment serializes theDataWriterTaskTracker
to HDFS, and the VPJ driver program reads the same file from HDFS and returns it to VPJ to perform further validation and job polling.How was this PR tested?
Tested manually, and in integration tests. More testing is in progress. Unit tests need to be added
Does this PR introduce any user-facing changes?