Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-validate checkpoint location write permission #414

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Jul 9, 2024

Description

This PR enhances the pre-validation process for checkpoint locations by adding a write permission check. Previously, only read permissions were verified. With this update, a temporary file is created at the specified location to verify write permissions.

The side effect of this change include:

  1. Checkpoint folder will be created earlier than before: Previously streaming job will do it when start
  2. A temporary file will remain in the checkpoint folder: This file is intentionally left undeleted to avoid requiring delete permissions, which are only necessary during the index vacuuming process after the work in [FEATURE] Checkpoint folder data cleanup limitation #102 completed.

Testing

Tested with an S3 bucket by first revoking the PutObject permission:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:AbortMultipartUpload",
                "s3:CreateBucket",
                "s3:DeleteObject",
                "s3:GetBucketVersioning",
                "s3:GetObject",
                "s3:GetObjectTagging",
                "s3:GetObjectVersion",
                "s3:ListBucket",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucketVersions",
                "s3:ListMultipartUploadParts",
                "s3:PutBucketVersioning",
                "s3:PutObject",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::*"
            ]
        }
    ]
}

Attempted to create a skipping index with checkpoint location option:

spark-sql> CREATE SKIPPING INDEX ON myglue.ds_tables.http_logs (
         >   status VALUE_SET
         > )
         > WITH (
         >   auto_refresh = true,
         >   checkpoint_location = 's3://validation_test/test-1'
         > );

Observed the expected error message:

java.lang.IllegalArgumentException: requirement failed: No sufficient permission to access
the checkpoint location s3://validation_test/test-1
  at scala.Predef$.require(Predef.scala:281) ~[scala-library-2.12.15.jar:?]
  at org.opensearch.flint.spark.refresh.AutoIndexRefresh.validate(AutoIndexRefresh.scala:59)
  at org.opensearch.flint.spark.FlintSparkIndexBuilder.validateIndex(FlintSparkIndexBuilder.scala:100)
  at org.opensearch.flint.spark.FlintSparkIndexBuilder.create(FlintSparkIndexBuilder.scala:63)

Granted the PutObject permission and create the index successfully. Here is the checkpoint folder looks like:

➜  ~ aws s3 ls s3://validation_test/test-2/
                           PRE commits/
                           PRE offsets/
                           PRE sources/
2024-07-10 11:39:23          0 51377eda-b2dc-4fbf-803f-aa5ac4e586b0.tmp
2024-07-10 11:39:30          0 commits_$folder$
2024-07-10 11:39:30         45 metadata
2024-07-10 11:39:30          0 offsets_$folder$
2024-07-10 11:39:31          0 sources_$folder$

Issues Resolved

#404

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen added enhancement New feature or request 0.5 labels Jul 9, 2024
@dai-chen dai-chen self-assigned this Jul 9, 2024
@dai-chen dai-chen marked this pull request as ready for review July 11, 2024 16:18
Comment on lines 19 to +20
import org.apache.spark.sql.execution.streaming.CheckpointFileManager
import org.apache.spark.sql.execution.streaming.CheckpointFileManager.RenameHelperMethods
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we merge the import?

Copy link
Collaborator Author

@dai-chen dai-chen Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find a way to do this in Scala. Please show me if you know how to do this. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the class structure of CheckpointFileManager, it appears that the only option to avoid another import is to call CheckpointFileManager.RenameHelperMethods

@dai-chen dai-chen merged commit 7c4d0d6 into opensearch-project:main Jul 12, 2024
4 checks passed
@dai-chen dai-chen deleted the prevalidate-checkpoint-location-write-permission branch July 12, 2024 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.5 enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants