-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix empty path segments in Data Path transformations #453
base: main
Are you sure you want to change the base?
Conversation
…ker interface for AccessKeyCredentials in the AWS context.
…rage-client # Conflicts: # adapta/security/clients/aws/_aws_client.py
Coverage Report
|
adapta/storage/models/aws.py
Outdated
@@ -36,7 +36,7 @@ def to_uri(self) -> str: | |||
if not self.bucket or not self.path: | |||
raise ValueError("Bucket and path must be defined") | |||
|
|||
return f"s3://{self.bucket}/{self.path}" | |||
return f"s3a://{self.bucket.rstrip('/')}/{self.path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use __post_init__
and assert or rstrip in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added __post_init__
with a regex expression to check all valid cases for s3 data paths as suggested in cbf15e9
Since it is a complex regex I suggest looking into the unit tests, I tried to cover most of the corner cases
adapta/storage/models/aws.py
Outdated
@@ -70,25 +70,30 @@ def from_hdfs_path(cls, hdfs_path: str) -> "S3Path": | |||
""" | |||
assert hdfs_path.startswith("s3a://"), "HDFS S3 path should start with s3a://" | |||
uri = urlparse(hdfs_path) | |||
parsed_path = uri.path.split("/") | |||
parsed_path = uri.path.replace("//", "/").split("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the check to use regex, will simplify assertion here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included this on the regex check done in __post_init__
done in cbf15e
… Bug-fix-empty-path-segment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs work
adapta/storage/models/aws.py
Outdated
if not self.bucket: | ||
raise ValueError("Bucket must be defined") | ||
|
||
path_regex = r"^(?![0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$)[a-z0-9]([a-z0-9\-]{1,61}[a-z0-9])?(\/(?!.*(\/\/|\\))([^\/].{0,1022}\/?)?)?$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplify this regex to just check for //
adapta/storage/models/aws.py
Outdated
|
||
path_regex = r"^(?![0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$)[a-z0-9]([a-z0-9\-]{1,61}[a-z0-9])?(\/(?!.*(\/\/|\\))([^\/].{0,1022}\/?)?)?$" | ||
|
||
s3_path_without_prefix = f"{self.bucket}/{self.path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regex should only be applied to self.path
, bucket is checked on line 62, we do not aim to validate bucket name with this PR, out of scope
adapta/storage/models/aws.py
Outdated
match = re.match(path_regex, s3_path_without_prefix) | ||
|
||
if not match: | ||
raise ValueError(f"Invalid S3Path provided, must comply with : {path_regex}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not put regex in error message, explain in human language :)
Fixes #449
Scope
Implemented:
Checklist
.pylintrc
with exceptions.latest
commit.