Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty path segments in Data Path transformations #453

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

Andre-Lx-Costa
Copy link
Contributor

Fixes #449

Scope

Implemented:

  • Fixed path transformations between S3 DataPath and paths to handle empty path segments

Checklist

  • GitHub issue exists for this change.
  • Unit tests added and they pass.
  • Pylint 10.0/10.0 without bloating .pylintrc with exceptions.
  • Review requested on latest commit.

Copy link

github-actions bot commented Jul 18, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
adapta/connectors/service_bus
   __init__.py110%18
   _connector.py17170%19–61
adapta/logs
   _async_logger.py81396%55, 78–79
   _base.py55689%35, 41, 44, 94–97
   _internal_logger.py109694%276–286
adapta/logs/handlers
   datadog_api_handler.py1093172%89, 106–113, 124, 136–152, 161–195, 204, 210, 238
adapta/metrics/providers
   datadog_provider.py43430%19–147
adapta/ml
   __init__.py110%19
   _model.py10100%17–42
adapta/ml/mlflow
   __init__.py220%17–18
   _client.py46460%19–164
   _functions.py47470%17–121
adapta/process_communication
   _models.py33682%90–96
adapta/security/clients
   __init__.py261254%27–28, 34–35, 41–42, 48–49, 53–54, 58–59
   _azure_client.py735032%42, 55–65, 75–78, 81, 84–86, 95–153, 156, 159–197
adapta/security/clients/aws
   _aws_client.py381755%37–40, 47, 57, 63, 75, 100–108, 114–117
   _aws_credentials.py733059%60–79, 83, 87, 91, 95, 99, 108–112, 116, 120, 124, 128, 132
adapta/security/clients/hashicorp_vault
   hashicorp_vault_client.py31487%46, 87, 91, 95
   kubernetes_client.py21576%45–48, 67–68
   oidc_client.py452056%33–62, 80–83, 92
   token_client.py17759%42–45, 52–53, 56, 59
adapta/storage/blob
   azure_storage_client.py1195157%71–78, 88, 95–96, 99–105, 127–128, 131–156, 159, 175–188, 196–200, 219, 240–242, 251, 267–271, 281–283, 286–305, 312
   local_storage_client.py54787%42, 67, 74, 87, 101, 104, 107
   s3_storage_client.py1177734%57, 61, 68–80, 91–96, 117–125, 133–134, 148–165, 181–185, 203–212, 222–243, 254–275
adapta/storage/cache
   redis_cache.py37370%19–107
adapta/storage/database/v2
   azure_sql.py34340%21–140
   odbc.py73730%21–219
   snowflake_sql.py69690%6–228
   trino_sql.py39390%21–127
adapta/storage/database/v2/models
   __init__.py110%19
   _models.py11110%20–54
adapta/storage/database/v3
   azure_sql.py322038%55, 70, 84–122, 132
   odbc.py721185%94–104, 114, 125–126, 140, 149–155, 182
   snowflake_sql.py772370%64–78, 93–95, 106–115, 142, 178, 193–196
   trino_sql.py38380%20–119
adapta/storage/delta_lake/v2
   _functions.py684534%73–106, 153–168, 210–294
adapta/storage/delta_lake/v3
   _functions.py661282%72, 156, 161, 163, 165, 221, 231, 241–251, 257, 292
adapta/storage/distributed_object_store/v2/datastax_astra
   _models.py19953%54–59, 62, 65, 68
   astra_client.py19614029%40–43, 113–137, 143–184, 190–191, 197–198, 201, 211, 223–224, 233, 273–361, 369, 393–520, 530, 541–556, 579–594, 616–632, 656–673
adapta/storage/distributed_object_store/v3/datastax_astra
   _model_mappers.py1612386%64, 130, 136, 138, 142, 148, 152, 167–170, 186–187, 218, 275–285, 370, 411–423, 455
   _models.py453131%63–69, 72, 75, 78–122, 125
   astra_client.py1396851%35–38, 133–174, 180–181, 187–188, 191, 201, 213–214, 223, 267–353, 361, 373, 384–403, 424–440, 462–482, 507–519
adapta/storage/models
   astra.py351071%37, 40, 44, 59–61, 64, 70–73
   aws.py37489%53–55, 63
   azure.py601772%32, 36–40, 47, 67, 70–71, 74–75, 89–93, 100, 113, 120, 123–124, 127
   filter_expression.py125596%55, 183–184, 238, 329
   hive.py572556%37, 41, 44, 92–99, 111, 114–115, 124–172, 175
   local.py21481%31, 35, 38, 50
adapta/storage/query_enabled_store
   _models.py611084%76, 138–141, 147–150, 156–157, 163
   _qes_astra.py571377%65–66, 78, 83–96, 106–109
   _qes_delta.py39490%33, 60, 72, 80
adapta/storage/secrets
   azure_secret_client.py20200%19–66
adapta/utils
   _common.py951584%36–37, 70–83, 94, 122, 142, 162, 245
   concurrent_task_runner.py27196%109
adapta/utils/data_structures
   _functions.py34197%134
adapta/utils/decorators
   _logging.py41198%32
   _rate_limit.py25196%58
adapta/utils/python_typing
   _functions.py11191%24
tests
   test_filtering_api.py32294%197–198
   test_s3_storage_client.py43295%34–35
   test_utils.py166199%371
   test_vault_client.py801878%33–35, 40–42, 47–51, 56–57, 62–66
TOTAL4570133871% 

Tests Skipped Failures Errors Time
215 5 💤 0 ❌ 0 🔥 1m 2s ⏱️

Copy link

github-actions bot commented Jul 18, 2024

Coverage

Coverage Report
FileStmtsMissCoverMissing
adapta/connectors/service_bus
   __init__.py110%18
   _connector.py17170%19–61
adapta/logs
   _async_logger.py81396%55, 78–79
   _base.py55689%35, 41, 44, 94–97
   _internal_logger.py109694%276–286
adapta/logs/handlers
   datadog_api_handler.py1093172%89, 106–113, 124, 136–152, 161–195, 204, 210, 238
adapta/metrics/providers
   datadog_provider.py43430%19–147
adapta/ml
   __init__.py110%19
   _model.py10100%17–42
adapta/ml/mlflow
   __init__.py220%17–18
   _client.py46460%19–164
   _functions.py47470%17–121
adapta/process_communication
   _models.py33682%90–96
adapta/security/clients
   __init__.py261254%27–28, 34–35, 41–42, 48–49, 53–54, 58–59
   _azure_client.py735032%42, 55–65, 75–78, 81, 84–86, 95–153, 156, 159–197
adapta/security/clients/aws
   _aws_client.py381755%37–40, 47, 57, 63, 75, 100–108, 114–117
   _aws_credentials.py733059%60–79, 83, 87, 91, 95, 99, 108–112, 116, 120, 124, 128, 132
adapta/security/clients/hashicorp_vault
   hashicorp_vault_client.py31487%46, 87, 91, 95
   kubernetes_client.py21576%45–48, 67–68
   oidc_client.py452056%33–62, 80–83, 92
   token_client.py17759%42–45, 52–53, 56, 59
adapta/storage/blob
   azure_storage_client.py1185058%71–78, 88, 95–96, 99–105, 127–128, 131–156, 159, 175–188, 196–200, 219, 240–242, 251, 267–271, 281–283, 286–305, 312
   local_storage_client.py54787%42, 67, 74, 87, 101, 104, 107
   s3_storage_client.py1177734%57, 61, 68–80, 91–96, 117–125, 133–134, 148–165, 181–185, 203–212, 222–243, 254–275
adapta/storage/cache
   redis_cache.py37370%19–107
adapta/storage/database/v2
   azure_sql.py34340%21–140
   odbc.py73730%21–219
   snowflake_sql.py69690%6–228
   trino_sql.py39390%21–127
adapta/storage/database/v2/models
   __init__.py110%19
   _models.py11110%20–54
adapta/storage/database/v3
   azure_sql.py322038%55, 70, 84–122, 132
   odbc.py721185%94–104, 114, 125–126, 140, 149–155, 182
   snowflake_sql.py772370%64–78, 93–95, 106–115, 142, 178, 193–196
   trino_sql.py38380%20–119
adapta/storage/delta_lake/v2
   _functions.py684534%73–106, 153–168, 210–294
adapta/storage/delta_lake/v3
   _functions.py661282%72, 156, 161, 163, 165, 221, 231, 241–251, 257, 292
adapta/storage/distributed_object_store/v2/datastax_astra
   _models.py19953%54–59, 62, 65, 68
   astra_client.py19614029%40–43, 113–137, 143–184, 190–191, 197–198, 201, 211, 223–224, 233, 273–361, 369, 393–520, 530, 541–556, 579–594, 616–632, 656–673
adapta/storage/distributed_object_store/v3/datastax_astra
   _model_mappers.py1612386%64, 130, 136, 138, 142, 148, 152, 167–170, 186–187, 218, 275–285, 370, 411–423, 455
   _models.py453131%63–69, 72, 75, 78–122, 125
   astra_client.py1396851%35–38, 133–174, 180–181, 187–188, 191, 201, 213–214, 223, 267–353, 361, 373, 384–403, 424–440, 462–482, 507–519
adapta/storage/models
   astra.py351071%37, 40, 44, 59–61, 64, 70–73
   aws.py37489%53–55, 63
   azure.py601772%32, 36–40, 47, 67, 70–71, 74–75, 89–93, 100, 113, 120, 123–124, 127
   filter_expression.py125596%55, 183–184, 238, 329
   hive.py572556%37, 41, 44, 92–99, 111, 114–115, 124–172, 175
   local.py21481%31, 35, 38, 50
adapta/storage/query_enabled_store
   _models.py611084%76, 138–141, 147–150, 156–157, 163
   _qes_astra.py571377%65–66, 78, 83–96, 106–109
   _qes_delta.py39490%33, 60, 72, 80
adapta/storage/secrets
   azure_secret_client.py20200%19–66
adapta/utils
   _common.py951584%36–37, 70–83, 94, 122, 142, 162, 245
   concurrent_task_runner.py27196%109
adapta/utils/data_structures
   _functions.py34197%134
adapta/utils/decorators
   _logging.py41198%32
   _rate_limit.py25196%58
adapta/utils/python_typing
   _functions.py11373%7–9, 22
tests
   test_filtering_api.py32294%197–198
   test_python_typing_functions.py11282%38–39
   test_s3_storage_client.py43295%34–35
   test_utils.py166199%371
   test_vault_client.py801878%33–35, 40–42, 47–51, 56–57, 62–66
TOTAL4568134171% 

Tests Skipped Failures Errors Time
215 6 💤 0 ❌ 0 🔥 1m 3s ⏱️

@Andre-Lx-Costa Andre-Lx-Costa marked this pull request as ready for review July 18, 2024 07:55
@@ -36,7 +36,7 @@ def to_uri(self) -> str:
if not self.bucket or not self.path:
raise ValueError("Bucket and path must be defined")

return f"s3://{self.bucket}/{self.path}"
return f"s3a://{self.bucket.rstrip('/')}/{self.path}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use __post_init__ and assert or rstrip in there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added __post_init__ with a regex expression to check all valid cases for s3 data paths as suggested in cbf15e9

Since it is a complex regex I suggest looking into the unit tests, I tried to cover most of the corner cases

@@ -70,25 +70,30 @@ def from_hdfs_path(cls, hdfs_path: str) -> "S3Path":
"""
assert hdfs_path.startswith("s3a://"), "HDFS S3 path should start with s3a://"
uri = urlparse(hdfs_path)
parsed_path = uri.path.split("/")
parsed_path = uri.path.replace("//", "/").split("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the check to use regex, will simplify assertion here

Copy link
Contributor Author

@Andre-Lx-Costa Andre-Lx-Costa Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included this on the regex check done in __post_init__ done in cbf15e

Copy link
Contributor

@george-zubrienko george-zubrienko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs work

if not self.bucket:
raise ValueError("Bucket must be defined")

path_regex = r"^(?![0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$)[a-z0-9]([a-z0-9\-]{1,61}[a-z0-9])?(\/(?!.*(\/\/|\\))([^\/].{0,1022}\/?)?)?$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this regex to just check for //


path_regex = r"^(?![0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$)[a-z0-9]([a-z0-9\-]{1,61}[a-z0-9])?(\/(?!.*(\/\/|\\))([^\/].{0,1022}\/?)?)?$"

s3_path_without_prefix = f"{self.bucket}/{self.path}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex should only be applied to self.path, bucket is checked on line 62, we do not aim to validate bucket name with this PR, out of scope

match = re.match(path_regex, s3_path_without_prefix)

if not match:
raise ValueError(f"Invalid S3Path provided, must comply with : {path_regex}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not put regex in error message, explain in human language :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Delta RS read fails if you have an empty path segment
2 participants