-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs #43197
Comments
Could you clarify your suggestion? The followings?
BTW, it seems that you misinterpreted supported formats:
All of them listed by you are compatible with https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri . |
@Tom-Newton Could you take a look at this? |
@kou
I suggest only the first point. Regarding the second point, I don't have a strong objection to the existing patterns as long as a file system implementation supports the Hadoop-compatible syntax, as the Hadoop-compatible syntax is the only one commonly supported by most file system implementations (except DuckDB...). However, as having storage account keys in ABFS URLs is a new practice invented by Apache Arrow and has security concerns, I suggest removing this feature.
You are right. I have updated the issue description, and let me copy-paste the original comment in the source code. /// 1. abfs[s]://[:\<password\>@]\<account\>.blob.core.windows.net
/// [/\<container\>[/\<path\>]]
/// 2. abfs[s]://\<container\>[:\<password\>]@\<account\>.dfs.core.windows.net
/// [/path]
/// 3. abfs[s]://[\<account[:\<password\>]@]\<host[.domain]\>[\<:port\>]
/// [/\<container\>[/path]]
/// 4. abfs[s]://[\<account[:\<password\>]@]\<container\>[/path] In my view, only the second pattern (without password) is Hadoop-compatible. 1 and 3 do not include the container (file system) name in the authority part of the patterns. 4 has a different syntax for the authority part. But as I explained above, I don't have strong objections against these patterns except for having passwords in URLs. |
To be honest I think I will struggle to find time to work on this. Personally I'm in favour of the proposed changes: removing some of this extra URL parsing functionality, including passwords. However I'm also not especially concerned about it. Certainly account key is more concerning but it is common practice to use SAS tokens in blob storage URLs. |
I'm neutral for this proposal. If we reject the password value, users can't use account key based authentication with the URI interface. It'll useful for local development with Azurite. Could you share bad scenarios you think? |
Could you start a discussion on the |
I use Apache Arrow mainly in Python code. Let me explain by using PyArrow as an example. When working with the PyArrow API, we have two methods to specify a file system.
import pyarrow.parquet as pq
# Explicitly set
s3 = fs.S3FileSystem(..)
pq.read_table("my-bucket/data.parquet", filesystem=s3)
# Infer from a URL
pq.read_table("s3://my-bucket/data.parquet") For 1, we don't need to embed a storage account key or any other credentials for file system access in a file path URL as long as we can set them when we create a file system instance. s3 = fs.S3FileSystem(access_key=...) For 2, many existing file system libraries provide an interface to configure credentials for file system access in a global context.
Even if it looks convenient, embedding credentials in a file path URL is generally unnecessary. Other file system implementations work well without this method. In Azure SDK, There is no standardized environment variable for setting storage account keys but We should consider these common practices instead of inventing new ABFS URL syntax. |
There are several Azure Blob File System implementations in Python, and we frequently need to use multiple implementations in the same code. However,
Because of 2, an Arrow's style ABFS URL containing a storage account key can be accidentally passed to a different ABFS implementation. However, the different implementation usually does not assume the passed URL contains a storage account key, as explained in 1. This leads to the rejection of the URL and an error message like the one below that can be exposed to error logs, HTTP error responses, etc.
Having storage account keys in ABFS URLs can cause this kind of interoperability issue with other ABFS implementations and unexpected exposure of keys. |
This is true, but a blob URL containing an SAS token is usually (and should be) treated as confidential information that must be carefully handled to avoid unexpected leaks. We cannot apply the same practice to file paths in general that appear in many places in a code. |
@sugibuchi Could you do this with the additional information? |
The discussion thread on |
### Rationale for this change Other Azure Blob Storage based filesystem API implementations don't use password field in URI. We don't use it too for compatibility. ### What changes are included in this PR? Ignore password field. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. **This PR includes breaking changes to public APIs.** * GitHub Issue: #43197 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Issue resolved by pull request 44220 |
Describe the enhancement requested
Outline
The Azure Blob File System (ABFS) support in Apache Arrow, implemented in C++ API by #18014 and integrated into Python API by #39968, currently allows embedding a storage account key as a password in an ABFS URL.
https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144
However, I strongly recommend stopping this practice for two reasons.
Security
An access key of a storage account is practically a "root password," giving full access to the data in the storage account.
Microsoft repeatedly emphasises this point in various places in the documentation and encourages the protection of account keys in a secure place like Azure Key Vault.
Embedding a storage account key in an ABFS URL, which is usually not considered confidential information, may lead to unexpected exposure of the key.
Interoperability with other file system implementations
For historical reasons, the syntax of the Azure Blob File System (ABFS) URL is inconsistent between different file system implementations.
Original implementations by Apache Hadoop's
hadoop-azure
package linkThis syntax is widely used, particularly by Apache Spark.
Python
adlfs
forfsspec
linkRust
object_store::azure
linkDuckDB
azure
extension linkApache Arrow link
This inconsistency of the syntax already causes problems in applications using different frameworks, including additional overhead to translate ABFS URLs between different syntax. It may also lead to unexpected behaviours due to misinterpretation of the same URL by different file system implementations.
I believe a new file system implementation should respect the existing syntax of a URL scheme and SHOULD NOT invent new ones. As far as I understand, no other ABFS file system implementation allows embedding storage account keys in ABFS URLs.
Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: