-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for filepaths to include #161
Conversation
When playing with omlmd I tried to push a file which contained a ":" and oras choked on it. I belive that you should check if the file exists and only split off the last : looking for options, not the first colon. Signed-off-by: Daniel J Walsh <[email protected]>
@@ -346,7 +346,8 @@ def split_path_and_content(ref: str) -> PathAndOptionalContent: | |||
: return: A Tuple of the path in the reference, and the content-type if one found, | |||
otherwise None. | |||
""" | |||
if ":" not in ref: | |||
|
|||
if os.path.exists(ref) or ":" not in ref: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another case to consider handling here... ref's that start with colon. I don't think they should be treated as a naked content type, right? Instead, I guess the entire ref should be considered as a path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@halfline you mean like :<content-type>
? Wouldn't that akin to an empty / missing file and a content type (is that allowed / is there a use case for it)? I'm wondering if we should specifically not allow that (e.g., if the path doesn't exist but the string starts with :
maybe we need to raise an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so in my mind, if someone e.g. does
╎❯ omlmd push http://quay.io/foo/foo:latest '.file-that-doesnt-exist'
and
╎❯ omlmd push http://quay.io/foo/foo:latest ':file-that-doesnt-exist'
in both cases the error message should be the same and the overall behavior should be the same.
But that's just a driveby subjective opinion. I haven't looked deeply into how split_path_and_content
is used or could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the path doesn't exist and the colon is being used weirdly like that to not separate a filepath from a content type, I think an error should be raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree if the file does not exist with or with the colon removed then raise the ENOENT error.
If the file exists with the colon then use it. If the file exists with the colon and a legitimate option then just use the file and let the user deal with it. IE change the names of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhatdan the discussion below is getting out of scope. If you'd like to add any additional tweaks based on the discussion here, please do. You'll also need to bump the version in oras/version.py and add a line to the changelog. When those are done I'll review once more and we can merge, and follow up discussion about reference formats, etc. should go in an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Proposed test
@pytest.mark.with_auth(False)
def test_file_contains_column(tmp_path, registry, credentials, target):
"""
Test for file containing column symbol
"""
client = oras.client.OrasClient(hostname=registry, insecure=True)
artifact = os.path.join(here, "artifact.txt")
assert os.path.exists(artifact)
try:
contains_column = here / "some:file"
with open(contains_column, "w") as f:
f.write("hello world some:file")
res = client.push(
files=[contains_column],
target=target
)
assert res.status_code in [200, 201]
files = client.pull(target, outdir=tmp_path / "download")
download = str(tmp_path / "download/some:file")
assert download in files
assert oras.utils.get_file_hash(str(contains_column)) == oras.utils.get_file_hash(
download
)
finally:
contains_column.unlink()
try:
contains_column = here / ":somefile"
with open(contains_column, "w") as f:
f.write("hello world :somefile")
res = client.push(
files=[contains_column],
target=target
)
assert res.status_code in [200, 201]
files = client.pull(target, outdir=tmp_path / "download")
download = str(tmp_path / "download/:somefile")
assert download in files
assert oras.utils.get_file_hash(str(contains_column)) == oras.utils.get_file_hash(
download
)
finally:
contains_column.unlink()
with pytest.raises(FileNotFoundError):
client.push(
files=[":doesnotexist"],
target=target
)
with pytest.raises(FileNotFoundError):
client.push(
files=[".doesnotexist"],
target=target
)
( I don't have rights to add myself to this PR )
Other considerations
With this patch, this test case on main starts to fail:
oras-py/oras/tests/test_provider.py
Lines 133 to 137 in 8f52216
testref = "path/to/config:application/vnd.oci.image.config.v1+json:extra" | |
remote = oras.provider.Registry(hostname=registry, insecure=True) | |
ref, content_type = remote._parse_manifest_ref(testref) | |
assert ref == "path/to/config" | |
assert content_type == "application/vnd.oci.image.config.v1+json:extra" |
but honestly that looks to me a wrong test case since it seems to me it's not a valid mime-type 🤔 should we drop that test case? i.e.: I'm not sure the trailing :extra
would make it into a valid mime-type application/vnd.oci.image.config.v1+json:extra
or I haven't found the requirement that motivates it
( I don't have rights to add myself to this PR )
HTH
@tarilabs akin to our discussion above, I think we should allow maximally one |
@vsoch we could transform the already existing test case on main, failing with this PR, as a failure case. If I understand that correctly, I agree. I do not concur we should allow maximally one Edit: or you meant "allow maximally one :" ...as a separator? with pytest.raises(FileNotFoundError, match=r".*does:not:exists.*"):
client.push(
files=["does:not:exists:mime"],
target=target
)
with pytest.raises(FileNotFoundError, match=r".*does:not:exists.*"):
client.push(
files=["does:not:exists:mime+ext"],
target=target
) Let me know how could I help on this PR ! Thanks for the early feedback |
Again just driveby commenting with limited context so feel free to disregard but... why would you allow one colon but not more than one colon? what if a filename has two? (eg machineid:hash-type:hash) what if the ref has one and a content type (eg hash-type:hash:content-type)? if the plan is to delegitimize certain filenames to avoid ambiguity with filenames with embedded content types then why allow embedded colons at all? Special case for just Dan's case seems strange to me, it should either allow colons in general or not all imo also, is there any security ramifications? is this ever used in untrusted scenarios? just thinking about the case where a content type could be spoofed by having files named |
More than one colon is fine if the file is found to exist. It's the cases where you've parsed a mimetype and there is an extra colon not associated with a file that should raise an error. That's likely a mistake. |
But if the file doesn't exist it ultimately raises an error anyway right (regardless of colons)? I guess the question is should, Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") . So perhaps this code should be validating the content type against the spec as well, and the kind of error returned perhaps should depend on if the content type is valid. |
Is that applicable here? 🤔
(source) it seems to me it's only the media-type itself, not the parameters typically found in content-type. |
We can discuss those more complex cases when they actually appear. I don't think it makes sense to solve a problem that doesn't exist yet. |
I think the main use case for parameters is legacy text files (e.g. ISO-8859-1 or Windows-1252 instead of utf-8), maybe not applicable, or worth supporting...no idea. Complications with parameters aside, I guess my main point, though, was, this merge request is focusing on validating the filename and doing I/O calls to make heuristic decisions, but it might be worthwhile to validate the media type too. Knowing if the media type is valid or invalid is useful in its own right, and can also potentially inform the heuristics as well. |
Totally agree! It's out of scope for this PR, but it would be great to open an issue (if you don't want to work on it) or PR directly to tackle it. |
okay, but note files can legitimately start with colons, may be none that will ever go through these functions, i'm don't know. (but I have some in my home dir right now for instance)
makes sense to me
assume you mean that the other way around, but makes sense
makes sense
I still think it probably makes sense to do more validation than just "is empty" since the docs say the format is a hard requirement. I wouldn't be surprised if there is some regex already crafted out there that validates to spec.
yea sorry not trying to derail things for cases that don't matter. maybe i'm pushing for over engineering here... |
We can move the "starts with" to be later, and yes other order is what I meant. Apologies - I keep switching state between deep programming and trying to be articulate - doesn't always work super well. And generally whatever you think is best practice I am happy to review. |
So I don't want to invest the time to build and test this right now, but my suggestion would be something like this, (though maybe slightly different if you and @tarilabs decide media type parameters aren't worthy of supporting) def split_path_and_content(ref: str) -> PathAndOptionalContent:
"""
Parse a string containing a path and an optional content
Examples
--------
<path>:<content-type>
path/to/config:application/vnd.oci.image.config.v1+json
/dev/null:application/vnd.oci.image.config.v1+json
C:\\myconfig:application/vnd.oci.image.config.v1+json
Or,
<path>
/dev/null
C:\\myconfig
:param ref: the manifest reference to parse (examples above)
:type ref: str
: return: A Tuple of the path in the reference, and the content-type if one found,
otherwise None.
"""
- if ":" not in ref:
- return PathAndOptionalContent(ref, None)
-
- if pathlib.Path(ref).drive:
- # Running on Windows and Path has Windows drive letter in it, it definitely has
- # one colon and could have two or feasibly more, e.g.
- # C:\test.tar
- # C:\test.tar:application/vnd.oci.image.layer.v1.tar
- # C:\test.tar:application/vnd.oci.image.layer.v1.tar:somethingelse
- #
- # This regex matches two colons in the string and returns everything before
- # the second colon as the "path" group and everything after the second colon
- # as the "context" group.
- # i.e.
- # (C:\test.tar):(application/vnd.oci.image.layer.v1.tar)
- # (C:\test.tar):(application/vnd.oci.image.layer.v1.tar:somethingelse)
- # But C:\test.tar along will not match and we just return it as is.
- path_and_content = re.search(r"(?P<path>.*?:.*?):(?P<content>.*)", ref)
- if path_and_content:
- return PathAndOptionalContent(
- path_and_content.group("path"), path_and_content.group("content")
- )
- return PathAndOptionalContent(ref, None)
- else:
- path_content_list = ref.split(":", 1)
- return PathAndOptionalContent(path_content_list[0], path_content_list[1])
+
+ # RFC 6838 Section 4.2: Media Type Syntax
+ # https://tools.ietf.org/html/rfc6838#section-4.2
+ #
+ # Media types are defined as:
+ #
+ # type-name = restricted-name
+ # subtype-name = restricted-name
+ # restricted-name = restricted-name-first *126restricted-name-chars
+ # restricted-name-first = ALPHA / DIGIT
+ # restricted-name-chars = ALPHA / DIGIT / "!" / "#" / "$" / "&" / "-" / "^" / "_"
+ #
+ # So valid characters are letters, digits, and the symbols: ! # $ & - ^ _
+ # Additionally, parameters can be added, which are not specified in detail in RFC 6838.
+ # For parameters, we'll accept any token or quoted string after a semi
+
+ RESTRICTED_NAME_FIRST = r"(?:[A-Za-z0-9])"
+ RESTRICTED_NAME_CHARS = r"[A-Za-z0-9!#$&\-\^_]"
+ RESTRICTED_NAME = rf"{RESTRICTED_NAME_FIRST}{RESTRICTED_NAME_CHARS}{{0,126}}"
+ QUOTED_STRING = r'"(?:\\.|[^"\\])*"'
+ VALUE = rf"(?:{RESTRICTED_NAME}|{QUOTED_STRING})"
+ PARAMETER = rf"(?:;\s*{RESTRICTED_NAME}\s*=\s*{VALUE})"
+ MEDIA_TYPE = rf"{RESTRICTED_NAME}/{RESTRICTED_NAME}(?:{PARAMETER})*"
+
+ media_type_regex = re.compile(rf"{MEDIA_TYPE}\Z")
+
+ if not ref:
+ raise ValueError("Invalid manifest reference")
+
+ path = pathlib.Path(ref)
+ drive = path.drive
+
+ path_candidate = ref
+ content_candidate = ''
+ search_start = len(drive)
+ search_end = len(ref)
+ while search_end >= 0:
+ search_end = ref.rfind(':', search_start, search_end)
+
+ if search_end >= 0:
+ path_candidate = ref[:search_end]
+ content_candidate = ref[search_end + 1:]
+
+ if os.path.exists(path_candidate):
+ if content_candidate:
+ if media_type_regex.fullmatch(content_candidate):
+ return PathAndOptionalContent(path_candidate, content_candidate)
+ else:
+ raise ValueError(f"Invalid media type '{content_candidate}'.")
+ else:
+ return PathAndOptionalContent(path_candidate, None)
+
+ raise FileNotFoundError(f"Filename '{ref}' does not exist.") |
To clarify, I don't think is a decision here, I simply believe the OCI spec I've liked is not supporting params. Did you have a chance to look at the source spec I've linked in #161 (comment) and would you reach the same conclusion? |
my reading of the source you pointed me to says the media type must follow rfc 6838. It then directs the reader to section 4.2, with language like "including section 4.2", not "limited to section 4.2". I mean the parameter bit is the very next section, 4.3, and it's a clear continuation of 4.2, so I don't know why it wouldn't be part of things, but that's just my hot take. |
Signed-off-by: tarilabs <[email protected]>
as the column does not belong to the file Signed-off-by: tarilabs <[email protected]>
Signed-off-by: tarilabs <[email protected]>
followup to: - oras-project/oras-py#161 - containers/omlmd#22 Signed-off-by: tarilabs <[email protected]>
When playing with omlmd I tried to push a file which contained a ":" and oras choked on it. I belive that you should check if the file exists and only split off the last : looking for options, not the first colon.