Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for filepaths to include #161

Merged
merged 4 commits into from
Oct 9, 2024
Merged

Allow for filepaths to include #161

merged 4 commits into from
Oct 9, 2024

Conversation

rhatdan
Copy link
Contributor

@rhatdan rhatdan commented Oct 7, 2024

When playing with omlmd I tried to push a file which contained a ":" and oras choked on it. I belive that you should check if the file exists and only split off the last : looking for options, not the first colon.

When playing with omlmd I tried to push a file which contained a ":" and
oras choked on it. I belive that you should check if the file exists and
only split off the last : looking for options, not the first colon.

Signed-off-by: Daniel J Walsh <[email protected]>
@@ -346,7 +346,8 @@ def split_path_and_content(ref: str) -> PathAndOptionalContent:
: return: A Tuple of the path in the reference, and the content-type if one found,
otherwise None.
"""
if ":" not in ref:

if os.path.exists(ref) or ":" not in ref:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case to consider handling here... ref's that start with colon. I don't think they should be treated as a naked content type, right? Instead, I guess the entire ref should be considered as a path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@halfline you mean like :<content-type> ? Wouldn't that akin to an empty / missing file and a content type (is that allowed / is there a use case for it)? I'm wondering if we should specifically not allow that (e.g., if the path doesn't exist but the string starts with : maybe we need to raise an error.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in my mind, if someone e.g. does

╎❯ omlmd push http://quay.io/foo/foo:latest '.file-that-doesnt-exist'

and

╎❯ omlmd push http://quay.io/foo/foo:latest ':file-that-doesnt-exist'

in both cases the error message should be the same and the overall behavior should be the same.

But that's just a driveby subjective opinion. I haven't looked deeply into how split_path_and_content is used or could be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the path doesn't exist and the colon is being used weirdly like that to not separate a filepath from a content type, I think an error should be raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree if the file does not exist with or with the colon removed then raise the ENOENT error.
If the file exists with the colon then use it. If the file exists with the colon and a legitimate option then just use the file and let the user deal with it. IE change the names of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhatdan the discussion below is getting out of scope. If you'd like to add any additional tweaks based on the discussion here, please do. You'll also need to bump the version in oras/version.py and add a line to the changelog. When those are done I'll review once more and we can merge, and follow up discussion about reference formats, etc. should go in an issue.

Copy link
Contributor

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Proposed test

@pytest.mark.with_auth(False)
def test_file_contains_column(tmp_path, registry, credentials, target):
    """
    Test for file containing column symbol
    """
    client = oras.client.OrasClient(hostname=registry, insecure=True)
    artifact = os.path.join(here, "artifact.txt")
    assert os.path.exists(artifact)

    try:
        contains_column = here / "some:file"
        with open(contains_column, "w") as f:
            f.write("hello world some:file")

        res = client.push(
            files=[contains_column],
            target=target
        )
        assert res.status_code in [200, 201]

        files = client.pull(target, outdir=tmp_path / "download")
        download = str(tmp_path / "download/some:file")
        assert download in files
        assert oras.utils.get_file_hash(str(contains_column)) == oras.utils.get_file_hash(
            download
        )
    finally:
        contains_column.unlink()

    try:
        contains_column = here / ":somefile"
        with open(contains_column, "w") as f:
            f.write("hello world :somefile")

        res = client.push(
            files=[contains_column],
            target=target
        )
        assert res.status_code in [200, 201]

        files = client.pull(target, outdir=tmp_path / "download")
        download = str(tmp_path / "download/:somefile")
        assert download in files
        assert oras.utils.get_file_hash(str(contains_column)) == oras.utils.get_file_hash(
            download
        )
    finally:
        contains_column.unlink()

    with pytest.raises(FileNotFoundError):
        client.push(
            files=[":doesnotexist"],
            target=target
        )

    with pytest.raises(FileNotFoundError):
        client.push(
            files=[".doesnotexist"],
            target=target
        )

( I don't have rights to add myself to this PR )

Other considerations

With this patch, this test case on main starts to fail:

testref = "path/to/config:application/vnd.oci.image.config.v1+json:extra"
remote = oras.provider.Registry(hostname=registry, insecure=True)
ref, content_type = remote._parse_manifest_ref(testref)
assert ref == "path/to/config"
assert content_type == "application/vnd.oci.image.config.v1+json:extra"

but honestly that looks to me a wrong test case since it seems to me it's not a valid mime-type 🤔 should we drop that test case? i.e.: I'm not sure the trailing :extra would make it into a valid mime-type application/vnd.oci.image.config.v1+json:extra or I haven't found the requirement that motivates it

( I don't have rights to add myself to this PR )

HTH

@vsoch
Copy link
Contributor

vsoch commented Oct 8, 2024

@tarilabs akin to our discussion above, I think we should allow maximally one :, and then if there is another one (and the filepath does not exist) an error should be raised. So your last example above (the newly failing test) should raise that error I think.

@tarilabs
Copy link
Contributor

tarilabs commented Oct 8, 2024

@vsoch we could transform the already existing test case on main, failing with this PR, as a failure case. If I understand that correctly, I agree.

I do not concur we should allow maximally one :, as it seems to me in case of Win you want to allow C:\file:mime and in other cases we want to allow my:file:containing:columns:mime.

Edit: or you meant "allow maximally one :" ...as a separator?

    with pytest.raises(FileNotFoundError, match=r".*does:not:exists.*"):
        client.push(
            files=["does:not:exists:mime"],
            target=target
        )

    with pytest.raises(FileNotFoundError, match=r".*does:not:exists.*"):
        client.push(
            files=["does:not:exists:mime+ext"],
            target=target
        )

Let me know how could I help on this PR ! Thanks for the early feedback

@halfline
Copy link

halfline commented Oct 8, 2024

@tarilabs akin to our discussion above, I think we should allow maximally one :, and then if there is another one (and the filepath does not exist) an error should be raised. So your last example above (the newly failing test) should raise that error I think.

Again just driveby commenting with limited context so feel free to disregard but...

why would you allow one colon but not more than one colon? what if a filename has two? (eg machineid:hash-type:hash) what if the ref has one and a content type (eg hash-type:hash:content-type)?

if the plan is to delegitimize certain filenames to avoid ambiguity with filenames with embedded content types then why allow embedded colons at all?

Special case for just Dan's case seems strange to me, it should either allow colons in general or not all imo

also, is there any security ramifications? is this ever used in untrusted scenarios? just thinking about the case where a content type could be spoofed by having files named file and file:wrongtype on disk at the time of the call

@vsoch
Copy link
Contributor

vsoch commented Oct 8, 2024

why would you allow one colon but not more than one colon? what if a filename has two? (eg machineid:hash-type:hash) what if the ref has one and a content type (eg hash-type:hash:content-type)?

More than one colon is fine if the file is found to exist. It's the cases where you've parsed a mimetype and there is an extra colon not associated with a file that should raise an error. That's likely a mistake.

@halfline
Copy link

halfline commented Oct 8, 2024

More than one colon is fine if the file is found to exist. It's the cases where you've parsed a mimetype and there is an extra colon not associated with a file that should raise an error. That's likely a mistake.

But if the file doesn't exist it ultimately raises an error anyway right (regardless of colons)? I guess the question is should, missing;file:type and missing:file:type both ultimately lead to FileNotFoundError or should missing:file:type return a ValueError instead.

Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") .

So perhaps this code should be validating the content type against the spec as well, and the kind of error returned perhaps should depend on if the content type is valid.

@tarilabs
Copy link
Contributor

tarilabs commented Oct 8, 2024

Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") .

Is that applicable here? 🤔

... If defined, the value MUST comply with RFC 6838, including the naming requirements in its section 4.2, ...

(source)

it seems to me it's only the media-type itself, not the parameters typically found in content-type.
wdyt?

@vsoch
Copy link
Contributor

vsoch commented Oct 8, 2024

  1. If the path starts with a colon, fail right away.
  2. If there are one or more colons, split by the colon, just once (as the PR does here)
  3. Then assume the first part is a media type, the second part is a file.
  4. If the file does not exist, error.
  5. If the media type is empty, also error.

We can discuss those more complex cases when they actually appear. I don't think it makes sense to solve a problem that doesn't exist yet.

@halfline
Copy link

halfline commented Oct 8, 2024

Another complication is mime types can support parameters, and the values of parameters can have colons (e.g, just making something up, prolly not be real, ref="file.mp4:video/mpeg4; aspect_ratio=16:9") .

Is that applicable here? 🤔

I think the main use case for parameters is legacy text files (e.g. ISO-8859-1 or Windows-1252 instead of utf-8), maybe not applicable, or worth supporting...no idea.

Complications with parameters aside, I guess my main point, though, was, this merge request is focusing on validating the filename and doing I/O calls to make heuristic decisions, but it might be worthwhile to validate the media type too. Knowing if the media type is valid or invalid is useful in its own right, and can also potentially inform the heuristics as well.

@vsoch
Copy link
Contributor

vsoch commented Oct 8, 2024

Complications with parameters aside, I guess my main point, though, was, this merge request is focusing on validating the filename and doing I/O calls to make heuristic decisions, but it might be worthwhile to validate the media type too. Knowing if the media type is valid or invalid is useful in its own right, and can also potentially inform the heuristics as well.

Totally agree! It's out of scope for this PR, but it would be great to open an issue (if you don't want to work on it) or PR directly to tackle it.

@halfline
Copy link

halfline commented Oct 8, 2024

  1. If the path starts with a colon, fail right away.

okay, but note files can legitimately start with colons, may be none that will ever go through these functions, i'm don't know. (but I have some in my home dir right now for instance)

2. If there are one or more colons, split by the colon, just _once_ (as the PR does here)

makes sense to me

3. Then assume the first part is a media type, the second part is a file.

assume you mean that the other way around, but makes sense

4. If the file does not exist, error.

makes sense

5. If the media type is empty, also error.

I still think it probably makes sense to do more validation than just "is empty" since the docs say the format is a hard requirement. I wouldn't be surprised if there is some regex already crafted out there that validates to spec.

We can discuss those more complex cases when they actually appear. I don't think it makes sense to solve a problem that doesn't exist yet.

yea sorry not trying to derail things for cases that don't matter. maybe i'm pushing for over engineering here...

@vsoch
Copy link
Contributor

vsoch commented Oct 8, 2024

We can move the "starts with" to be later, and yes other order is what I meant. Apologies - I keep switching state between deep programming and trying to be articulate - doesn't always work super well.

And generally whatever you think is best practice I am happy to review.

@halfline
Copy link

halfline commented Oct 8, 2024

And generally whatever you think is best practice I am happy to review.

So I don't want to invest the time to build and test this right now, but my suggestion would be something like this, (though maybe slightly different if you and @tarilabs decide media type parameters aren't worthy of supporting)

 def split_path_and_content(ref: str) -> PathAndOptionalContent:
     """
     Parse a string containing a path and an optional content
 
     Examples
     --------
     <path>:<content-type>
     path/to/config:application/vnd.oci.image.config.v1+json
     /dev/null:application/vnd.oci.image.config.v1+json
     C:\\myconfig:application/vnd.oci.image.config.v1+json
 
     Or,
     <path>
     /dev/null
     C:\\myconfig
 
     :param ref: the manifest reference to parse (examples above)
     :type ref: str
     : return: A Tuple of the path in the reference, and the content-type if one found,
               otherwise None.
     """
-    if ":" not in ref:
-        return PathAndOptionalContent(ref, None)
-
-    if pathlib.Path(ref).drive:
-        # Running on Windows and Path has Windows drive letter in it, it definitely has
-        # one colon and could have two or feasibly more, e.g.
-        # C:\test.tar
-        # C:\test.tar:application/vnd.oci.image.layer.v1.tar
-        # C:\test.tar:application/vnd.oci.image.layer.v1.tar:somethingelse
-        #
-        # This regex matches two colons in the string and returns everything before
-        # the second colon as the "path" group and everything after the second colon
-        # as the "context" group.
-        # i.e.
-        # (C:\test.tar):(application/vnd.oci.image.layer.v1.tar)
-        # (C:\test.tar):(application/vnd.oci.image.layer.v1.tar:somethingelse)
-        # But C:\test.tar along will not match and we just return it as is.
-        path_and_content = re.search(r"(?P<path>.*?:.*?):(?P<content>.*)", ref)
-        if path_and_content:
-            return PathAndOptionalContent(
-                path_and_content.group("path"), path_and_content.group("content")
-            )
-        return PathAndOptionalContent(ref, None)
-    else:
-        path_content_list = ref.split(":", 1)
-        return PathAndOptionalContent(path_content_list[0], path_content_list[1])
+
+    # RFC 6838 Section 4.2: Media Type Syntax
+    # https://tools.ietf.org/html/rfc6838#section-4.2
+    #
+    # Media types are defined as:
+    #
+    # type-name = restricted-name
+    # subtype-name = restricted-name
+    # restricted-name = restricted-name-first *126restricted-name-chars
+    # restricted-name-first  = ALPHA / DIGIT
+    # restricted-name-chars  = ALPHA / DIGIT / "!" / "#" / "$" / "&" / "-" / "^" / "_"
+    #
+    # So valid characters are letters, digits, and the symbols: ! # $ & - ^ _
+    # Additionally, parameters can be added, which are not specified in detail in RFC 6838.
+    # For parameters, we'll accept any token or quoted string after a semi
+
+    RESTRICTED_NAME_FIRST = r"(?:[A-Za-z0-9])"
+    RESTRICTED_NAME_CHARS = r"[A-Za-z0-9!#$&\-\^_]"
+    RESTRICTED_NAME = rf"{RESTRICTED_NAME_FIRST}{RESTRICTED_NAME_CHARS}{{0,126}}"
+    QUOTED_STRING = r'"(?:\\.|[^"\\])*"'
+    VALUE = rf"(?:{RESTRICTED_NAME}|{QUOTED_STRING})"
+    PARAMETER = rf"(?:;\s*{RESTRICTED_NAME}\s*=\s*{VALUE})"
+    MEDIA_TYPE = rf"{RESTRICTED_NAME}/{RESTRICTED_NAME}(?:{PARAMETER})*"
+
+    media_type_regex = re.compile(rf"{MEDIA_TYPE}\Z")
+
+    if not ref:
+        raise ValueError("Invalid manifest reference")
+
+    path = pathlib.Path(ref)
+    drive = path.drive
+
+    path_candidate = ref
+    content_candidate = ''
+    search_start = len(drive)
+    search_end = len(ref)
+    while search_end >= 0:
+        search_end = ref.rfind(':', search_start, search_end)
+
+        if search_end >= 0:
+            path_candidate = ref[:search_end]
+            content_candidate = ref[search_end + 1:]
+
+        if os.path.exists(path_candidate):
+            if content_candidate:
+                if media_type_regex.fullmatch(content_candidate):
+                    return PathAndOptionalContent(path_candidate, content_candidate)
+                else:
+                    raise ValueError(f"Invalid media type '{content_candidate}'.")
+            else:
+                return PathAndOptionalContent(path_candidate, None)
+
+    raise FileNotFoundError(f"Filename '{ref}' does not exist.")

@tarilabs
Copy link
Contributor

tarilabs commented Oct 8, 2024

decide media type parameters aren't worthy of supporting

To clarify, I don't think is a decision here, I simply believe the OCI spec I've liked is not supporting params. Did you have a chance to look at the source spec I've linked in #161 (comment) and would you reach the same conclusion?

@halfline
Copy link

halfline commented Oct 9, 2024

decide media type parameters aren't worthy of supporting

To clarify, I don't think is a decision here, I simply believe the OCI spec I've liked is not supporting params. Did you have a chance to look at the source spec I've linked in #161 (comment) and would you reach the same conclusion?

my reading of the source you pointed me to says the media type must follow rfc 6838. It then directs the reader to section 4.2, with language like "including section 4.2", not "limited to section 4.2". I mean the parameter bit is the very next section, 4.3, and it's a clear continuation of 4.2, so I don't know why it wouldn't be part of things, but that's just my hot take.

as the column does not belong to the file

Signed-off-by: tarilabs <[email protected]>
@vsoch vsoch merged commit 66d57d3 into oras-project:main Oct 9, 2024
5 checks passed
@tarilabs
Copy link
Contributor

thank you @vsoch @rhatdan

tarilabs added a commit to tarilabs/ramalama that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants