Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a metadata API #383

Closed
Tracked by #526
brettcannon opened this issue Jan 4, 2021 · 119 comments · Fixed by #518
Closed
Tracked by #526

Add a metadata API #383

brettcannon opened this issue Jan 4, 2021 · 119 comments · Fixed by #518

Comments

@brettcannon
Copy link
Member

This is related to #147, but not quite since I want to go beyond validation and have read/write support for things like TOML, METADATA, etc.

I see two potential API forms. One is a dataclass-like object that stores the data with some methods to help read/write to various formats. The other form is a dict with various functions to manage the reading/writing.

I actually think the dict approach might be the best and easiest to work with for two reasons. One, I don't see anything inherent in the potential design that will require OOP (i.e. method overriding to influence results). I have not thought of an API that would call other methods where overriding would be necessary.

Two, it makes representing the explicit lack of data versus what goes into dynamic ala PEP 621 more obvious. If the lack of a key means "information is explicitly not provided", then that can be expressed in a dict. If we make the existence of a key with a None value represent something that would go into dynamic, then we don't have to maintain dynamic details separately. Using a dataclass doesn't buy us anything over this as it makes representing no data a little trickier as we would have to add some way to delineate that (e.g. some other false value that isn't None?). Best benefit to a dataclass I can think of is a property to dynamically calculate dynamic (😁 ) which could also be done with a function.

There's also no loss in typing as a partial TypedDict would work in this instance.

So for me, the simplicity of using a dict for storing data in the appropriate formats and using the lack of key to delineate purposefully missing data is the way to go.

@uranusjr
Copy link
Member

uranusjr commented Jan 5, 2021

METADATA is already covered by importlib-metadata so a discussion on who handles what would be needed.

@di
Copy link
Member

di commented Jan 5, 2021

I'll cross-link the #332 draft PR here. It contains no actual validation logic, and is only focused on defining an API that tries to align with what was being discussed in https://discuss.python.org/t/package-metadata-description-field/4508/ -- basically your first approach.

It seems like regardless of the approach, using TypedDict would make defining the differences between metadata versions a little cleaner, so I'd be on board with that.

@brettcannon
Copy link
Member Author

@uranusjr It basically returns a dictionary. And I don't think there needs to be a direct discussion on that as the target audiences are different: we are looking at tool support for packaging front-end and back-ends while importlib.metadata is for people wanting to know what and about installed projects (while replacing pkg_resources).

@di that PR is what instigated all of this. 😄 With PEP 621 thrown into the mix since that PR was opened I tried to think of an API where the three-way data/no-data/dynamic state could be cleanly represented, and dicts came the closest for me.

As for TypedDict handling version differences more cleanly, are you thinking of having various one for each metadata version as the way it would be cleaner?

@di
Copy link
Member

di commented Jan 5, 2021

As for TypedDict handling version differences more cleanly, are you thinking of having various one for each metadata version as the way it would be cleaner?

Yes, with inheritance. Does that still work w/ what you're thinking about?

@uranusjr
Copy link
Member

uranusjr commented Jan 5, 2021

I was worrying about packaging also implementing an API to access the actual files in .dist-info, but #332’s aproach (only deals with in-memory structures) looks fine to me. I have no objeections if we stick to that and leave file access to other tools like importlib.metadata.

@brettcannon
Copy link
Member Author

It should (and the PEP says inheritance is supported).

The one question would be what to do with Metadata-Version and how that would affect return types? I can see three approaches.

One is to just leave it as a str and not worrying about it. That would basically mean we always return the Metadata type for functions.

Two, update the potential literal values at each bump and only support returning the latest one (but that assumes types don't ever change for existing keys). This would mean we return whatever the newest Metadata type is, but that is technically a type change.

Three, we do whatever inheritance tricks we need to for superclasses and have an explicit literal for each metadata version and have all functions return the union of all possible metadata types. That's the most explicit and the most accurate going forward.

And to to be clear about that third option, I'm thinking:

class RequiredMetadata(typing.TypeDict):
    name: str
    version: Optional[version.Version]
    ...

class CommonMetadata(RequiredMetadata, total=False):
    ...

class MetaData2_2(CommonMetadata):
    metadata_version: typing.Literal["2.2"]

class Metadata2_1(CommonMetadata):
    metadata_version: typing.Literal["2.1"]

...

There's also the question of naming (for either dicts or objects). Would we use the capitalization approach of core metadata spec or lower-case like how it's written in PEP 621 in TOML? And what about dashes? If we use the alternative syntax for TypedDict we could keep the names as-is from either, but we would lose typing strictness as we couldn't cleanly delineate required versus optional keys.

@brettcannon
Copy link
Member Author

@uranusjr one trick with no file access is PEP 621 and the license and readme values as those can specify file paths which don't translate to the core metadata spec directly since they embed the text in that case.

@MrMino
Copy link
Contributor

MrMino commented Jan 12, 2021

Hi there.

I see two potential API forms. One is a dataclass-like object that stores the data with some methods to help read/write to various formats. The other form is a dict with various functions to manage the reading/writing.

I prototyped & used both forms, and, as this is of particular interest to me, let me share my perspective on the pros and cons of each.

For the dict form, I used Paul Moore's gist. This approach makes for the most concise implementation, and puts the spec front and center. But it is an unimaginable pain to have to work with the metadata keys in the regime where an arbitrary string is accepted. Without looking at the spec / code, could you tell which of these three triplets contain only correct keys?

Description-Content-Type
Classifier
Home-Page

Descriptlon-Content-Type
Classifiers
Home-Page

Description-Content-Type
Classiffier
Home-page

That's right, none of them 🙃. The issue is alleviated somewhat by naively converting to the json form on the fly (replace + lower), but still - there's only one set of keys above that would pass that. Linters won't highlight anything in these strings.

I guess this is mostly fixed by TypedDict, but I haven't used that, so I cannot speak to the devX of it. Even if the incorporation of annotations can fix the issue outlined above, it still leaves the usability defect of there being no code completion for these keys in most tools.

The dataclass style, on the other hand, provides many opportunities for improving usability, depending on how far away would you like to get carried with it. So far my experience with it is that the further you go, the less the spec is visible in the API - maybe that's a good thing?

My implementation goes quite far in that direction, mostly because I need to teach this stuff to an audience that isn't very familiar with the subject, and I don't want to waste time on debugging silly mistakes. Disclaimer: I had to make this far longer than it should be, cause the usecase for the v0.0.1 forbids me from using dataclass backports - I'm targetting 3.6.

I tried the following:

  • All attributes can be given via __init__().
  • Attributes corresponding to Name and Version are mandatory.
  • __init__() uses keyword-only arguments for every passed value. This prevents misassigning values by erroneous usage of positionals. With this many parameters positionals are a no-go.
  • __slots__ are used to prevent misspellings of object attributes - I added this after the third time I used md.classiffiers = ....
  • Naming scheme for fields is simplified by using lowercase with underscores.
  • Multiple-use fields have s appended to their names. This makes for a prose-like readability when an attribute is used in a loop. The only place this behaves weirdly is with Keywords field, but this is alleviated by the fact that they can be given either by a list or a single string (the actual attribute gets a list afterwards). Also, the new Dynamic field has its meaning twisted - Dynamics. But I guess if you are going this far from the spec you might as well use dynamic_fields for it - it even makes more intuitive sense when encountered in the code.
  • All multiple-use fields default to an empty list - this makes it simple to loop over it without having to worry about Nones.
  • Metadata-Version cannot be changed - it's hardcoded to 2.1, and my current plan is to do this in every other class, should I actually manage to implement all of them >=1.1.

Code completion makes using this a breeze, and improves discoverability.
image

While this wins on the usability front - the poka-yoke of it - the big con of all of that is simply: the implementation feels like taking out a bazooka to shoot down a fly.

Converting between names of attributes and the keys from the spec is counter-intuitive, and the fact that the spec keys doesn't have consistent (i.e. algorithmic) casing doesn't help. Inheritance between classes that implement different version of this spec is probably not possible nor advisable - DRY would probably have to be violated in places.

I would be eternally grateful if any of you would provide any thoughts on this. This should by no means be read as me trying to advertise my implementation for inclusion in packaging, rather I'm trying to find out the best option out there.

@uranusjr
Copy link
Member

Without looking at the spec / code, could you tell which of these three triplets contain only correct keys?

The second is actually valid, if I read it correctly. Since METADATA is stored in an Email-ish format (and actually parsed with email.parser internally), the metadata keys are case-insensitive. But of course, this introduces other problems to the dict-based approach, since we can’t just use a dict, but a case-insensitive (but maybe preserving?) custom mapping like email.message.EmailMessage.

@MrMino
Copy link
Contributor

MrMino commented Jan 12, 2021

Without looking at the spec / code, could you tell which of these three triplets contain only correct keys?

The second is actually valid, if I read it correctly.

Maybe you meant the first one - then yes, that's correct, if you assume case insensitivity.

Since METADATA is stored in an Email-ish format (and actually parsed with email.parser internally), the metadata keys are case-insensitive.

Is it correct to assume that all tools that are made up-to-spec use metadata keys in a case-insensitive manner? That's a big assumption that isn't explicitly in the spec.

@uranusjr
Copy link
Member

Is it correct to assume that all tools that are made up-to-spec use metadata keys in a case-insensitive manner?

All tools? Of course not 😉 But it is my understanding that common tools (basically pkg_resources, importlib-metadata, and maybe distlib) treat them as case-insensitive.

@MrMino
Copy link
Contributor

MrMino commented Jan 12, 2021

Is it correct to assume that all tools that are made up-to-spec use metadata keys in a case-insensitive manner?

All tools? Of course not But it is my understanding that common tools (basically pkg_resources, importlib-metadata, and maybe distlib) treat them as case-insensitive.

Being able to take mis-cased keys is a must if you wish to go down the dict route. But based on what you've said, I don't think outputting them that way is a good idea. Especially in case of wheels, where these mistakes become persistent. I feel like you could easily end up in a situation where tools based on different implementations of metadata parsers cannot read each others' wheels.

@uranusjr
Copy link
Member

It’s also notable that no common tools (AFAIK) can read Core Metadata into a data structure and then output that data structure into Core Metadata. Existing tools I know all produce Core Metadata from a foreign format, so the point you raised here (whether to preserve casing) is untrotted territory.

I do agree with your though; it should be preferrable if the tool canonicalise cases though.

@brettcannon
Copy link
Member Author

@MrMino Thanks for the thorough look at this and the feedback!

I guess this is mostly fixed by TypedDict, but I haven't used that, so I cannot speak to the devX of it.

Basically, if you run a type checker (e.g. mypy), it will tell you when you get something wrong. E.g., in VS Code with the Pylance extension (disclaimer: I work at MS and work closely with the Pylance team):

image

For your "could you tell which of these three triplets contain only correct keys?" example, a type checker would tell you when you were at least using an incorrect key.

it still leaves the usability defect of there being no code completion for these keys in most tools.

Depends on the tool, but hopefully they can eventually provide it (and I'm going to open an issue on Pylance to support this exact feature right now 😁 ).

@brettcannon
Copy link
Member Author

It’s also notable that no common tools (AFAIK) can read Core Metadata into a data structure and then output that data structure into Core Metadata.

That is a key goal of this for me. I want a way to read and write core metadata, so this is part of defining a common data format to go to/from while being easy to work with from a tool author perspective.

@MrMino
Copy link
Contributor

MrMino commented Jan 12, 2021

Basically, if you run a type checker (e.g. mypy), it will tell you when you get something wrong

Yea, that's what I was expecting, but since I haven't used it I wouldn't know if the tools support it (and how).

Depends on the tool, but hopefully they can eventually provide it

Me too. I'm worried though, that projects like jedi might not have taken this into account architecturally. Providing autocompletion in string literals would be the last thing I would think of if I were to create an autocompletion engine. This seems like something ~atypical to implement.

Wonder if I can PR this there, it would be super handy to have that.

@MrMino
Copy link
Contributor

MrMino commented Jan 12, 2021

Wow, actually, scratch that. Jedi already supports TypedDicts on a branch (see jedi/#1478).

Things seem to be moving much faster than I expected.

@brettcannon
Copy link
Member Author

FYI Pylance landed key completions for indexing on a TypedDict.

What I think I might do is write out the data representation using both an object and TypedDict and see how they look (and post them here). Then based on that we can make a decision as to which looks like it would be the best going forward.

@MrMino
Copy link
Contributor

MrMino commented Jan 20, 2021

@brettcannon feel free to copy paste some of mine for the attributes. Writing this by hand gets tedious fast 😉.

@brettcannon
Copy link
Member Author

brettcannon commented Jan 25, 2021

OK, as a way to do some independent validation of ideas, I purposefully didn't read what @MrMino while writing out the following examples. I covered:

  1. A single TypedDict
  2. "Stacked" TypedDict approach so each metadata version could be independently typed (for @di )
  3. Traditional class approach
from __future__ import annotations

import pathlib
import typing
from typing import Any, FrozenSet, Iterable, Literal, Mapping, Type

from . import requirements
from . import specifiers
from . import version as pkg_version


# https://packaging.python.org/specifications/core-metadata/


class RequiredMetadata(typing.TypedDict):
    metadata_version: Literal["1.0"] | Literal["1.1"] | Literal["1.2"] | Literal[
        "2.1"
    ] | Literal["2.2"]
    name: str
    version: pkg_version.Version


class MetadataDict(RequiredMetadata, partial=True):
    # `dynamic` is left out as it's calculated based on attributes being set to `None`.
    # dynamic: 2.2
    platforms: Iterable[str] | None
    supported_platforms: Iterable[str] | None  # 1.1
    summary: str | None
    description: tuple[str, str] | None  # content-type added in 2.1.
    keywords: Iterable[str] | None
    home_page: str | None
    download_url: str | None  # 1.1
    authors: Iterable[
        tuple[str, str]
    ] | None  # Is there a better way to represent name and email?
    maintainers: Iterable[tuple[str, str]] | None  # 1.2
    license: str | None
    classifiers: str | None  # Should be typed based on trove-classifiers package. 1.1
    requires_dists: Iterable[requirements.Requirement] | None  # 1.2
    requires_python: specifiers.SpecifierSet | None  # 1.2
    requires_externals: Iterable[str] | None  # 1.2
    project_urls: Mapping[str, str] | None  # 1.2
    provides_extras: Iterable[str] | None  # 2.1
    provides_dists: Iterable[requirements.Requirement] | None  # Or str? 1.2
    obsoletes_dists: Iterable[requirements.Requirement] | None  # Or str? 1.2


class BaseMetadata(typing.TypedDict):
    name: str
    version: pkg_version.Version


class Metadata_1_0_Base(RequiredMetadata, partial=True):
    platforms: Iterable[str] | None
    summary: str | None
    description: str | None
    keywords: Iterable[str] | None
    home_page: str | None
    authors: Iterable[
        tuple[str, str]
    ] | None  # Is there a better way to represent name and email?
    license: str | None


class Metadata_1_0(Metadata_1_0_Base):
    version: Literal["1.0"]


class Metadata_1_1_Base(Metadata_1_0_Base, partial=True):
    supported_platforms: Iterable[str] | None
    download_url: str | None
    classifiers: str | None


class Metadata_1_1(Metadata_1_1_Base):
    version: Literal["1.1"]


class Metadata_1_2_Base(Metadata_1_1_Base, partial=True):
    maintainers: Iterable[tuple[str, str]] | None
    requires_dists: Iterable[requirements.Requirement] | None
    requires_python: specifiers.SpecifierSet | None
    requires_externals: Iterable[str] | None
    project_urls: Mapping[str, str] | None
    provides_dists: Iterable[requirements.Requirement] | None
    obsoletes_dists: Iterable[requirements.Requirement] | None


class Metadata_1_2(Metadata_1_2_Base):
    version: Literal["1.2"]


class Metadata_2_1_Base(Metadata_1_2_Base, partial=True):
    # Using a two-item tuple would require having a 'description' mixin for
    # previous versions as you can't redefine keys in a TypedDict.
    description_content_type: str | None
    provides_extras: Iterable[str] | None


def dynamic(metadata: MetadataDict) -> FrozenSet[str]:
    """Calculate the value of `dynamic` based on which keys have a `None` value."""
    # TODO: Technically should return an Iterable of literals that match core
    #       metadata field names.
    raise NotImplementedError


def from_project(
    project: Mapping[str, Any], *, path: pathlib.Path | None = None
) -> MetadataDict:
    """Extract metadata from the `[project]` table of a `pyproject.toml` file.

    The `path` argument is the path to the `pyproject.toml` file. This is
    required to resolve and read files specified in the data relative to the
    `pyproject.toml` file itself.
    """
    raise NotImplementedError


def from_pkg_info(pkg_info: str, /) -> MetadataDict:
    """Parse PKG-INFO data for metadata."""
    raise NotImplementedError


def from_metadata(metadata_source: str, /) -> MetadataDict:
    """Parse METADATA data for metadata."""
    raise NotImplementedError


T = typing.TypeVar("T", bound="MetadataClass")

_UNSPECIFIED = object()


class MetadataClass:

    # Any field that is set to `None` is considered dynamic, while an otherwise false value means purposefully unspecified.

    name: str
    version: pkg_version.Version
    platforms: Iterable[str] | None
    supported_platforms: Iterable[str] | None  # 1.1
    summary: str | None
    description: tuple[str, str] | None  # content-type added in 2.1.
    keywords: Iterable[str] | None
    home_page: str | None
    download_url: str | None  # 1.1
    authors: Iterable[
        tuple[str, str]
    ] | None  # Is there a better way to represent name and email?
    maintainers: Iterable[tuple[str, str]] | None  # 1.2
    license: str | None
    classifiers: str | None  # Should be typed based on trove-classifiers package. 1.1
    requires_dists: Iterable[requirements.Requirement] | None  # 1.2
    requires_python: specifiers.SpecifierSet | None  # 1.2
    requires_externals: Iterable[str] | None  # 1.2
    project_urls: Mapping[str, str] | None  # 1.2
    provides_extras: Iterable[str] | None  # 2.1
    provides_dists: Iterable[requirements.Requirement] | None  # Or str? 1.2
    obsoletes_dists: Iterable[requirements.Requirement] | None  # Or str? 1.2

    def __init__(
        self,
        name,
        version,
        *,
        platforms=(),
        supported_platforms=(),
        summary="",
        description=("", ""),
        keywords=(),
        home_page="",
        download_url="",
        authors=(),
        maintainers=(),
        license="",
        classifiers=(),
        requires_dists=(),
        requires_python=specifiers.SpecifierSet(),
        requires_externals=(),
        # A dict default value may get accidentally mutated.
        project_urls=_UNSPECIFIED,
        provides_extras=(),
        provides_dists=(),
        obsoletes_dists=(),
    ):
        self.name = name
        self.version = version
        self.platforms = platforms
        self.supported_platforms = supported_platforms
        self.summary = summary
        self.description = description
        self.keywords = keywords
        self.home_page = home_page
        self.download_url = download_url
        self.authors = authors
        self.maintainers = maintainers
        self.license = license
        self.classifiers = classifiers
        self.requires_dists = requires_dists
        self.requires_python = requires_python
        self.requires_externals = requires_externals
        self.project_urls = project_urls if project_urls is not _UNSPECIFIED else {}
        self.provides_extras = provides_extras
        self.provides_dists = provides_dists
        self.obsoletes_dists = obsoletes_dists

    def dynamic(self) -> FrozenSet[str]:
        """Calculate the `dynamic` field based on which fields are set to `None`."""
        raise NotImplementedError

    @classmethod
    def from_project(
        cls: Type[T], project: Mapping[str, Any], *, path: pathlib.Path | None = None
    ) -> T:
        """Extract metadata from the `[project]` table of a `pyproject.toml` file.

        The `path` argument is the path to the `pyproject.toml` file. This is
        required to resolve and read files specified in the data relative to the
        `pyproject.toml` file itself.
        """
        raise NotImplementedError

    @classmethod
    def from_pkg_info(cls: Type[T], pkg_info: str, /) -> T:
        """Parse PKG-INFO data for metadata."""
        raise NotImplementedError

    @classmethod
    def from_metadata(cls: Type[T], metadata_source: str, /) -> T:
        """Parse METADATA data for metadata."""
        raise NotImplementedError

    def to_pkg_info(self) -> str:
        """Return the representation of the metadata for PKG-INFO."""
        raise NotImplementedError

    def to_metadata(self) -> str:
        """Retrun the representation of the metadata for METADATA."""
        raise NotImplementedError

I think I prefer the class approach, but I'm flexible. I'm also curious what people think about the proposed types and naming for all the fields (I did take the plurality suggestion from @MrMino ). If we can get some agreement on this we can implement this gradually (i.e. just a class to hold the data, from pyproject.toml, from METADATA, to METADATA, from PKG-INFO, to PKG-INFO).

@pradyunsg
Copy link
Member

I think I prefer the class approach, but I'm flexible.

Same.

@MrMino
Copy link
Contributor

MrMino commented Jan 25, 2021

2 questions about the class approach:

  • What's the deal with the project_urls=_UNSPECIFIED? I see the comment right above it but I'm not sure if I understand what it tries to convey - I get that you wouldn't want to use a mutable object in the signature, but why not just use None? Could you please elaborate on that?
  • Isn't an empty string field different from having the field missing from the file? In that case, shouldn't __init__ values default to None / ()?

@brettcannon
Copy link
Member Author

@MrMino

I can answer both questions with the same answer: None means the field is dynamic, not that it's unspecified. As of core metadata 2.2 we need to distinguish between "I purposefully didn't set this metadata" from "I didn't specify anything because it's coming later".

And we can't use () everywhere as that doesn't type appropriately for string fields.

@MrMino
Copy link
Contributor

MrMino commented Jan 27, 2021

@MrMino

I can answer both questions with the same answer: None means the field is dynamic, not that it's unspecified.

Understood, thanks

@brettcannon
Copy link
Member Author

One thing that may inform attribute names is https://www.python.org/dev/peps/pep-0566/#json-compatible-metadata.

@merwok
Copy link

merwok commented Mar 9, 2021

FTR we had a Metadata class in distutils2, it was more that a typed data class/dict and did version detection, mutations and validation: https://hg.python.org/distutils2/file/tip/distutils2/metadata.py#l197

distlib updated it for versions 2.1 and 2.0 (withdrawn): https://bitbucket.org/pypa/distlib/src/master/distlib/metadata.py

@FFY00
Copy link
Member

FFY00 commented May 7, 2021

FYI, I have written a parser for PEP 621 metadata for my backend. It could be used as a starting point if it makes sense.

https://github.com/FFY00/trampolim/blob/main/trampolim/_metadata.py
https://github.com/FFY00/trampolim/blob/main/tests/test_metadata.py

@brettcannon
Copy link
Member Author

I still hope to get to this some day, but not right now. In spite of that, my brain is still thinking about the API. 😄

As such, I'm trying to think through the single dict approach or the dataclass-like approach as the preferred representation. I think I'm leaning towards the dict so that in the future any metadata changes of adding keys will naturally flow into an API where keys may be missing compared to a class approach where the lack of an attribute could throw people if there's a mixing of old a new code/metadata versions.

@FFY00
Copy link
Member

FFY00 commented May 20, 2021

When you come up with a final API, I may be able to take some time to write the implementation. Let me know.

@abravalheri
Copy link
Contributor

Are there other issues to track the remaining aspects of the API?

I can see an existing issue for the validation of metadata, but none related to the "ingestion" (via pyproject.toml/email message) and "production" (email message/anything else?). Should we create new ones?

@pfmoore
Copy link
Member

pfmoore commented Jun 26, 2022

I have https://pypi.org/project/pkg-metadata/ but I don't know if I want to offer it to packaging or keep it standalone. It's still at a relatively early stage right now.

@brettcannon brettcannon reopened this Jun 27, 2022
@brettcannon
Copy link
Member Author

@abravalheri GitHub got too aggressive with the auto-closing of this issue. 😅

@abravalheri
Copy link
Contributor

abravalheri commented Jun 27, 2022

Thank you very much @brettcannon 🎉

@pfmoore, I think it makes sense for packaging to expose these APIs directly, since it is "in-line" with the project description (i.e., interoperability specifications).

@pfmoore
Copy link
Member

pfmoore commented Jun 27, 2022

@abravalheri I'm not against packaging having such APIs, I'm just not sure if the goals of pkg_metadata are the same. My interest is in APIs that will read metadata leniently, not failing on (for example) an invalid Metadata-Version value. My impression is that @brettcannon wants packaging to be strict.

If we can agree on an approach, I'm happy for pkg_metadata to be the packaging implementation. If not, then it's fine if we have two separate implementations for two different purposes.

@abravalheri
Copy link
Contributor

Thank you very much for the clarification @pfmoore. I understand now, this makes a lot of sense (I also suggested a lenient conversion from/to email-header in #498).

Maybe there is difference of interests in terms of who is the audience of the APIs packaging offers.

Probably for PyPI a more strict approach would be suitable, while for a backend like setuptools ingesting not-particularly-well-formed email-headers would be the way to go as long as the output is well-formed.

@pfmoore
Copy link
Member

pfmoore commented Jun 27, 2022

My personal use case is data analysis of projects already on PyPI. To put it politely, there are a lot of "deviations from the standards" on there 🙂 (some understandable, they are projects that pre-date the relevant standards, others less so, they simply make no sense to me at all!)

I want a library that can parse and read all of that data, so that I can analyze the weirdnesses with useful data structures. The packaging.Metadata class works for this use case, although you have to violate the type declarations (not all versions on PyPI are parseable as a packaging Version, for example). I'm not sure how viable deliberately ignoring type hints is in practice, though.

I don't know how best to handle the discrepancy in use cases here, which is why I say that for now I don't know what will happen.

@MrMino
Copy link
Contributor

MrMino commented Jun 27, 2022

I don't know how best to handle the discrepancy in use cases here, which is why I say that for now I don't know what will happen.

My wheel API project has exactly this problem. To maximize the utility it has to cater to both usage scenarios, and this means having separate "strict" and "lenient" modes (or at least it will, whenever I get to implementing that ;)).

It would be nice if packaging didn't force strictness in cases where we know that deviations from the standard exist in the wild.

@brettcannon
Copy link
Member Author

Maybe there is difference of interests in terms of who is the audience of the APIs packaging offers.

As of right now I would say what @pfmoore has for lenient parsing serves that group well (which are probably installers). But for builders, you want the strict solution to prevent users from propagating mistakes, and that doesn't seem to exist in a reusable package. And there are also those who want to tell users that their metadata is malformed (e.g. PyPI rejecting packages that messed something up). So the current plan is to support strict to start.

But that isn't to say the strict code couldn't be tweaked to emit a warning instead of raising an exception is some strict=False argument was passed to the appropriate from_*() class method. I'm not sure if that would be enough, but if that was all it took then I would be willing to contort the initialization/processing code and put in the effort to (eventually) change every raise InvalidMetadata(msg) to:

if strict:
   raise InvalidMetadata(msg)
else:
   warnings.warn(msg, MetadataWarning)

... or whatever the equivalent would be. But what I'm not personally up for is maintaining two copies of metadata parsing to support both use-cases.

@pfmoore
Copy link
Member

pfmoore commented Jun 28, 2022

IMO, it's important for a project to have a clear understanding of its own goals, and resist the temptation to try to be "all things to all men" as the saying goes. If packaging intends to focus on the strict case1, then that's perfectly OK, and I see no problem with that. pkg_metadata is aimed at the lenient case and the two can happily coexist. What may be possible is for the strict case to reuse the lenient one for initial parsing and then add a validation layer on top - I don't know what the policy here is on having external dependencies, but that might offer a way to avoid duplicating effort. Alternatively, pkg_metadata is designed to allow the "dictionary form" to be a common internal format, so having a Metadata constructor from dictionary form might cover both use cases sufficiently2.

Footnotes

  1. Which it has for other classes like Version.

  2. The "dict from pyproject" code in pkg_metadata is something of an afterthought, and I'd be happy to remove it in favour of something in packaging. It doesn't really fit with the "lenient" approach, and it's only there because I had a need for it and nowhere better to put it.

@MrMino
Copy link
Contributor

MrMino commented Jun 28, 2022

From my experience, the amount of packages that fail to meet the standard, or use PEPs that aren't accepted yet (or will never be), is far larger than what one would expect.

My personal goal for wheelfile was that it would be able to open every wheel that is both a) downloadable from PyPI, b) that some 3.6-compatible version of Pip is able to install. The "3.6" requirement would get continuously updated with time, gradually lowering the amount of necessary workarounds and leniency.

I imagined it to be a viable strategy for packaging too, but now I see that it's already too late for that. Metadata does not have some of the fields that non-conformant packages do. E.g. jupyterlab-3.1.7.dist-info/METADATA contains License-File field (PEP-639), which packaging.metadata.Metadata does not.

@brettcannon
Copy link
Member Author

Metadata does not have some of the fields that non-conformant packages do. E.g. jupyterlab-3.1.7.dist-info/METADATA contains License-File field (PEP-639), which packaging.metadata.Metadata does not.

That is only a concern if (a) we choose to reject core metadata fields that shouldn't be there, and (b) you want access to those fields somehow.

BTW, it's way easier to go from strict to lenient than the other way around. I would much rather be as strict as possible, do an analysis of where things fail from packages on PyPI due to misaligned expectations/assumptions, and then loosen accordingly to find a good middle ground.

@abravalheri
Copy link
Contributor

abravalheri commented Jul 6, 2022

But for builders, you want the strict solution to prevent users from propagating mistakes.

If builders include build backends, I think they would also prefer lenient parsing + strict dumping.

The use case is building from old sdists:

  • If a backend builds from an old sdist, it would be good to parse PKG-INFO in the lenient mode, but produce the new METADATA file with strict compliance to the latest standards, so it would stop mistakes being propagated without making old distributions unusable...

@brettcannon
Copy link
Member Author

If builders include build back-ends

I'm thinking Flit, Hatch, Setuptools, etc. that have a pyproject.toml, and not build, pip, etc. that have to work with anything/everything.

@abravalheri
Copy link
Contributor

abravalheri commented Jul 6, 2022

I'm thinking Flit, Hatch, Setuptools, etc. that have a pyproject.toml, and not build, pip, etc. that have to work with anything/everything.

Since version caps for the build dependencies are not the most prevalent practice in the Python packaging ecosystem, I think any build-backend like Flit, Hatch, Setuptools would have to deal with old sdists. For example, let's say Setuptools adopts packaging' Metadata API in a near future, it would still have to be able to parse PKG-INFO files created years ago, which might not be compliant.

(Just to clarify, I think it is completely OK for packaging to implement strict parsing... I am just discussing what different audiences might be looking for. And it is completely OK to not focus on some audiences).

@brettcannon
Copy link
Member Author

For example, let's say Setuptools adopts packaging' Metadata API in a near future, it would still have to be able to parse PKG-INFO files created years ago, which might not be compliant.

Yep, setuptools specifically is in a tough spot with this.

Just to clarify, I think it is completely OK for packaging to implement strict parsing... I am just discussing what different audiences might be looking for. And it is completely OK to not focus on some audiences

And I do appreciate it!

Here are my rough plans for packaging.metadata (time permitting 😅):

  1. Implement a from_email_headers() class method that does strict parsing for wheels 1
  2. Implement a to_email_headers() method for wheels 2
  3. Evaluate where from_email_headers() could potentially be tweaked to be useful to preexisting wheels that are not compliant (i.e. try to be data-driven about what flexibility is necessary) 3
  4. Update from_email_headers() to handle sdists/PKG-INFO explicitly
  5. Update to_email_headers() to handle sdists/PKG-INFO explicitly
  6. Implement to_json()/from_json() 4
  7. Implement from_pyproject()/from_toml()/from_project_dict()/choose-a-name for [project] from pyproject.toml

Footnotes

  1. For PyPI.

  2. Useful to hopefully some back-ends since sdists are more strict than wheels thanks to Dynamic and thus could theoretically be viewed as a subset.

  3. expands usefulness to back-ends and front-ends dealing with old wheels.

  4. So we can hopefully move over to the format some day 🙏.

@pradyunsg
Copy link
Member

@brettcannon Let's file a tracking issue, with that list as a checklist? That way, others can chime in and help get this over the line and we have a single place to look at how this is progressing. :)

@brettcannon
Copy link
Member Author

#570 has the TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants