Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Add a dialect.type property #74

Closed
wants to merge 1 commit into from

Conversation

khusmann
Copy link
Contributor

@khusmann khusmann commented Jun 6, 2024

Addresses frictionlessdata/datapackage#921

I put this PR together to see what a dialect.type field would look like, and I like the result. I think it helps clarify the dialect definitions and more clearly defines which (shared) properties apply to which types, benefiting both data producers and implementations parsing the descriptors.

It does limit which dialect properties can be combined, (e.g. you can't simultaneously define dialect.table and dialect.sheetName), but I would argue this is a good thing. Otherwise implementations have to anticipate what these combinations might mean in obscure cases with obscure formats, and we start risking divergent expectations. With the dialect.type approach, by contrast, if there was a format where it made sense to define both table and sheetName, it would constitute a new dialect type that should be explicitly added to the list of dialect types (edit: and/or move table and sheetName to the "Additional Properties" section). When this was done, it would explicitly notify data producers and implementations it was a combination they can consider supporting. This also makes warnings / errors more informative in implementations (e.g. "dialect type X not supported for Y format" vs silently ignoring table or sheetName and getting an unrelated error somewhere else down the line)

@roll @peterdesmet Thoughts?

@roll
Copy link
Member

roll commented Jun 10, 2024

I'm neutral on this as, in my opinion, there are two main contradicting factors:

  • of course, mixing different dialects might be a bad idea for metadata quality
  • on the other hand, restricting dialect to pre-defined types might close a path for using Table Dialect the way we don't know at the moment. Let's say dialect.table is applicable for some other tabular types or at some point we will have dialet.optionX which are applicable for both e.g. delimited and structured (this case is easier to handle although)

@khusmann
Copy link
Contributor Author

on the other hand, restricting dialect to pre-defined types might close a path for using Table Dialect the way we don't know at the moment.

I don't see it closing any paths, because as a tagged union all new cases can be handled via 1) adding new dialect types, or 2) promoting options from the type-specific to the shared "Additional Properties" section (as we would do with both examples you provide).

The dialect.type property simply provides explicit scoping for properties, ensuring that implementations will have similar behaviors when processing the same format. (In other words, all allowable combinations of properties are standardized across implementations).

It also allows for more principled extensions -- if producers are using a format that doesn't match one of the current dialect types, they can use a custom type with its own group of properties, which can be later adopted by the standard. In the meantime, it would throw a clear "unrecognized dialect type" when running on standard implementations.

That said, if I'm the only one who is interested in this property, then it's probably not worth adding for now!

@peterdesmet
Copy link
Member

I mainly don't like how the documentation is currently structured and would prefer one where the Types are explained first and then all the Properties, in an order that makes sense (cf. Field Types and Field Constraints in Table Schema):

Overview
Language
Introduction
Descriptor
Tabular Data Types
- Delimited
- Structured
- Spreadsheet
- Database
Properties
- $schema
- header
- headerRows
- headerJoin
- commentRows
- commentChar
- delimiter
- lineTerminator
- quoteChar
- doubleQuote
- escapeChar
- nullSequence
- skipInitialSpace
- property
- itemType
- itemKeys
- sheetNumber
- sheetName
- table
Excluded
References

Each Tabular Field Type section would consist of:

Heading

Paragraph explaining what this Tabular Field Type is.

Applicable properties:

Example (with just the default values):

{
  "$schema": "https://datapackage.org/profiles/1.0/tabledialect.json",
  "header": true,
  "headerRows": 1,
  "headerJoin": " ",
  "delimiter": ",",
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "doubleQuote": true,
  "skipInitialSpace": false
}

Each Property section would consist of (cf. Field Constraints):

header

  • Applies to: all

Current documentation

Note: if all dialect table properties are MAY then we should repeat "A Table Dialect descriptor MAY have ..." 18 times, but just list that as a general section under properties and shorten the descriptions for each property.

@khusmann @roll, if you agree, I can make a stab at this.

@peterdesmet
Copy link
Member

peterdesmet commented Jun 13, 2024

All that said, I agree with @khusmann that a dialect.type might be useful to generate errors such as:

Not valid: dialect property "sheet" is not supported by type "structured"

Or:

frictionless-r currently doesn't support reading tabular data of type "database"

But I can't assess how necessary it is to have without getting my hands dirty on implementation. 😄

@khusmann
Copy link
Contributor Author

if you agree, I can make a stab at this.

I agree! Feel free to take a stab.

@peterdesmet
Copy link
Member

Ok! @roll are you fine with my suggestion?

@roll
Copy link
Member

roll commented Jun 14, 2024

@peterdesmet
I don't have strong opinion. Data Resource and Table Schema also use "Properties" subsections to organize long lists

@peterdesmet
Copy link
Member

My opinion is that we don't need a dialect.type property for now, but keep it simply as a list of potential properties. But happy to revisit when working on implementation.

@khusmann
Copy link
Contributor Author

khusmann commented Jun 20, 2024

One more advantage of dialect.type: because it enables static validation of dialect types, it's a big boost to intellesense:

When dialect.type = "database":

image

When dialect.type = "structured":

image

Without dialect.type, all the properties are mixed:

image

My opinion is that we don't need a dialect.type property for now, but keep it simply as a list of potential properties. But happy to revisit when working on implementation.

I agree that dialect.type isn't strictly necessary, especially given your reorganization in #76. But is this something we will be able to add later though? Wouldn't it technically be a breaking change? If someone authors a "structured" dialect without dialect.type, for example, it would default to delimited and fail parsing. We'd have to fall back on dialect discovery, which defeats the purpose / value of dialect.type

That all said, I don't see this as a make-or-break issue, so I'm fine letting it drop (clarifying which dialect properties go with which types as we do in #76 is the bigger issue I think -- dialect.type just allows these groupings to be explicit & statically enforced e.g. via a JSON schema)

@ezwelty
Copy link

ezwelty commented Jun 21, 2024

For me, the best argument for dialect.type is this line from the documentation suggested in #76:

Some properties are generic and can be used for multiple formats, while others are specific to one format. A property MUST be ignored if it is no applicable for an arbitrary data format. For example, SQL databases do not have a concept of a header row.

So it seems like we are both defining clearly-delineated types (i.e. which property and/or default applies to which type) while simultaneously trying to avoid defining a formal type? Frankly this seems slippery... How can header be ignored for a database if table is omitted (since it is an optional property)? What I read between the lines is that the type will be inferred by the presence of properties that happen to be unique for a given type (table, sheetName), which doesn't seem manageable in the long term if more types are added. As @roll wrote (but as an argument against dialect.type):

Let's say dialect.table is applicable for some other tabular types

Once that's the case and all we have is table, then how do we know what type we are working with? Will adding this future type represent a breaking change for existing implicitly-database table dialects? I guess implementations could next try to guess based on the file extension of the path (or content thereof) of the parent Data Resource... This reminds me of using GDAL without the ability to specify a driver (https://gdal.org/drivers/vector) – it works, except when it doesn't. Except even wilder, because we aren't defining types by file type, but by conceptual categories, which map even less clearly to file type/extensions.

So I would suggest dialect.type as an optional property without a default. If set, it overrides whatever guess work an implementation might attempt. If not set, guess away. This could provide an opportunity to define in the spec exactly what guess work we have in mind.

@peterdesmet
Copy link
Member

@ezwelty @khusmann you have me convinced. I think we're painting ourselves into a corner if we don't add it now:

  • Better error messages
  • Code completion
  • No messy inference

So I would suggest dialect.type as an optional property without a default.

I would say: add it with default delimited, since Table Dialect is the follow up to CSV dialect (which $schema defaults to), which was only used for delimited files.

@khusmann
Copy link
Contributor Author

What I read between the lines is that the type will be inferred by the presence of properties that happen to be unique for a given type (table, sheetName), which doesn't seem manageable in the long term if more types are added.

💯. This was exactly my underlying concern but I wasn't doing a good job articulating it… Thanks for jumping in @ezwelty!

I would say: add it with default delimited, since Table Dialect is the follow up to CSV dialect (which $schema defaults to), which was only used for delimited files.

I agree; this is the behavior I wrote into the current PR. This gives us complete backwards compatibility with CSV dialect, and avoids future implementation guesswork as we add more dialect types.

@peterdesmet do you want to write dialect.type into #76, or would you like me to reorganize this PR in the same way you did there?

@peterdesmet
Copy link
Member

@khusmann do you mind branching of #76 and incorporating dialect.type there as a new PR? Will be easier to review.

@roll
Copy link
Member

roll commented Jun 24, 2024

@khusmann
Now it's updated - https://datapackage.org/specifications/table-dialect/ (also #79 is on its way) -- I think dialect.table will naturally fit into the new structure (personally ready to vote for it)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants