Add a `dialect.type` property #74

khusmann · 2024-06-06T00:58:06Z

Addresses frictionlessdata/datapackage#921

I put this PR together to see what a dialect.type field would look like, and I like the result. I think it helps clarify the dialect definitions and more clearly defines which (shared) properties apply to which types, benefiting both data producers and implementations parsing the descriptors.

It does limit which dialect properties can be combined, (e.g. you can't simultaneously define dialect.table and dialect.sheetName), but I would argue this is a good thing. Otherwise implementations have to anticipate what these combinations might mean in obscure cases with obscure formats, and we start risking divergent expectations. With the dialect.type approach, by contrast, if there was a format where it made sense to define both table and sheetName, it would constitute a new dialect type that should be explicitly added to the list of dialect types (edit: and/or move table and sheetName to the "Additional Properties" section). When this was done, it would explicitly notify data producers and implementations it was a combination they can consider supporting. This also makes warnings / errors more informative in implementations (e.g. "dialect type X not supported for Y format" vs silently ignoring table or sheetName and getting an unrelated error somewhere else down the line)

@roll @peterdesmet Thoughts?

roll · 2024-06-10T07:18:23Z

I'm neutral on this as, in my opinion, there are two main contradicting factors:

of course, mixing different dialects might be a bad idea for metadata quality
on the other hand, restricting dialect to pre-defined types might close a path for using Table Dialect the way we don't know at the moment. Let's say dialect.table is applicable for some other tabular types or at some point we will have dialet.optionX which are applicable for both e.g. delimited and structured (this case is easier to handle although)

khusmann · 2024-06-12T17:10:58Z

on the other hand, restricting dialect to pre-defined types might close a path for using Table Dialect the way we don't know at the moment.

I don't see it closing any paths, because as a tagged union all new cases can be handled via 1) adding new dialect types, or 2) promoting options from the type-specific to the shared "Additional Properties" section (as we would do with both examples you provide).

The dialect.type property simply provides explicit scoping for properties, ensuring that implementations will have similar behaviors when processing the same format. (In other words, all allowable combinations of properties are standardized across implementations).

It also allows for more principled extensions -- if producers are using a format that doesn't match one of the current dialect types, they can use a custom type with its own group of properties, which can be later adopted by the standard. In the meantime, it would throw a clear "unrecognized dialect type" when running on standard implementations.

That said, if I'm the only one who is interested in this property, then it's probably not worth adding for now!

peterdesmet · 2024-06-13T15:23:15Z

I mainly don't like how the documentation is currently structured and would prefer one where the Types are explained first and then all the Properties, in an order that makes sense (cf. Field Types and Field Constraints in Table Schema):

Overview
Language
Introduction
Descriptor
Tabular Data Types
- Delimited
- Structured
- Spreadsheet
- Database
Properties
- $schema
- header
- headerRows
- headerJoin
- commentRows
- commentChar
- delimiter
- lineTerminator
- quoteChar
- doubleQuote
- escapeChar
- nullSequence
- skipInitialSpace
- property
- itemType
- itemKeys
- sheetNumber
- sheetName
- table
Excluded
References

Each Tabular Field Type section would consist of:

Heading

Paragraph explaining what this Tabular Field Type is.

Applicable properties:

Example (with just the default values):

{
  "$schema": "https://datapackage.org/profiles/1.0/tabledialect.json",
  "header": true,
  "headerRows": 1,
  "headerJoin": " ",
  "delimiter": ",",
  "lineTerminator": "\r\n",
  "quoteChar": "\"",
  "doubleQuote": true,
  "skipInitialSpace": false
}

Each Property section would consist of (cf. Field Constraints):

header

Applies to: all

Current documentation

Note: if all dialect table properties are MAY then we should repeat "A Table Dialect descriptor MAY have ..." 18 times, but just list that as a general section under properties and shorten the descriptions for each property.

@khusmann @roll, if you agree, I can make a stab at this.

peterdesmet · 2024-06-13T15:26:03Z

All that said, I agree with @khusmann that a dialect.type might be useful to generate errors such as:

Not valid: dialect property "sheet" is not supported by type "structured"

Or:

frictionless-r currently doesn't support reading tabular data of type "database"

But I can't assess how necessary it is to have without getting my hands dirty on implementation. 😄

khusmann · 2024-06-14T01:24:12Z

if you agree, I can make a stab at this.

I agree! Feel free to take a stab.

peterdesmet · 2024-06-14T08:29:11Z

Ok! @roll are you fine with my suggestion?

roll · 2024-06-14T09:32:07Z

@peterdesmet
I don't have strong opinion. Data Resource and Table Schema also use "Properties" subsections to organize long lists

peterdesmet · 2024-06-20T14:10:35Z

My opinion is that we don't need a dialect.type property for now, but keep it simply as a list of potential properties. But happy to revisit when working on implementation.

khusmann · 2024-06-20T17:10:45Z

One more advantage of dialect.type: because it enables static validation of dialect types, it's a big boost to intellesense:

When dialect.type = "database":

When dialect.type = "structured":

Without dialect.type, all the properties are mixed:

My opinion is that we don't need a dialect.type property for now, but keep it simply as a list of potential properties. But happy to revisit when working on implementation.

I agree that dialect.type isn't strictly necessary, especially given your reorganization in #76. But is this something we will be able to add later though? Wouldn't it technically be a breaking change? If someone authors a "structured" dialect without dialect.type, for example, it would default to delimited and fail parsing. We'd have to fall back on dialect discovery, which defeats the purpose / value of dialect.type

That all said, I don't see this as a make-or-break issue, so I'm fine letting it drop (clarifying which dialect properties go with which types as we do in #76 is the bigger issue I think -- dialect.type just allows these groupings to be explicit & statically enforced e.g. via a JSON schema)

ezwelty · 2024-06-21T07:17:44Z

For me, the best argument for dialect.type is this line from the documentation suggested in #76:

Some properties are generic and can be used for multiple formats, while others are specific to one format. A property MUST be ignored if it is no applicable for an arbitrary data format. For example, SQL databases do not have a concept of a header row.

So it seems like we are both defining clearly-delineated types (i.e. which property and/or default applies to which type) while simultaneously trying to avoid defining a formal type? Frankly this seems slippery... How can header be ignored for a database if table is omitted (since it is an optional property)? What I read between the lines is that the type will be inferred by the presence of properties that happen to be unique for a given type (table, sheetName), which doesn't seem manageable in the long term if more types are added. As @roll wrote (but as an argument against dialect.type):

Let's say dialect.table is applicable for some other tabular types

Once that's the case and all we have is table, then how do we know what type we are working with? Will adding this future type represent a breaking change for existing implicitly-database table dialects? I guess implementations could next try to guess based on the file extension of the path (or content thereof) of the parent Data Resource... This reminds me of using GDAL without the ability to specify a driver (https://gdal.org/drivers/vector) – it works, except when it doesn't. Except even wilder, because we aren't defining types by file type, but by conceptual categories, which map even less clearly to file type/extensions.

So I would suggest dialect.type as an optional property without a default. If set, it overrides whatever guess work an implementation might attempt. If not set, guess away. This could provide an opportunity to define in the spec exactly what guess work we have in mind.

peterdesmet · 2024-06-21T07:27:57Z

@ezwelty @khusmann you have me convinced. I think we're painting ourselves into a corner if we don't add it now:

Better error messages
Code completion
No messy inference

So I would suggest dialect.type as an optional property without a default.

I would say: add it with default delimited, since Table Dialect is the follow up to CSV dialect (which $schema defaults to), which was only used for delimited files.

khusmann · 2024-06-21T18:20:16Z

What I read between the lines is that the type will be inferred by the presence of properties that happen to be unique for a given type (table, sheetName), which doesn't seem manageable in the long term if more types are added.

💯. This was exactly my underlying concern but I wasn't doing a good job articulating it… Thanks for jumping in @ezwelty!

I would say: add it with default delimited, since Table Dialect is the follow up to CSV dialect (which $schema defaults to), which was only used for delimited files.

I agree; this is the behavior I wrote into the current PR. This gives us complete backwards compatibility with CSV dialect, and avoids future implementation guesswork as we add more dialect types.

@peterdesmet do you want to write dialect.type into #76, or would you like me to reorganize this PR in the same way you did there?

peterdesmet · 2024-06-21T23:02:17Z

@khusmann do you mind branching of #76 and incorporating dialect.type there as a new PR? Will be easier to review.

roll · 2024-06-24T09:59:59Z

@khusmann
Now it's updated - https://datapackage.org/specifications/table-dialect/ (also #79 is on its way) -- I think dialect.table will naturally fit into the new structure (personally ready to vote for it)

add dialect.type property

086c22f

peterdesmet mentioned this pull request Jun 20, 2024

Table dialect reorganized #76

Merged

ezwelty mentioned this pull request Jun 21, 2024

Migrate to Data Package v2 (once supported by frictionless-py) mjacqu/glenglat#60

Open

khusmann mentioned this pull request Jun 24, 2024

Add a dialect.type property (with table dialect reorganized) #82

Closed

khusmann closed this Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `dialect.type` property #74

Add a `dialect.type` property #74

khusmann commented Jun 6, 2024 •

edited

Loading

roll commented Jun 10, 2024

khusmann commented Jun 12, 2024

peterdesmet commented Jun 13, 2024

Heading

header

peterdesmet commented Jun 13, 2024 •

edited

Loading

khusmann commented Jun 14, 2024

peterdesmet commented Jun 14, 2024

roll commented Jun 14, 2024

peterdesmet commented Jun 20, 2024

khusmann commented Jun 20, 2024 •

edited

Loading

ezwelty commented Jun 21, 2024

peterdesmet commented Jun 21, 2024

khusmann commented Jun 21, 2024

peterdesmet commented Jun 21, 2024

roll commented Jun 24, 2024

Add a dialect.type property #74

Add a dialect.type property #74

Conversation

khusmann commented Jun 6, 2024 • edited Loading

roll commented Jun 10, 2024

khusmann commented Jun 12, 2024

peterdesmet commented Jun 13, 2024

Heading

header

peterdesmet commented Jun 13, 2024 • edited Loading

khusmann commented Jun 14, 2024

peterdesmet commented Jun 14, 2024

roll commented Jun 14, 2024

peterdesmet commented Jun 20, 2024

khusmann commented Jun 20, 2024 • edited Loading

ezwelty commented Jun 21, 2024

peterdesmet commented Jun 21, 2024

khusmann commented Jun 21, 2024

peterdesmet commented Jun 21, 2024

roll commented Jun 24, 2024

Add a `dialect.type` property #74

Add a `dialect.type` property #74

khusmann commented Jun 6, 2024 •

edited

Loading

peterdesmet commented Jun 13, 2024 •

edited

Loading

khusmann commented Jun 20, 2024 •

edited

Loading