Provide special treatment for human-written text #261

chuwy · 2017-08-24T16:29:51Z

In JSON Schema specification maxLength keyword specifies the amount of characters in field, not bytes. However, in Redshift (as in most databases, I believe) in VARCHAR we specify amount of bytes, which may introduce mismatch in text-fields, usually written by humans, not computers.

This is generally not a problem as analytical data typically contains ASCII-text (written by computers), where amount of bytes precisely match amount of characters.

But at the same time, I can imagine an issue:

User supposed to enter his/her city name in native language (likely with non-ASCII characters)
Web-developer constrains input-field to 32 characters
Analysts makes a wrong assumption that maxLength: 32 is correct constrain
Redshift truncates all non-ASCII city names to 16 characters

This could be done as part of #170 (format: "unicode", which specify that string has absolutely no structure) or similar custom JSON-schema extension.

The text was updated successfully, but these errors were encountered:

kasper-agg · 2017-08-25T09:32:56Z

I would like to add to point 3 that it is not necessarily a mistake or assumption.

In our case we use the schema as source of truth and generate DataTransferObjects from the schema, in order to comply with the schema.
In other words, if the schema defines a specific maxLength expressed in characters, then that is what the application (and developer) honors and should be able to trust to be stored as the schema describes.

Maybe igluctl can "detect" if specific storage engines need over provisioning based on the schema requirements and thus generate "larger" fields in the generated DDL (Redschift sql schema) files?

chuwy · 2017-08-25T09:36:59Z

Hey @kasper-agg,

Yes I agree that using JSON Schemas as source of truth is a great idea and this issue prevents engineers from doing that. I also agree that it would be a right thing to do to over-provision Redshift, but problem here is that we need to come up with way to "mark" these fields as it is relatively narrow subset of cases.

kasper-agg · 2017-08-25T09:41:19Z

Regardless of the solution, I must agree it sounds like a challenge. Even more to fix it in clean way...

chuwy · 2017-08-25T09:43:10Z

I'm quite keen to add custom extensions into JSON Schema (example #170, but new properties as well - it is fully-compatible with standard) and I think this is the way to go here.

alexanderdean · 2017-08-25T09:47:53Z

I have a weird feeling we may end up having to drive a Draft 5 (?) of the JSON Schema spec, given there don't seem to be many other vendors taking a major interest in it...

chuwy · 2017-08-25T09:48:57Z

That would be lovely! I've seen efforts on Draft 5 specification, but almost no implementations out there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide special treatment for human-written text #261

Provide special treatment for human-written text #261

chuwy commented Aug 24, 2017

kasper-agg commented Aug 25, 2017

chuwy commented Aug 25, 2017

kasper-agg commented Aug 25, 2017

chuwy commented Aug 25, 2017 •

edited

Loading

alexanderdean commented Aug 25, 2017

chuwy commented Aug 25, 2017 •

edited

Loading

Provide special treatment for human-written text #261

Provide special treatment for human-written text #261

Comments

chuwy commented Aug 24, 2017

kasper-agg commented Aug 25, 2017

chuwy commented Aug 25, 2017

kasper-agg commented Aug 25, 2017

chuwy commented Aug 25, 2017 • edited Loading

alexanderdean commented Aug 25, 2017

chuwy commented Aug 25, 2017 • edited Loading

chuwy commented Aug 25, 2017 •

edited

Loading

chuwy commented Aug 25, 2017 •

edited

Loading