Check for valid postgres identifer length #5853

jqnatividad · 2021-01-25T19:49:24Z

Fixes #5834

Proposed fixes:

add len check to is_valid_field_name

Features:

includes tests covering changes
includes updated documentation
includes user-visible changes
includes API changes
includes bugfix for possible backport

which cannot be longer than 63 characters

wardi · 2021-01-26T13:43:20Z

ckanext/datastore/helpers.py

    '''
    return (name and name == name.strip() and
            not name.startswith('_') and
-            '"' not in name)
+            '"' not in name and
+            not len(name) > 63)


Does postgres count character length the same way we do in python?

Does the encoding of the datastore database affect this calculation?

I believe so.

In my testing, I encountered this problem when I was automatically creating aliases when a resource was created using the template "ownerorg-packagename-resourcename"(ckan/ideas#264), and sometimes, it will create aliases longer than 63 characters and the resulting aliases are truncated at 63 characters by postgres.

Also, there is another area where the datastore checks for 63 characters already -

ckan/ckanext/datastore/backend/postgres.py

Lines 863 to 867 in cc000d3

for field_id in field_ids:

# Postgres has a limit of 63 characters for a column name

if len(field_id) > 63:

message = 'Column heading "{0}" exceeds limit of 63 '\

'characters.'.format(field_id)

However, it only does it for field names. Not table or alias names.

Both might be wrong. It sounds like the identifier length is limited to 63 bytes, so we either need to know the encoding of the DB or we should do a select octet_length(...) on the strings we plan to use field or alias names.

Since postgres UTF8 encoding is already required, and the code has the # encoding utf-8 magic comment, won't doing len() be good enough?

If not, I'd go for len(name.encode("utf-8")) instead of querying postgres for octet_length(...).

where is postgres utf-8 encoding required? There's nothing preventing a user from setting the datastore encoding to cp1252 to increase performance or space-efficiency for their data is there?

The docs imply a utf-8 requirement:

https://docs.ckan.org/en/latest/maintaining/installing/install-from-source.html?highlight=UTF#setup-a-postgresql-database
https://docs.ckan.org/en/latest/maintaining/datastore.html?highlight=UTF#set-up-the-database

There is also a hardcoded utf8 assumption in the postgres backend.

ckan/ckanext/datastore/backend/postgres.py

Lines 555 to 557 in cc000d3

def _generate_index_name(resource_id, field):

value = (resource_id + field).encode('utf-8')

return hashlib.sha1(value).hexdigest()

Regardless, the docs should be updated to state if utf8 encoding is actually not required, and maybe even the tests updated for alternate encodings.

This encode is used to generate a hexadecimal hash, it doesn't affect any postgres encoding.

In other issues we're talking about letting users create their own materialized views or foreign data wrappers or other objects. It's not unreasonable for a user to want an encoding optimal for their use case.

postgres=# create database cp1252test encoding 'win1252' lc_ctype='C' lc_collate='C' template=template0; CREATE DATABASE cp1252test=# create table t1 ( ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ text ); NOTICE: identifier "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" will be truncated to "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ"

vs

utf8test=# create table t1 ( ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ text ); NOTICE: identifier "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" will be truncated to "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞ" CREATE TABLE

Point taken! Especially as someone who's asking for more advanced postgres support. :)

And speaking of advanced postgres support, perhaps, instead of hardcoding it, get it from postgres directly with pg_control_init() now that postgres v9.5 is about to get EOLed Feb 11, 2021.

https://pgpedia.info/p/pg_control_init.html

wardi · 2021-01-30T21:18:54Z

the postgresql encoding names don't match the ones in Python

cp1252test=# SELECT character_set_name FROM INFORMATION_SCHEMA.CHARACTER_SETS;
 character_set_name 
--------------------
 WIN1252
(1 row)

>>> 'hi'.encode('WIN1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: WIN1252

Doing a select octet_length(...) < max_identifier_length for each column name would be the safest. You should be able to pass them as an array and get the result back with a single query.

pdelboca · 2023-04-14T14:14:56Z

Closing due to inactivity.

Feel free to reopen it if consider that it is still relevant.

Check for valid postgres identifer length

a6b6cea

which cannot be longer than 63 characters

amercader assigned wardi Jan 26, 2021

wardi reviewed Jan 26, 2021

View reviewed changes

jqnatividad added 2 commits January 30, 2021 14:49

Check length taking encoding & max_identifier_length into account

4f34d14

typo

dfe7935

jqnatividad marked this pull request as draft February 1, 2021 13:31

pdelboca closed this Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for valid postgres identifer length #5853

Check for valid postgres identifer length #5853

jqnatividad commented Jan 25, 2021

wardi Jan 26, 2021

jqnatividad Jan 26, 2021

wardi Jan 26, 2021

jqnatividad Jan 26, 2021

wardi Jan 26, 2021

jqnatividad Jan 26, 2021

wardi Jan 26, 2021 •

edited

Loading

jqnatividad Jan 26, 2021

wardi commented Jan 30, 2021

pdelboca commented Apr 14, 2023

	for field_id in field_ids:
	# Postgres has a limit of 63 characters for a column name
	if len(field_id) > 63:
	message = 'Column heading "{0}" exceeds limit of 63 '\
	'characters.'.format(field_id)

	def _generate_index_name(resource_id, field):
	value = (resource_id + field).encode('utf-8')
	return hashlib.sha1(value).hexdigest()

Check for valid postgres identifer length #5853

Check for valid postgres identifer length #5853

Conversation

jqnatividad commented Jan 25, 2021

Proposed fixes:

Features:

wardi Jan 26, 2021

Choose a reason for hiding this comment

jqnatividad Jan 26, 2021

Choose a reason for hiding this comment

wardi Jan 26, 2021

Choose a reason for hiding this comment

jqnatividad Jan 26, 2021

Choose a reason for hiding this comment

wardi Jan 26, 2021

Choose a reason for hiding this comment

jqnatividad Jan 26, 2021

Choose a reason for hiding this comment

wardi Jan 26, 2021 • edited Loading

Choose a reason for hiding this comment

jqnatividad Jan 26, 2021

Choose a reason for hiding this comment

wardi commented Jan 30, 2021

pdelboca commented Apr 14, 2023

wardi Jan 26, 2021 •

edited

Loading