Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for valid postgres identifer length #5853

Closed

Conversation

jqnatividad
Copy link
Contributor

Fixes #5834

Proposed fixes:

  • add len check to is_valid_field_name

Features:

  • includes tests covering changes
  • includes updated documentation
  • includes user-visible changes
  • includes API changes
  • includes bugfix for possible backport

which cannot be longer than 63 characters
'''
return (name and name == name.strip() and
not name.startswith('_') and
'"' not in name)
'"' not in name and
not len(name) > 63)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does postgres count character length the same way we do in python?

Does the encoding of the datastore database affect this calculation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so.

In my testing, I encountered this problem when I was automatically creating aliases when a resource was created using the template "ownerorg-packagename-resourcename"(ckan/ideas#264), and sometimes, it will create aliases longer than 63 characters and the resulting aliases are truncated at 63 characters by postgres.

Also, there is another area where the datastore checks for 63 characters already -

for field_id in field_ids:
# Postgres has a limit of 63 characters for a column name
if len(field_id) > 63:
message = 'Column heading "{0}" exceeds limit of 63 '\
'characters.'.format(field_id)

However, it only does it for field names. Not table or alias names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both might be wrong. It sounds like the identifier length is limited to 63 bytes, so we either need to know the encoding of the DB or we should do a select octet_length(...) on the strings we plan to use field or alias names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since postgres UTF8 encoding is already required, and the code has the # encoding utf-8 magic comment, won't doing len() be good enough?

If not, I'd go for len(name.encode("utf-8")) instead of querying postgres for octet_length(...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is postgres utf-8 encoding required? There's nothing preventing a user from setting the datastore encoding to cp1252 to increase performance or space-efficiency for their data is there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs imply a utf-8 requirement:

https://docs.ckan.org/en/latest/maintaining/installing/install-from-source.html?highlight=UTF#setup-a-postgresql-database
https://docs.ckan.org/en/latest/maintaining/datastore.html?highlight=UTF#set-up-the-database

There is also a hardcoded utf8 assumption in the postgres backend.

def _generate_index_name(resource_id, field):
value = (resource_id + field).encode('utf-8')
return hashlib.sha1(value).hexdigest()

Regardless, the docs should be updated to state if utf8 encoding is actually not required, and maybe even the tests updated for alternate encodings.

Copy link
Contributor

@wardi wardi Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encode is used to generate a hexadecimal hash, it doesn't affect any postgres encoding.

In other issues we're talking about letting users create their own materialized views or foreign data wrappers or other objects. It's not unreasonable for a user to want an encoding optimal for their use case.

postgres=# create database cp1252test encoding 'win1252' lc_ctype='C' lc_collate='C' template=template0;
CREATE DATABASE
cp1252test=# create table t1 ( ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ text );
NOTICE:  identifier "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" will be truncated to "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ"

vs

utf8test=# create table t1 ( ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ text );
NOTICE:  identifier "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" will be truncated to "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞ"
CREATE TABLE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken! Especially as someone who's asking for more advanced postgres support. :)

And speaking of advanced postgres support, perhaps, instead of hardcoding it, get it from postgres directly with pg_control_init() now that postgres v9.5 is about to get EOLed Feb 11, 2021.

https://pgpedia.info/p/pg_control_init.html

@wardi
Copy link
Contributor

wardi commented Jan 30, 2021

the postgresql encoding names don't match the ones in Python

cp1252test=# SELECT character_set_name FROM INFORMATION_SCHEMA.CHARACTER_SETS;
 character_set_name 
--------------------
 WIN1252
(1 row)
>>> 'hi'.encode('WIN1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: WIN1252

Doing a select octet_length(...) < max_identifier_length for each column name would be the safest. You should be able to pass them as an array and get the result back with a single query.

@jqnatividad jqnatividad marked this pull request as draft February 1, 2021 13:31
@pdelboca
Copy link
Member

Closing due to inactivity.

Feel free to reopen it if consider that it is still relevant.

@pdelboca pdelboca closed this Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Postgres check for valid identifiers does not check for length
3 participants