-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for valid postgres identifer length #5853
Check for valid postgres identifer length #5853
Conversation
which cannot be longer than 63 characters
ckanext/datastore/helpers.py
Outdated
''' | ||
return (name and name == name.strip() and | ||
not name.startswith('_') and | ||
'"' not in name) | ||
'"' not in name and | ||
not len(name) > 63) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does postgres count character length the same way we do in python?
Does the encoding of the datastore database affect this calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so.
In my testing, I encountered this problem when I was automatically creating aliases when a resource was created using the template "ownerorg-packagename-resourcename"(ckan/ideas#264), and sometimes, it will create aliases longer than 63 characters and the resulting aliases are truncated at 63 characters by postgres.
Also, there is another area where the datastore checks for 63 characters already -
ckan/ckanext/datastore/backend/postgres.py
Lines 863 to 867 in cc000d3
for field_id in field_ids: | |
# Postgres has a limit of 63 characters for a column name | |
if len(field_id) > 63: | |
message = 'Column heading "{0}" exceeds limit of 63 '\ | |
'characters.'.format(field_id) |
However, it only does it for field names. Not table or alias names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both might be wrong. It sounds like the identifier length is limited to 63 bytes, so we either need to know the encoding of the DB or we should do a select octet_length(...)
on the strings we plan to use field or alias names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since postgres UTF8 encoding is already required, and the code has the # encoding utf-8
magic comment, won't doing len()
be good enough?
If not, I'd go for len(name.encode("utf-8"))
instead of querying postgres for octet_length(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is postgres utf-8 encoding required? There's nothing preventing a user from setting the datastore encoding to cp1252 to increase performance or space-efficiency for their data is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs imply a utf-8 requirement:
https://docs.ckan.org/en/latest/maintaining/installing/install-from-source.html?highlight=UTF#setup-a-postgresql-database
https://docs.ckan.org/en/latest/maintaining/datastore.html?highlight=UTF#set-up-the-database
There is also a hardcoded utf8 assumption in the postgres backend.
ckan/ckanext/datastore/backend/postgres.py
Lines 555 to 557 in cc000d3
def _generate_index_name(resource_id, field): | |
value = (resource_id + field).encode('utf-8') | |
return hashlib.sha1(value).hexdigest() |
Regardless, the docs should be updated to state if utf8 encoding is actually not required, and maybe even the tests updated for alternate encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This encode
is used to generate a hexadecimal hash, it doesn't affect any postgres encoding.
In other issues we're talking about letting users create their own materialized views or foreign data wrappers or other objects. It's not unreasonable for a user to want an encoding optimal for their use case.
postgres=# create database cp1252test encoding 'win1252' lc_ctype='C' lc_collate='C' template=template0;
CREATE DATABASE
cp1252test=# create table t1 ( ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ text );
NOTICE: identifier "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" will be truncated to "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ"
vs
utf8test=# create table t1 ( ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ text );
NOTICE: identifier "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" will be truncated to "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞ"
CREATE TABLE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken! Especially as someone who's asking for more advanced postgres support. :)
And speaking of advanced postgres support, perhaps, instead of hardcoding it, get it from postgres directly with pg_control_init()
now that postgres v9.5 is about to get EOLed Feb 11, 2021.
the postgresql encoding names don't match the ones in Python cp1252test=# SELECT character_set_name FROM INFORMATION_SCHEMA.CHARACTER_SETS;
character_set_name
--------------------
WIN1252
(1 row) >>> 'hi'.encode('WIN1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: WIN1252 Doing a |
Closing due to inactivity. Feel free to reopen it if consider that it is still relevant. |
Fixes #5834
Proposed fixes:
Features: