-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement auto_value_type
operation
#7908
Changes from all commits
c31df3c
10c04f1
7b7f8ae
38889e4
605be6b
fcf623c
ade2f32
345499b
5ef6641
9d6ec5f
fd5341c
62f949f
8aa7e88
a69f41c
a7dddf1
e4be978
741a5c0
556a5e8
a41f367
24cc278
d321290
aabbf47
519cc00
e1eac40
533bf7a
0f15f74
735494a
7154cbb
4880f47
7e56752
4459569
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1762,6 +1762,43 @@ type Column | |
on_problems.attach_problems_before problems <| | ||
Column.from_storage self.name new_storage | ||
|
||
## Change the value type of the column to a more specific one, based on its | ||
contents. | ||
|
||
Arguments: | ||
- shrink_types: If set `True`, smaller types will be chosen if possible, | ||
according to the rules below. Defaults to `False`. | ||
|
||
? Auto Type Selection Rules | ||
|
||
- If a `Mixed` column can be assigned a single type, like `Char` or | ||
`Integer`, that will be used. | ||
- Text columns are not parsed. To do that, use the `parse` method. | ||
- If a `Float` column contains only integers, it will be converted to | ||
an Integer column. | ||
- If a `Decimal` column contains only integers that could fit in a | ||
64-bit integer storage, it will be converted to an Integer column. | ||
- If `shrink_types` is `False` (default), no other transformations are | ||
applied. | ||
- However, if `shrink_types` is set to `True`, then: | ||
- Integer columns will be assigned the smallest size that can fit all | ||
values (down to 16-bit integers; converting to the `Byte` type has | ||
to be done manually through `cast`). | ||
- If all elements in a text column have the same length, the type | ||
will become fixed length. | ||
- Otherwise, if a text column is variable length, but all text | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which is length 255 treated specially here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apparently because it is usually the cutoff point where Databases will stop storing the strings 'in-place'. That was the design discussed by @jdunkerley and @NedHarding. I think the rationale is that we don't want to shrink the column length just to the maximum length of the longest entry because it may be quite arbitrary and in other runs the length may differ - thus if we e.g. create a Database column out of this, we could get too small a limit. So instead, we see if the values are 'small' (i.e. fit in this 255 cutoff) and if so, we shrink the type to that and otherwise keep the type 'large' - i.e. unlimited. The user can always manually choose whatever limit they want through |
||
elements are no longer than 255 characters, the column will get a | ||
max length of 255. Otherwise, the column size limit will stay | ||
unchanged. | ||
auto_value_type : Boolean -> Column | ||
auto_value_type self shrink_types=False = | ||
new_value_type = case shrink_types of | ||
False -> self.inferred_precise_value_type | ||
True -> | ||
Storage.to_value_type self.java_column.getStorage.inferPreciseTypeShrunk | ||
# We run with Report_Error because we do not expect any problems. | ||
self.cast new_value_type on_problems=Problem_Behavior.Report_Error | ||
|
||
## ALIAS transform column | ||
|
||
Applies `function` to each item in this column and returns the column | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,10 +88,20 @@ type Table | |
Column.from_vector (v.at 0) (v.at 1) . java_column | ||
Column.Value java_col -> java_col | ||
_ -> invalid_input_shape | ||
if cols.is_empty then Error.throw (Illegal_Argument.Error "Cannot create a table with no columns.") else | ||
if (cols.all c-> c.getSize == cols.first.getSize).not then Error.throw (Illegal_Argument.Error "All columns must have the same row count.") else | ||
if cols.distinct .getName . length != cols.length then Error.throw (Illegal_Argument.Error "Column names must be distinct.") else | ||
Table.Value (Java_Table.new cols) | ||
Panic.recover Illegal_Argument <| | ||
if cols.is_empty then | ||
Panic.throw (Illegal_Argument.Error "Cannot create a table with no columns.") | ||
|
||
if cols.distinct .getName . length != cols.length then | ||
Panic.throw (Illegal_Argument.Error "Column names must be distinct.") | ||
|
||
mismatched_size_column = cols.find if_missing=Nothing c-> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since there aren't going to be a lot of columns most of the time, I think it would be good to report here the lengths of all the columns so it's easier for the user to know which columns are wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO if we just list lengths of all columns it will be harder to find which ones fit and which ones do not, especially if there are >5 columns which is rather common. IMO best to just show one that is wrong - then the user can fix it - if there are any more wrong ones - a new one will appear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not just the lengths, but the names of the columns as well, so it wouldn't be confusing. If we only show one, and there are 10 columns with different lengths, that would take 9-10 runs to find all the problems. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but keep in mind that our IDE error message box is currently very small, so the bigger the error is the less readable it becomes. I'm happy to revisit it at later point, e.g. when our error messages will be displayed larger. For now IMO this is already improving that we know at least one 'offending example', so I'd keep it as is. If the user has just a few columns they will just see the sizes. If there's more of them, the message will be too long to easily see anyway - the user can always do |
||
c.getSize != cols.first.getSize | ||
if mismatched_size_column.is_nothing.not then | ||
msg = "All columns must have the same row count, but the column [" + mismatched_size_column.getName + "] has " + mismatched_size_column.getSize.to_text + " rows, while the column [" + cols.first.getName + "] has " + cols.first.getSize.to_text + " rows." | ||
Panic.throw (Illegal_Argument.Error msg) | ||
|
||
Table.Value (Java_Table.new cols) | ||
|
||
## GROUP Standard.Base.Constants | ||
Creates a new table from a vector of column names and a vector of vectors | ||
|
@@ -946,6 +956,9 @@ type Table | |
Arguments: | ||
- columns: The selection of columns to cast. | ||
- value_type: The `Value_Type` to cast the column to. | ||
- error_on_missing_columns: Specifies if a missing input column should | ||
result in an error regardless of the `on_problems` settings. Defaults | ||
to `True`. | ||
- on_problems: Specifies how to handle problems if they occur, reporting | ||
them as warnings by default. | ||
|
||
|
@@ -996,6 +1009,50 @@ type Table | |
new_column = column_to_cast.cast value_type on_problems | ||
table.set new_column new_name=column_to_cast.name set_mode=Set_Mode.Update | ||
|
||
## Change the value type of table columns to a more specific one, based on | ||
their contents. | ||
|
||
This is most useful for `Mixed` type columns and will allow to narrow | ||
down the type if all values in the column fit a more specific type. | ||
|
||
Arguments: | ||
- columns: The selection of columns to convert. | ||
- shrink_types: If set `True`, smaller types will be chosen if possible, | ||
according to the rules below. Defaults to `False`. | ||
- error_on_missing_columns: Specifies if a missing input column should | ||
result in an error regardless of the `on_problems` settings. Defaults | ||
to `True`. | ||
- on_problems: Specifies how to handle problems if they occur, reporting | ||
them as warnings by default. | ||
|
||
? Auto Type Selection Rules | ||
|
||
- If a `Mixed` column can be assigned a single type, like `Char` or | ||
`Integer`, that will be used. | ||
- Text columns are not parsed. To do that, use the `parse` method. | ||
- If a `Float` column contains only integers, it will be converted to | ||
an Integer column. | ||
- If a `Decimal` column contains only integers that could fit in a | ||
64-bit integer storage, it will be converted to an Integer column. | ||
- If `shrink_types` is `False` (default), no other transformations are | ||
applied. | ||
- However, if `shrink_types` is set to `True`, then: | ||
- Integer columns will be assigned the smallest size that can fit all | ||
values (down to 16-bit integers; converting to the `Byte` type has | ||
to be done manually through `cast`). | ||
- If all elements in a text column have the same length, the type | ||
will become fixed length. | ||
- Otherwise, if a text column is variable length, but all text | ||
elements are no longer than 255 characters, the column will get a | ||
max length of 255. Otherwise, the column size limit will stay | ||
unchanged. | ||
auto_value_types : Vector (Text | Integer | Regex) | Text | Integer | Regex -> Boolean -> Boolean -> Problem_Behavior -> Table | ||
auto_value_types self columns=self.column_names shrink_types=False error_on_missing_columns=True on_problems=Problem_Behavior.Report_Warning = | ||
selected = self.columns_helper.select_columns columns Case_Sensitivity.Default reorder=False error_on_missing_columns=error_on_missing_columns on_problems=on_problems error_on_empty=False | ||
selected.fold self table-> column_to_cast-> | ||
new_column = column_to_cast.auto_value_type shrink_types | ||
table.set new_column new_name=column_to_cast.name set_mode=Set_Mode.Update | ||
|
||
## GROUP Standard.Base.Conversions | ||
Splits a column of text into a set of new columns. | ||
The original column will be removed from the table. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this pattern, I hope we use it more.