-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1871175: Add support for specifying a schema string for DataFrame.create_dataframe
#2828
base: main
Are you sure you want to change the base?
Conversation
bracket_depth += 1 | ||
# we don't store the opening bracket in 'inside_chars' | ||
# if bracket_depth was 0 -> 1, to skip the outer bracket | ||
if bracket_depth > 1: | ||
inside_chars.append(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this form allowed array<<<...>>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea for something like array<array<...>>
, added a comment here
for field_def in field_defs: | ||
# Try splitting on colon first, else whitespace | ||
if ":" in field_def: | ||
left, right = field_def.split(":", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider multiple colon cases like "a:b:c"? or this is handled by upstream/downstream logic already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, PySpark's simpleString format only considers the first colon or whitespace.
for i, c in enumerate(s): | ||
if c in ["<", "("]: | ||
bracket_depth += 1 | ||
elif c in [">", ")"]: | ||
bracket_depth -= 1 | ||
if bracket_depth < 0: | ||
raise ValueError(f"Mismatched bracket in '{s}'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this bracket check logic has repeated multiple times
do you think it's possible to check the bracket match as the initial step for only one time for the whole input string, and then in the downstream logic we can only focus on extracting the names and types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to parse bracket to split fields, and extract names and types anyway. There is indeed a duplicate of validating whether the bracket expression is valid or not, maybe we can remove it. But to make the function self-contained, maybe let's still keep it? They are also covered in the test.
src/snowflake/snowpark/session.py
Outdated
f"Invalid schema string: {schema}. " | ||
f"You should provide a valid schema string representing a struct type." | ||
) | ||
if isinstance(schema, (StructType, str)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the schema still be of type str
after being processed by the type_string_to_type_object
?
I feel we don't need to the str
in the instance check here
if isinstance(schema, (StructType, str)): | |
if isinstance(schema, StructType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it's by mistake
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1871175
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Support schema string