Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1871175: Add support for specifying a schema string for DataFrame.create_dataframe #2828

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-jdu
Copy link
Collaborator

@sfc-gh-jdu sfc-gh-jdu commented Jan 7, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1871175

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    Support schema string

Comment on lines +1011 to +1015
bracket_depth += 1
# we don't store the opening bracket in 'inside_chars'
# if bracket_depth was 0 -> 1, to skip the outer bracket
if bracket_depth > 1:
inside_chars.append(c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this form allowed array<<<...>>>?

Copy link
Collaborator Author

@sfc-gh-jdu sfc-gh-jdu Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea for something like array<array<...>>, added a comment here

for field_def in field_defs:
# Try splitting on colon first, else whitespace
if ":" in field_def:
left, right = field_def.split(":", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider multiple colon cases like "a:b:c"? or this is handled by upstream/downstream logic already

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, PySpark's simpleString format only considers the first colon or whitespace.

Comment on lines +1071 to +1077
for i, c in enumerate(s):
if c in ["<", "("]:
bracket_depth += 1
elif c in [">", ")"]:
bracket_depth -= 1
if bracket_depth < 0:
raise ValueError(f"Mismatched bracket in '{s}'.")
Copy link
Contributor

@sfc-gh-aling sfc-gh-aling Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this bracket check logic has repeated multiple times
do you think it's possible to check the bracket match as the initial step for only one time for the whole input string, and then in the downstream logic we can only focus on extracting the names and types

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to parse bracket to split fields, and extract names and types anyway. There is indeed a duplicate of validating whether the bracket expression is valid or not, maybe we can remove it. But to make the function self-contained, maybe let's still keep it? They are also covered in the test.

f"Invalid schema string: {schema}. "
f"You should provide a valid schema string representing a struct type."
)
if isinstance(schema, (StructType, str)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the schema still be of type str after being processed by the type_string_to_type_object?
I feel we don't need to the str in the instance check here

Suggested change
if isinstance(schema, (StructType, str)):
if isinstance(schema, StructType):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea it's by mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants