Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1167007: Validate no null values for not null cols #739

Merged

Conversation

sfc-gh-wtrefon
Copy link
Contributor

Validate if row contains null values for non-nullable columns. If yes, add such column names to InsertValidationResponse error as NullValueForNotNullColNames so it can be used for schema evolution.

@sfc-gh-wtrefon sfc-gh-wtrefon requested review from sfc-gh-tzhang and a team as code owners April 15, 2024 13:21
@sfc-gh-wtrefon sfc-gh-wtrefon force-pushed the wtrefon-SNOW-1167007-null-values-for-not-null-cols branch from bc0795c to bf9ed76 Compare April 15, 2024 13:24
Comment on lines +446 to +452
List<String> nullValueNotNullCols = new ArrayList<>();
for (String columnName : this.nonNullableFieldNames) {
if (inputColNamesMap.containsKey(columnName)
&& row.get(inputColNamesMap.get(columnName)) == null) {
nullValueNotNullCols.add(statsMap.get(columnName).getColumnDisplayName());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already throw an exception at the end of parseColumnValueToParquet when we insert null into a non-nullable column, should be just catch that and add the col name to the nullValueNotNullCols list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is the right place. The verifyInputColumns is called before and sounds like the right place for such validation. verifyInputColumns is also in the abstract class, the parseColumnValueToParquet only in one implementation (Parquet).

If we catched the exception in insertRows function we would only add single column to the nullValueNotNullCols even if there is more columns with null values. We would need to pass the error object into the addRow function to catch all the exception that may be thrown in a loop.

Therefore i think the verifyInputColumns is much better place for the validation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense to me, my concern was that in the worst case, we're iterating through all the data again to check this which is not ideal so I was hoping that we can do this as part of data validation.

@@ -442,6 +442,26 @@ Set<String> verifyInputColumns(
"Values for all non-nullable columns must be specified, rowIndex:%d", rowIndex));
}

// Check for null values in not-nullable columns in the row
List<String> nullValueNotNullCols = new ArrayList<>();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to differentiate between missingCols and nullValueNotNullCols?
Is there a conceptual difference between the cases when a column is present with null value and when it isn't present at all.

Copy link
Contributor Author

@sfc-gh-wtrefon sfc-gh-wtrefon Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking about this, the only thing is the naming of missingCols. But if we do like this then we don't need to adapt anything in kafka push code. What u think @sfc-gh-tzhang ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with combing it and we will need to update the name to make it clear, the only concern is that we're updating a public API, but I highly doubt any customer is using this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we would need to change the name anyways (so we still would need changes in kafka push code) and it changes also the public API i would keep the lists separated.

The benefits I see from 2 separate lists:

  • not breaking the public api (mentioned above)
  • we give more detailed information what is going on. Missing columns vs null values is a bit different and might help in debugging in future.

So I would leave it as it is.

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -442,6 +442,26 @@ Set<String> verifyInputColumns(
"Values for all non-nullable columns must be specified, rowIndex:%d", rowIndex));
}

// Check for null values in not-nullable columns in the row
List<String> nullValueNotNullCols = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with combing it and we will need to update the name to make it clear, the only concern is that we're updating a public API, but I highly doubt any customer is using this.

Comment on lines +446 to +452
List<String> nullValueNotNullCols = new ArrayList<>();
for (String columnName : this.nonNullableFieldNames) {
if (inputColNamesMap.containsKey(columnName)
&& row.get(inputColNamesMap.get(columnName)) == null) {
nullValueNotNullCols.add(statsMap.get(columnName).getColumnDisplayName());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense to me, my concern was that in the worst case, we're iterating through all the data again to check this which is not ideal so I was hoping that we can do this as part of data validation.

@sfc-gh-wtrefon sfc-gh-wtrefon merged commit b1318be into master Apr 17, 2024
15 checks passed
@sfc-gh-wtrefon sfc-gh-wtrefon deleted the wtrefon-SNOW-1167007-null-values-for-not-null-cols branch April 17, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants