Skip to content

Commit

Permalink
[SPARK-44732][XML][FOLLOWUP] Partial backport of spark-xml "Shortcut …
Browse files Browse the repository at this point in the history
…common type inference cases to fail fast"

### What changes were proposed in this pull request?

Partial back-port of databricks/spark-xml@994e357?diff=split from spark-xml

### Why are the changes needed?

Though no more development was intended on spark-xml, there was a non-trivial improvement to inference speed that I committed anyway to resolve a customer issue. Part of it can be 'backported' here to sync the code. I attached this as a follow-up to the main code port JIRA.

There is still, in general, no intent to commit more to spark-xml in the meantime unless it's significantly important.

### Does this PR introduce _any_ user-facing change?

No, this should only speed up schema inference without behavior change.

### How was this patch tested?

Tested in spark-xml, and will be tested by tests here too

Closes #42844 from srowen/SPARK-44732.2.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
  • Loading branch information
srowen committed Sep 7, 2023
1 parent b8b58e0 commit a37c265
Showing 1 changed file with 16 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,12 @@ private[sql] object TypeCast {
} else {
value
}
// A little shortcut to avoid trying many formatters in the common case that
// the input isn't a double. All built-in formats will start with a digit or period.
if (signSafeValue.isEmpty ||
!(Character.isDigit(signSafeValue.head) || signSafeValue.head == '.')) {
return false
}
// Rule out strings ending in D or F, as they will parse as double but should be disallowed
if (value.nonEmpty && (value.last match {
case 'd' | 'D' | 'f' | 'F' => true
Expand All @@ -171,6 +177,11 @@ private[sql] object TypeCast {
} else {
value
}
// A little shortcut to avoid trying many formatters in the common case that
// the input isn't a number. All built-in formats will start with a digit.
if (signSafeValue.isEmpty || !Character.isDigit(signSafeValue.head)) {
return false
}
(allCatch opt signSafeValue.toInt).isDefined
}

Expand All @@ -180,6 +191,11 @@ private[sql] object TypeCast {
} else {
value
}
// A little shortcut to avoid trying many formatters in the common case that
// the input isn't a number. All built-in formats will start with a digit.
if (signSafeValue.isEmpty || !Character.isDigit(signSafeValue.head)) {
return false
}
(allCatch opt signSafeValue.toLong).isDefined
}

Expand Down

0 comments on commit a37c265

Please sign in to comment.