[SPARK-44732][XML][FOLLOWUP] Partial backport of spark-xml "Shortcut …

…common type inference cases to fail fast" ### What changes were proposed in this pull request? Partial back-port of databricks/spark-xml@994e357?diff=split from spark-xml ### Why are the changes needed? Though no more development was intended on spark-xml, there was a non-trivial improvement to inference speed that I committed anyway to resolve a customer issue. Part of it can be 'backported' here to sync the code. I attached this as a follow-up to the main code port JIRA. There is still, in general, no intent to commit more to spark-xml in the meantime unless it's significantly important. ### Does this PR introduce _any_ user-facing change? No, this should only speed up schema inference without behavior change. ### How was this patch tested? Tested in spark-xml, and will be tested by tests here too Closes #42844 from srowen/SPARK-44732.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>
apache · Sep 7, 2023 · a37c265 · a37c265
1 parent b8b58e0
commit a37c265
Showing 1 changed file with 16 additions and 0 deletions.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/TypeCast.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/TypeCast.scala
@@ -155,6 +155,12 @@ private[sql] object TypeCast {
     } else {
       value
     }
+    // A little shortcut to avoid trying many formatters in the common case that
+    // the input isn't a double. All built-in formats will start with a digit or period.
+    if (signSafeValue.isEmpty ||
+      !(Character.isDigit(signSafeValue.head) || signSafeValue.head == '.')) {
+      return false
+    }
     // Rule out strings ending in D or F, as they will parse as double but should be disallowed
     if (value.nonEmpty && (value.last match {
           case 'd' | 'D' | 'f' | 'F' => true
@@ -171,6 +177,11 @@ private[sql] object TypeCast {
     } else {
       value
     }
+    // A little shortcut to avoid trying many formatters in the common case that
+    // the input isn't a number. All built-in formats will start with a digit.
+    if (signSafeValue.isEmpty || !Character.isDigit(signSafeValue.head)) {
+      return false
+    }
     (allCatch opt signSafeValue.toInt).isDefined
   }
 
@@ -180,6 +191,11 @@ private[sql] object TypeCast {
     } else {
       value
     }
+    // A little shortcut to avoid trying many formatters in the common case that
+    // the input isn't a number. All built-in formats will start with a digit.
+    if (signSafeValue.isEmpty || !Character.isDigit(signSafeValue.head)) {
+      return false
+    }
     (allCatch opt signSafeValue.toLong).isDefined
   }