Bugfix/466 standardization support for second fractions #679

benedeki · 2019-07-17T14:59:28Z

Class Section to define parts of the pattern string and string to convert
Implicit classes additional logic to enable String and Columns to operate with Section class
Actual implementation of the support for second fractions
S, i and n placeholders in pattern
EnceladusDateTimeParser support for the new placeholders
Standardization (TypeParser) support for the new placeholders
Fixing and enhancing UTs for the new second fractions placeholders
Validators to support the new second fractions placeholders
README.md update

* Class Section to define parts of the pattern string and string to convert

* Implicit classes additional logic to enable String and Columns to operate with Section class

* Actual implementation of the support for second fractions * S, i and n placeholders in pattern * EnceladusDateTimeParser support for the new placeholders * Standardization (TypeParser) support for the new placeholders

* Fixing and enhancing UTs for the new second fractions placeholders

* Validators to support the new second fractions placeholders

* README.md update

Zejnilovic

I overestimated myself, I am going to rest and finish this tomorrow. Sorry for splitting it.

README.md

Zejnilovic · 2019-07-17T19:27:18Z

README.md

-| _epoch_ | Seconds since 1970/01/01 00:00:00 | 1557136493|
-| _milliepoch_ | Milliseconds since 1970/01/01 00:00:00.0000| 15571364938124 |
+| _epoch_ | Seconds since 1970/01/01 00:00:00 | 1557136493, 1557136493.136|
+| _epochmilli_ | Milliseconds since 1970/01/01 00:00:00.0000| 15571364938124, 15571364938124.001 |


why does milli have 4 more places

Zejnilovic · 2019-07-17T19:29:27Z

README.md


 **NB!** Spark uses US Locale and because on-the-fly conversion would be complicated, at the moment we stick to this 
 hardcoded locale as well. E.g. `am/pm` for `a` placeholder, English names of days and months etc.

 **NB!** The keywords are case **insensitive**. Therefore, there is no difference between `epoch` and `EpoCH`.
+
+**NB!** While _nanoseconds_ designation is supported on input, it's not supported in storage or further usage. So any


Please put some * or something to the format or something, that would lead people to this point.

Zejnilovic · 2019-07-17T19:39:00Z

...tion/src/main/scala/za/co/absa/enceladus/standardization/interpreter/stages/TypeParser.scala

-      val interim: Column = to_timestamp(stringColumn, pattern)
-      pattern.defaultTimeZone.map(to_utc_timestamp(interim, _)).getOrElse(interim)
+      if (pattern.containsSecondFractions) {
+        //this is a trick how to enforce fractions of seconds into teh timestamp


Suggested change

//this is a trick how to enforce fractions of seconds into teh timestamp

//this is a trick how to enforce fractions of seconds into the timestamp

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

Zejnilovic · 2019-07-17T19:50:52Z

utils/src/main/scala/za/co/absa/enceladus/utils/implicits/ColumnImplicits.scala


 object ColumnImplicits {
  implicit class ColumnEnhancements(column: Column) {
    def isInfinite: Column = {
      column.isin(Double.PositiveInfinity, Double.NegativeInfinity)
    }
+
+    def zeroBasedSubstr(startPos: Int): Column = {


Would you add a basic doc for public members?

Thanks, forgot to put them here.

benedeki · 2019-07-17T21:22:37Z

I separated the logical chain of dependencies into separate commits, in hope it would help understanding this PR.

The basic logic of the solution is, that the the pattern is checked for the presence of milli-, micro- and/or nanoseconds. If not found, nothing special is needed.
If present their location is identified withing the pattern (class `Section'), The parsing is then done in two steps:

The second fractions parts are removed both from the pattern and source and parsed - this is a timestamp with seconds precision
The fractions are extracted from the source and divided to be the appropriate decimal value of a second; then added to the timestamp.
Unfortunately there are complications, I failed to take into account and discovered during testing. See bug If Timestamp/Date pattern contains both literal or variable length entry and fractions of second placeholders it's likely to cause error #677. Still this enhances the current possibilities of parsing, the errors are kind of corner-cases. Therefore I think this is worth to include, despite the known bug.

Btw, epochmicro and epochnano added as well. Furthermore, they don't have to be whole numbers only, can be decimal.

Zejnilovic · 2019-07-18T06:32:24Z

utils/src/main/scala/za/co/absa/enceladus/utils/implicits/StringImplicits.scala

@@ -22,6 +22,21 @@ import scala.annotation.tailrec
 object StringImplicits {
  implicit class StringEnhancements(string: String) {

+    def injectWith(what: String, where: Int): String = {


Please add docs here as well and explanation if the string is added on the position where or after it.

👍
I wonder if to keep it though. I created the methods, but turned out, I don't need it. Wondered if its useful enough to keep. Opinions?

If you're not using it, I'd say remove it and consider putting it in commons instead.

Zejnilovic · 2019-07-18T06:56:03Z

utils/src/main/scala/za/co/absa/enceladus/utils/time/EnceladusDateTimeParser.scala

+  private def makePreciseTimestamp(seconds: Long, nanoseconds: Int): Timestamp = {
+    val result = new Timestamp(seconds * MillisecondsInSecond)
+    if (nanoseconds > 0) {
+    result.setNanos(nanoseconds)


Suggested change

result.setNanos(nanoseconds)

result.setNanos(nanoseconds)

GeorgiChochov

I'm going to need a tutorial on this one lol

README.md

...rc/test/scala/za/co/absa/enceladus/utils/validation/field/FieldValidatorTimestampSuite.scala

utils/src/test/scala/za/co/absa/enceladus/utils/validation/field/FieldValidatorDateSuite.scala

utils/src/test/scala/za/co/absa/enceladus/utils/general/SectionSuite.scala

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

…mments * Removed `injectWith` method from `StringImplicits` as it's not used * `EnceladusDateTimeParser.parseDate` now trims time of the date information * Improved some existing added missing classes and methods documentation * Better UTs * README.md better wording * Code reordering * Typos

yruslan · 2019-07-22T07:59:57Z

README.md

-| _milliepoch_ | Milliseconds since 1970/01/01 00:00:00.0000| 15571364938124 |
+| placeholder | Description | Example | Note |
+| --- | --- | --- | --- |
+| G | Era designator | AD | |


I like that we have such a good description of pattern characters. I had to look up for this every time I needed to write a pattern. Now README is the place to look. 👍

Zejnilovic

Comments as discussed with @benedeki, @yruslan, @GeorgiChochov

Zejnilovic · 2019-07-23T14:01:10Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+  }
+
+  /**
+    * Metrics defined on Section


Number of positions between two sections

Zejnilovic · 2019-07-23T14:01:31Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+        (start min forString.length, start + length)
+      } else {
+        val startIndex = forString.length + start
+        if (startIndex >= 0) { //


Suggested change

if (startIndex >= 0) { //

if (startIndex >= 0) {

Zejnilovic · 2019-07-23T14:02:26Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+          (0, length + startIndex max 0)
+        }
+      }
+    (realStart, after min forString.length)


change lines 72, 75 from infix notation to prefix notation

Zejnilovic · 2019-07-23T14:03:16Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+    * @param fromString the string to apply the section to
+    * @return           substring defined by this section
+    */
+  def extract(fromString: String): String = {


Suggested change

def extract(fromString: String): String = {

def extractFrom(string: String): String = {

Do a similar change the rest of methods as consulted at the meeting

Zejnilovic · 2019-07-23T14:03:41Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+    * @return     None - if one Section has a negative start and the other positive or zero
+    *             The end of the smaller section subtracted from the start of the greater one (see comparison), e.g. can be negative
+    */
+  def distance(that: Section): Option[Int] = {


Suggested change

def distance(that: Section): Option[Int] = {

def distance(secondSection: Section): Option[Int] = {

Zejnilovic · 2019-07-23T14:07:17Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+    *             The end of the smaller section subtracted from the start of the greater one (see comparison), e.g. can be negative
+    */
+  def distance(that: Section): Option[Int] = {
+    def subtractLike(from: Section, what: Section): Int = {


Suggested change

def subtractLike(from: Section, what: Section): Int = {

def calculateDistance(first: Section, second: Section): Int = {

Zejnilovic · 2019-07-23T14:12:07Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+}
+
+object Section {
+  def fromIndexes(start: Int, end: Int): Section = {


please add comments to public methods

Zejnilovic · 2019-07-23T14:13:27Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+
+object Section {
+  def fromIndexes(start: Int, end: Int): Section = {
+    Section(start, end - start + 1)


add a check that they are in the right order

Zejnilovic · 2019-07-23T14:16:25Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+    Section(start, end - start + 1)
+  }
+
+  def ofSameChars(fromString: String, fromIndex: Int): Section = {


Suggested change

def ofSameChars(fromString: String, fromIndex: Int): Section = {

def ofSameChars(inputString: String, signedIndex: Int): Section = {

Zejnilovic · 2019-07-23T14:17:11Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+  }
+
+  def ofSameChars(fromString: String, fromIndex: Int): Section = {
+    val realFromIndex = if (fromIndex >= 0) {


Suggested change

val realFromIndex = if (fromIndex >= 0) {

val index = if (fromIndex >= 0) {

…mments * Section.fromIndexes is now agnostic to input parameters order * Using rather Math.min/max instead of infix operators * Better method names for Section class * Better method parameter names for Section class * Better test descriptions for SectionSuite class

yruslan · 2019-07-24T12:36:11Z

utils/src/test/scala/za/co/absa/enceladus/utils/general/SectionSuite.scala

+    assert(result == sections)
+  }
+
+  test("mergeSections: touching") {


Maybe illustrate an example of to with the explanation why for the given input we get a particular output?

For instance,

For a string: 1234567890ACDFEFGHIJKLMNOPQUSTUVWXYZ ^ ^ ^^ | | ^^ | | | Section(5,1) Section(-4,2) | Section(3,2) Section(1,1) Output of the merge: ...

etc.

That is so we won't forget your explanation.

Zejnilovic · 2019-07-24T05:55:35Z

utils/src/main/scala/za/co/absa/enceladus/utils/time/DateTimePattern.scala

+
+  val secondFractionsSections: Seq[Section]
+  val patternWithoutSecondFractions: String
+  def containsSecondFractions: Boolean = secondFractionsSections.nonEmpty


Since this is not changing value and si called a few times, it could be turned into lazy val instead of def.

lazy having bigger overhead than this simple expression, which might very likely be even inligned by the compiler.

Zejnilovic · 2019-07-24T08:34:43Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+  }
+
+  /**
+    * Inverse function for `remove`, inserts the `what` string into the `into` string as defined by the `section`


Does the length of the string have to be the same as the length of the section?
Add constraints description to the docs please

Thanks, good idea 👍

Zejnilovic · 2019-07-24T12:21:14Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+      } else {
+        val sorted = input.sorted
+        sorted.tail.foldLeft(List(sorted.head)) { (resultAcc, item) =>
+          if (item touches resultAcc.head) {


please change from infix notation to prefix notation

Suggested change

if (item touches resultAcc.head) {

if (item.touches(resultAcc.head)) {

I have to admit I really like the inflix here. Makes it easy to read, no issues with precedence. You mind it much? 😉

I don't mind it, I would just keep everything the "same".

Zejnilovic · 2019-07-24T12:31:12Z

utils/src/main/scala/za/co/absa/enceladus/utils/general/Section.scala

+    * @param sections the sections to merge
+    * @return         an ordered from greater to smaller sequence of distinct sections (their distance is at least 1 or undefined)
+    */
+  def mergeSections(sections: Seq[Section]): Seq[Section] = {


Suggested change

def mergeSections(sections: Seq[Section]): Seq[Section] = {

def mergeTouchingSections(sections: Seq[Section]): Seq[Section] = {

Eh, sorry the name comes off as misleading IMO.
And sequence of Sections can be provided. Not just "touching" ones, distinct, overlapping, one included in other, same ...

Merge sections seems like you are going to merge all of them and allow gaps in between them. Let's think of something then.

…mments * Removed exception throwing from the code * Function renamed to better express purpose -> mergeSections->mergeTouchingSectionsAndSort * Further improvement on comments

benedeki · 2019-07-27T11:04:30Z

Wanted to change the microsecond placeholder to u (widely used shorthand for micro prefix). Unfortunately turns out, u is already being used for day of the week. But it made me spot the bug #707

benedeki added 7 commits July 9, 2019 10:26

#466: Standardization support for second fractions

b7c5baa

#466: Standardization support for second fractions

a280691

* Class Section to define parts of the pattern string and string to convert

#466: Standardization support for second fractions

7785145

* Implicit classes additional logic to enable String and Columns to operate with Section class

#466: Standardization support for second fractions

d995710

* Actual implementation of the support for second fractions * S, i and n placeholders in pattern * EnceladusDateTimeParser support for the new placeholders * Standardization (TypeParser) support for the new placeholders

#466: Standardization support for second fractions

4b6da16

* Fixing and enhancing UTs for the new second fractions placeholders

#466: Standardization support for second fractions

27c7d00

* Validators to support the new second fractions placeholders

#466: Standardization support for second fractions

078b26d

* README.md update

benedeki added bug Something isn't working feature New feature Standardization Standardization Job affected labels Jul 17, 2019

benedeki requested review from yruslan, Zejnilovic and GeorgiChochov July 17, 2019 14:59

benedeki self-assigned this Jul 17, 2019

Zejnilovic reviewed Jul 17, 2019

View reviewed changes

Zejnilovic reviewed Jul 18, 2019

View reviewed changes

GeorgiChochov requested changes Jul 18, 2019

View reviewed changes

yruslan reviewed Jul 22, 2019

View reviewed changes

Zejnilovic reviewed Jul 23, 2019

View reviewed changes

yruslan reviewed Jul 24, 2019

View reviewed changes

Zejnilovic reviewed Jul 24, 2019

View reviewed changes

#466: Standardization support for second fractions - addressing PR co…

3e7a5eb

…mments * Removed exception throwing from the code * Function renamed to better express purpose -> mergeSections->mergeTouchingSectionsAndSort * Further improvement on comments

GeorgiChochov approved these changes Jul 26, 2019

View reviewed changes

yruslan approved these changes Jul 26, 2019

View reviewed changes

benedeki mentioned this pull request Jul 27, 2019

Parsing leniency differs between EnceladusDateTimeParser and Spark #707

Closed

benedeki merged commit cdd35a7 into develop Jul 27, 2019

benedeki deleted the bugfix/466-standardization-support-for-second-fractions branch July 27, 2019 11:49

	//this is a trick how to enforce fractions of seconds into teh timestamp
	//this is a trick how to enforce fractions of seconds into the timestamp

	def extract(fromString: String): String = {
	def extractFrom(string: String): String = {

	def distance(that: Section): Option[Int] = {
	def distance(secondSection: Section): Option[Int] = {

	def subtractLike(from: Section, what: Section): Int = {
	def calculateDistance(first: Section, second: Section): Int = {

	def ofSameChars(fromString: String, fromIndex: Int): Section = {
	def ofSameChars(inputString: String, signedIndex: Int): Section = {

	val realFromIndex = if (fromIndex >= 0) {
	val index = if (fromIndex >= 0) {

	if (item touches resultAcc.head) {
	if (item.touches(resultAcc.head)) {

	def mergeSections(sections: Seq[Section]): Seq[Section] = {
	def mergeTouchingSections(sections: Seq[Section]): Seq[Section] = {

Bugfix/466 standardization support for second fractions #679

Bugfix/466 standardization support for second fractions #679

Conversation

benedeki commented Jul 17, 2019

Zejnilovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benedeki commented Jul 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgiChochov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zejnilovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zejnilovic Jul 25, 2019 • edited Loading

Choose a reason for hiding this comment

benedeki commented Jul 27, 2019

Zejnilovic Jul 25, 2019 •

edited

Loading