Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/466 standardization support for second fractions #679

Merged

Conversation

benedeki
Copy link
Collaborator

  • Class Section to define parts of the pattern string and string to convert
  • Implicit classes additional logic to enable String and Columns to operate with Section class
  • Actual implementation of the support for second fractions
  • S, i and n placeholders in pattern
  • EnceladusDateTimeParser support for the new placeholders
  • Standardization (TypeParser) support for the new placeholders
  • Fixing and enhancing UTs for the new second fractions placeholders
  • Validators to support the new second fractions placeholders
  • README.md update

* Class Section to define parts of the pattern string and string to convert
* Implicit classes additional logic to enable String and Columns to operate with Section class
* Actual implementation of the support for second fractions
* S, i and n placeholders in pattern
* EnceladusDateTimeParser support for the new placeholders
* Standardization  (TypeParser) support for the new placeholders
* Fixing and enhancing UTs for the new second fractions placeholders
* Validators to support the new second fractions placeholders
@benedeki benedeki added bug Something isn't working feature New feature Standardization Standardization Job affected labels Jul 17, 2019
@benedeki benedeki self-assigned this Jul 17, 2019
Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overestimated myself, I am going to rest and finish this tomorrow. Sorry for splitting it.

README.md Outdated Show resolved Hide resolved
README.md Outdated
| _epoch_ | Seconds since 1970/01/01 00:00:00 | 1557136493|
| _milliepoch_ | Milliseconds since 1970/01/01 00:00:00.0000| 15571364938124 |
| _epoch_ | Seconds since 1970/01/01 00:00:00 | 1557136493, 1557136493.136|
| _epochmilli_ | Milliseconds since 1970/01/01 00:00:00.0000| 15571364938124, 15571364938124.001 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does milli have 4 more places

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo


**NB!** Spark uses US Locale and because on-the-fly conversion would be complicated, at the moment we stick to this
hardcoded locale as well. E.g. `am/pm` for `a` placeholder, English names of days and months etc.

**NB!** The keywords are case **insensitive**. Therefore, there is no difference between `epoch` and `EpoCH`.

**NB!** While _nanoseconds_ designation is supported on input, it's not supported in storage or further usage. So any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put some * or something to the format or something, that would lead people to this point.

val interim: Column = to_timestamp(stringColumn, pattern)
pattern.defaultTimeZone.map(to_utc_timestamp(interim, _)).getOrElse(interim)
if (pattern.containsSecondFractions) {
//this is a trick how to enforce fractions of seconds into teh timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//this is a trick how to enforce fractions of seconds into teh timestamp
//this is a trick how to enforce fractions of seconds into the timestamp


object ColumnImplicits {
implicit class ColumnEnhancements(column: Column) {
def isInfinite: Column = {
column.isin(Double.PositiveInfinity, Double.NegativeInfinity)
}

def zeroBasedSubstr(startPos: Int): Column = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a basic doc for public members?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, forgot to put them here.

@benedeki
Copy link
Collaborator Author

I separated the logical chain of dependencies into separate commits, in hope it would help understanding this PR.

The basic logic of the solution is, that the the pattern is checked for the presence of milli-, micro- and/or nanoseconds. If not found, nothing special is needed.
If present their location is identified withing the pattern (class `Section'), The parsing is then done in two steps:

  1. The second fractions parts are removed both from the pattern and source and parsed - this is a timestamp with seconds precision
  2. The fractions are extracted from the source and divided to be the appropriate decimal value of a second; then added to the timestamp.
    Unfortunately there are complications, I failed to take into account and discovered during testing. See bug If Timestamp/Date pattern contains both literal or variable length entry and fractions of second placeholders it's likely to cause error #677. Still this enhances the current possibilities of parsing, the errors are kind of corner-cases. Therefore I think this is worth to include, despite the known bug.

Btw, epochmicro and epochnano added as well. Furthermore, they don't have to be whole numbers only, can be decimal.

@@ -22,6 +22,21 @@ import scala.annotation.tailrec
object StringImplicits {
implicit class StringEnhancements(string: String) {

def injectWith(what: String, where: Int): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docs here as well and explanation if the string is added on the position where or after it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I wonder if to keep it though. I created the methods, but turned out, I don't need it. Wondered if its useful enough to keep. Opinions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not using it, I'd say remove it and consider putting it in commons instead.

private def makePreciseTimestamp(seconds: Long, nanoseconds: Int): Timestamp = {
val result = new Timestamp(seconds * MillisecondsInSecond)
if (nanoseconds > 0) {
result.setNanos(nanoseconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result.setNanos(nanoseconds)
result.setNanos(nanoseconds)

Copy link
Collaborator

@GeorgiChochov GeorgiChochov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to need a tutorial on this one lol

…mments

* Removed `injectWith` method from `StringImplicits` as it's not used
* `EnceladusDateTimeParser.parseDate` now trims time of the date information
* Improved some existing added missing classes and methods documentation
* Better UTs
* README.md better wording
* Code reordering
* Typos
| _milliepoch_ | Milliseconds since 1970/01/01 00:00:00.0000| 15571364938124 |
| placeholder | Description | Example | Note |
| --- | --- | --- | --- |
| G | Era designator | AD | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we have such a good description of pattern characters. I had to look up for this every time I needed to write a pattern. Now README is the place to look. 👍

Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments as discussed with @benedeki, @yruslan, @GeorgiChochov

}

/**
* Metrics defined on Section
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of positions between two sections

(start min forString.length, start + length)
} else {
val startIndex = forString.length + start
if (startIndex >= 0) { //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (startIndex >= 0) { //
if (startIndex >= 0) {

(0, length + startIndex max 0)
}
}
(realStart, after min forString.length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change lines 72, 75 from infix notation to prefix notation

* @param fromString the string to apply the section to
* @return substring defined by this section
*/
def extract(fromString: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract(fromString: String): String = {
def extractFrom(string: String): String = {

Do a similar change the rest of methods as consulted at the meeting

* @return None - if one Section has a negative start and the other positive or zero
* The end of the smaller section subtracted from the start of the greater one (see comparison), e.g. can be negative
*/
def distance(that: Section): Option[Int] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def distance(that: Section): Option[Int] = {
def distance(secondSection: Section): Option[Int] = {

* The end of the smaller section subtracted from the start of the greater one (see comparison), e.g. can be negative
*/
def distance(that: Section): Option[Int] = {
def subtractLike(from: Section, what: Section): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def subtractLike(from: Section, what: Section): Int = {
def calculateDistance(first: Section, second: Section): Int = {

}

object Section {
def fromIndexes(start: Int, end: Int): Section = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments to public methods


object Section {
def fromIndexes(start: Int, end: Int): Section = {
Section(start, end - start + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check that they are in the right order

Section(start, end - start + 1)
}

def ofSameChars(fromString: String, fromIndex: Int): Section = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def ofSameChars(fromString: String, fromIndex: Int): Section = {
def ofSameChars(inputString: String, signedIndex: Int): Section = {

}

def ofSameChars(fromString: String, fromIndex: Int): Section = {
val realFromIndex = if (fromIndex >= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val realFromIndex = if (fromIndex >= 0) {
val index = if (fromIndex >= 0) {

…mments

* Section.fromIndexes is now agnostic to input parameters order
* Using rather Math.min/max instead of infix operators
* Better method names for Section class
* Better method parameter names for Section class
* Better test descriptions for SectionSuite class
assert(result == sections)
}

test("mergeSections: touching") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe illustrate an example of to with the explanation why for the given input we get a particular output?

For instance,

For a string:
1234567890ACDFEFGHIJKLMNOPQUSTUVWXYZ
 ^ ^                            ^^
 | | ^^                         |
 | | Section(5,1)               Section(-4,2)
 | Section(3,2)
 Section(1,1)

Output of the merge:
...

etc.

That is so we won't forget your explanation.


val secondFractionsSections: Seq[Section]
val patternWithoutSecondFractions: String
def containsSecondFractions: Boolean = secondFractionsSections.nonEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not changing value and si called a few times, it could be turned into lazy val instead of def.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy having bigger overhead than this simple expression, which might very likely be even inligned by the compiler.

}

/**
* Inverse function for `remove`, inserts the `what` string into the `into` string as defined by the `section`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the length of the string have to be the same as the length of the section?
Add constraints description to the docs please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good idea 👍

} else {
val sorted = input.sorted
sorted.tail.foldLeft(List(sorted.head)) { (resultAcc, item) =>
if (item touches resultAcc.head) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change from infix notation to prefix notation

Suggested change
if (item touches resultAcc.head) {
if (item.touches(resultAcc.head)) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I really like the inflix here. Makes it easy to read, no issues with precedence. You mind it much? 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind it, I would just keep everything the "same".

* @param sections the sections to merge
* @return an ordered from greater to smaller sequence of distinct sections (their distance is at least 1 or undefined)
*/
def mergeSections(sections: Seq[Section]): Seq[Section] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def mergeSections(sections: Seq[Section]): Seq[Section] = {
def mergeTouchingSections(sections: Seq[Section]): Seq[Section] = {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, sorry the name comes off as misleading IMO.
And sequence of Sections can be provided. Not just "touching" ones, distinct, overlapping, one included in other, same ...

Copy link
Contributor

@Zejnilovic Zejnilovic Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge sections seems like you are going to merge all of them and allow gaps in between them. Let's think of something then.

…mments

* Removed exception throwing from the code
* Function renamed to better express purpose -> mergeSections->mergeTouchingSectionsAndSort
* Further improvement on comments
@benedeki
Copy link
Collaborator Author

Wanted to change the microsecond placeholder to u (widely used shorthand for micro prefix). Unfortunately turns out, u is already being used for day of the week. But it made me spot the bug #707

@benedeki benedeki merged commit cdd35a7 into develop Jul 27, 2019
@benedeki benedeki deleted the bugfix/466-standardization-support-for-second-fractions branch July 27, 2019 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature New feature Standardization Standardization Job affected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants