Base feature #1

Zejnilovic · 2021-11-15T19:19:56Z

This is a PR with base standardization features.

I am not saying this is how it should be on release. There is probably a lot of duplicities and places for improvement. My idea is to create a lot of issues from this PR and maybe even for absa commons.
Even name and package probably need a new name.

Class in is Standardization.class

src/main/scala/za/co/absa/standardization/StandardizationConfig.scala

AdrianOlosutean

Correct me if I'm wrong, but all the files under test/resources/interpreter are not needed

src/main/scala/za/co/absa/standardization/StandardizationConfig.scala

AdrianOlosutean · 2021-11-22T13:14:29Z

src/main/scala/za/co/absa/standardization/time/TimeZoneNormalizer.scala

+   */
+object TimeZoneNormalizer {
+  private val log: Logger = LogManager.getLogger(this.getClass)
+  private val generalConfig: Config = ConfigFactory.load()


It could be up to date with the latest version in Enceladus

I keep wondering, we have a ConfigReader that allows passing existing config -- and we sometimes use it and sometimes, e.g. here, we don't.

Is there a decision login in place? (In certain cases, you may be fine with hard-coded config loading, but with a library, we may want to be cautious). Probably a bit of a discussion point.

//edit:
-> #8

-> #8 (also a discussion point)

src/main/scala/za/co/absa/standardization/udf/UDFLibrary.scala

AdrianOlosutean · 2021-11-22T13:30:18Z

src/main/scala/za/co/absa/standardization/udf/UDFNames.scala

+  final val stdNullErr = "stdNullErr"
+  final val stdSchemaErr = "stdSchemaErr"
+
+  final val confMappingErr = "confMappingErr"


Same like with the functions, the conf names should be removed, either in this PR or another one

Due to a slight complication with SparkUtils.withColumnIfDoesNotExist (tied to conformance), let's discuss and solve separately => #6

src/test/resources/application.conf

dk1844

Awesome effort on the library base! This is a must for broader acceptance of the standardization tools by other parties.

I have found some imperfections/suggestion points, see comments below.

It was a bit challenging to follow on the exact changes made when originating from Enceladus, but mostly found a way to do.

Other notes

It seems that some test resources are not used:
- src/test/resources/data/standardization_*
- src/test/resources/data/interpreter/*

(Just read the code, haven't run it)

src/main/scala/za/co/absa/standardization/implicits/ColumnImplicits.scala

src/main/scala/za/co/absa/standardization/validation/field/FieldValidationFailure.scala

src/main/scala/za/co/absa/standardization/JsonUtils.scala

src/main/scala/za/co/absa/standardization/RecordIdGeneration.scala

dk1844 · 2021-12-02T15:49:05Z

src/main/scala/za/co/absa/standardization/time/TimeZoneNormalizer.scala

+   */
+object TimeZoneNormalizer {
+  private val log: Logger = LogManager.getLogger(this.getClass)
+  private val generalConfig: Config = ConfigFactory.load()


I keep wondering, we have a ConfigReader that allows passing existing config -- and we sometimes use it and sometimes, e.g. here, we don't.

Is there a decision login in place? (In certain cases, you may be fine with hard-coded config loading, but with a library, we may want to be cautious). Probably a bit of a discussion point.

//edit:
-> #8

.editorconfig

build.sbt

src/main/scala/za/co/absa/standardization/Constants.scala

src/main/scala/za/co/absa/standardization/Standardization.scala

dk1844 · 2021-12-03T14:28:24Z

src/main/scala/za/co/absa/standardization/Standardization.scala

+                 (implicit sparkSession: SparkSession): DataFrame = {
+    implicit val udfLib: UDFLibrary = new UDFLibrary
+    implicit val hadoopConf: Configuration = sparkSession.sparkContext.hadoopConfiguration


Suggested change

(implicit sparkSession: SparkSession): DataFrame = {

implicit val udfLib: UDFLibrary = new UDFLibrary

implicit val hadoopConf: Configuration = sparkSession.sparkContext.hadoopConfiguration

(implicit sparkSession: SparkSession, hadoopConf: Configuration): DataFrame = {

implicit val udfLib: UDFLibrary = new UDFLibrary

I would suggest making the Hadoop Configuration an implicit param as well in order to allow customizations of the config. To make it easier for the user, so they don't have to define

implicit val hdpCnf = sparkSession.sparkContext.hadoopConfiguration

everytime they don't want to customize, you could perhaps create a convenience object with the ready-to-use implicit, e.g.

object HadoopConfImplicits { implicit val DefaultHadoopConfig = sparkSession.sparkContext.hadoopConfiguration }

and the no-customization users can just call:

import HadoopConfImplicits.DefaultHadoopConfig

Discussion point: this.

I like the suggestion. It could be even easier though then import, I believe even implicit parameters can have a default value.

sbt-header check + github workflow

…-compatiblity), created issue #5 to follow up

…oString returning `Boolean` vs `scala.Boolean`

…oString returning `Boolean` vs `scala.Boolean` + test

…dationFailure` renamed to `FieldValidationIssue`

build name = small caps,

build.sbt

…ed to `StandardizationCsvSuite`

…to the latter.

AdrianOlosutean

Looks good enough for now

dk1844 · 2021-12-09T08:35:00Z

Just my changes on this PR: https://github.com/AbsaOSS/spark-data-standardization/pull/1/files/dedf3a591932e06d77a61e7ff3c1cd2ffe951215..3dbe9fb094408dbd0788100a3dc72f63b772757f

Zejnilovic added 2 commits November 15, 2021 20:11

Base feature

dedf3a5

Update sbt version

eecb953

AdrianOlosutean reviewed Nov 22, 2021

View reviewed changes

src/main/scala/za/co/absa/standardization/StandardizationConfig.scala Show resolved Hide resolved

AdrianOlosutean reviewed Nov 22, 2021

View reviewed changes

dk1844 reviewed Dec 3, 2021

View reviewed changes

dk1844 added 3 commits December 6, 2021 14:26

#1 github workflow to build,

c087d37

sbt-header check + github workflow

#1 headers update (newline) to be compatible with sbt header plugin

c66cf48

#1 Constants reverted to Enceladus values (test passing and Enceladus…

d872bc6

…-compatiblity), created issue #5 to follow up

dk1844 added a commit that referenced this pull request Dec 7, 2021

#1 testfix (missing test config, typeName wrapper - typeOf[Boolean].t…

a83f0a4

…oString returning `Boolean` vs `scala.Boolean`

#1 testfix (missing test config, typeName wrapper - typeOf[Boolean].t…

f91f20d

…oString returning `Boolean` vs `scala.Boolean` + test

dk1844 force-pushed the base-feature branch from a83f0a4 to f91f20d Compare December 7, 2021 12:39

dk1844 added 4 commits December 7, 2021 14:57

ColumnImplicits.scala - doc update

739e920

#1 cleanup, SectionSuite added (originating in Enceladus). `FieldVali…

74d6b26

…dationFailure` renamed to `FieldValidationIssue`

#1 sbt in -> /

8eb78ed

#1 removed unused test resources

87ec7e5

dk1844 added a commit that referenced this pull request Dec 8, 2021

#1 added header checking for src/test/scala, too. + header NL fixes,

41e3424

build name = small caps,

#1 added header checking for src/test/scala, too. + header NL fixes,

024d8ec

build name = small caps,

dk1844 force-pushed the base-feature branch from 41e3424 to 024d8ec Compare December 8, 2021 10:49

Zejnilovic commented Dec 8, 2021

View reviewed changes

build.sbt Outdated Show resolved Hide resolved

dk1844 added 4 commits December 8, 2021 13:20

#1 assembly removed (plugin + config)

0d47122

#1 StructFieldImplicitsSuite added, StandardizationRerunSuite renam…

39a63c9

…ed to `StandardizationCsvSuite`

#1 Enceladus#677 changed to #7

ed317f8

StdInterpreterSuite and StandardizationInterpreterSuite merged in…

3dbe9fb

…to the latter.

AdrianOlosutean approved these changes Dec 8, 2021

View reviewed changes

#1 StandardizationInterpreterSuite test fix

8521e0b

This was referenced Dec 10, 2021

Add publish.sbt #3

Closed

Add github metadata #2

Open

Feature/4 std config #9

Merged

dk1844 requested a review from benedeki December 10, 2021 12:15

dk1844 mentioned this pull request Dec 10, 2021

Feature/3 publish #10

Merged

dk1844 merged commit ed7c2aa into master Dec 13, 2021

dk1844 deleted the base-feature branch December 14, 2021 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base feature #1

Base feature #1

Zejnilovic commented Nov 15, 2021

AdrianOlosutean left a comment

AdrianOlosutean Nov 22, 2021

dk1844 Dec 2, 2021 •

edited

Loading

dk1844 Dec 9, 2021

AdrianOlosutean Nov 22, 2021

dk1844 Dec 8, 2021

dk1844 left a comment

dk1844 Dec 2, 2021 •

edited

Loading

dk1844 Dec 3, 2021

dk1844 Dec 9, 2021

benedeki Dec 14, 2021

AdrianOlosutean left a comment

dk1844 commented Dec 9, 2021

Base feature #1

Base feature #1

Conversation

Zejnilovic commented Nov 15, 2021

AdrianOlosutean left a comment

Choose a reason for hiding this comment

AdrianOlosutean Nov 22, 2021

Choose a reason for hiding this comment

dk1844 Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

dk1844 Dec 9, 2021

Choose a reason for hiding this comment

AdrianOlosutean Nov 22, 2021

Choose a reason for hiding this comment

dk1844 Dec 8, 2021

Choose a reason for hiding this comment

dk1844 left a comment

Choose a reason for hiding this comment

dk1844 Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

dk1844 Dec 3, 2021

Choose a reason for hiding this comment

dk1844 Dec 9, 2021

Choose a reason for hiding this comment

benedeki Dec 14, 2021

Choose a reason for hiding this comment

AdrianOlosutean left a comment

Choose a reason for hiding this comment

dk1844 commented Dec 9, 2021

dk1844 Dec 2, 2021 •

edited

Loading

dk1844 Dec 2, 2021 •

edited

Loading