Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base feature #1

Merged
merged 16 commits into from
Dec 13, 2021
Merged

Base feature #1

merged 16 commits into from
Dec 13, 2021

Conversation

Zejnilovic
Copy link
Collaborator

This is a PR with base standardization features.

I am not saying this is how it should be on release. There is probably a lot of duplicities and places for improvement. My idea is to create a lot of issues from this PR and maybe even for absa commons.
Even name and package probably need a new name.

Class in is Standardization.class

Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but all the files under test/resources/interpreter are not needed

*/
object TimeZoneNormalizer {
private val log: Logger = LogManager.getLogger(this.getClass)
private val generalConfig: Config = ConfigFactory.load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be up to date with the latest version in Enceladus

Copy link
Contributor

@dk1844 dk1844 Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep wondering, we have a ConfigReader that allows passing existing config -- and we sometimes use it and sometimes, e.g. here, we don't.

Is there a decision login in place? (In certain cases, you may be fine with hard-coded config loading, but with a library, we may want to be cautious). Probably a bit of a discussion point.

//edit:
-> #8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #8 (also a discussion point)

final val stdNullErr = "stdNullErr"
final val stdSchemaErr = "stdSchemaErr"

final val confMappingErr = "confMappingErr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same like with the functions, the conf names should be removed, either in this PR or another one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to a slight complication with SparkUtils.withColumnIfDoesNotExist (tied to conformance), let's discuss and solve separately => #6

src/test/resources/application.conf Outdated Show resolved Hide resolved
Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome effort on the library base! This is a must for broader acceptance of the standardization tools by other parties.

I have found some imperfections/suggestion points, see comments below.

It was a bit challenging to follow on the exact changes made when originating from Enceladus, but mostly found a way to do.

Other notes

  • It seems that some test resources are not used:
    • src/test/resources/data/standardization_*
    • src/test/resources/data/interpreter/*

(Just read the code, haven't run it)

src/main/scala/za/co/absa/standardization/JsonUtils.scala Outdated Show resolved Hide resolved
*/
object TimeZoneNormalizer {
private val log: Logger = LogManager.getLogger(this.getClass)
private val generalConfig: Config = ConfigFactory.load()
Copy link
Contributor

@dk1844 dk1844 Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep wondering, we have a ConfigReader that allows passing existing config -- and we sometimes use it and sometimes, e.g. here, we don't.

Is there a decision login in place? (In certain cases, you may be fine with hard-coded config loading, but with a library, we may want to be cautious). Probably a bit of a discussion point.

//edit:
-> #8

.editorconfig Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
src/main/scala/za/co/absa/standardization/Constants.scala Outdated Show resolved Hide resolved
Comment on lines +36 to +38
(implicit sparkSession: SparkSession): DataFrame = {
implicit val udfLib: UDFLibrary = new UDFLibrary
implicit val hadoopConf: Configuration = sparkSession.sparkContext.hadoopConfiguration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(implicit sparkSession: SparkSession): DataFrame = {
implicit val udfLib: UDFLibrary = new UDFLibrary
implicit val hadoopConf: Configuration = sparkSession.sparkContext.hadoopConfiguration
(implicit sparkSession: SparkSession, hadoopConf: Configuration): DataFrame = {
implicit val udfLib: UDFLibrary = new UDFLibrary

I would suggest making the Hadoop Configuration an implicit param as well in order to allow customizations of the config. To make it easier for the user, so they don't have to define

implicit val hdpCnf = sparkSession.sparkContext.hadoopConfiguration

everytime they don't want to customize, you could perhaps create a convenience object with the ready-to-use implicit, e.g.

object HadoopConfImplicits {
  implicit val DefaultHadoopConfig = sparkSession.sparkContext.hadoopConfiguration
}

and the no-customization users can just call:

import HadoopConfImplicits.DefaultHadoopConfig 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point: this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion. It could be even easier though then import, I believe even implicit parameters can have a default value.

dk1844 added a commit that referenced this pull request Dec 7, 2021
…oString returning `Boolean` vs `scala.Boolean`
…oString returning `Boolean` vs `scala.Boolean` + test
dk1844 added a commit that referenced this pull request Dec 8, 2021
build.sbt Outdated Show resolved Hide resolved
Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough for now

@dk1844
Copy link
Contributor

dk1844 commented Dec 9, 2021

Just my changes on this PR: https://github.com/AbsaOSS/spark-data-standardization/pull/1/files/dedf3a591932e06d77a61e7ff3c1cd2ffe951215..3dbe9fb094408dbd0788100a3dc72f63b772757f

This was referenced Dec 10, 2021
@dk1844 dk1844 requested a review from benedeki December 10, 2021 12:15
@dk1844 dk1844 mentioned this pull request Dec 10, 2021
@dk1844 dk1844 merged commit ed7c2aa into master Dec 13, 2021
@dk1844 dk1844 deleted the base-feature branch December 14, 2021 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants