-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base feature #1
Base feature #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but all the files under test/resources/interpreter are not needed
*/ | ||
object TimeZoneNormalizer { | ||
private val log: Logger = LogManager.getLogger(this.getClass) | ||
private val generalConfig: Config = ConfigFactory.load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be up to date with the latest version in Enceladus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep wondering, we have a ConfigReader
that allows passing existing config -- and we sometimes use it and sometimes, e.g. here, we don't.
Is there a decision login in place? (In certain cases, you may be fine with hard-coded config loading, but with a library, we may want to be cautious). Probably a bit of a discussion point.
//edit:
-> #8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #8 (also a discussion point)
final val stdNullErr = "stdNullErr" | ||
final val stdSchemaErr = "stdSchemaErr" | ||
|
||
final val confMappingErr = "confMappingErr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same like with the functions, the conf names should be removed, either in this PR or another one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to a slight complication with SparkUtils.withColumnIfDoesNotExist
(tied to conformance), let's discuss and solve separately => #6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome effort on the library base! This is a must for broader acceptance of the standardization tools by other parties.
I have found some imperfections/suggestion points, see comments below.
It was a bit challenging to follow on the exact changes made when originating from Enceladus, but mostly found a way to do.
Other notes
- It seems that some test resources are not used:
src/test/resources/data/standardization_*
src/test/resources/data/interpreter/*
(Just read the code, haven't run it)
src/main/scala/za/co/absa/standardization/implicits/ColumnImplicits.scala
Outdated
Show resolved
Hide resolved
src/main/scala/za/co/absa/standardization/validation/field/FieldValidationFailure.scala
Outdated
Show resolved
Hide resolved
*/ | ||
object TimeZoneNormalizer { | ||
private val log: Logger = LogManager.getLogger(this.getClass) | ||
private val generalConfig: Config = ConfigFactory.load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep wondering, we have a ConfigReader
that allows passing existing config -- and we sometimes use it and sometimes, e.g. here, we don't.
Is there a decision login in place? (In certain cases, you may be fine with hard-coded config loading, but with a library, we may want to be cautious). Probably a bit of a discussion point.
//edit:
-> #8
(implicit sparkSession: SparkSession): DataFrame = { | ||
implicit val udfLib: UDFLibrary = new UDFLibrary | ||
implicit val hadoopConf: Configuration = sparkSession.sparkContext.hadoopConfiguration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(implicit sparkSession: SparkSession): DataFrame = { | |
implicit val udfLib: UDFLibrary = new UDFLibrary | |
implicit val hadoopConf: Configuration = sparkSession.sparkContext.hadoopConfiguration | |
(implicit sparkSession: SparkSession, hadoopConf: Configuration): DataFrame = { | |
implicit val udfLib: UDFLibrary = new UDFLibrary |
I would suggest making the Hadoop Configuration an implicit param as well in order to allow customizations of the config. To make it easier for the user, so they don't have to define
implicit val hdpCnf = sparkSession.sparkContext.hadoopConfiguration
everytime they don't want to customize, you could perhaps create a convenience object with the ready-to-use implicit, e.g.
object HadoopConfImplicits {
implicit val DefaultHadoopConfig = sparkSession.sparkContext.hadoopConfiguration
}
and the no-customization users can just call:
import HadoopConfImplicits.DefaultHadoopConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion point: this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion. It could be even easier though then import
, I believe even implicit
parameters can have a default value.
sbt-header check + github workflow
…-compatiblity), created issue #5 to follow up
…oString returning `Boolean` vs `scala.Boolean`
…oString returning `Boolean` vs `scala.Boolean` + test
…dationFailure` renamed to `FieldValidationIssue`
build name = small caps,
…ed to `StandardizationCsvSuite`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good enough for now
This is a PR with base standardization features.
I am not saying this is how it should be on release. There is probably a lot of duplicities and places for improvement. My idea is to create a lot of issues from this PR and maybe even for absa commons.
Even name and package probably need a new name.
Class in is
Standardization.class