-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/1692 api v3 runs #2073
Feature/1692 api v3 runs #2073
Conversation
* 1422 and 1423 Remove HDFS and Oozie from Menas * #1422 Fix HDFS location validation * #1424 Add Menas Dockerfile * #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles. * #1416 - enceladus on S3: - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running standardization works via: spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt * #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output) 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running conformance works via: spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt * ref issue = 1416 * related test cases ignored (issue reference added) * PR updates * Merge spline 0.5.3 into aws-poc * Update spline to 0.5.4 for AWS PoC * #1503 Remove HDFS url Validation This is a temporary solution. We currently experiment with many forms of URLs, and having a regex there now slows us down. * New dockerfile - smaller image * s3 persistence (atum, sdk fs usage, ...) (#1526) #1526 * FsUtils divided into LocalFsUtils & HdfsUtils * PathConfigSuite update * S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut TestRunnerJob updated to manually cover the cases - should serve as a basis for tests * HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting) * using final version of s3-powered Atum (3.0.0) * mockito-update version update, scalatest version update * S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive) * explicit stubbing fix for hyperdrive * Feature/1556 file access PoC using Hadoop FS API (#1586) * s3 using hadoop fs api * s3 sdk usage removed (pom, classes, tests) * atum final version 3.1.0 used * readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs) * 1554 Tomcat with TLS in Docker container (#1585) * #1554 Tomcat with TLS container * #1554 Added envoy config + enabling running unencrypted container * #1499 Add authentication to /lineage + update spline to 0.5.5 * #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619) * #1612 Separation start * #1612 Updated DAO for spark-jobs * #1612 Fixed spline integration and schema, removed redundant code * #1612 Fixed tests, removed unused dependency * #1612 Added back dependency * WIP fixing merge issues * * Merge compiles * Tests pass * Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48) * #1612 Removed Spring from menas-web, enabled building war and static resources. Removed version subpath in menas-web + added theme dependencies in repo * #1612 Cookies + updated lineage * * put back HDFS browser * put back Oozie * downgraded Spline * * AWS SDK Exclusion * #1612 Included HDFSFolder + missing Oozie parts * * New ATUM version * * Adding missing files * #1612 menas-web on nginx container and passing API_URL * #1612 Working TLS on nginx, resources not included in code * 1622: Merge of aws-poc to develop brach * Addressed issues identified by reviewers * * comments improvement * 1434 Add new way of serving properties to Docker * #1612 Building using ui5 + reused /api route * #1612 Project version * #713 Add favicon * #1612 Merges * #1612 pom parent version * #1648 Fix war deployment + adding back spline to menas * #1612 other fixes * #1612 added pom package.json version sync * #1612 newline * #1612 fix version sync + cleaning dist * 1648 merge to develop * 1648 merge fix * 1648 Fixes schema upload * 1648 Fixes schema registry request * 1648 pom version * 1612 add docker build * #601 Swagger 2 PoC * #601 Swagger 2 PoC * #601 Swagger 2 PoC * #1648 Updating menas-web to 3.0 * #1612 Updated npm project versions + mvn plugin * #1612 license_check.yml * #1612 licence check fix Co-authored-by: Saša Zejnilović <[email protected]> Co-authored-by: Daniel Kavan <[email protected]> Co-authored-by: Jan Scherbaum <[email protected]> Co-authored-by: David Benedeki <[email protected]>
% Conflicts: % data-model/src/main/scala/META-INF/MANIFEST.MF % menas-web/ui/components/dataset/conformanceRule/MappingConformanceRule/targetSchemaFieldSelector.fragment.xml % menas/ui/index.html
…velop-ver.30 # Conflicts: # menas/src/main/scala/za/co/absa/enceladus/menas/MvcConfig.scala
merging develop into develop-ver3.0 (via feature/merging-develop-ver.30 branch)
KafkaErrorSenderPluginSuite test fix for SourcePhase.{Standardization, Conformance} capitalization.
* #601 Swagger api docs
…XMLHack removed, the test holds. (#1783)
* #1770 Rename menas-web to menas
% Conflicts: % dao/pom.xml % data-model/pom.xml % examples/pom.xml % menas/pom.xml % migrations-cli/pom.xml % migrations/pom.xml % plugins-api/pom.xml % plugins-builtin/pom.xml % pom.xml % spark-jobs/pom.xml % utils/pom.xml
Spline 0.6 integration
% Conflicts: % menas/pom.xml % menas/ui/index.html % pom.xml % rest-api/src/main/scala/za/co/absa/enceladus/rest_api/WebSecurityConfig.scala % rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/DatasetController.scala
% Conflicts: % menas/pom.xml
…elate to rest_api, not menas: `/3rdParty/**` should not be added anymore as it is menas-related)
* Adding back Menas module, that somehow got omitted.
…lop-ver-3 Merging Release 2.23. 0 into develop ver 3
% Conflicts: % README.md % spark-jobs/src/main/scala/za/co/absa/enceladus/common/CommonJobExecution.scala
.find[BsonDocument](filter) | ||
.headOption() | ||
.map(_.map(bson => SerializationUtils.fromJson[Run](bson.toJson))) | ||
// why not just .find[Run]? Because Run.RunStatus.RunState is a Scala enum that does not play nice with bson-serde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the comment 👍
rest-api/src/main/scala/za/co/absa/enceladus/rest_api/repositories/RunMongoRepository.scala
Show resolved
Hide resolved
...absa/enceladus/rest_api/integration/controllers/v3/DatasetControllerV3IntegrationSuite.scala
Show resolved
Hide resolved
import za.co.absa.enceladus.model.{Run, SplineReference, Validation} | ||
|
||
import scala.concurrent.Future | ||
import scala.util.{Failure, Success, Try} | ||
|
||
@Service | ||
class RunService @Autowired()(val mongoRepository: RunMongoRepository) | ||
@Service("runService") // by-name qualifier: making V2 autowiring un-ambiguous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would have time to explain this to me (educating 😉 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RunServiceV3
extends RunService
. This has a few (expected) effects:
- Wherever a field of type
RunServiceV3
is needed,@Autowired
only has 1 option to use -RunServiceV3
. - Wherever a field of type
RunService
is needed,@Autowired
has 2 options to use -RunService
orRunServiceV3
and thus it fails with ambiguity exception (No qualifying bean of type 'za.co.absa.enceladus.rest_api.services.RunService' available: expected single matching bean but found 2: runService,runServiceV3
). Therefore it needs to be disambiguated somehow (some details here, but basically this is summed up by By default, Spring resolves@Autowired
entries by type.)
There are multiple ways how to hint the desired wiring to be used, e.g. mark one of the implementations by @Primary
etc. The option I have used here is that I mark the bean with a qualifier -- in my case one that matches the field name where requested. So, I have:
@Service("runService") // by-name qualifier: making V2 autowiring un-ambiguous
class RunService()
@Service
class RunServiceV3() extends RunService
class LandingPageController @Autowired() (
...,
runService: RunService, // matches the qualifier of RunService by field name => no ambiguity, RunService is wired.
)
In case that you could not match the name of the qualifier with the field name (unable to change either for some reason), this can be solved by adding the @Qualifier(qualifierName)
to the property as shown with @Qualifier("fooFormatter")
at https://www.baeldung.com/spring-autowire#1-autowiring-by-qualifier. So it would look like this:
@Service("runService")
class RunService()
@Service
class RunServiceV3() extends RunService
class LandingPageController @Autowired() (
...,
@Qualifier("runService") // matches the qualifier of RunService => no ambiguity, RunService is wired.
runServiceWithADifferentName: RunService,
)
@@ -42,6 +44,7 @@ class RunRepositoryIntegrationSuite extends BaseRepositoryTest { | |||
private val runFixture: RunFixtureService = null | |||
|
|||
@Autowired | |||
@Qualifier("runMongoRepository") // to correctly wire V2 runMongoRepository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another education request, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same case as #2073 (comment)
A field of type RunMongoRepository
can be wired with more than one class, so the qualifier here helps to resolve that -- when the field of this type is named runMongoRepository
or accompanied by @Qualifier("runMongoRepository")
.
rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/v3/RunControllerV3.scala
Outdated
Show resolved
Hide resolved
rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/v3/RunControllerV3.scala
Outdated
Show resolved
Hide resolved
rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/v3/RunControllerV3.scala
Show resolved
Hide resolved
...-api/src/main/scala/za/co/absa/enceladus/rest_api/repositories/v3/RunMongoRepositoryV3.scala
Outdated
Show resolved
Hide resolved
rest-api/src/main/scala/za/co/absa/enceladus/rest_api/controllers/v3/RunControllerV3.scala
Show resolved
Hide resolved
Question 1On non-existing URLs I do not get a 404 but rather an empty array. I can see why return an empty array if the dataset exists but has no runs for the Name or the Version, but if the dataset name or combination of name and version don't even exist in Dataset collection, then it is misleading. Thoughts? |
I agree. 404 in case Dataset & Version does not exists. |
Suggestion 2on Update status, that is {
"status": "updated",
"message": "message"
} But this already smells of the JSON envelope. So maybe we keep it for the next PR |
… instead of lexicographical comparison as before (which does not work for DD-MM-YYYY). Input changed for controllers - now accepts YYYY-MM-DD instead of DD-MM-YYYY from the client. Tests added + existing adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code reviewed
built
run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review
…tasetVersion} are checked for ds existence => 404
…rns an wrapped message now
Hi guys,
This I have fixed as suggested. For
Yes, it does smell a bit as to be part of #2062, thus, so far, I have just wrapped the response in an object with a message field (so it is a valid JSON with the message, but no other fields like |
SonarCloud Quality Gate failed. 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR implements Runs v3 API #1692.
RunRepositoryV3
is covered by a separate integTest, too.run
collectionEndpoints details
GET /runs[?startDate=yyyy-MM-dd|sparkAppId=...|uniqueId=...]
RunSummari
es (Run
withoutsplineRef
andcontrolMeasure
)datasetName
andversionName
- theRun
with the latestrunId
is only returnedGET /runs/{datasetName}[?startDate=yyyy-MM-dd]
GET /runs/{datasetName}/{datasetVersion}[?startDate=yyyy-MM-dd]
RunSummari
es (Run
withoutsplineRef
andcontrolMeasure
)runId
s should be returnedPUT /runs/{datasetName}/{datasetVersion}
Run
Location
header with url to the newly created entityGET /runs/{datasetName}/{datasetVersion}/{runId}
Run
object (containssplineRef
andcontrolMeasure
)latest
for arunId
instead of #.PUT /runs/{datasetName}/{datasetVersion}/{runId}
RunStatus
(logically updated state, optional error)GET /runs/{datasetName}/{datasetVersion}/{runId}/checkpoints
latest
for arunId
instead of #.POST /runs/{datasetName}/{datasetVersion}/{runId}/checkpoints
Checkpoint
Location
header with url to the newly created entityGET /runs/{datasetName}/{datasetVersion}/{runId}/checkpoints/{checkpointName}
latest
for arunId
instead of #GET /runs/{datasetName}/{datasetVersion}/{runId}/metadata
latest
for arunId
instead of #.GET .../metadata/{metadataName}
, because it does not seem to make a lot of sence (but I am open to suggstions in this part)Consideration for latest-of-each RunSummaries in MongoDB
The latest-of-each retrieval poses an approach conundrum. I have basically thought about these options to implement it:
lookup
just as the V2 repository does itaggregate
, while storing the documents in the with the group and then filtering by themax(runId)
first
aggregate accumulator (guaranteed to be the max runId because of the pre-sorting)The first option seems like overkill, but I admit I have not tried it. With options 2 and 3, I have written basic implementations of such aggregate queries and rudimentarily measured it on our UAT environment holding ~450k run records.
Option 3 appeared to take about two times (~30s) more time than option 2 (~15s), provided that there is an appropriate index present in the collection for the pre-sorting. Surprisingly, this index
{ $sort: { "runId": -1 } }
was present on our UAT, but must have been added extra, because our default setup does not introduce this index to theruns_v1
collection. Thus, in the code, I have added it to be present for future MongoDB provisionings.Option 2 mongo script used:
Option 3 mongo script used: