diff --git a/.env b/.env index 78f4ea85..fdd6ecbb 100644 --- a/.env +++ b/.env @@ -1,3 +1,3 @@ # Defines environment variables for docker-compose. # Can be overridden via e.g. `MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build`. -MARKLOGIC_TAG=11.1.0-centos-1.1.0 +MARKLOGIC_TAG=11.2.0-centos-1.1.2 diff --git a/.gitignore b/.gitignore index ce326605..486958c0 100644 --- a/.gitignore +++ b/.gitignore @@ -17,3 +17,4 @@ logs venv .venv docker +export diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 02a738e3..7da6aa5c 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -24,26 +24,6 @@ The above will result in a new MarkLogic instance with a single node. Alternatively, if you would like to test against a 3-node MarkLogic cluster with a load balancer in front of it, run `docker-compose -f docker-compose-3nodes.yaml up -d --build`. -## Accessing MarkLogic logs in Grafana - -This project's `docker-compose-3nodes.yaml` file includes -[Grafana, Loki, and promtail services](https://grafana.com/docs/loki/latest/clients/promtail/) for the primary reason of -collecting MarkLogic log files and allowing them to be viewed and searched via Grafana. - -Once you have run `docker-compose`, you can access Grafana at http://localhost:3000 . Follow these instructions to -access MarkLogic logging data: - -1. Click on the hamburger in the upper left hand corner and select "Explore", or simply go to - http://localhost:3000/explore . -2. Verify that "Loki" is the default data source - you should see it selected in the upper left hand corner below - the "Home" link. -3. Click on the "Select label" dropdown and choose `job`. Click on the "Select value" label for this filter and - select `marklogic` as the value. -4. Click on the blue "Run query" button in the upper right hand corner. - -You should now see logs from all 3 nodes in the MarkLogic cluster. - - ## Deploying the test application To deploy the test application, first create `./gradle-local.properties` and add the following to it: @@ -63,20 +43,6 @@ To run the tests against the test application, run the following Gradle task: ./gradlew test -If you installed MarkLogic using this project's `docker-compose.yaml` file, you can also run the tests from within the -Docker environment by first running the following task: - - ./gradlew dockerBuildCache - -The above task is a mostly one-time step to build a Docker image that contains all of this project's Gradle -dependencies. This will allow the next step to run much more quickly. You'll only need to run this again when the -project's Gradle dependencies change. - -You can then run the tests from within the Docker environment via the following task: - - ./gradlew dockerTest - - ## Generating code quality reports with SonarQube In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file and you must @@ -117,6 +83,25 @@ you've introduced on the feature branch you're working on. You can then click on Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar` without having to re-run the tests. +## Accessing MarkLogic logs in Grafana + +This project's `docker-compose-3nodes.yaml` file includes +[Grafana, Loki, and promtail services](https://grafana.com/docs/loki/latest/clients/promtail/) for the primary reason of +collecting MarkLogic log files and allowing them to be viewed and searched via Grafana. + +Once you have run `docker-compose`, you can access Grafana at http://localhost:3000 . Follow these instructions to +access MarkLogic logging data: + +1. Click on the hamburger in the upper left hand corner and select "Explore", or simply go to + http://localhost:3000/explore . +2. Verify that "Loki" is the default data source - you should see it selected in the upper left hand corner below + the "Home" link. +3. Click on the "Select label" dropdown and choose `job`. Click on the "Select value" label for this filter and + select `marklogic` as the value. +4. Click on the blue "Run query" button in the upper right hand corner. + +You should now see logs from all 3 nodes in the MarkLogic cluster. + # Testing with PySpark The documentation for this project @@ -131,7 +116,7 @@ This will produce a single jar file for the connector in the `./build/libs` dire You can then launch PySpark with the connector available via: - pyspark --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar + pyspark --jars build/libs/marklogic-spark-connector-2.3.0.rc1.jar The below command is an example of loading data from the test application deployed via the instructions at the top of this page. @@ -171,14 +156,28 @@ df2.head() json.loads(df2.head()['content']) ``` +For a quick test of writing documents, use the following: + +``` + +spark.read.option("header", True).csv("src/test/resources/data.csv")\ + .repartition(2)\ + .write.format("marklogic")\ + .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8000")\ + .option("spark.marklogic.write.permissions", "spark-user-role,read,spark-user-role,update")\ + .option("spark.marklogic.write.logProgress", 50)\ + .option("spark.marklogic.write.batchSize", 10)\ + .mode("append")\ + .save() +``` # Testing against a local Spark cluster When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster that still runs on your local machine, perform the following steps: -1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.4.1` since we are currently -building against Spark 3.4.1. +1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.4.3` since we are currently +building against Spark 3.4.3. 2. `cd ~/.sdkman/candidates/spark/current/sbin`, which is where sdkman will install Spark. 3. Run `./start-master.sh` to start a master Spark node. 4. `cd ../logs` and open the master log file that was created to find the address for the master node. It will be in a @@ -193,7 +192,7 @@ The Spark master GUI is at . You can use this to view det Now that you have a Spark cluster running, you just need to tell PySpark to connect to it: - pyspark --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar + pyspark --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.3.0.rc1.jar You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to examine details of each of the commands that you run. @@ -212,12 +211,12 @@ You will need the connector jar available, so run `./gradlew clean shadowJar` if You can then run a test Python program in this repository via the following (again, change the master address as needed); note that you run this outside of PySpark, and `spark-submit` is available after having installed PySpark: - spark-submit --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar src/test/python/test_program.py + spark-submit --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.3.0.rc1.jar src/test/python/test_program.py You can also test a Java program. To do so, first move the `com.marklogic.spark.TestProgram` class from `src/test/java` to `src/main/java`. Then run `./gradlew clean shadowJar` to rebuild the connector jar. Then run the following: - spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar + spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.3.0.rc1.jar Be sure to move `TestProgram` back to `src/test/java` when you are done. diff --git a/Jenkinsfile b/Jenkinsfile index 2cd80227..ff347234 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -40,7 +40,6 @@ pipeline{ buildDiscarder logRotator(artifactDaysToKeepStr: '7', artifactNumToKeepStr: '', daysToKeepStr: '30', numToKeepStr: '') } environment{ - JAVA8_HOME_DIR="/home/builder/java/openjdk-1.8.0-262" JAVA11_HOME_DIR="/home/builder/java/jdk-11.0.2" GRADLE_DIR =".gradle" DMC_USER = credentials('MLBUILD_USER') diff --git a/LICENSE.txt b/LICENSE.txt index 0cf4434d..ee0ffb33 100644 --- a/LICENSE.txt +++ b/LICENSE.txt @@ -1,4 +1,4 @@ -Copyright © 2023 MarkLogic Corporation. +Copyright © 2024 MarkLogic Corporation. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at diff --git a/build.gradle b/build.gradle index 9b3212cb..a050df72 100644 --- a/build.gradle +++ b/build.gradle @@ -2,7 +2,7 @@ plugins { id 'java-library' id 'net.saliman.properties' version '1.5.2' id 'com.github.johnrengelman.shadow' version '8.1.1' - id "com.marklogic.ml-gradle" version "4.6.0" + id "com.marklogic.ml-gradle" version "4.7.0" id 'maven-publish' id 'signing' id "jacoco" @@ -10,53 +10,66 @@ plugins { } group 'com.marklogic' -version '2.2.0' +version '2.3.0.rc1' java { - sourceCompatibility = 1.8 - targetCompatibility = 1.8 + // To support reading RDF files, Apache Jena is used - but that requires Java 11. If we want to do a 2.2.0 release + // without requiring Java 11, we'll remove the support for reading RDF files along with the Jena dependency. + sourceCompatibility = 11 + targetCompatibility = 11 } repositories { mavenCentral() } +configurations { + // Defines all the implementation dependencies, but in such a way that they are not included as dependencies in the + // library's pom.xml file. This is due to the shadow jar being published instead of a jar only containing this + // project's classes. The shadow jar is published due to the need to relocate several packages to avoid conflicts + // with Spark. + shadowDependencies + + // This approach allows for all of the dependencies to be available for compilation and for running tests. + compileOnly.extendsFrom(shadowDependencies) + testImplementation.extendsFrom(compileOnly) +} + dependencies { - compileOnly 'org.apache.spark:spark-sql_2.12:' + sparkVersion - implementation ("com.marklogic:marklogic-client-api:6.5.0") { + // This is compileOnly as any environment this is used in will provide the Spark dependencies itself. + compileOnly ('org.apache.spark:spark-sql_2.12:' + sparkVersion) { + // Excluded from our ETL tool for size reasons, so excluded here as well to ensure we don't need it. + exclude module: "rocksdbjni" + } + + shadowDependencies ("com.marklogic:marklogic-client-api:6.6.1") { // The Java Client uses Jackson 2.15.2; Scala 3.4.x does not yet support that and will throw the following error: // Scala module 2.14.2 requires Jackson Databind version >= 2.14.0 and < 2.15.0 - Found jackson-databind version 2.15.2 // So the 4 Jackson modules are excluded to allow for Spark's to be used. - exclude module: 'jackson-core' - exclude module: 'jackson-databind' - exclude module: 'jackson-annotations' - exclude module: 'jackson-dataformat-csv' + exclude group: "com.fasterxml.jackson.core" + exclude group: "com.fasterxml.jackson.dataformat" } + // Required for converting JSON to XML. Using 2.14.2 to align with Spark 3.4.1. + shadowDependencies "com.fasterxml.jackson.dataformat:jackson-dataformat-xml:2.14.2" + // Need this so that an OkHttpClientConfigurator can be created. - implementation 'com.squareup.okhttp3:okhttp:4.12.0' + shadowDependencies 'com.squareup.okhttp3:okhttp:4.12.0' - // Makes it possible to use lambdas in Java 8 to implement Spark's Function1 and Function2 interfaces - // See https://github.com/scala/scala-java8-compat for more information - implementation("org.scala-lang.modules:scala-java8-compat_2.12:1.0.2") { - // Prefer the Scala libraries used within the user's Spark runtime. - exclude module: "scala-library" + shadowDependencies ("org.apache.jena:jena-arq:4.10.0") { + exclude group: "com.fasterxml.jackson.core" + exclude group: "com.fasterxml.jackson.dataformat" } - testImplementation 'org.apache.spark:spark-sql_2.12:' + sparkVersion + shadowDependencies "org.jdom:jdom2:2.0.6.1" - // The exclusions in these two modules ensure that we use the Jackson libraries from spark-sql when running the tests. - testImplementation ('com.marklogic:ml-app-deployer:4.6.0') { - exclude module: 'jackson-core' - exclude module: 'jackson-databind' - exclude module: 'jackson-annotations' - exclude module: 'jackson-dataformat-csv' + testImplementation ('com.marklogic:ml-app-deployer:4.7.0') { + exclude group: "com.fasterxml.jackson.core" + exclude group: "com.fasterxml.jackson.dataformat" } testImplementation ('com.marklogic:marklogic-junit5:1.4.0') { - exclude module: 'jackson-core' - exclude module: 'jackson-databind' - exclude module: 'jackson-annotations' - exclude module: 'jackson-dataformat-csv' + exclude group: "com.fasterxml.jackson.core" + exclude group: "com.fasterxml.jackson.dataformat" } testImplementation "ch.qos.logback:logback-classic:1.3.14" @@ -105,7 +118,11 @@ if (JavaVersion.current().isCompatibleWith(JavaVersion.VERSION_17)) { } shadowJar { - // "all" is the default; no need for that in the connector filename. + configurations = [project.configurations.shadowDependencies] + + // "all" is the default; no need for that in the connector filename. This also results in this becoming the library + // artifact that is published as a dependency. That is desirable as it includes the relocated packages listed below, + // which a dependent would otherwise have to manage themselves. archiveClassifier.set("") // Spark uses an older version of OkHttp; see @@ -121,38 +138,6 @@ task perfTest(type: JavaExec) { args mlHost } -task dockerBuildCache(type: Exec) { - description = "Creates an image named 'marklogic-spark-cache' containing a cache of the Gradle dependencies." - commandLine 'docker', 'build', '--no-cache', '-t', 'marklogic-spark-cache', '.' -} - -task dockerTest(type: Exec) { - description = "Run all of the tests within a Docker environment." - commandLine 'docker', 'run', - // Allows for communicating with the MarkLogic cluster that is setup via docker-compose.yaml. - '--network=marklogic_spark_external_net', - // Map the project directory into the Docker container. - '-v', getProjectDir().getAbsolutePath() + ':/root/project', - // Working directory for the Gradle tasks below. - '-w', '/root/project', - // Remove the container after it finishes running. - '--rm', - // Use the output of dockerBuildCache to avoid downloading all the Gradle dependencies. - 'marklogic-spark-cache:latest', - 'gradle', '-i', '-PmlHost=bootstrap_3n.local', 'test' -} - -task dockerPerfTest(type: Exec) { - description = "Run PerformanceTester a Docker environment." - commandLine 'docker', 'run', - '--network=marklogic_spark_external_net', - '-v', getProjectDir().getAbsolutePath() + ':/root/project', - '-w', '/root/project', - '--rm', - 'marklogic-spark-cache:latest', - 'gradle', '-i', '-PmlHost=bootstrap_3n.local', 'perfTest' -} - task sourcesJar(type: Jar, dependsOn: classes) { archiveClassifier = "sources" from sourceSets.main.allSource diff --git a/docker-compose-3nodes.yaml b/docker-compose-3nodes.yaml index ec7766fb..228a85e8 100644 --- a/docker-compose-3nodes.yaml +++ b/docker-compose-3nodes.yaml @@ -30,7 +30,7 @@ services: # by this host. Note that each MarkLogic host has its 8000-8002 ports exposed externally so that the apps on those # ports can each be accessed if needed. bootstrap_3n: - image: "marklogicdb/marklogic-db:11.1.0-centos-1.1.0" + image: "marklogicdb/marklogic-db:${MARKLOGIC_TAG}" platform: linux/amd64 container_name: bootstrap_3n hostname: bootstrap_3n.local @@ -50,7 +50,7 @@ services: - internal_net node2: - image: "marklogicdb/marklogic-db:11.1.0-centos-1.1.0" + image: "marklogicdb/marklogic-db:${MARKLOGIC_TAG}" platform: linux/amd64 container_name: node2 hostname: node2.local @@ -74,7 +74,7 @@ services: - internal_net node3: - image: "marklogicdb/marklogic-db:11.1.0-centos-1.1.0" + image: "marklogicdb/marklogic-db:${MARKLOGIC_TAG}" platform: linux/amd64 container_name: node3 hostname: node3.local diff --git a/docker-compose.yaml b/docker-compose.yaml index e48d78b2..0dc976f4 100644 --- a/docker-compose.yaml +++ b/docker-compose.yaml @@ -19,7 +19,8 @@ services: # Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration . sonarqube: - image: sonarqube:community + # Using 10.2 to avoid requiring Java 17 for now. + image: sonarqube:10.2.1-community depends_on: - postgres environment: diff --git a/docs/Gemfile.lock b/docs/Gemfile.lock index b4ee060f..9f936db9 100644 --- a/docs/Gemfile.lock +++ b/docs/Gemfile.lock @@ -224,8 +224,8 @@ GEM rb-fsevent (0.11.2) rb-inotify (0.10.1) ffi (~> 1.0) - rexml (3.2.8) - strscan (>= 3.0.9) + rexml (3.3.2) + strscan rouge (3.26.0) ruby2_keywords (0.0.5) rubyzip (2.3.2) diff --git a/docs/configuration.md b/docs/configuration.md index 288c2a33..3fa79822 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -126,7 +126,9 @@ The following options control how the connector reads rows from MarkLogic via cu | --- | --- | | spark.marklogic.read.invoke | The path to a module to invoke; the module must be in your application's modules database. | | spark.marklogic.read.javascript | JavaScript code to execute. | +| spark.marklogic.read.javascriptFile | Local file path containing JavaScript code to execute. | | spark.marklogic.read.xquery | XQuery code to execute. | +| spark.marklogic.read.xqueryFile | Local file path containing XQuery code to execute. | | spark.marklogic.read.vars. | Prefix for user-defined variables to be sent to the custom code. | If you are using Spark's streaming support with custom code, or you need to break up your custom code query into @@ -136,7 +138,9 @@ multiple queries, the following options can also be used to control how partitio | --- | --- | | spark.marklogic.read.partitions.invoke | The path to a module to invoke; the module must be in your application's modules database. | | spark.marklogic.read.partitions.javascript | JavaScript code to execute. | +| spark.marklogic.read.partitions.javascriptFile | Local file path containing JavaScript code to execute. | | spark.marklogic.read.partitions.xquery | XQuery code to execute. | +| spark.marklogic.read.partitions.xqueryFile | Local file path containing XQuery code to execute. | ### Read options for documents @@ -156,6 +160,22 @@ The following options control how the connector reads document rows from MarkLog | spark.marklogic.read.documents.transformParams | Comma-delimited sequence of transform parameter names and values - e.g. `param1,value1,param2,value`. | | spark.marklogic.read.documents.transformParamsDelimiter | Delimiter for transform parameters; defaults to a comma. | +### Read options for files + +As of the 2.3.0 release, the connector supports reading aggregate XML files, RDF files, and ZIP files. The following +options control how the connector reads files: + +| Option | Description | +| --- | --- | +| spark.marklogic.read.aggregates.xml.element | Required when reading aggregate XML files; defines the name of the element for selecting elements to convert into Spark rows. | +| spark.marklogic.read.aggregates.xml.namespace | Optional namespace for the element identified by `spark.marklogic.read.aggregates.xml.element`. | +| spark.marklogic.read.aggregates.xml.uriElement | Optional element name for constructing a URI based on an element value. | +| spark.marklogic.read.aggregates.xml.uriNamespace | Optional namespace for the element identified by `spark.marklogic.read.aggregates.xml.uriElement`. | +| spark.marklogic.read.files.abortOnFailure | Set to `false` so that the connector logs errors and continues processing files. Defaults to `true`. | +| spark.marklogic.read.files.compression | Set to `gzip` or `zip` when reading compressed files. | +| spark.marklogic.read.files.type | Set to `rdf` when reading RDF files. This option only needs to be set when the connector is otherwise unable to detect that it should perform some sort of handling for the file. | + + ## Write options See [the guide on writing](writing.md) for more information on how data is written to MarkLogic. @@ -171,8 +191,10 @@ The following options control how the connector writes rows as documents to Mark | spark.marklogic.write.collections | Comma-delimited string of collection names to add to each document. | | spark.marklogic.write.permissions | Comma-delimited string of role names and capabilities to add to each document - e.g. role1,read,role2,update,role3,execute . | | spark.marklogic.write.fileRows.documentType | Forces a document type when MarkLogic does not recognize a URI extension; must be one of `JSON`, `XML`, or `TEXT`. | +| spark.marklogic.write.jsonRootName | As of 2.3.0, specifies a root field name when writing JSON documents based on arbitrary rows. | | spark.marklogic.write.temporalCollection | Name of a temporal collection to assign each document to. | -| spark.marklogic.write.threadCount | The number of threads used within each partition to send documents to MarkLogic; defaults to 4. | +| spark.marklogic.write.threadCount | The number of threads used across all partitions to send documents to MarkLogic; defaults to 4. | +| spark.marklogic.write.threadCountPerPartition | New in 2.3.0; the number of threads used per partition to send documents to MarkLogic. | | spark.marklogic.write.transform | Name of a REST transform to apply to each document. | | spark.marklogic.write.transformParams | Comma-delimited string of transform parameter names and values - e.g. param1,value1,param2,value2 . | | spark.marklogic.write.transformParamsDelimiter | Delimiter to use instead of a command for the `transformParams` option. | @@ -191,7 +213,9 @@ The following options control how rows can be processed with custom code in Mark | spark.marklogic.write.batchSize | The number of rows sent in a call to MarkLogic; defaults to 1. | | spark.marklogic.write.invoke | The path to a module to invoke; the module must be in your application's modules database. | | spark.marklogic.write.javascript | JavaScript code to execute. | +| spark.marklogic.write.javascriptFile | Local file path containing JavaScript code to execute. | | spark.marklogic.write.xquery | XQuery code to execute. | +| spark.marklogic.write.xqueryFile | Local file path containing XQuery code to execute. | | spark.marklogic.write.externalVariableName | Name of the external variable in custom code that is populated with row values; defaults to `URI`. | | spark.marklogic.write.externalVariableDelimiter | Delimiter used when multiple row values are sent in a single call; defaults to a comma. | | spark.marklogic.write.vars. | Prefix for user-defined variables to be sent to the custom code. | diff --git a/docs/reading-data/custom-code.md b/docs/reading-data/custom-code.md index 8d48a2d0..9375c684 100644 --- a/docs/reading-data/custom-code.md +++ b/docs/reading-data/custom-code.md @@ -55,6 +55,13 @@ df = spark.read.format("marklogic") \ df.show() ``` +As of the 2.3.0 release, you can also specify a local file path containing either JavaScript or XQuery code via +the `spark.marklogic.read.javascriptFile` and `spark.marklogic.read.xqueryFile` options. The value of the option +must be a file path that can be resolved by the Spark environment running the connector. The file will not be loaded +into your application's modules database. Its content will be read in and then evaluated in the same fashion as +when specifying code via `spark.marklogic.read.javascript` or `spark.marklogic.read.xquery`. + + ## Custom code schemas While the connector can infer a schema when executing an Optic query, it does not have any way to do so with custom @@ -104,12 +111,15 @@ your query into many smaller queries, you can use one of the following options t - `spark.marklogic.read.partitions.invoke` - `spark.marklogic.read.partitions.javascript` +- `spark.marklogic.read.partitions.javascriptFile` (New in 2.3.0) - `spark.marklogic.read.partitions.xquery` +- `spark.marklogic.read.partitions.xqueryFile` (New in 2.3.0) If one of the above options is defined, the connector will execute the code associated with the option and expect a sequence of values to be returned. You can return any values you want to define partitions; the connector does not care what the values represent. The connector will then execute your custom code - defined by `spark.marklogic.read.invoke`, -`spark.marklogic.read.javascript`, or `spark.marklogic.read.xquery` - once for each partition value. The partition value +`spark.marklogic.read.javascript`, `spark.marklogic.read.javascriptFile`, `spark.marklogic.read.xquery` or +`spark.marklogic.read.xqueryFile` - once for each partition value. The partition value will be defined in an external variable named `PARTITION`. Note as well that any external variables you define via the `spark.marklogic.read.vars` prefix will also be sent to the code for returning partitions. @@ -151,8 +161,9 @@ to read a stream of data from MarkLogic. This can be useful for when you wish to MarkLogic and immediately send them to a Spark writer. When streaming results from your custom code, you will need to set one of the options described above - either -`spark.marklogic.read.partitions.invoke`, `spark.marklogic.read.partitions.javascript`, or -`spark.marklogic.read.partitions.xquery` - for defining partitions. +`spark.marklogic.read.partitions.invoke`, `spark.marklogic.read.partitions.javascript`, +`spark.marklogic.read.partitions.javascriptFile`, `spark.marklogic.read.partitions.xquery`, or +`spark.marklogic.read.partitions.xqueryFile` - for defining partitions. The following example shows how the same connector configuration can be used for defining partitions and the custom code for returning rows, just with different Spark APIs. In this example, Spark will invoke the custom code once diff --git a/docs/reading-data/documents.md b/docs/reading-data/documents.md index cc7ca5e5..d39ff270 100644 --- a/docs/reading-data/documents.md +++ b/docs/reading-data/documents.md @@ -142,8 +142,8 @@ df.show(2) +--------------------+--------------------+------+-----------+--------------------+-------+----------+--------------+ | URI| content|format|collections| permissions|quality|properties|metadataValues| +--------------------+--------------------+------+-----------+--------------------+-------+----------+--------------+ -|/employee/70325be...|[7B 22 47 55 49 4...| JSON| [employee]|{rest-reader -> [...| 0| {}| {}| -|/employee/58ef1ba...|[7B 22 47 55 49 4...| JSON| [employee]|{rest-reader -> [...| 0| {}| {}| +|/employee/70325be...|[7B 22 47 55 49 4...| JSON| [employee]|{rest-reader -> [...| 0| null| {}| +|/employee/58ef1ba...|[7B 22 47 55 49 4...| JSON| [employee]|{rest-reader -> [...| 0| null| {}| +--------------------+--------------------+------+-----------+--------------------+-------+----------+--------------+ ``` diff --git a/docs/reading-data/reading-files/aggregate-xml.md b/docs/reading-data/reading-files/aggregate-xml.md new file mode 100644 index 00000000..96763445 --- /dev/null +++ b/docs/reading-data/reading-files/aggregate-xml.md @@ -0,0 +1,106 @@ +--- +layout: default +title: Aggregate XML +parent: Reading files +grand_parent: Reading Data +nav_order: 2 +--- + +XML files often contain aggregate data that can be disaggregated by splitting it into +multiple smaller documents rooted at a recurring element. Disaggregating large XML files consumes fewer resources +during loading and improves performance when searching and retrieving content. This guide describes how to use the +connector to read aggregate XML files and produce many rows from specific child elements. + +## Table of contents +{: .no_toc .text-delta } + +- TOC +{:toc} + +## Usage + +The connector supports the above use case via the `spark.marklogic.read.aggregates.xml.element` and optional +`spark.marklogic.read.aggregates.xml.namespace` options. When using these options, the connector will return rows with +the same schema as used by +[Spark's Binary data source](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html). The connector +knows how to write rows adhering to this schema as documents in MarkLogic. + +The `examples/getting-started` directory in this repository contains a small XML file with multiple occurrences of +the element `Employee` in the namespace `org:example`. The following command demonstrates how to read this file such +that each occurrence of the element `Employee` becomes a separate row in Spark (note that the namespace option is +not required): + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.aggregates.xml.element", "Employee") \ + .option("spark.marklogic.read.aggregates.xml.namespace", "org:example") \ + .load("data/employees.xml") +df.show() +``` + +You can then write each of the rows as separate XML documents in MarkLogic: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.aggregates.xml.element", "MedlineCitation") \ + .option("spark.marklogic.read.aggregates.xml.uriElement", "MedlineID") \ + .load("/Users/rudin/data/medline02n0228.xml") + +df.write.format("marklogic") \ + .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \ + .option("spark.marklogic.write.collections", "aggregate-xml") \ + .option("spark.marklogic.write.permissions", "rest-reader,read,rest-writer,update") \ + .option("spark.marklogic.write.uriReplace", ".*/data,'/xml'") \ + .mode("append") \ + .save() +``` + +The above will produce 3 XML documents, each with a root element of `Employee` in the `org:example` namespace`, in the +`aggregate-xml` collection in MarkLogic. + +## Generating a URI via an element + +Some XML documents may contain a particular element that is useful for generating a unique URI for each document. +You can specify the element name and optional namespace via the `spark.marklogic.read.aggregates.xml.uriElement` and +optional `spark.marklogic.read.aggregates.xml.uriNamespace` options as shown below: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.aggregates.xml.element", "Employee") \ + .option("spark.marklogic.read.aggregates.xml.namespace", "org:example") \ + .option("spark.marklogic.read.aggregates.xml.uriElement", "name") \ + .option("spark.marklogic.read.aggregates.xml.uriNamespace", "org:example") \ + .load("data/employees.xml") +df.show() +``` + +## Reading compressed files + +The connector supports reading GZIP and ZIP compressed files via the `spark.marklogic.read.files.compression` option. + +For a GZIP compressed file, set the option to a value of `gzip`: + +``` +.option("spark.marklogic.read.files.compression", "gzip") +``` + +Each aggregate XML file will be unzipped first and then processed normally. + +For a ZIP compressed file, which may contain one to many aggregate XML files, set the option to a value of `zip`: + +``` +.option("spark.marklogic.read.files.compression", "zip") +``` + +Each entry in the zip file must be an aggregate XML file. The same element and namespace, along with URI element and +namespace, will be applied to every file in the zip. + +## Error handling + +The connector defaults to throwing any error that occurs while reading an aggregate XML file. You can set the +`spark.marklogic.read.files.abortOnFailure` option to `false` to have each error logged instead. The connector will +continue trying to process each aggregate XML file. + +In the case of an error due to an element missing the child element specified by +`spark.marklogic.read.aggregates.xml.uriElement`, the connector will log the error and continue trying to process +elements in the aggregate XML file. diff --git a/docs/reading-data/reading-files/generic-file-support.md b/docs/reading-data/reading-files/generic-file-support.md new file mode 100644 index 00000000..d3b3f10f --- /dev/null +++ b/docs/reading-data/reading-files/generic-file-support.md @@ -0,0 +1,81 @@ +--- +layout: default +title: Generic file support +parent: Reading files +grand_parent: Reading Data +nav_order: 1 +--- + +The MarkLogic connector extends Spark's support for reading files to include file types that benefit from special +handling when trying to import files into MarkLogic. This page describes the features that are inherited from +Spark for reading files. + +## Table of contents +{: .no_toc .text-delta } + +- TOC +{:toc} + +## Selecting files to read + +Use Spark's standard `load()` function or `path` option: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.files.compression", "zip") \ + .load("path/to/zipfiles") +``` + +Or: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.files.compression", "zip") \ + .option("path", "path/to/zipfiles") \ + .load() +``` + +## Generic Spark file source options + +The connector also supports the following +[generic Spark file source options](https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html): + +- Use `pathGlobFilter` to only include files with file names matching the given pattern. +- Use `recursiveFileLookup` to include files in child directories. +- Use `modifiedBefore` and `modifiedAfter` to select files based on their modification time. + +## Reading any file + +If you wish to read files without any special handling provided by the connector, you can use the +[Spark Binary data source](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html). If you try to write these rows as documents, the connector will recognize +the Binary data source schema and write each row as a separate document. For example, the following will +write each file in the `examples/getting-started/data` directory in this repository without any special handling +of each file: + +``` +spark.read.format("binaryFile") \ + .option("recursiveFileLookup", True) \ + .load("data") \ + .write.format("marklogic") \ + .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \ + .option("spark.marklogic.write.collections", "binary-example") \ + .option("spark.marklogic.write.permissions", "rest-reader,read,rest-writer,update") \ + .option("spark.marklogic.write.uriReplace", ".*data,'/binary-example'") \ + .mode("append") \ + .save() +``` + +The above will result in each document in the `data` directory being written as a document to MarkLogic. MarkLogic +will determine the document type based on the file extension. + +If you are writing files with extensions that MarkLogic does not recognize based on its configured set of MIME types, +you can force a document type for each file with an unrecognized extension: + +``` + .option("spark.marklogic.write.fileRows.documentType", "JSON") +``` + +The `spark.marklogic.write.fileRows.documentType` option supports values of `JSON`, `XML`, and `TEXT`. + +Please see [the guide on writing data](../../writing.md) for information on how "file rows" can then be written to +MarkLogic as documents. diff --git a/docs/reading-data/reading-files/rdf-data.md b/docs/reading-data/reading-files/rdf-data.md new file mode 100644 index 00000000..2c4b9cb4 --- /dev/null +++ b/docs/reading-data/reading-files/rdf-data.md @@ -0,0 +1,108 @@ +--- +layout: default +title: RDF data +parent: Reading files +grand_parent: Reading Data +nav_order: 3 +--- + +The MarkLogic connector supports reading [files containing RDF data](https://www.w3.org/RDF/), allowing you to ingest +this data into MarkLogic and leverage the power of +[semantic graphs in MarkLogic](https://docs.marklogic.com/guide/semantics). + +## Table of contents +{: .no_toc .text-delta } + +- TOC +{:toc} + +## Reading RDF files + +To read RDF files, configure the connector by setting the option `spark.marklogic.read.files.type` to a value of `rdf`: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.files.type", "rdf") \ + .load("data/taxonomy.xml") +df.show() +``` + +The connector returns rows with the following schema: + +- `subject` = string representing the subject of the RDF triple. +- `predicate` = string representing the predicate of the RDF triple. +- `object` = string representing the object value of the RDF triple. +- `datatype` = optional string defining datatype URI of the literal value in the `object` column. +- `lang` = optional string defining the language of the literal value in the `object` column. +- `graph` = optional string defining the graph associated with the RDF triple; only populated when reading quads. + +When reading files containing quads, such as TriG and N-Quads files, the `graph` column will be populated with the +semantic graph associated with each triple: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.files.type", "rdf") \ + .load("data/quads.trig") +df.show() +``` + +## Supported RDF file types + +The connector supports the same [RDF data formats](https://docs.marklogic.com/guide/semantics/loading#id_70682) as +MarkLogic server does, which are listed below: + +- RDF/JSON +- RDF/XML +- N3 +- N-Quads +- N-Triples +- TriG +- Turtle + +The connector depends on the [Apache Jena library](https://jena.apache.org/index.html) for reading RDF files. While the +connector has only been officially tested with the above files types, you may be able to read all the +[supported Jena file types](https://jena.apache.org/documentation/io/#command-line-tools). + +## Configuring a graph + +The connector supports configuring a graph when writing triples to MarkLogic, but not yet when reading data from RDF +files. However, you can easily set the value of the `graph` column in each row via Spark - one approach is shown below: + +``` +from pyspark.sql.functions import lit +spark.read.format("marklogic") \ + .option("spark.marklogic.read.files.type", "rdf") \ + .load("data/taxonomy.xml") \ + .withColumn("graph", lit("example-graph")) \ + .show() +``` + +## Reading compressed files + +The connector supports reading GZIP and ZIP compressed files via the `spark.marklogic.read.files.compression` option. + +For a GZIP compressed RDF file, set the option to a value of `gzip`: + +``` +.option("spark.marklogic.read.files.compression", "gzip") +``` + +Each RDF file will be unzipped first and then processed normally. + +For a ZIP compressed file, which may contain one to many RDF files, set the option to a value of `zip`: + +``` +.option("spark.marklogic.read.files.compression", "zip") +``` + +The ZIP may contain RDF files of different types. Each RDF file will be processed separately, with each triple in +each file becoming a separate Spark row. + + +## Error handling + +The connector defaults to throwing any error that occurs while reading an RDF file. You can set the +`spark.marklogic.read.files.abortOnFailure` option to `false` to have each error logged instead. When an error occurs, +the connector will not process the rest of the file that caused the error, but it will continue processing every other +selected file or entry in a zip file. + diff --git a/docs/reading-data/reading-files/reading-files.md b/docs/reading-data/reading-files/reading-files.md new file mode 100644 index 00000000..3be7c932 --- /dev/null +++ b/docs/reading-data/reading-files/reading-files.md @@ -0,0 +1,19 @@ +--- +layout: default +title: Reading files +nav_order: 4 +has_children: true +parent: Reading Data +--- + + +As of the 2.3.0 release, the MarkLogic Spark connector extends the out-of-the-box Spark capabilities for reading files +to include support for reading the following file types: + +- Aggregate XML files +- RDF files +- ZIP files + +Please see the guides below for more information on each of the above file types, as well as for using +[Spark's Binary data source](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html) to read any type +of file and write it to MarkLogic. diff --git a/docs/reading-data/reading-files/zip-files.md b/docs/reading-data/reading-files/zip-files.md new file mode 100644 index 00000000..cffa5f4a --- /dev/null +++ b/docs/reading-data/reading-files/zip-files.md @@ -0,0 +1,48 @@ +--- +layout: default +title: ZIP files +parent: Reading files +grand_parent: Reading Data +nav_order: 4 +--- + +The MarkLogic connector has special handling for ZIP files, enabling each entry in a ZIP file to be read as a separate +row and eventually written to MarkLogic as a separate document. + +## Table of contents +{: .no_toc .text-delta } + +- TOC +{:toc} + +## Specifying ZIP files to read + +To configure the connector to read each entry in one or more ZIP files as separate rows, set the +`spark.marklogic.read.files.compression` option to a value of `zip`: + +``` +df = spark.read.format("marklogic") \ + .option("spark.marklogic.read.files.compression", "zip") \ + .load("data/employees.zip") +df.show() +``` + +The connector will return 1 row per entry in each zip file, with each row conforming to the +[Spark Binary data source schema](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html). Each row +will have a `path` column with a value based on the path of the ZIP file and the name of the ZIP entry. + +To see the full path - which you will likely want to customize if writing these rows as documents to +MarkLogic - try the following: + +``` +df.select("path").show(20, 0, True) +``` + +Please see [the guide on writing data](../../writing.md) for information on how "file rows" can then be written to +MarkLogic as documents. + +## Error handling + +Due to how the underlying Java support for reading ZIP files works, files that are not valid ZIP files do not result +in any errors being thrown. Instead, the Java support simply does not return any rows for any file that it cannot read +as a ZIP file. diff --git a/docs/reading-data/reading.md b/docs/reading-data/reading.md index 766e44a2..42da62a2 100644 --- a/docs/reading-data/reading.md +++ b/docs/reading-data/reading.md @@ -9,5 +9,9 @@ permalink: /docs/reading-data The connector allows for data to be retrieved from MarkLogic as rows either via an [Optic query](https://docs.marklogic.com/guide/app-dev/OpticAPI#id_46710) or via custom code written in JavaScript or XQuery, or as "document rows" where each row corresponds -to a document in MarkLogic. See the guides below for more information. +to a document in MarkLogic. + +As of the 2.3.0 release, you can also use the connector to read files from a local filesystem, HDFS, or S3. + +See the guides below for more information. diff --git a/docs/writing.md b/docs/writing.md index 49b2998f..774b5f85 100644 --- a/docs/writing.md +++ b/docs/writing.md @@ -66,9 +66,13 @@ documents to MarkLogic: 4. `collections` is an array of `string`s. 5. `permissions` is a map with keys of type `string` and values that are arrays of `string`s. 6. `quality` is an `integer`. -7. `properties` is a map with keys and values of type `string`. +7. `properties` is of type `string` and must be a serialized XML string of MarkLogic properties in the `http://marklogic.com/xdmp/property` namespace. 8. `metadataValues` is a map with keys and values of type `string`. +Note that in the 2.2.0 release of the connector, the `properties` column was a map with keys and values of type +`string`. This approach could not handle complex XML structures and was thus fixed in the 2.3.0 release to be of +type `string`. + Writing rows corresponding to the "document row" schema is largely the same as writing rows of any arbitrary schema, but bear in mind these differences: @@ -104,6 +108,27 @@ the parameter values contains a comma: .option("spark.marklogic.write.transformParams", "my-param;has,commas") .option("spark.marklogic.write.transformParamsDelimiter", ";") +### Setting a JSON root name + +As of 2.3.0, when writing JSON documents based on arbitrary rows, you can specify a root field name to be inserted +at the top level of the document. Each column value will then be included in an object assigned to that root field name. +This can be useful when you want your JSON data to be more self-documenting. For example, a document representing an +employee may be easier to understand if it has a single root field named "Employee" with every property of the employee +captured in an object assigned to the "Employee" field. + +The following will produce JSON documents that each have a single root field named "myRootField", with all column +values in a row assigned to an object associated with "myRootField": + +``` +df.write.format("marklogic") \ + .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \ + .option("spark.marklogic.write.permissions", "rest-reader,read,rest-writer,update") \ + .option("spark.marklogic.write.uriPrefix", "/write/") \ + .option("spark.marklogic.write.jsonRootName", "myRootField") \ + .mode("append") \ + .save() +``` + ### Controlling document URIs By default, the connector will construct a URI for each document beginning with a UUID and ending with `.json`. A @@ -135,6 +160,14 @@ The following template would construct URIs based on those two columns: Both columns should have values in each row in the DataFrame. If the connector encounters a row that does not have a value for any column in the URI template, an error will be thrown. +As of the 2.3.0 release, you can also use a [JSONPointer](https://www.rfc-editor.org/rfc/rfc6901) expression to +reference a value. This is often useful in conjunction with the `spark.marklogic.write.jsonRootName` option. For +example, if `spark.marklogic.write.jsonRootName` is set to "Employee", and you wish to include the `employee_id` +value in a URI, you would use the following configuration: + + .option("spark.marklogic.write.uriTemplate", "/example/{organization}/{/Employee/employee_id}.json") + + If you are writing file rows that conform to [Spark's binaryFile schema](https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html), the `path`, `modificationTime`, and `length` columns will be available for use with the template. The `content` column will not be @@ -211,8 +244,15 @@ The connector uses MarkLogic's following options can be set to adjust how the connector performs when writing data: - `spark.marklogic.write.batchSize` = the number of documents written in one call to MarkLogic; defaults to 100. -- `spark.marklogic.write.threadCount` = the number of threads used by each partition to write documents to MarkLogic; +- `spark.marklogic.write.threadCount` = the number of threads used across all partitions to write documents to MarkLogic; defaults to 4. +- `spark.marklogic.write.threadCountPerPartition` = the number of threads to use per partition to write documents to +MarkLogic. If set, will be used instead of `spark.marklogic.write.threadCount`. + +Prior to the 2.3.0 release, `spark.marklogic.write.threadCount` configured a number of threads per partition. Based on +feedback, this was changed to the number of total threads used across all partitions to match what users typically +expect "thread count" to mean in the context of writing to MarkLogic. `spark.marklogic.write.threadCountPerPartition` +was then added for users who do wish to configure a number of threads per Spark partition. These options are in addition to the number of partitions within the Spark DataFrame that is being written to MarkLogic. For each partition in the DataFrame, a separate instance of a MarkLogic batch writer is created, each @@ -231,7 +271,7 @@ the connector can directly connect to each host in the cluster. The rule of thumb above can thus be expressed as: - Number of partitions * Value of spark.marklogic.write.threadCount <= Number of hosts * number of app server threads + Value of spark.marklogic.write.threadCount <= Number of hosts * number of app server threads ### Using a load balancer @@ -317,6 +357,12 @@ spark.read.format("marklogic") \ .save() ``` +As of the 2.3.0 release, you can also specify a local file path containing either JavaScript or XQuery code via +the `spark.marklogic.write.javascriptFile` and `spark.marklogic.write.xqueryFile` options. The value of the option +must be a file path that can be resolved by the Spark environment running the connector. The file will not be loaded +into your application's modules database. Its content will be read in and then evaluated in the same fashion as +when specifying code via `spark.marklogic.write.javascript` or `spark.marklogic.write.xquery`. + ### Processing multiple rows in a single call By default, a single row is sent by the connector to your custom code. In many use cases, particularly when writing diff --git a/examples/entity-aggregation/build.gradle b/examples/entity-aggregation/build.gradle index 403de1a9..6fca031f 100644 --- a/examples/entity-aggregation/build.gradle +++ b/examples/entity-aggregation/build.gradle @@ -7,9 +7,9 @@ repositories { } dependencies { - implementation 'org.apache.spark:spark-sql_2.12:3.4.1' + implementation 'org.apache.spark:spark-sql_2.12:3.4.3' implementation "com.marklogic:marklogic-spark-connector:2.2.0" - implementation "org.postgresql:postgresql:42.6.0" + implementation "org.postgresql:postgresql:42.6.2" } task importCustomers(type: JavaExec) { diff --git a/examples/getting-started/data/employees.xml b/examples/getting-started/data/employees.xml new file mode 100644 index 00000000..269f4c9e --- /dev/null +++ b/examples/getting-started/data/employees.xml @@ -0,0 +1,16 @@ + + + John + 40 + + + Jane + 41 + + + + Brenda + 42 + + + diff --git a/examples/getting-started/data/quads.trig b/examples/getting-started/data/quads.trig new file mode 100644 index 00000000..898900df --- /dev/null +++ b/examples/getting-started/data/quads.trig @@ -0,0 +1,30 @@ +# This document encodes three graphs. + +@prefix rdf: . +@prefix xsd: . +@prefix swp: . +@prefix dc: . +@prefix ex: . +@prefix : . + +:G1 { :Monica ex:name "Monica Murphy" . + :Monica ex:homepage . + :Monica ex:email . + :Monica ex:hasSkill ex:Management } + +:G2 { :Monica rdf:type ex:Person . + :Monica ex:hasSkill ex:Programming } + +:G3 { :G1 swp:assertedBy _:w1 . + _:w1 swp:authority :Chris . + _:w1 dc:date "2003-10-02"^^xsd:date . + :G2 swp:quotedBy _:w2 . + :G3 swp:assertedBy _:w2 . + _:w2 dc:date "2003-09-03"^^xsd:date . + _:w2 swp:authority :Chris . + :Chris rdf:type ex:Person . + :Chris ex:email } + +{ + :Default ex:graphname "Default" +} diff --git a/examples/getting-started/data/taxonomy.xml b/examples/getting-started/data/taxonomy.xml new file mode 100644 index 00000000..9fc7d0da --- /dev/null +++ b/examples/getting-started/data/taxonomy.xml @@ -0,0 +1,18 @@ + + + + + + wb + 2013-05-21T15:49:55Z + 0 + Debt Management + + + + + + diff --git a/examples/java-dependency/build.gradle b/examples/java-dependency/build.gradle index c50b5eb4..f20a43bc 100644 --- a/examples/java-dependency/build.gradle +++ b/examples/java-dependency/build.gradle @@ -7,7 +7,7 @@ repositories { } dependencies { - implementation 'org.apache.spark:spark-sql_2.12:3.4.1' + implementation 'org.apache.spark:spark-sql_2.12:3.4.3' implementation 'com.marklogic:marklogic-spark-connector:2.2.0' } diff --git a/gradle.properties b/gradle.properties index b208ed9b..d70c573e 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,13 +1,6 @@ -# Testing against 3.3.2 for the 2.0.0 release as 3.3.0 was released in June 2022 and 3.3.2 in February 2023, while -# 3.4.0 is fairly new - April 2023. And at least AWS Glue and EMR are only on 3.3.0. But 3.3.2 has bug fixes that -# affect some of our tests - see PushDownGroupByCountTest for an example. So we're choosing to build and test -# against the latest 3.3.x release so we're not writing assertions based on buggy behavior in Spark 3.3.0. -# -# For 2.1.0, planning on using at least 3.4.x, and possibly 3.5.x. All tests are passing with 3.4.x when authors are -# in a single document on MarkLogic 11. The tests that verify the number of rows read from MarkLogic (as opposed to -# rows in the Spark dataset) will fail on MarkLogic 12 for now given that all rows come from the same document, and thus -# all come from a single call to MarkLogic. -sparkVersion=3.4.1 +# Staying with 3.4.x for now, as some pushdown tests are failing when using 3.5.x. +# 3.4.3 release notes - https://spark.apache.org/releases/spark-release-3-4-3.html . +sparkVersion=3.4.3 # Only used for the test app and for running tests. mlHost=localhost diff --git a/src/main/java/com/marklogic/spark/ConnectionString.java b/src/main/java/com/marklogic/spark/ConnectionString.java new file mode 100644 index 00000000..953b6bde --- /dev/null +++ b/src/main/java/com/marklogic/spark/ConnectionString.java @@ -0,0 +1,83 @@ +package com.marklogic.spark; + +import java.io.UnsupportedEncodingException; +import java.net.URLDecoder; + +public class ConnectionString { + + private final String host; + private final int port; + private final String username; + private final String password; + private final String database; + + public ConnectionString(String connectionString, String optionNameForErrorMessage) { + final String errorMessage = String.format( + "Invalid value for %s; must be username:password@host:port/optionalDatabaseName", + optionNameForErrorMessage + ); + + String[] parts = connectionString.split("@"); + if (parts.length != 2) { + throw new ConnectorException(errorMessage); + } + String[] tokens = parts[0].split(":"); + if (tokens.length != 2) { + throw new ConnectorException(errorMessage); + } + this.username = decodeValue(tokens[0], "username"); + this.password = decodeValue(tokens[1], "password"); + + tokens = parts[1].split(":"); + if (tokens.length != 2) { + throw new ConnectorException(errorMessage); + } + this.host = tokens[0]; + if (tokens[1].contains("/")) { + tokens = tokens[1].split("/"); + this.port = parsePort(tokens[0], optionNameForErrorMessage); + this.database = tokens[1]; + } else { + this.port = parsePort(tokens[1], optionNameForErrorMessage); + this.database = null; + } + } + + private int parsePort(String value, String optionNameForErrorMessage) { + try { + return Integer.parseInt(value); + } catch (NumberFormatException e) { + throw new ConnectorException(String.format( + "Invalid value for %s; port must be numeric, but was '%s'", optionNameForErrorMessage, value + )); + } + } + + private String decodeValue(String value, String label) { + try { + return URLDecoder.decode(value, "UTF-8"); + } catch (UnsupportedEncodingException e) { + throw new ConnectorException(String.format("Unable to decode '%s'; cause: %s", label, e.getMessage())); + } + } + + public String getHost() { + return host; + } + + public int getPort() { + return port; + } + + public String getUsername() { + return username; + } + + public String getPassword() { + return password; + } + + public String getDatabase() { + return database; + } +} diff --git a/src/main/java/com/marklogic/spark/ContextSupport.java b/src/main/java/com/marklogic/spark/ContextSupport.java index ffb538f0..1d18ac18 100644 --- a/src/main/java/com/marklogic/spark/ContextSupport.java +++ b/src/main/java/com/marklogic/spark/ContextSupport.java @@ -22,8 +22,6 @@ import org.slf4j.LoggerFactory; import java.io.Serializable; -import java.io.UnsupportedEncodingException; -import java.net.URLDecoder; import java.util.HashMap; import java.util.Map; import java.util.concurrent.TimeUnit; @@ -68,7 +66,13 @@ public DatabaseClient connectToMarkLogic(String host) { } DatabaseClient.ConnectionResult result = client.checkConnection(); if (!result.isConnected()) { - throw new ConnectorException(String.format("Unable to connect to MarkLogic; status code: %d; error message: %s", result.getStatusCode(), result.getErrorMessage())); + if (result.getStatusCode() == 404) { + throw new ConnectorException(String.format("Unable to connect to MarkLogic; status code: 404; ensure that " + + "you are attempting to connect to a MarkLogic REST API app server. See the MarkLogic documentation on " + + "REST API app servers for more information.")); + } + throw new ConnectorException(String.format( + "Unable to connect to MarkLogic; status code: %d; error message: %s", result.getStatusCode(), result.getErrorMessage())); } return client; } @@ -86,15 +90,12 @@ protected final Map buildConnectionProperties() { connectionProps.put("spark.marklogic.client.authType", "digest"); connectionProps.put("spark.marklogic.client.connectionType", "gateway"); connectionProps.putAll(this.properties); - if (optionExists(Options.CLIENT_URI)) { - parseClientUri(properties.get(Options.CLIENT_URI), connectionProps); + parseConnectionString(properties.get(Options.CLIENT_URI), connectionProps); } - if ("true".equalsIgnoreCase(properties.get(Options.CLIENT_SSL_ENABLED))) { connectionProps.put("spark.marklogic.client.sslProtocol", "default"); } - return connectionProps; } @@ -103,52 +104,34 @@ public final boolean optionExists(String option) { return value != null && value.trim().length() > 0; } - private void parseClientUri(String clientUri, Map connectionProps) { - final String errorMessage = String.format("Invalid value for %s; must be username:password@host:port", Options.CLIENT_URI); - String[] parts = clientUri.split("@"); - if (parts.length != 2) { - throw new IllegalArgumentException(errorMessage); - } - String[] tokens = parts[0].split(":"); - if (tokens.length != 2) { - throw new IllegalArgumentException(errorMessage); - } - connectionProps.put(Options.CLIENT_USERNAME, decodeValue(tokens[0], "username")); - connectionProps.put(Options.CLIENT_PASSWORD, decodeValue(tokens[1], "password")); - - tokens = parts[1].split(":"); - if (tokens.length != 2) { - throw new IllegalArgumentException(errorMessage); - } - connectionProps.put(Options.CLIENT_HOST, tokens[0]); - if (tokens[1].contains("/")) { - tokens = tokens[1].split("/"); - connectionProps.put(Options.CLIENT_PORT, tokens[0]); - connectionProps.put(Options.CLIENT_DATABASE, tokens[1]); - } else { - connectionProps.put(Options.CLIENT_PORT, tokens[1]); - } + public final String getOptionNameForMessage(String option) { + return Util.getOptionNameForErrorMessage(option); } - private String decodeValue(String value, String label) { - try { - return URLDecoder.decode(value, "UTF-8"); - } catch (UnsupportedEncodingException e) { - throw new ConnectorException(String.format("Unable to decode %s; cause: %s", label, value)); + private void parseConnectionString(String value, Map connectionProps) { + ConnectionString connectionString = new ConnectionString(value, getOptionNameForMessage("spark.marklogic.client.uri")); + connectionProps.put(Options.CLIENT_USERNAME, connectionString.getUsername()); + connectionProps.put(Options.CLIENT_PASSWORD, connectionString.getPassword()); + connectionProps.put(Options.CLIENT_HOST, connectionString.getHost()); + connectionProps.put(Options.CLIENT_PORT, connectionString.getPort() + ""); + + String db = connectionString.getDatabase(); + if (db != null && db.trim().length() > 0) { + connectionProps.put(Options.CLIENT_DATABASE, db); } } - protected long getNumericOption(String optionName, long defaultValue, long minimumValue) { + public final long getNumericOption(String optionName, long defaultValue, long minimumValue) { try { long value = this.getProperties().containsKey(optionName) ? Long.parseLong(this.getProperties().get(optionName)) : defaultValue; - if (value < minimumValue) { - throw new ConnectorException(String.format("Value of '%s' option must be %d or greater.", optionName, minimumValue)); + if (value != defaultValue && value < minimumValue) { + throw new ConnectorException(String.format("The value of '%s' must be %d or greater.", getOptionNameForMessage(optionName), minimumValue)); } return value; } catch (NumberFormatException ex) { - throw new ConnectorException(String.format("Value of '%s' option must be numeric.", optionName), ex); + throw new ConnectorException(String.format("The value of '%s' must be numeric.", getOptionNameForMessage(optionName)), ex); } } @@ -170,6 +153,10 @@ public final boolean hasOption(String... options) { return Util.hasOption(this.properties, options); } + public final String getStringOption(String option) { + return hasOption(option) ? properties.get(option).trim() : null; + } + public Map getProperties() { return properties; } diff --git a/src/main/java/com/marklogic/spark/DefaultSource.java b/src/main/java/com/marklogic/spark/DefaultSource.java index c336ab38..01d546ae 100644 --- a/src/main/java/com/marklogic/spark/DefaultSource.java +++ b/src/main/java/com/marklogic/spark/DefaultSource.java @@ -20,7 +20,7 @@ import com.marklogic.client.row.RowManager; import com.marklogic.spark.reader.document.DocumentRowSchema; import com.marklogic.spark.reader.document.DocumentTable; -import com.marklogic.spark.reader.file.FileRowSchema; +import com.marklogic.spark.reader.file.TripleRowSchema; import com.marklogic.spark.reader.optic.SchemaInferrer; import com.marklogic.spark.writer.WriteContext; import org.apache.spark.sql.SparkSession; @@ -63,12 +63,14 @@ public String shortName() { public StructType inferSchema(CaseInsensitiveStringMap options) { final Map properties = options.asCaseSensitiveMap(); if (isFileOperation(properties)) { - return FileRowSchema.SCHEMA; + final String type = properties.get(Options.READ_FILES_TYPE); + return "rdf".equalsIgnoreCase(type) ? TripleRowSchema.SCHEMA : DocumentRowSchema.SCHEMA; } if (isReadDocumentsOperation(properties)) { return DocumentRowSchema.SCHEMA; - } - if (isReadWithCustomCodeOperation(properties)) { + } else if (isReadTriplesOperation(properties)) { + return TripleRowSchema.SCHEMA; + } else if (Util.isReadWithCustomCodeOperation(properties)) { return new StructType().add("URI", DataTypes.StringType); } return inferSchemaFromOpticQuery(properties); @@ -77,24 +79,35 @@ public StructType inferSchema(CaseInsensitiveStringMap options) { @Override public Table getTable(StructType schema, Transform[] partitioning, Map properties) { if (isFileOperation(properties)) { + // Not yet supporting progress logging for file operations. return new MarkLogicFileTable(SparkSession.active(), new CaseInsensitiveStringMap(properties), JavaConverters.asScalaBuffer(getPaths(properties)), schema ); } + final ContextSupport tempContext = new ContextSupport(properties); + + // The appropriate progress logger is reset here so that when the connector is used repeatedly in an + // environment like PySpark, the counts start with zero on each new Spark job. + final long readProgressInterval = tempContext.getNumericOption(Options.READ_LOG_PROGRESS, 0, 0); if (isReadDocumentsOperation(properties)) { - return new DocumentTable(); - } - else if (isReadOperation(properties)) { - if (logger.isDebugEnabled()) { - logger.debug("Creating new table for reading"); - } + ReadProgressLogger.initialize(readProgressInterval, "Documents read: {}"); + return new DocumentTable(DocumentRowSchema.SCHEMA); + } else if (isReadTriplesOperation(properties)) { + ReadProgressLogger.initialize(readProgressInterval, "Triples read: {}"); + return new DocumentTable(TripleRowSchema.SCHEMA); + } else if (properties.get(Options.READ_OPTIC_QUERY) != null) { + ReadProgressLogger.initialize(readProgressInterval, "Rows read: {}"); + return new MarkLogicTable(schema, properties); + } else if (Util.isReadWithCustomCodeOperation(properties)) { + ReadProgressLogger.initialize(readProgressInterval, "Items read: {}"); return new MarkLogicTable(schema, properties); } - if (logger.isDebugEnabled()) { - logger.debug("Creating new table for writing"); - } + + final long writeProgressInterval = tempContext.getNumericOption(Options.WRITE_LOG_PROGRESS, 0, 0); + String message = Util.isWriteWithCustomCodeOperation(properties) ? "Items processed: {}" : "Documents written: {}"; + WriteProgressLogger.initialize(writeProgressInterval, message); return new MarkLogicTable(new WriteContext(schema, properties)); } @@ -113,35 +126,43 @@ private boolean isFileOperation(Map properties) { return properties.containsKey("path") || properties.containsKey("paths"); } - private boolean isReadOperation(Map properties) { - return properties.get(Options.READ_OPTIC_QUERY) != null || isReadWithCustomCodeOperation(properties); - } - private boolean isReadDocumentsOperation(Map properties) { return properties.containsKey(Options.READ_DOCUMENTS_QUERY) || properties.containsKey(Options.READ_DOCUMENTS_STRING_QUERY) || properties.containsKey(Options.READ_DOCUMENTS_COLLECTIONS) || properties.containsKey(Options.READ_DOCUMENTS_DIRECTORY) || - properties.containsKey(Options.READ_DOCUMENTS_OPTIONS); + properties.containsKey(Options.READ_DOCUMENTS_OPTIONS) || + properties.containsKey(Options.READ_DOCUMENTS_URIS); } - private boolean isReadWithCustomCodeOperation(Map properties) { - return Util.hasOption(properties, Options.READ_INVOKE, Options.READ_XQUERY, Options.READ_JAVASCRIPT); + private boolean isReadTriplesOperation(Map properties) { + return Util.hasOption(properties, + Options.READ_TRIPLES_GRAPHS, + Options.READ_TRIPLES_COLLECTIONS, + Options.READ_TRIPLES_QUERY, + Options.READ_TRIPLES_STRING_QUERY, + Options.READ_TRIPLES_URIS, + Options.READ_TRIPLES_DIRECTORY + ); } private StructType inferSchemaFromOpticQuery(Map caseSensitiveOptions) { final String query = caseSensitiveOptions.get(Options.READ_OPTIC_QUERY); if (query == null || query.trim().length() < 1) { - throw new IllegalArgumentException(String.format("No Optic query found; must define %s", Options.READ_OPTIC_QUERY)); + throw new ConnectorException(Util.getOptionNameForErrorMessage("spark.marklogic.read.noOpticQuery")); } RowManager rowManager = new ContextSupport(caseSensitiveOptions).connectToMarkLogic().newRowManager(); RawQueryDSLPlan dslPlan = rowManager.newRawQueryDSLPlan(new StringHandle(query)); try { // columnInfo is what forces a minimum MarkLogic version of 10.0-9 or higher. StringHandle columnInfoHandle = rowManager.columnInfo(dslPlan, new StringHandle()); - return SchemaInferrer.inferSchema(columnInfoHandle.get()); + StructType schema = SchemaInferrer.inferSchema(columnInfoHandle.get()); + if (Util.MAIN_LOGGER.isDebugEnabled()) { + logger.debug("Inferred schema from Optic columnInfo: {}", schema); + } + return schema; } catch (Exception ex) { - throw new ConnectorException(String.format("Unable to run Optic DSL query %s; cause: %s", query, ex.getMessage()), ex); + throw new ConnectorException(String.format("Unable to run Optic query %s; cause: %s", query, ex.getMessage()), ex); } } diff --git a/src/main/java/com/marklogic/spark/JsonRowSerializer.java b/src/main/java/com/marklogic/spark/JsonRowSerializer.java new file mode 100644 index 00000000..6b6dec76 --- /dev/null +++ b/src/main/java/com/marklogic/spark/JsonRowSerializer.java @@ -0,0 +1,82 @@ +/* + * Copyright © 2024 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved. + */ +package com.marklogic.spark; + +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.json.JSONOptions; +import org.apache.spark.sql.catalyst.json.JacksonGenerator; +import org.apache.spark.sql.types.StructType; +import scala.Predef; +import scala.collection.JavaConverters; + +import java.io.StringWriter; +import java.util.HashMap; +import java.util.Map; + +/** + * Handles serializing a Spark row into a JSON string. Includes support for all the options defined in Spark's + * JSONOptions.scala class. + */ +public class JsonRowSerializer { + + private final StructType schema; + private final JSONOptions jsonOptions; + private final boolean includeNullFields; + + public JsonRowSerializer(StructType schema, Map connectorProperties) { + this.schema = schema; + + final Map options = buildOptionsForJsonOptions(connectorProperties); + this.includeNullFields = "false".equalsIgnoreCase(options.get("ignoreNullFields")); + + this.jsonOptions = new JSONOptions( + // Funky code to convert a Java map into a Scala immutable Map. + JavaConverters.mapAsScalaMapConverter(options).asScala().toMap(Predef.$conforms()), + + // As verified via tests, this default timezone ID is overridden by a user via + // the spark.sql.session.timeZone option. + "Z", + + // We don't expect corrupted records - i.e. corrupted values - to be present in the index. But Spark + // requires this to be set. See + // https://medium.com/@sasidharan-r/how-to-handle-corrupt-or-bad-record-in-apache-spark-custom-logic-pyspark-aws-430ddec9bb41 + // for more information. + "_corrupt_record" + ); + } + + public String serializeRowToJson(InternalRow row) { + StringWriter writer = new StringWriter(); + JacksonGenerator jacksonGenerator = new JacksonGenerator(this.schema, writer, this.jsonOptions); + jacksonGenerator.write(row); + jacksonGenerator.flush(); + return writer.toString(); + } + + /** + * A user can specify any of the options found in the JSONOptions.scala class - though it's not yet clear where + * a user finds out about these except via the Spark source code. "ignoreNullFields" however is expected to be the + * primary one that is configured. + */ + private Map buildOptionsForJsonOptions(Map connectorProperties) { + Map options = new HashMap<>(); + // Default to include null fields, as they are easily queried in MarkLogic. + options.put("ignoreNullFields", "false"); + connectorProperties.forEach((key, value) -> { + if (key.startsWith(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX)) { + String optionName = key.substring(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX.length()); + options.put(optionName, value); + } + }); + return options; + } + + public JSONOptions getJsonOptions() { + return jsonOptions; + } + + public boolean isIncludeNullFields() { + return this.includeNullFields; + } +} diff --git a/src/main/java/com/marklogic/spark/MarkLogicFileTable.java b/src/main/java/com/marklogic/spark/MarkLogicFileTable.java index c3f76ca8..32a5c759 100644 --- a/src/main/java/com/marklogic/spark/MarkLogicFileTable.java +++ b/src/main/java/com/marklogic/spark/MarkLogicFileTable.java @@ -45,7 +45,7 @@ public ScanBuilder newScanBuilder(CaseInsensitiveStringMap options) { @Override public WriteBuilder newWriteBuilder(LogicalWriteInfo info) { // Need to pass along a serializable object. - return new DocumentFileWriteBuilder(this.options.asCaseSensitiveMap()); + return new DocumentFileWriteBuilder(this.options.asCaseSensitiveMap(), this.schema); } @Override diff --git a/src/main/java/com/marklogic/spark/MarkLogicTable.java b/src/main/java/com/marklogic/spark/MarkLogicTable.java index 38435ac5..77ba7707 100644 --- a/src/main/java/com/marklogic/spark/MarkLogicTable.java +++ b/src/main/java/com/marklogic/spark/MarkLogicTable.java @@ -16,7 +16,7 @@ package com.marklogic.spark; import com.marklogic.spark.reader.optic.OpticScanBuilder; -import com.marklogic.spark.reader.optic.ReadContext; +import com.marklogic.spark.reader.optic.OpticReadContext; import com.marklogic.spark.reader.customcode.CustomCodeScanBuilder; import com.marklogic.spark.writer.MarkLogicWriteBuilder; import com.marklogic.spark.writer.WriteContext; @@ -75,7 +75,7 @@ class MarkLogicTable implements SupportsRead, SupportsWrite { */ @Override public ScanBuilder newScanBuilder(CaseInsensitiveStringMap options) { - if (Util.hasOption(readProperties, Options.READ_INVOKE, Options.READ_JAVASCRIPT, Options.READ_XQUERY)) { + if (Util.isReadWithCustomCodeOperation(readProperties)) { if (logger.isDebugEnabled()) { logger.debug("Will read rows via custom code"); } @@ -91,7 +91,7 @@ public ScanBuilder newScanBuilder(CaseInsensitiveStringMap options) { // This is needed by the Optic partition reader; capturing it in the ReadContext so that the reader does not // have a dependency on an active Spark session, which makes certain kinds of tests easier. int defaultMinPartitions = SparkSession.active().sparkContext().defaultMinPartitions(); - return new OpticScanBuilder(new ReadContext(readProperties, readSchema, defaultMinPartitions)); + return new OpticScanBuilder(new OpticReadContext(readProperties, readSchema, defaultMinPartitions)); } @Override @@ -107,7 +107,7 @@ public String name() { /** * @deprecated Marked as deprecated in the Table interface. */ - @SuppressWarnings("java:S1133") + @SuppressWarnings({"java:S1133", "java:S6355"}) @Deprecated @Override public StructType schema() { diff --git a/src/main/java/com/marklogic/spark/Options.java b/src/main/java/com/marklogic/spark/Options.java index 1a825afc..9f7580f3 100644 --- a/src/main/java/com/marklogic/spark/Options.java +++ b/src/main/java/com/marklogic/spark/Options.java @@ -29,12 +29,16 @@ public abstract class Options { public static final String READ_INVOKE = "spark.marklogic.read.invoke"; public static final String READ_JAVASCRIPT = "spark.marklogic.read.javascript"; + public static final String READ_JAVASCRIPT_FILE = "spark.marklogic.read.javascriptFile"; public static final String READ_XQUERY = "spark.marklogic.read.xquery"; + public static final String READ_XQUERY_FILE = "spark.marklogic.read.xqueryFile"; public static final String READ_VARS_PREFIX = "spark.marklogic.read.vars."; public static final String READ_PARTITIONS_INVOKE = "spark.marklogic.read.partitions.invoke"; public static final String READ_PARTITIONS_JAVASCRIPT = "spark.marklogic.read.partitions.javascript"; + public static final String READ_PARTITIONS_JAVASCRIPT_FILE = "spark.marklogic.read.partitions.javascriptFile"; public static final String READ_PARTITIONS_XQUERY = "spark.marklogic.read.partitions.xquery"; + public static final String READ_PARTITIONS_XQUERY_FILE = "spark.marklogic.read.partitions.xqueryFile"; public static final String READ_OPTIC_QUERY = "spark.marklogic.read.opticQuery"; public static final String READ_NUM_PARTITIONS = "spark.marklogic.read.numPartitions"; @@ -55,8 +59,28 @@ public abstract class Options { public static final String READ_DOCUMENTS_TRANSFORM = "spark.marklogic.read.documents.transform"; public static final String READ_DOCUMENTS_TRANSFORM_PARAMS = "spark.marklogic.read.documents.transformParams"; public static final String READ_DOCUMENTS_TRANSFORM_PARAMS_DELIMITER = "spark.marklogic.read.documents.transformParamsDelimiter"; + public static final String READ_DOCUMENTS_URIS = "spark.marklogic.read.documents.uris"; + public static final String READ_TRIPLES_GRAPHS = "spark.marklogic.read.triples.graphs"; + public static final String READ_TRIPLES_COLLECTIONS = "spark.marklogic.read.triples.collections"; + public static final String READ_TRIPLES_QUERY = "spark.marklogic.read.triples.query"; + public static final String READ_TRIPLES_STRING_QUERY = "spark.marklogic.read.triples.stringQuery"; + public static final String READ_TRIPLES_URIS = "spark.marklogic.read.triples.uris"; + public static final String READ_TRIPLES_DIRECTORY = "spark.marklogic.read.triples.directory"; + public static final String READ_TRIPLES_OPTIONS = "spark.marklogic.read.triples.options"; + public static final String READ_TRIPLES_FILTERED = "spark.marklogic.read.triples.filtered"; + public static final String READ_TRIPLES_BASE_IRI = "spark.marklogic.read.triples.baseIri"; + + // For logging progress when reading documents, rows, or items via custom code. Defines the interval at which + // progress should be logged - e.g. a value of 10,000 will result in a message being logged on every 10,000 items + // being written/processed. + public static final String READ_LOG_PROGRESS = "spark.marklogic.read.logProgress"; + + public static final String READ_FILES_TYPE = "spark.marklogic.read.files.type"; public static final String READ_FILES_COMPRESSION = "spark.marklogic.read.files.compression"; + public static final String READ_FILES_ENCODING = "spark.marklogic.read.files.encoding"; + public static final String READ_FILES_ABORT_ON_FAILURE = "spark.marklogic.read.files.abortOnFailure"; + public static final String READ_ARCHIVES_CATEGORIES = "spark.marklogic.read.archives.categories"; // "Aggregate" = an XML document containing N child elements, each of which should become a row / document. // "xml" is included in the name in anticipation of eventually supporting "aggregate JSON" - i.e. an array of N @@ -68,19 +92,29 @@ public abstract class Options { public static final String WRITE_BATCH_SIZE = "spark.marklogic.write.batchSize"; public static final String WRITE_THREAD_COUNT = "spark.marklogic.write.threadCount"; + public static final String WRITE_THREAD_COUNT_PER_PARTITION = "spark.marklogic.write.threadCountPerPartition"; public static final String WRITE_ABORT_ON_FAILURE = "spark.marklogic.write.abortOnFailure"; + // For logging progress when writing documents or processing with custom code. Defines the interval at which + // progress should be logged - e.g. a value of 10,000 will result in a message being logged on every 10,000 items + // being written/processed. + public static final String WRITE_LOG_PROGRESS = "spark.marklogic.write.logProgress"; + // For writing via custom code. public static final String WRITE_INVOKE = "spark.marklogic.write.invoke"; public static final String WRITE_JAVASCRIPT = "spark.marklogic.write.javascript"; + public static final String WRITE_JAVASCRIPT_FILE = "spark.marklogic.write.javascriptFile"; public static final String WRITE_XQUERY = "spark.marklogic.write.xquery"; + public static final String WRITE_XQUERY_FILE = "spark.marklogic.write.xqueryFile"; public static final String WRITE_EXTERNAL_VARIABLE_NAME = "spark.marklogic.write.externalVariableName"; public static final String WRITE_EXTERNAL_VARIABLE_DELIMITER = "spark.marklogic.write.externalVariableDelimiter"; public static final String WRITE_VARS_PREFIX = "spark.marklogic.write.vars."; // For writing documents to MarkLogic. + public static final String WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS = "spark.marklogic.write.archivePathForFailedDocuments"; public static final String WRITE_COLLECTIONS = "spark.marklogic.write.collections"; public static final String WRITE_PERMISSIONS = "spark.marklogic.write.permissions"; + public static final String WRITE_JSON_ROOT_NAME = "spark.marklogic.write.jsonRootName"; public static final String WRITE_TEMPORAL_COLLECTION = "spark.marklogic.write.temporalCollection"; public static final String WRITE_URI_PREFIX = "spark.marklogic.write.uriPrefix"; public static final String WRITE_URI_REPLACE = "spark.marklogic.write.uriReplace"; @@ -89,13 +123,45 @@ public abstract class Options { public static final String WRITE_TRANSFORM_NAME = "spark.marklogic.write.transform"; public static final String WRITE_TRANSFORM_PARAMS = "spark.marklogic.write.transformParams"; public static final String WRITE_TRANSFORM_PARAMS_DELIMITER = "spark.marklogic.write.transformParamsDelimiter"; + public static final String WRITE_XML_ROOT_NAME = "spark.marklogic.write.xmlRootName"; + public static final String WRITE_XML_NAMESPACE = "spark.marklogic.write.xmlNamespace"; + + // For serializing a row into JSON. Intent is to allow for other constants defined in the Spark + // JSONOptions.scala class to be used after "spark.marklogic.write.json." + // Example - "spark.marklogic.write.json.ignoreNullFields=false. + public static final String WRITE_JSON_SERIALIZATION_OPTION_PREFIX = "spark.marklogic.write.json."; + - // For writing rows adhering to {@code FileRowSchema} to MarkLogic. + // For writing RDF + public static final String WRITE_GRAPH = "spark.marklogic.write.graph"; + public static final String WRITE_GRAPH_OVERRIDE = "spark.marklogic.write.graphOverride"; + + /** + * For writing rows adhering to Spark's binaryFile schema - https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html . + * + * @deprecated since 2.3.0 + */ + @Deprecated(since = "2.3.0", forRemoval = true) + // We don't need Sonar to remind us of this deprecation. + @SuppressWarnings("java:S1133") public static final String WRITE_FILE_ROWS_DOCUMENT_TYPE = "spark.marklogic.write.fileRows.documentType"; - // For writing rows adhering to {@code DocumentRowSchema} to a filesystem. + // Forces a document type when writing rows corresponding to our document row schema. Used when the URI extension + // does not result in MarkLogic choosing the correct document type. + public static final String WRITE_DOCUMENT_TYPE = "spark.marklogic.write.documentType"; + + // For writing rows adhering to {@code DocumentRowSchema} or {@code TripleRowSchema} to a filesystem. public static final String WRITE_FILES_COMPRESSION = "spark.marklogic.write.files.compression"; + // Applies to XML and JSON documents. + public static final String WRITE_FILES_PRETTY_PRINT = "spark.marklogic.write.files.prettyPrint"; + + // Applies to writing documents as files, gzipped files, and as entries in zips/archives. + public static final String WRITE_FILES_ENCODING = "spark.marklogic.write.files.encoding"; + + public static final String WRITE_RDF_FILES_FORMAT = "spark.marklogic.write.files.rdf.format"; + public static final String WRITE_RDF_FILES_GRAPH = "spark.marklogic.write.files.rdf.graph"; + private Options() { } } diff --git a/src/main/java/com/marklogic/spark/ReadProgressLogger.java b/src/main/java/com/marklogic/spark/ReadProgressLogger.java new file mode 100644 index 00000000..784119d4 --- /dev/null +++ b/src/main/java/com/marklogic/spark/ReadProgressLogger.java @@ -0,0 +1,38 @@ +/* + * Copyright © 2024 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved. + */ +package com.marklogic.spark; + +import java.io.Serializable; +import java.util.concurrent.atomic.AtomicLong; + +/** + * Handles the progress counter for any operation involving reading from MarkLogic. A Spark job/application can only have + * one reader, and thus DefaultSource handles resetting this counter before a new read job starts up. A static counter + * is used so that all reader partitions in the same JVM can have their progress aggregated and logged. + */ +public class ReadProgressLogger implements Serializable { + + static final long serialVersionUID = 1L; + + private static final AtomicLong progressCounter = new AtomicLong(0); + private static long progressInterval; + private static long nextProgressInterval; + private static String message; + + public static void initialize(long progressInterval, String message) { + progressCounter.set(0); + ReadProgressLogger.progressInterval = progressInterval; + nextProgressInterval = progressInterval; + ReadProgressLogger.message = message; + } + + public static void logProgressIfNecessary(long itemCount) { + if (progressInterval > 0 && progressCounter.addAndGet(itemCount) >= nextProgressInterval) { + synchronized (progressCounter) { + Util.MAIN_LOGGER.info(message, nextProgressInterval); + nextProgressInterval += progressInterval; + } + } + } +} diff --git a/src/main/java/com/marklogic/spark/Util.java b/src/main/java/com/marklogic/spark/Util.java index aa100071..5617db22 100644 --- a/src/main/java/com/marklogic/spark/Util.java +++ b/src/main/java/com/marklogic/spark/Util.java @@ -15,14 +15,10 @@ */ package com.marklogic.spark; -import org.apache.spark.sql.catalyst.json.JSONOptions; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import scala.collection.immutable.HashMap; -import java.util.ArrayList; -import java.util.List; -import java.util.Map; +import java.util.*; import java.util.stream.Stream; public interface Util { @@ -33,19 +29,6 @@ public interface Util { */ Logger MAIN_LOGGER = LoggerFactory.getLogger("com.marklogic.spark"); - JSONOptions DEFAULT_JSON_OPTIONS = new JSONOptions( - new HashMap<>(), - - // As verified via tests, this default timezone ID is overridden by a user via the spark.sql.session.timeZone option. - "Z", - - // We don't expect corrupted records - i.e. corrupted values - to be present in the index. But Spark - // requires this to be set. See - // https://medium.com/@sasidharan-r/how-to-handle-corrupt-or-bad-record-in-apache-spark-custom-logic-pyspark-aws-430ddec9bb41 - // for more information. - "_corrupt_record" - ); - static boolean hasOption(Map properties, String... options) { return Stream.of(options) .anyMatch(option -> properties.get(option) != null && properties.get(option).trim().length() > 0); @@ -73,4 +56,32 @@ static List parsePaths(String pathsValue) { } return paths; } + + static boolean isReadWithCustomCodeOperation(Map properties) { + return Util.hasOption(properties, + Options.READ_INVOKE, Options.READ_XQUERY, Options.READ_JAVASCRIPT, + Options.READ_JAVASCRIPT_FILE, Options.READ_XQUERY_FILE + ); + } + + static boolean isWriteWithCustomCodeOperation(Map properties) { + return Util.hasOption(properties, + Options.WRITE_INVOKE, Options.WRITE_JAVASCRIPT, Options.WRITE_XQUERY, + Options.WRITE_JAVASCRIPT_FILE, Options.WRITE_XQUERY_FILE + ); + } + + /** + * Allows Flux to override what's shown in a validation error. The connector is fine showing option names + * such as "spark.marklogic.read.opticQuery", but that is meaningless to a Flux user. This can also be used to + * access any key in the messages properties file. + * + * @param option + * @return + */ + static String getOptionNameForErrorMessage(String option) { + ResourceBundle bundle = ResourceBundle.getBundle("marklogic-spark-messages", Locale.getDefault()); + String optionName = bundle.getString(option); + return optionName != null && optionName.trim().length() > 0 ? optionName.trim() : option; + } } diff --git a/src/main/java/com/marklogic/spark/WriteProgressLogger.java b/src/main/java/com/marklogic/spark/WriteProgressLogger.java new file mode 100644 index 00000000..f3989923 --- /dev/null +++ b/src/main/java/com/marklogic/spark/WriteProgressLogger.java @@ -0,0 +1,38 @@ +/* + * Copyright © 2024 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved. + */ +package com.marklogic.spark; + +import java.io.Serializable; +import java.util.concurrent.atomic.AtomicLong; + +/** + * Handles the progress counter for any operation involving writing to MarkLogic. A Spark job/application can only have + * one writer, and thus DefaultSource handles resetting this counter before a new write job starts up. A static counter + * is used so that all writer partitions in the same JVM can have their progress aggregated and logged. + */ +public class WriteProgressLogger implements Serializable { + + static final long serialVersionUID = 1L; + + private static final AtomicLong progressCounter = new AtomicLong(0); + private static long progressInterval; + private static long nextProgressInterval; + private static String message; + + public static void initialize(long progressInterval, String message) { + progressCounter.set(0); + WriteProgressLogger.progressInterval = progressInterval; + nextProgressInterval = progressInterval; + WriteProgressLogger.message = message; + } + + public static void logProgressIfNecessary(long itemCount) { + if (progressInterval > 0 && progressCounter.addAndGet(itemCount) >= nextProgressInterval) { + synchronized (progressCounter) { + Util.MAIN_LOGGER.info(message, nextProgressInterval); + nextProgressInterval += progressInterval; + } + } + } +} diff --git a/src/main/java/com/marklogic/spark/reader/JsonRowDeserializer.java b/src/main/java/com/marklogic/spark/reader/JsonRowDeserializer.java index a7e4e6ff..51394065 100644 --- a/src/main/java/com/marklogic/spark/reader/JsonRowDeserializer.java +++ b/src/main/java/com/marklogic/spark/reader/JsonRowDeserializer.java @@ -2,9 +2,10 @@ import com.fasterxml.jackson.core.JsonFactory; import com.fasterxml.jackson.core.JsonParser; -import com.marklogic.spark.Util; +import com.marklogic.spark.JsonRowSerializer; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.catalyst.json.CreateJacksonParser; +import org.apache.spark.sql.catalyst.json.JSONOptions; import org.apache.spark.sql.catalyst.json.JacksonParser; import org.apache.spark.sql.sources.Filter; import org.apache.spark.sql.types.StructType; @@ -13,9 +14,9 @@ import scala.Function2; import scala.collection.JavaConverters; import scala.collection.Seq; -import scala.compat.java8.JFunction; import java.util.ArrayList; +import java.util.HashMap; /** * Handles deserializing a JSON object into a Spark InternalRow. This is accomplished via Spark's JacksonParser. @@ -33,17 +34,10 @@ public class JsonRowDeserializer { private final Function2 jsonParserCreator; private final Function1 utf8StringCreator; - // Ignoring warnings about JFunction.func until an alternative can be found. - @SuppressWarnings("java:S1874") public JsonRowDeserializer(StructType schema) { this.jacksonParser = newJacksonParser(schema); - - // Used https://github.com/scala/scala-java8-compat in the DHF Spark 2 connector. Per the README for - // scala-java8-compat, we should be able to use scala.jdk.FunctionConverters since those are part of Scala - // 2.13. However, that is not yet working within PySpark. So sticking with this "legacy" approach as it seems - // to work fine in both vanilla Spark (i.e. JUnit tests) and PySpark. - this.jsonParserCreator = JFunction.func(CreateJacksonParser::string); - this.utf8StringCreator = JFunction.func(UTF8String::fromString); + this.jsonParserCreator = CreateJacksonParser::string; + this.utf8StringCreator = UTF8String::fromString; } public InternalRow deserializeJson(String json) { @@ -53,6 +47,7 @@ public InternalRow deserializeJson(String json) { private JacksonParser newJacksonParser(StructType schema) { final boolean allowArraysAsStructs = true; final Seq filters = JavaConverters.asScalaIterator(new ArrayList().iterator()).toSeq(); - return new JacksonParser(schema, Util.DEFAULT_JSON_OPTIONS, allowArraysAsStructs, filters); + JSONOptions jsonOptions = new JsonRowSerializer(schema, new HashMap<>()).getJsonOptions(); + return new JacksonParser(schema, jsonOptions, allowArraysAsStructs, filters); } } diff --git a/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeContext.java b/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeContext.java index db45fd31..4c29952a 100644 --- a/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeContext.java +++ b/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeContext.java @@ -4,9 +4,15 @@ import com.marklogic.client.eval.ServerEvaluationCall; import com.marklogic.spark.ConnectorException; import com.marklogic.spark.ContextSupport; +import com.marklogic.spark.Options; +import org.apache.commons.io.FileUtils; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructType; +import java.io.File; +import java.io.IOException; +import java.nio.charset.Charset; +import java.nio.file.NoSuchFileException; import java.util.Arrays; import java.util.Map; import java.util.stream.Collectors; @@ -46,7 +52,14 @@ public ServerEvaluationCall buildCall(DatabaseClient client, CallOptions callOpt call.javascript(properties.get(callOptions.javascriptOptionName)); } else if (optionExists(callOptions.xqueryOptionName)) { call.xquery(properties.get(callOptions.xqueryOptionName)); + } else if (optionExists(callOptions.javascriptFileOptionName)) { + String content = readFileToString(properties.get(callOptions.javascriptFileOptionName)); + call.javascript(content); + } else if (optionExists(callOptions.xqueryFileOptionName)) { + String content = readFileToString(properties.get(callOptions.xqueryFileOptionName)); + call.xquery(content); } else { + // The ETL tool validates this itself via a validator. throw new ConnectorException("Must specify one of the following options: " + Arrays.asList( callOptions.invokeOptionName, callOptions.javascriptOptionName, callOptions.xqueryOptionName )); @@ -56,20 +69,48 @@ public ServerEvaluationCall buildCall(DatabaseClient client, CallOptions callOpt return call; } + private String readFileToString(String file) { + try { + // commons-io is a Spark dependency, so safe to use. + return FileUtils.readFileToString(new File(file), Charset.defaultCharset()); + } catch (IOException e) { + String message = e.getMessage(); + if (e instanceof NoSuchFileException) { + message += " was not found."; + } + throw new ConnectorException(String.format("Cannot read from file %s; cause: %s", file, message), e); + } + } + public boolean isCustomSchema() { return customSchema; } + boolean hasPartitionCode() { + return hasOption( + Options.READ_PARTITIONS_INVOKE, + Options.READ_PARTITIONS_JAVASCRIPT, + Options.READ_PARTITIONS_JAVASCRIPT_FILE, + Options.READ_PARTITIONS_XQUERY, + Options.READ_PARTITIONS_XQUERY_FILE + ); + } + // Intended solely to simplify passing these 3 option names around. public static class CallOptions { private final String invokeOptionName; private final String javascriptOptionName; private final String xqueryOptionName; + private final String javascriptFileOptionName; + private final String xqueryFileOptionName; - public CallOptions(String invokeOptionName, String javascriptOptionName, String xqueryOptionName) { + public CallOptions(String invokeOptionName, String javascriptOptionName, String xqueryOptionName, + String javascriptFileOptionName, String xqueryFileOptionName) { this.javascriptOptionName = javascriptOptionName; this.xqueryOptionName = xqueryOptionName; this.invokeOptionName = invokeOptionName; + this.javascriptFileOptionName = javascriptFileOptionName; + this.xqueryFileOptionName = xqueryFileOptionName; } } } diff --git a/src/main/java/com/marklogic/spark/reader/customcode/CustomCodePartitionReader.java b/src/main/java/com/marklogic/spark/reader/customcode/CustomCodePartitionReader.java index 02ac739c..25ca4446 100644 --- a/src/main/java/com/marklogic/spark/reader/customcode/CustomCodePartitionReader.java +++ b/src/main/java/com/marklogic/spark/reader/customcode/CustomCodePartitionReader.java @@ -4,6 +4,7 @@ import com.marklogic.client.eval.EvalResultIterator; import com.marklogic.client.eval.ServerEvaluationCall; import com.marklogic.spark.Options; +import com.marklogic.spark.ReadProgressLogger; import com.marklogic.spark.reader.JsonRowDeserializer; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; @@ -19,17 +20,24 @@ class CustomCodePartitionReader implements PartitionReader { private final JsonRowDeserializer jsonRowDeserializer; private final DatabaseClient databaseClient; + // Only needed for logging progress. + private final long batchSize; + private long progressCounter; + public CustomCodePartitionReader(CustomCodeContext customCodeContext, String partition) { this.databaseClient = customCodeContext.connectToMarkLogic(); this.serverEvaluationCall = customCodeContext.buildCall( this.databaseClient, - new CustomCodeContext.CallOptions(Options.READ_INVOKE, Options.READ_JAVASCRIPT, Options.READ_XQUERY) + new CustomCodeContext.CallOptions(Options.READ_INVOKE, Options.READ_JAVASCRIPT, Options.READ_XQUERY, + Options.READ_JAVASCRIPT_FILE, Options.READ_XQUERY_FILE) ); if (partition != null) { this.serverEvaluationCall.addVariable("PARTITION", partition); } + this.batchSize = customCodeContext.getNumericOption(Options.READ_BATCH_SIZE, 1, 1); + this.isCustomSchema = customCodeContext.isCustomSchema(); this.jsonRowDeserializer = new JsonRowDeserializer(customCodeContext.getSchema()); } @@ -48,6 +56,11 @@ public InternalRow get() { if (this.isCustomSchema) { return this.jsonRowDeserializer.deserializeJson(val); } + progressCounter++; + if (progressCounter >= batchSize) { + ReadProgressLogger.logProgressIfNecessary(progressCounter); + progressCounter = 0; + } return new GenericInternalRow(new Object[]{UTF8String.fromString(val)}); } diff --git a/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeScan.java b/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeScan.java index ebdae9d9..d1c91b2a 100644 --- a/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeScan.java +++ b/src/main/java/com/marklogic/spark/reader/customcode/CustomCodeScan.java @@ -21,17 +21,18 @@ public CustomCodeScan(CustomCodeContext customCodeContext) { this.customCodeContext = customCodeContext; this.partitions = new ArrayList<>(); - if (this.customCodeContext.hasOption(Options.READ_PARTITIONS_INVOKE, Options.READ_PARTITIONS_JAVASCRIPT, Options.READ_PARTITIONS_XQUERY)) { + if (this.customCodeContext.hasPartitionCode()) { DatabaseClient client = this.customCodeContext.connectToMarkLogic(); try { this.customCodeContext .buildCall(client, new CustomCodeContext.CallOptions( - Options.READ_PARTITIONS_INVOKE, Options.READ_PARTITIONS_JAVASCRIPT, Options.READ_PARTITIONS_XQUERY + Options.READ_PARTITIONS_INVOKE, Options.READ_PARTITIONS_JAVASCRIPT, Options.READ_PARTITIONS_XQUERY, + Options.READ_PARTITIONS_JAVASCRIPT_FILE, Options.READ_PARTITIONS_XQUERY_FILE )) .eval() .forEach(result -> this.partitions.add(result.getString())); } catch (Exception ex) { - throw new ConnectorException("Unable to retrieve partitions", ex); + throw new ConnectorException(String.format("Unable to retrieve partitions; cause: %s", ex.getMessage()), ex); } finally { client.release(); } diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentBatch.java b/src/main/java/com/marklogic/spark/reader/document/DocumentBatch.java index 30a7f71c..8cb91aea 100644 --- a/src/main/java/com/marklogic/spark/reader/document/DocumentBatch.java +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentBatch.java @@ -6,6 +6,7 @@ import com.marklogic.client.query.QueryManager; import com.marklogic.client.query.SearchQueryDefinition; import com.marklogic.spark.Util; +import com.marklogic.spark.reader.file.TripleRowSchema; import org.apache.spark.sql.connector.read.Batch; import org.apache.spark.sql.connector.read.InputPartition; import org.apache.spark.sql.connector.read.PartitionReaderFactory; @@ -29,7 +30,10 @@ class DocumentBatch implements Batch { DatabaseClient client = this.context.connectToMarkLogic(); Forest[] forests = client.newDataMovementManager().readForestConfig().listForests(); - SearchQueryDefinition query = this.context.buildSearchQuery(client); + SearchQueryDefinition query = TripleRowSchema.SCHEMA.equals(context.getSchema()) ? + this.context.buildTriplesSearchQuery(client) : + this.context.buildSearchQuery(client); + // Must null this out so SearchHandle still works below. query.setResponseTransform(null); diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentContext.java b/src/main/java/com/marklogic/spark/reader/document/DocumentContext.java index d3ab58d2..2d7daee4 100644 --- a/src/main/java/com/marklogic/spark/reader/document/DocumentContext.java +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentContext.java @@ -5,6 +5,7 @@ import com.marklogic.client.query.SearchQueryDefinition; import com.marklogic.spark.ContextSupport; import com.marklogic.spark.Options; +import org.apache.spark.sql.types.StructType; import org.apache.spark.sql.util.CaseInsensitiveStringMap; import java.util.HashSet; @@ -14,9 +15,11 @@ class DocumentContext extends ContextSupport { private Integer limit; + private final StructType schema; - DocumentContext(CaseInsensitiveStringMap options) { + DocumentContext(CaseInsensitiveStringMap options, StructType schema) { super(options.asCaseSensitiveMap()); + this.schema = schema; } Set getRequestedMetadata() { @@ -50,6 +53,11 @@ boolean contentWasRequested() { SearchQueryDefinition buildSearchQuery(DatabaseClient client) { final Map props = getProperties(); + // REST API allows commas in URIs, but not newlines, so that's safe to use as a delimiter. + String[] uris = null; + if (hasOption(Options.READ_DOCUMENTS_URIS)) { + uris = getStringOption(Options.READ_DOCUMENTS_URIS).split("\n"); + } return new SearchQueryBuilder() .withStringQuery(props.get(Options.READ_DOCUMENTS_STRING_QUERY)) .withQuery(props.get(Options.READ_DOCUMENTS_QUERY)) @@ -59,9 +67,39 @@ SearchQueryDefinition buildSearchQuery(DatabaseClient client) { .withTransformName(props.get(Options.READ_DOCUMENTS_TRANSFORM)) .withTransformParams(props.get(Options.READ_DOCUMENTS_TRANSFORM_PARAMS)) .withTransformParamsDelimiter(props.get(Options.READ_DOCUMENTS_TRANSFORM_PARAMS_DELIMITER)) + .withUris(uris) + .buildQuery(client); + } + + SearchQueryDefinition buildTriplesSearchQuery(DatabaseClient client) { + final Map props = getProperties(); + String[] uris = null; + if (hasOption(Options.READ_TRIPLES_URIS)) { + uris = getStringOption(Options.READ_TRIPLES_URIS).split("\n"); + } + return new SearchQueryBuilder() + .withStringQuery(props.get(Options.READ_TRIPLES_STRING_QUERY)) + .withQuery(props.get(Options.READ_TRIPLES_QUERY)) + .withCollections(combineCollectionsAndGraphs()) + .withDirectory(props.get(Options.READ_TRIPLES_DIRECTORY)) + .withOptionsName(props.get(Options.READ_TRIPLES_OPTIONS)) + .withUris(uris) .buildQuery(client); } + private String combineCollectionsAndGraphs() { + String graphs = getProperties().get(Options.READ_TRIPLES_GRAPHS); + String collections = getProperties().get(Options.READ_TRIPLES_COLLECTIONS); + if (graphs != null && graphs.trim().length() > 0) { + if (collections == null || collections.trim().length() == 0) { + collections = graphs; + } else { + collections += "," + graphs; + } + } + return collections; + } + int getBatchSize() { // Testing has shown that at least for smaller documents, 100 or 200 can be significantly slower than something // like 1000 or even 10000. 500 is thus used as a default that should still be reasonably performant for larger @@ -82,4 +120,8 @@ void setLimit(Integer limit) { Integer getLimit() { return limit; } + + StructType getSchema() { + return schema; + } } diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentRowBuilder.java b/src/main/java/com/marklogic/spark/reader/document/DocumentRowBuilder.java new file mode 100644 index 00000000..1bb8df6a --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentRowBuilder.java @@ -0,0 +1,197 @@ +package com.marklogic.spark.reader.document; + +import com.marklogic.client.document.DocumentManager; +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.spark.ConnectorException; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.catalyst.util.ArrayBasedMapData; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.unsafe.types.ByteArray; +import org.apache.spark.unsafe.types.UTF8String; +import org.jdom2.Document; +import org.jdom2.Element; +import org.jdom2.Namespace; +import org.jdom2.input.SAXBuilder; +import org.jdom2.output.XMLOutputter; + +import java.io.ByteArrayInputStream; +import java.util.*; + +/** + * Knows how to build a Spark row conforming to our {@code DocumentRowSchema}. + *

+ * This has to support two different ways of specifying which metadata to include. {@code ForestReader} needs to + * capture the requested metadata in one way, while other approaches can just capture the metadata categories as a + * simple list of strings. + */ +public class DocumentRowBuilder { + + private final List metadataCategories; + private final Set requestedMetadata; + + // For handling XML document properties + private final SAXBuilder saxBuilder; + private final XMLOutputter xmlOutputter; + private static final Namespace PROPERTIES_NAMESPACE = Namespace.getNamespace("prop", "http://marklogic.com/xdmp/property"); + + private String uri; + private byte[] content; + private String format; + private DocumentMetadataHandle metadata; + + public DocumentRowBuilder(List metadataCategories) { + this.saxBuilder = new SAXBuilder(); + this.xmlOutputter = new XMLOutputter(); + this.metadataCategories = metadataCategories != null ? metadataCategories : new ArrayList<>(); + this.requestedMetadata = null; + } + + public DocumentRowBuilder(Set requestedMetadata) { + this.saxBuilder = new SAXBuilder(); + this.xmlOutputter = new XMLOutputter(); + this.requestedMetadata = requestedMetadata; + this.metadataCategories = null; + } + + public DocumentRowBuilder withUri(String uri) { + this.uri = uri; + return this; + } + + public DocumentRowBuilder withContent(byte[] content) { + this.content = content; + return this; + } + + public DocumentRowBuilder withFormat(String format) { + this.format = format; + return this; + } + + public DocumentRowBuilder withMetadata(DocumentMetadataHandle metadata) { + this.metadata = metadata; + return this; + } + + public GenericInternalRow buildRow() { + Object[] row = new Object[8]; + row[0] = UTF8String.fromString(uri); + row[1] = ByteArray.concat(content); + if (format != null) { + row[2] = UTF8String.fromString(format); + } + if (metadata != null) { + if (includeCollections()) { + populateCollectionsColumn(row, metadata); + } + if (includePermissions()) { + populatePermissionsColumn(row, metadata); + } + if (includeQuality()) { + populateQualityColumn(row, metadata); + } + if (includeProperties()) { + populatePropertiesColumn(row, metadata); + } + if (includeMetadataValues()) { + populateMetadataValuesColumn(row, metadata); + } + } + return new GenericInternalRow(row); + } + + private boolean includeCollections() { + return includeMetadata("collections", DocumentManager.Metadata.COLLECTIONS); + } + + private boolean includePermissions() { + return includeMetadata("permissions", DocumentManager.Metadata.PERMISSIONS); + } + + private boolean includeQuality() { + return includeMetadata("quality", DocumentManager.Metadata.QUALITY); + } + + private boolean includeProperties() { + return includeMetadata("properties", DocumentManager.Metadata.PROPERTIES); + } + + private boolean includeMetadataValues() { + return includeMetadata("metadatavalues", DocumentManager.Metadata.METADATAVALUES); + } + + private boolean includeMetadata(String categoryName, DocumentManager.Metadata metadataType) { + return metadataCategories != null ? + metadataCategories.contains(categoryName) || metadataCategories.isEmpty() : + requestedMetadata.contains(metadataType) || requestedMetadata.contains(DocumentManager.Metadata.ALL); + } + + private void populateCollectionsColumn(Object[] row, DocumentMetadataHandle metadata) { + UTF8String[] collections = new UTF8String[metadata.getCollections().size()]; + Iterator iterator = metadata.getCollections().iterator(); + for (int i = 0; i < collections.length; i++) { + collections[i] = UTF8String.fromString(iterator.next()); + } + row[3] = ArrayData.toArrayData(collections); + } + + private void populatePermissionsColumn(Object[] row, DocumentMetadataHandle metadata) { + DocumentMetadataHandle.DocumentPermissions perms = metadata.getPermissions(); + UTF8String[] roles = new UTF8String[perms.size()]; + Object[] capabilityArrays = new Object[perms.size()]; + int i = 0; + for (Map.Entry> entry : perms.entrySet()) { + roles[i] = UTF8String.fromString(entry.getKey()); + UTF8String[] capabilities = new UTF8String[entry.getValue().size()]; + int j = 0; + Iterator iterator = entry.getValue().iterator(); + while (iterator.hasNext()) { + capabilities[j++] = UTF8String.fromString(iterator.next().name()); + } + capabilityArrays[i++] = ArrayData.toArrayData(capabilities); + } + row[4] = ArrayBasedMapData.apply(roles, capabilityArrays); + } + + private void populateQualityColumn(Object[] row, DocumentMetadataHandle metadata) { + row[5] = metadata.getQuality(); + } + + /** + * The properties fragment can be a complex XML structure with mixed content and attributes and thus cannot be + * defined as a map of particular types. Instead, as of the 2.3.0 release of the connector, the properties column + * is of type String and is expected to contain a serialized string of XML representing the contents of the + * properties fragment. To obtain that, this method serializes the metadata object into its REST API XML + * serialization and then extracts the portion containing the document properties. + * + * @param row + * @param metadata + */ + private void populatePropertiesColumn(Object[] row, DocumentMetadataHandle metadata) { + if (metadata.getProperties() == null || metadata.getProperties().size() == 0) { + return; + } + try { + Document doc = this.saxBuilder.build(new ByteArrayInputStream(metadata.toBuffer())); + Element properties = doc.getRootElement().getChild("properties", PROPERTIES_NAMESPACE); + if (properties != null) { + row[6] = UTF8String.fromString(this.xmlOutputter.outputString(properties)); + } + } catch (Exception e) { + throw new ConnectorException(String.format( + "Unable to process XML document properties for row with URI %s; cause: %s", row[0], e.getMessage()), e); + } + } + + private void populateMetadataValuesColumn(Object[] row, DocumentMetadataHandle metadata) { + DocumentMetadataHandle.DocumentMetadataValues metadataValues = metadata.getMetadataValues(); + UTF8String[] keys = new UTF8String[metadataValues.size()]; + UTF8String[] values = new UTF8String[metadataValues.size()]; + int index = 0; + for (Map.Entry entry : metadataValues.entrySet()) { + keys[index] = UTF8String.fromString(entry.getKey()); + values[index++] = UTF8String.fromString(entry.getValue()); + } + row[7] = ArrayBasedMapData.apply(keys, values); + } +} diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentRowSchema.java b/src/main/java/com/marklogic/spark/reader/document/DocumentRowSchema.java index aa08d3b4..e4de9ee6 100644 --- a/src/main/java/com/marklogic/spark/reader/document/DocumentRowSchema.java +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentRowSchema.java @@ -1,5 +1,9 @@ package com.marklogic.spark.reader.document; +import com.marklogic.client.io.DocumentMetadataHandle; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.util.ArrayData; +import org.apache.spark.sql.catalyst.util.MapData; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructType; @@ -15,9 +19,79 @@ public abstract class DocumentRowSchema { DataTypes.createArrayType(DataTypes.StringType)) ) .add("quality", DataTypes.IntegerType) - .add("properties", DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)) + .add("properties", DataTypes.StringType) .add("metadataValues", DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType)); private DocumentRowSchema() { } + + /** + * Given a row that conforms to this class's schema, return a {@code DocumentMetadataHandle} that contains the + * metadata from the given row. + * + * @param row + * @return + */ + public static DocumentMetadataHandle makeDocumentMetadata(InternalRow row) { + DocumentMetadataHandle metadata = new DocumentMetadataHandle(); + addCollectionsToMetadata(row, metadata); + addPermissionsToMetadata(row, metadata); + if (!row.isNullAt(5)) { + metadata.setQuality(row.getInt(5)); + } + addPropertiesToMetadata(row, metadata); + addMetadataValuesToMetadata(row, metadata); + return metadata; + } + + private static void addCollectionsToMetadata(InternalRow row, DocumentMetadataHandle metadata) { + if (!row.isNullAt(3)) { + ArrayData collections = row.getArray(3); + for (int i = 0; i < collections.numElements(); i++) { + String value = collections.get(i, DataTypes.StringType).toString(); + metadata.getCollections().add(value); + } + } + } + + private static void addPermissionsToMetadata(InternalRow row, DocumentMetadataHandle metadata) { + if (!row.isNullAt(4)) { + MapData permissions = row.getMap(4); + ArrayData roles = permissions.keyArray(); + ArrayData capabilities = permissions.valueArray(); + for (int i = 0; i < roles.numElements(); i++) { + String role = roles.get(i, DataTypes.StringType).toString(); + ArrayData caps = capabilities.getArray(i); + DocumentMetadataHandle.Capability[] capArray = new DocumentMetadataHandle.Capability[caps.numElements()]; + for (int j = 0; j < caps.numElements(); j++) { + String value = caps.get(j, DataTypes.StringType).toString(); + capArray[j] = DocumentMetadataHandle.Capability.valueOf(value.toUpperCase()); + } + metadata.getPermissions().add(role, capArray); + } + } + } + + private static void addPropertiesToMetadata(InternalRow row, DocumentMetadataHandle metadata) { + if (!row.isNullAt(6)) { + String propertiesXml = row.getString(6); + String metadataXml = String.format("%s", propertiesXml); + DocumentMetadataHandle tempMetadata = new DocumentMetadataHandle(); + tempMetadata.fromBuffer(metadataXml.getBytes()); + metadata.setProperties(tempMetadata.getProperties()); + } + } + + private static void addMetadataValuesToMetadata(InternalRow row, DocumentMetadataHandle metadata) { + if (!row.isNullAt(7)) { + MapData properties = row.getMap(7); + ArrayData keys = properties.keyArray(); + ArrayData values = properties.valueArray(); + for (int i = 0; i < keys.numElements(); i++) { + String key = keys.get(i, DataTypes.StringType).toString(); + String value = values.get(i, DataTypes.StringType).toString(); + metadata.getMetadataValues().put(key, value); + } + } + } } diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentScan.java b/src/main/java/com/marklogic/spark/reader/document/DocumentScan.java index b86254f9..e7198150 100644 --- a/src/main/java/com/marklogic/spark/reader/document/DocumentScan.java +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentScan.java @@ -7,14 +7,16 @@ class DocumentScan implements Scan { private final DocumentBatch batch; + private final DocumentContext context; DocumentScan(DocumentContext context) { + this.context = context; this.batch = new DocumentBatch(context); } @Override public StructType readSchema() { - return DocumentRowSchema.SCHEMA; + return context.getSchema(); } @Override diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentScanBuilder.java b/src/main/java/com/marklogic/spark/reader/document/DocumentScanBuilder.java index 36a34e95..8c5b9e3e 100644 --- a/src/main/java/com/marklogic/spark/reader/document/DocumentScanBuilder.java +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentScanBuilder.java @@ -3,14 +3,15 @@ import org.apache.spark.sql.connector.read.Scan; import org.apache.spark.sql.connector.read.ScanBuilder; import org.apache.spark.sql.connector.read.SupportsPushDownLimit; +import org.apache.spark.sql.types.StructType; import org.apache.spark.sql.util.CaseInsensitiveStringMap; class DocumentScanBuilder implements ScanBuilder, SupportsPushDownLimit { private final DocumentContext context; - DocumentScanBuilder(CaseInsensitiveStringMap options) { - this.context = new DocumentContext(options); + DocumentScanBuilder(CaseInsensitiveStringMap options, StructType schema) { + this.context = new DocumentContext(options, schema); } @Override diff --git a/src/main/java/com/marklogic/spark/reader/document/DocumentTable.java b/src/main/java/com/marklogic/spark/reader/document/DocumentTable.java index 4b0b4f64..79a95616 100644 --- a/src/main/java/com/marklogic/spark/reader/document/DocumentTable.java +++ b/src/main/java/com/marklogic/spark/reader/document/DocumentTable.java @@ -25,9 +25,15 @@ public class DocumentTable implements SupportsRead, SupportsWrite { capabilities.add(TableCapability.BATCH_WRITE); } + private final StructType schema; + + public DocumentTable(StructType schema) { + this.schema = schema; + } + @Override public ScanBuilder newScanBuilder(CaseInsensitiveStringMap options) { - return new DocumentScanBuilder(options); + return new DocumentScanBuilder(options, this.schema); } @Override @@ -42,7 +48,7 @@ public String name() { @Override public StructType schema() { - return DocumentRowSchema.SCHEMA; + return this.schema; } @Override diff --git a/src/main/java/com/marklogic/spark/reader/document/ForestReader.java b/src/main/java/com/marklogic/spark/reader/document/ForestReader.java index 89884620..cbcef175 100644 --- a/src/main/java/com/marklogic/spark/reader/document/ForestReader.java +++ b/src/main/java/com/marklogic/spark/reader/document/ForestReader.java @@ -12,20 +12,13 @@ import com.marklogic.client.query.SearchQueryDefinition; import com.marklogic.client.query.StructuredQueryBuilder; import com.marklogic.spark.Options; +import com.marklogic.spark.ReadProgressLogger; import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; -import org.apache.spark.sql.catalyst.util.ArrayBasedMapData; -import org.apache.spark.sql.catalyst.util.ArrayData; import org.apache.spark.sql.connector.read.PartitionReader; -import org.apache.spark.unsafe.types.ByteArray; -import org.apache.spark.unsafe.types.UTF8String; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import javax.xml.namespace.QName; -import java.util.Iterator; import java.util.List; -import java.util.Map; import java.util.Set; /** @@ -112,22 +105,21 @@ public boolean next() { @Override public InternalRow get() { DocumentRecord document = this.currentDocumentPage.next(); - - Object[] row = new Object[8]; - row[0] = UTF8String.fromString(document.getUri()); + DocumentRowBuilder builder = new DocumentRowBuilder(requestedMetadata).withUri(document.getUri()); if (this.contentWasRequested) { - row[1] = ByteArray.concat(document.getContent(new BytesHandle()).get()); - final String format = document.getFormat() != null ? document.getFormat().toString() : Format.UNKNOWN.toString(); - row[2] = UTF8String.fromString(format); + builder.withContent(document.getContent(new BytesHandle()).get()); + builder.withFormat(document.getFormat() != null ? document.getFormat().toString() : Format.UNKNOWN.toString()); } - if (!requestedMetadata.isEmpty()) { - DocumentMetadataHandle metadata = document.getMetadata(new DocumentMetadataHandle()); - populateMetadataColumns(row, metadata); + builder.withMetadata(document.getMetadata(new DocumentMetadataHandle())); } - docCount++; - return new GenericInternalRow(row); + return builder.buildRow(); + } + + @Override + public void close() { + closeCurrentDocumentPage(); } private List getNextBatchOfUris() { @@ -154,87 +146,10 @@ private DocumentPage readPage(List uris) { if (logger.isTraceEnabled()) { logger.trace("Retrieved page of documents in {}ms from partition {}", (System.currentTimeMillis() - start), this.forestPartition); } + ReadProgressLogger.logProgressIfNecessary(page.getPageSize()); return page; } - private void populateMetadataColumns(Object[] row, DocumentMetadataHandle metadata) { - if (requestedMetadataHas(DocumentManager.Metadata.COLLECTIONS)) { - populateCollectionsColumn(row, metadata); - } - if (requestedMetadataHas(DocumentManager.Metadata.PERMISSIONS)) { - populatePermissionsColumn(row, metadata); - } - if (requestedMetadataHas(DocumentManager.Metadata.QUALITY)) { - row[5] = metadata.getQuality(); - } - if (requestedMetadataHas(DocumentManager.Metadata.PROPERTIES)) { - populatePropertiesColumn(row, metadata); - } - if (requestedMetadataHas(DocumentManager.Metadata.METADATAVALUES)) { - populateMetadataValuesColumn(row, metadata); - } - } - - private void populateCollectionsColumn(Object[] row, DocumentMetadataHandle metadata) { - UTF8String[] collections = new UTF8String[metadata.getCollections().size()]; - Iterator iterator = metadata.getCollections().iterator(); - for (int i = 0; i < collections.length; i++) { - collections[i] = UTF8String.fromString(iterator.next()); - } - row[3] = ArrayData.toArrayData(collections); - } - - private void populatePermissionsColumn(Object[] row, DocumentMetadataHandle metadata) { - DocumentMetadataHandle.DocumentPermissions perms = metadata.getPermissions(); - UTF8String[] roles = new UTF8String[perms.size()]; - Object[] capabilityArrays = new Object[perms.size()]; - int i = 0; - for (Map.Entry> entry : perms.entrySet()) { - roles[i] = UTF8String.fromString(entry.getKey()); - UTF8String[] capabilities = new UTF8String[entry.getValue().size()]; - int j = 0; - Iterator iterator = entry.getValue().iterator(); - while (iterator.hasNext()) { - capabilities[j++] = UTF8String.fromString(iterator.next().name()); - } - capabilityArrays[i++] = ArrayData.toArrayData(capabilities); - } - row[4] = ArrayBasedMapData.apply(roles, capabilityArrays); - } - - private void populatePropertiesColumn(Object[] row, DocumentMetadataHandle metadata) { - DocumentMetadataHandle.DocumentProperties props = metadata.getProperties(); - UTF8String[] keys = new UTF8String[props.size()]; - UTF8String[] values = new UTF8String[props.size()]; - int index = 0; - for (QName key : props.keySet()) { - keys[index] = UTF8String.fromString(key.toString()); - values[index++] = UTF8String.fromString(props.get(key, String.class)); - } - row[6] = ArrayBasedMapData.apply(keys, values); - } - - private void populateMetadataValuesColumn(Object[] row, DocumentMetadataHandle metadata) { - DocumentMetadataHandle.DocumentMetadataValues metadataValues = metadata.getMetadataValues(); - UTF8String[] keys = new UTF8String[metadataValues.size()]; - UTF8String[] values = new UTF8String[metadataValues.size()]; - int index = 0; - for (Map.Entry entry : metadataValues.entrySet()) { - keys[index] = UTF8String.fromString(entry.getKey()); - values[index++] = UTF8String.fromString(entry.getValue()); - } - row[7] = ArrayBasedMapData.apply(keys, values); - } - - private boolean requestedMetadataHas(DocumentManager.Metadata metadata) { - return requestedMetadata.contains(metadata) || requestedMetadata.contains(DocumentManager.Metadata.ALL); - } - - @Override - public void close() { - closeCurrentDocumentPage(); - } - private void closeCurrentDocumentPage() { if (currentDocumentPage != null) { currentDocumentPage.close(); diff --git a/src/main/java/com/marklogic/spark/reader/document/ForestReaderFactory.java b/src/main/java/com/marklogic/spark/reader/document/ForestReaderFactory.java index 92caebd2..8b7ab061 100644 --- a/src/main/java/com/marklogic/spark/reader/document/ForestReaderFactory.java +++ b/src/main/java/com/marklogic/spark/reader/document/ForestReaderFactory.java @@ -1,5 +1,6 @@ package com.marklogic.spark.reader.document; +import com.marklogic.spark.reader.file.TripleRowSchema; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.connector.read.InputPartition; import org.apache.spark.sql.connector.read.PartitionReader; @@ -8,7 +9,7 @@ class ForestReaderFactory implements PartitionReaderFactory { static final long serialVersionUID = 1; - + private DocumentContext documentContext; ForestReaderFactory(DocumentContext documentContext) { @@ -17,6 +18,8 @@ class ForestReaderFactory implements PartitionReaderFactory { @Override public PartitionReader createReader(InputPartition partition) { - return new ForestReader((ForestPartition) partition, documentContext); + return TripleRowSchema.SCHEMA.equals(documentContext.getSchema()) ? + new OpticTriplesReader((ForestPartition) partition, documentContext) : + new ForestReader((ForestPartition) partition, documentContext); } } diff --git a/src/main/java/com/marklogic/spark/reader/document/OpticTriplesReader.java b/src/main/java/com/marklogic/spark/reader/document/OpticTriplesReader.java new file mode 100644 index 00000000..84b6403a --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/document/OpticTriplesReader.java @@ -0,0 +1,157 @@ +package com.marklogic.spark.reader.document; + +import com.marklogic.client.DatabaseClient; +import com.marklogic.client.expression.PlanBuilder; +import com.marklogic.client.query.SearchQueryDefinition; +import com.marklogic.client.row.RowManager; +import com.marklogic.client.row.RowRecord; +import com.marklogic.client.type.PlanColumn; +import com.marklogic.spark.Options; +import com.marklogic.spark.ReadProgressLogger; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; +import org.apache.spark.unsafe.types.UTF8String; + +import java.io.IOException; +import java.net.URI; +import java.net.URISyntaxException; +import java.util.Iterator; +import java.util.List; + +/** + * Reads triples from a batch of document URIs via the Optic fromTriples data accessor. + */ +class OpticTriplesReader implements PartitionReader { + + private static final String DATATYPE_COLUMN = "datatype"; + private static final String GRAPH_COLUMN = "graph"; + private static final String OBJECT_COLUMN = "object"; + + private final UriBatcher uriBatcher; + private final DatabaseClient databaseClient; + private final DocumentContext documentContext; + private final RowManager rowManager; + private final PlanBuilder op; + private final String graphBaseIri; + + // Only for logging + private final long batchSize; + private long progressCounter; + + private Iterator currentRowIterator; + + public OpticTriplesReader(ForestPartition forestPartition, DocumentContext context) { + this.documentContext = context; + this.graphBaseIri = context.getStringOption(Options.READ_TRIPLES_BASE_IRI); + this.databaseClient = context.isDirectConnection() ? + context.connectToMarkLogic(forestPartition.getHost()) : + context.connectToMarkLogic(); + this.rowManager = this.databaseClient.newRowManager(); + this.op = this.rowManager.newPlanBuilder(); + + final SearchQueryDefinition query = context.buildTriplesSearchQuery(this.databaseClient); + boolean filtered = false; + if (context.hasOption(Options.READ_TRIPLES_FILTERED)) { + filtered = Boolean.parseBoolean(context.getProperties().get(Options.READ_TRIPLES_FILTERED)); + } + this.uriBatcher = new UriBatcher(this.databaseClient, query, forestPartition, context.getBatchSize(), filtered); + + this.batchSize = context.getBatchSize(); + } + + @Override + public boolean next() throws IOException { + if (currentRowIterator != null && currentRowIterator.hasNext()) { + return true; + } + while (currentRowIterator == null || !currentRowIterator.hasNext()) { + List uris = uriBatcher.nextBatchOfUris(); + if (uris.isEmpty()) { + return false; // End state; no more matching documents were found. + } + readNextBatchOfTriples(uris); + } + return true; + } + + @Override + public InternalRow get() { + Object[] row = convertNextTripleIntoRow(); + progressCounter++; + if (progressCounter >= batchSize) { + ReadProgressLogger.logProgressIfNecessary(this.progressCounter); + progressCounter = 0; + } + return new GenericInternalRow(row); + } + + @Override + public void close() { + // Nothing to close. + } + + private void readNextBatchOfTriples(List uris) { + PlanBuilder.ModifyPlan plan = op + .fromTriples(op.pattern(op.col("subject"), op.col("predicate"), op.col(OBJECT_COLUMN), op.graphCol(GRAPH_COLUMN))) + .where(op.cts.documentQuery(op.xs.stringSeq(uris.toArray(new String[0])))); + + if (documentContext.hasOption(Options.READ_TRIPLES_GRAPHS)) { + String[] graphs = documentContext.getStringOption(Options.READ_TRIPLES_GRAPHS).split(","); + plan = plan.where(op.in(op.col(GRAPH_COLUMN), op.xs.stringSeq(graphs))); + } + + plan = bindDatatypeAndLang(plan); + + currentRowIterator = rowManager.resultRows(plan).iterator(); + } + + /** + * Ideally, fromTriples would allow for columns to be declared so that datatype and lang could be easily fetched. + * Instead, we have to bind additional columns to retrieve these values. + */ + private PlanBuilder.ModifyPlan bindDatatypeAndLang(PlanBuilder.ModifyPlan plan) { + final PlanColumn objectCol = op.col(OBJECT_COLUMN); + return plan.bindAs(DATATYPE_COLUMN, op.caseExpr( + op.when(op.sem.isLiteral(objectCol), op.sem.datatype(objectCol)), + op.elseExpr(op.sem.iri(op.xs.string(""))) + )).bindAs("lang", op.caseExpr( + op.when(op.eq(op.col(DATATYPE_COLUMN), op.sem.iri("http://www.w3.org/1999/02/22-rdf-syntax-ns#langString")), op.sem.lang(objectCol)), + op.elseExpr(op.xs.string("")) + )); + } + + private Object[] convertNextTripleIntoRow() { + RowRecord row = currentRowIterator.next(); + return new Object[]{ + getString(row, "subject"), + getString(row, "predicate"), + getString(row, OBJECT_COLUMN), + getString(row, DATATYPE_COLUMN), + getString(row, "lang"), + getGraph(row) + }; + } + + private UTF8String getGraph(RowRecord row) { + String value = row.getString(GRAPH_COLUMN); + if (this.graphBaseIri != null && isGraphRelative(value)) { + value = this.graphBaseIri + value; + } + return value != null && value.trim().length() > 0 ? UTF8String.fromString(value) : null; + } + + private boolean isGraphRelative(String value) { + try { + return value != null && !(new URI(value).isAbsolute()); + } catch (URISyntaxException e) { + // If the graph is not a valid URI, it is not an absolute URI, and thus the base IRI will be prepended. + return true; + } + } + + private UTF8String getString(RowRecord row, String column) { + String value = row.getString(column); + return value != null && value.trim().length() > 0 ? UTF8String.fromString(value) : null; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/document/SearchQueryBuilder.java b/src/main/java/com/marklogic/spark/reader/document/SearchQueryBuilder.java index 3b96bee9..81d70bae 100644 --- a/src/main/java/com/marklogic/spark/reader/document/SearchQueryBuilder.java +++ b/src/main/java/com/marklogic/spark/reader/document/SearchQueryBuilder.java @@ -5,6 +5,7 @@ import com.marklogic.client.io.Format; import com.marklogic.client.io.StringHandle; import com.marklogic.client.query.*; +import com.marklogic.spark.Util; /** * Potentially reusable class for the Java Client that handles constructing a query based on a common @@ -20,6 +21,7 @@ public class SearchQueryBuilder { private String transformName; private String transformParams; private String transformParamsDelimiter; + private String[] uris; SearchQueryDefinition buildQuery(DatabaseClient client) { QueryDefinition queryDefinition = buildQueryDefinition(client); @@ -82,8 +84,25 @@ public SearchQueryBuilder withTransformParamsDelimiter(String delimiter) { return this; } + public SearchQueryBuilder withUris(String... uris) { + this.uris = uris; + return this; + } + private QueryDefinition buildQueryDefinition(DatabaseClient client) { final QueryManager queryManager = client.newQueryManager(); + + if (uris != null && uris.length > 0) { + StructuredQueryDefinition urisQuery = queryManager.newStructuredQueryBuilder().document(this.uris); + if (stringQuery != null && stringQuery.length() > 0) { + urisQuery.withCriteria(stringQuery); + } + if (this.query != null) { + Util.MAIN_LOGGER.warn("Ignoring query since a list of URIs was provided; query: {}", this.query); + } + return urisQuery; + } + if (query != null) { StringHandle queryHandle = new StringHandle(query); // v1/search assumes XML by default, so only need to set to JSON if the query is JSON. @@ -95,14 +114,16 @@ private QueryDefinition buildQueryDefinition(DatabaseClient client) { // for any of the 3 query types. RawStructuredQueryDefinition queryDefinition = queryManager.newRawStructuredQueryDefinition(queryHandle); if (stringQuery != null && stringQuery.length() > 0) { - queryDefinition.withCriteria(stringQuery); + queryDefinition.setCriteria(stringQuery); } return queryDefinition; } + StringQueryDefinition queryDefinition = queryManager.newStringDefinition(); if (this.stringQuery != null && stringQuery.length() > 0) { queryDefinition.setCriteria(this.stringQuery); } + return queryDefinition; } @@ -131,7 +152,7 @@ private ServerTransform buildServerTransform() { String delimiter = transformParamsDelimiter != null && transformParamsDelimiter.trim().length() > 0 ? transformParamsDelimiter : ","; String[] params = transformParams.split(delimiter); if (params.length % 2 != 0) { - throw new IllegalArgumentException("Transform params must have an equal number of parameter names and values: " + transformParams); + throw new IllegalArgumentException("Transform parameters must have an equal number of parameter names and values: " + transformParams); } for (int i = 0; i < params.length; i += 2) { String name = params[i]; diff --git a/src/main/java/com/marklogic/spark/reader/file/AggregateXMLFileReader.java b/src/main/java/com/marklogic/spark/reader/file/AggregateXMLFileReader.java deleted file mode 100644 index 4c074e11..00000000 --- a/src/main/java/com/marklogic/spark/reader/file/AggregateXMLFileReader.java +++ /dev/null @@ -1,73 +0,0 @@ -package com.marklogic.spark.reader.file; - -import com.marklogic.spark.ConnectorException; -import org.apache.commons.io.IOUtils; -import org.apache.hadoop.fs.Path; -import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.connector.read.PartitionReader; -import org.apache.spark.util.SerializableConfiguration; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Map; - -class AggregateXMLFileReader implements PartitionReader { - - private static final Logger logger = LoggerFactory.getLogger(AggregateXMLFileReader.class); - - private final String path; - private final InputStream inputStream; - private final AggregateXMLSplitter aggregateXMLSplitter; - - AggregateXMLFileReader(FilePartition partition, Map properties, SerializableConfiguration hadoopConfiguration) { - if (logger.isTraceEnabled()) { - logger.trace("Reading path: {}", partition.getPath()); - } - this.path = partition.getPath(); - Path hadoopPath = new Path(this.path); - - try { - this.inputStream = makeInputStream(hadoopPath, hadoopConfiguration); - } catch (IOException e) { - throw new ConnectorException(String.format("Unable to open %s; cause: %s", path, e.getMessage()), e); - } - - String identifierForError = "file " + hadoopPath; - try { - this.aggregateXMLSplitter = new AggregateXMLSplitter(identifierForError, this.inputStream, properties); - } catch (Exception e) { - // Interestingly, this won't fail if the file is malformed or not XML. It's only when we try to get the - // first element. - throw new ConnectorException(String.format("Unable to read %s", hadoopPath), e); - } - } - - @Override - public boolean next() { - try { - return this.aggregateXMLSplitter.hasNext(); - } catch (RuntimeException e) { - String message = String.format("Unable to read XML from %s; cause: %s", this.path, e.getMessage()); - throw new ConnectorException(message, e); - } - } - - @Override - public InternalRow get() { - return this.aggregateXMLSplitter.nextRow(this.path); - } - - @Override - public void close() { - IOUtils.closeQuietly(this.inputStream); - } - - // Protected so that it can be overridden for gzipped files. - protected InputStream makeInputStream(Path path, SerializableConfiguration hadoopConfiguration) throws IOException { - // Contrary to writing files, testing has shown no difference in performance with using e.g. FileInputStream - // instead of fileSystem.open when fileSystem is a LocalFileSystem. - return path.getFileSystem(hadoopConfiguration.value()).open(path); - } -} diff --git a/src/main/java/com/marklogic/spark/reader/file/AggregateXmlFileReader.java b/src/main/java/com/marklogic/spark/reader/file/AggregateXmlFileReader.java new file mode 100644 index 00000000..0b98ed2d --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/AggregateXmlFileReader.java @@ -0,0 +1,101 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; + +import java.io.InputStream; + +class AggregateXmlFileReader implements PartitionReader { + + private final FilePartition filePartition; + private final FileContext fileContext; + + private InputStream inputStream; + private AggregateXmlSplitter aggregateXMLSplitter; + private InternalRow nextRowToReturn; + private int filePathIndex = 0; + + AggregateXmlFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + } + + @Override + public boolean next() { + if (aggregateXMLSplitter == null && !initializeAggregateXMLSplitter()) { + return false; + } + + // Iterate until either a valid element is found or we run out of elements. + while (true) { + try { + if (!this.aggregateXMLSplitter.hasNext()) { + aggregateXMLSplitter = null; + filePathIndex++; + return next(); + } + } catch (ConnectorException ex) { + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + aggregateXMLSplitter = null; + filePathIndex++; + return next(); + } + + try { + nextRowToReturn = this.aggregateXMLSplitter.nextRow(filePartition.getPaths().get(filePathIndex)); + return true; + } catch (RuntimeException ex) { + // Error is expected to be friendly already. + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + } + } + } + + @Override + public InternalRow get() { + return nextRowToReturn; + } + + @Override + public void close() { + IOUtils.closeQuietly(this.inputStream); + } + + private boolean initializeAggregateXMLSplitter() { + if (filePathIndex >= filePartition.getPaths().size()) { + return false; + } + + final String filePath = filePartition.getPaths().get(filePathIndex); + try { + this.inputStream = fileContext.openFile(filePath); + String identifierForError = "file " + filePath; + this.aggregateXMLSplitter = new AggregateXmlSplitter(identifierForError, this.inputStream, fileContext); + return true; + } catch (ConnectorException ex) { + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + filePathIndex++; + return initializeAggregateXMLSplitter(); + } catch (Exception ex) { + String message = String.format("Unable to read file at %s; cause: %s", filePath, ex.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, ex); + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + filePathIndex++; + return initializeAggregateXMLSplitter(); + } + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/AggregateXMLSplitter.java b/src/main/java/com/marklogic/spark/reader/file/AggregateXmlSplitter.java similarity index 66% rename from src/main/java/com/marklogic/spark/reader/file/AggregateXMLSplitter.java rename to src/main/java/com/marklogic/spark/reader/file/AggregateXmlSplitter.java index 0a19fe7e..533b16d7 100644 --- a/src/main/java/com/marklogic/spark/reader/file/AggregateXMLSplitter.java +++ b/src/main/java/com/marklogic/spark/reader/file/AggregateXmlSplitter.java @@ -9,19 +9,19 @@ import org.apache.spark.unsafe.types.ByteArray; import org.apache.spark.unsafe.types.UTF8String; +import javax.xml.stream.XMLInputFactory; import javax.xml.stream.XMLStreamException; import javax.xml.stream.XMLStreamReader; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Iterator; -import java.util.Map; /** * Knows how to split an aggregate XML document and return a row for each user-defined child element. Each row has - * a schema matching that of {@code FileRowSchema}. + * a schema matching that of {@code DocumentRowSchema}. */ -class AggregateXMLSplitter { +class AggregateXmlSplitter { private final Iterator contentStream; private final String identifierForErrors; @@ -32,20 +32,37 @@ class AggregateXMLSplitter { private int rowCounter = 1; + private static XMLInputFactory xmlInputFactory; + + static { + xmlInputFactory = XMLInputFactory.newFactory(); + // The following prevents XXE attacks, per Sonar java:S2755 rule. + // Note that setting XMLConstants.ACCESS_EXTERNAL_DTD and XMLConstants.ACCESS_EXTERNAL_SCHEMA to empty + // strings is also suggested by the Sonar S2755 docs and will work fine in this connector project - but it + // will result in warnings in the Flux application that oddly cause no data to be read. So do not set those + // to empty strings here. The below config satisfies Sonar in terms of preventing XXE attacks and does not + // impact functionality. + xmlInputFactory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, Boolean.FALSE); + xmlInputFactory.setProperty(XMLInputFactory.SUPPORT_DTD, false); + } + /** * @param identifierForErrors allows the caller of this class to provide a useful description to be included in * any errors to help users with debugging. * @param inputStream the stream of aggregate XML data - * @param properties connector properties + * @param fileContext */ - AggregateXMLSplitter(String identifierForErrors, InputStream inputStream, Map properties) { + AggregateXmlSplitter(String identifierForErrors, InputStream inputStream, FileContext fileContext) { this.identifierForErrors = identifierForErrors; - this.uriElement = properties.get(Options.READ_AGGREGATES_XML_URI_ELEMENT); - this.uriNamespace = properties.get(Options.READ_AGGREGATES_XML_URI_NAMESPACE); - String namespace = properties.get(Options.READ_AGGREGATES_XML_NAMESPACE); - String element = properties.get(Options.READ_AGGREGATES_XML_ELEMENT); + this.uriElement = fileContext.getStringOption(Options.READ_AGGREGATES_XML_URI_ELEMENT); + this.uriNamespace = fileContext.getStringOption(Options.READ_AGGREGATES_XML_URI_NAMESPACE); + final String namespace = fileContext.getStringOption(Options.READ_AGGREGATES_XML_NAMESPACE); + final String element = fileContext.getStringOption(Options.READ_AGGREGATES_XML_ELEMENT); + final String encoding = fileContext.getStringOption(Options.READ_FILES_ENCODING); + try { - this.contentStream = XMLSplitter.makeSplitter(namespace, element).split(inputStream).iterator(); + XMLStreamReader reader = xmlInputFactory.createXMLStreamReader(inputStream, encoding); + this.contentStream = XMLSplitter.makeSplitter(namespace, element).split(reader).iterator(); } catch (IOException | XMLStreamException e) { throw new ConnectorException( String.format("Unable to read XML at %s; cause: %s", this.identifierForErrors, e.getMessage()), e @@ -54,14 +71,19 @@ class AggregateXMLSplitter { } boolean hasNext() { - return this.contentStream.hasNext(); + try { + return this.contentStream.hasNext(); + } catch (Exception e) { + String message = String.format("Unable to read XML from %s; cause: %s", identifierForErrors, e.getMessage()); + throw new ConnectorException(message, e); + } } /** - * @param uriPrefix used to construct a URI if no uriElement was specified - * @return a row corresponding to the {@code FileRowSchema} + * @param pathPrefix used to construct a path if no uriElement was specified + * @return a row corresponding to the {@code DocumentRowSchema} */ - InternalRow nextRow(String uriPrefix) { + InternalRow nextRow(String pathPrefix) { String xml; try { xml = this.contentStream.next().get(); @@ -71,17 +93,18 @@ InternalRow nextRow(String uriPrefix) { throw new ConnectorException(message, ex); } - final String uri = this.uriElement != null && !this.uriElement.trim().isEmpty() ? + final String path = this.uriElement != null && !this.uriElement.trim().isEmpty() ? extractUriElementValue(xml) : - uriPrefix + "-" + rowCounter + ".xml"; + pathPrefix + "-" + rowCounter + ".xml"; rowCounter++; byte[] content = xml.getBytes(); - long length = content.length; return new GenericInternalRow(new Object[]{ - UTF8String.fromString(uri), null, length, - ByteArray.concat(xml.getBytes()), + UTF8String.fromString(path), + ByteArray.concat(content), + UTF8String.fromString("xml"), + null, null, null, null, null }); } diff --git a/src/main/java/com/marklogic/spark/reader/file/ArchiveFileReader.java b/src/main/java/com/marklogic/spark/reader/file/ArchiveFileReader.java new file mode 100644 index 00000000..d490fc29 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/ArchiveFileReader.java @@ -0,0 +1,113 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import com.marklogic.spark.Util; +import com.marklogic.spark.reader.document.DocumentRowBuilder; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; + +class ArchiveFileReader implements PartitionReader { + + private final FilePartition filePartition; + private final FileContext fileContext; + private final List metadataCategories; + + private String currentFilePath; + private ZipInputStream currentZipInputStream; + private int nextFilePathIndex; + private InternalRow nextRowToReturn; + + ArchiveFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + this.metadataCategories = new ArrayList<>(); + if (fileContext.hasOption(Options.READ_ARCHIVES_CATEGORIES)) { + for (String category : fileContext.getStringOption(Options.READ_ARCHIVES_CATEGORIES).split(",")) { + this.metadataCategories.add(category.toLowerCase()); + } + } + + openNextFile(); + } + + @Override + public boolean next() { + try { + ZipEntry contentZipEntry = FileUtil.findNextFileEntry(currentZipInputStream); + if (contentZipEntry == null) { + return openNextFileAndReadNextEntry(); + } + byte[] content = fileContext.readBytes(currentZipInputStream); + if (content == null || content.length == 0) { + return openNextFileAndReadNextEntry(); + } + final String zipEntryName = contentZipEntry.getName(); + + byte[] metadataBytes = readMetadataEntry(zipEntryName); + if (metadataBytes == null || metadataBytes.length == 0) { + return openNextFileAndReadNextEntry(); + } + + DocumentMetadataHandle metadata = new DocumentMetadataHandle(); + metadata.fromBuffer(metadataBytes); + this.nextRowToReturn = new DocumentRowBuilder(this.metadataCategories) + .withUri(zipEntryName).withContent(content).withMetadata(metadata) + .buildRow(); + return true; + } catch (IOException e) { + String message = String.format("Unable to read archive file at %s; cause: %s", this.currentFilePath, e.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, e); + } + Util.MAIN_LOGGER.warn(message); + return openNextFileAndReadNextEntry(); + } + } + + @Override + public InternalRow get() { + return nextRowToReturn; + } + + @Override + public void close() { + IOUtils.closeQuietly(this.currentZipInputStream); + } + + private void openNextFile() { + this.currentFilePath = filePartition.getPaths().get(nextFilePathIndex); + nextFilePathIndex++; + this.currentZipInputStream = new ZipInputStream(fileContext.openFile(this.currentFilePath)); + } + + private boolean openNextFileAndReadNextEntry() { + close(); + if (nextFilePathIndex >= this.filePartition.getPaths().size()) { + return false; + } + openNextFile(); + return next(); + } + + private byte[] readMetadataEntry(String zipEntryName) throws IOException { + ZipEntry metadataEntry = FileUtil.findNextFileEntry(currentZipInputStream); + if (metadataEntry == null || !metadataEntry.getName().endsWith(".metadata")) { + String message = String.format("Could not find metadata entry for entry %s in file %s", zipEntryName, this.currentFilePath); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message); + } + Util.MAIN_LOGGER.warn(message); + return new byte[0]; + } + return fileContext.readBytes(currentZipInputStream); + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/FileBatch.java b/src/main/java/com/marklogic/spark/reader/file/FileBatch.java index 9e030173..809c1bee 100644 --- a/src/main/java/com/marklogic/spark/reader/file/FileBatch.java +++ b/src/main/java/com/marklogic/spark/reader/file/FileBatch.java @@ -1,5 +1,7 @@ package com.marklogic.spark.reader.file; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; import org.apache.hadoop.conf.Configuration; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.connector.read.Batch; @@ -22,13 +24,17 @@ class FileBatch implements Batch { @Override public InputPartition[] planInputPartitions() { - // TBD For gzipped files, we may want a different approach that isn't one partition per file. - String[] files = fileIndex.inputFiles(); - InputPartition[] result = new InputPartition[files.length]; - for (int i = 0; i < result.length; i++) { - result[i] = new FilePartition(files[i]); + String[] inputFiles = fileIndex.inputFiles(); + int numPartitions = inputFiles.length; + if (properties.containsKey(Options.READ_NUM_PARTITIONS)) { + String value = properties.get(Options.READ_NUM_PARTITIONS); + try { + numPartitions = Integer.parseInt(value); + } catch (NumberFormatException e) { + throw new ConnectorException(String.format("Invalid value for number of partitions: %s", value)); + } } - return result; + return FileUtil.makeFilePartitions(inputFiles, numPartitions); } @Override @@ -36,6 +42,7 @@ public PartitionReaderFactory createReaderFactory() { // This config is needed to resolve file paths. This is our last chance to access it and provide a serialized // version to the factory, which must be serializable itself. Configuration config = SparkSession.active().sparkContext().hadoopConfiguration(); - return new FilePartitionReaderFactory(properties, new SerializableConfiguration(config)); + FileContext fileContext = new FileContext(properties, new SerializableConfiguration(config)); + return new FilePartitionReaderFactory(fileContext); } } diff --git a/src/main/java/com/marklogic/spark/reader/file/FileContext.java b/src/main/java/com/marklogic/spark/reader/file/FileContext.java new file mode 100644 index 00000000..3595c213 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/FileContext.java @@ -0,0 +1,68 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.ContextSupport; +import com.marklogic.spark.Options; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.spark.util.SerializableConfiguration; + +import java.io.IOException; +import java.io.InputStream; +import java.io.Serializable; +import java.nio.charset.Charset; +import java.nio.charset.UnsupportedCharsetException; +import java.util.Map; +import java.util.zip.GZIPInputStream; + +class FileContext extends ContextSupport implements Serializable { + + private SerializableConfiguration hadoopConfiguration; + private final String encoding; + + FileContext(Map properties, SerializableConfiguration hadoopConfiguration) { + super(properties); + this.hadoopConfiguration = hadoopConfiguration; + this.encoding = getStringOption(Options.READ_FILES_ENCODING); + if (this.encoding != null) { + try { + Charset.forName(this.encoding); + } catch (UnsupportedCharsetException e) { + throw new ConnectorException(String.format("Unsupported encoding value: %s", this.encoding), e); + } + } + } + + boolean isZip() { + return "zip".equalsIgnoreCase(getStringOption(Options.READ_FILES_COMPRESSION)); + } + + boolean isGzip() { + return "gzip".equalsIgnoreCase(getStringOption(Options.READ_FILES_COMPRESSION)); + } + + InputStream openFile(String filePath) { + try { + Path hadoopPath = new Path(filePath); + FileSystem fileSystem = hadoopPath.getFileSystem(hadoopConfiguration.value()); + FSDataInputStream inputStream = fileSystem.open(hadoopPath); + return this.isGzip() ? new GZIPInputStream(inputStream) : inputStream; + } catch (Exception e) { + throw new ConnectorException(String.format( + "Unable to read file at %s; cause: %s", filePath, e.getMessage()), e); + } + } + + boolean isReadAbortOnFailure() { + if (hasOption(Options.READ_FILES_ABORT_ON_FAILURE)) { + return Boolean.parseBoolean(getStringOption(Options.READ_FILES_ABORT_ON_FAILURE)); + } + return true; + } + + byte[] readBytes(InputStream inputStream) throws IOException { + byte[] bytes = FileUtil.readBytes(inputStream); + return this.encoding != null ? new String(bytes, this.encoding).getBytes() : bytes; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/FilePartition.java b/src/main/java/com/marklogic/spark/reader/file/FilePartition.java index 68091e59..1e452e81 100644 --- a/src/main/java/com/marklogic/spark/reader/file/FilePartition.java +++ b/src/main/java/com/marklogic/spark/reader/file/FilePartition.java @@ -2,17 +2,24 @@ import org.apache.spark.sql.connector.read.InputPartition; +import java.util.List; + class FilePartition implements InputPartition { static final long serialVersionUID = 1; - private String path; + private final List paths; + + public FilePartition(List paths) { + this.paths = paths; + } - FilePartition(String path) { - this.path = path; + List getPaths() { + return paths; } - String getPath() { - return path; + @Override + public String toString() { + return this.paths.toString(); } } diff --git a/src/main/java/com/marklogic/spark/reader/file/FilePartitionReaderFactory.java b/src/main/java/com/marklogic/spark/reader/file/FilePartitionReaderFactory.java index bb304e34..f47ba40f 100644 --- a/src/main/java/com/marklogic/spark/reader/file/FilePartitionReaderFactory.java +++ b/src/main/java/com/marklogic/spark/reader/file/FilePartitionReaderFactory.java @@ -1,47 +1,44 @@ package com.marklogic.spark.reader.file; -import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.connector.read.InputPartition; import org.apache.spark.sql.connector.read.PartitionReader; import org.apache.spark.sql.connector.read.PartitionReaderFactory; -import org.apache.spark.util.SerializableConfiguration; - -import java.util.Map; class FilePartitionReaderFactory implements PartitionReaderFactory { static final long serialVersionUID = 1; - private final Map properties; - private final SerializableConfiguration hadoopConfiguration; + private final FileContext fileContext; - FilePartitionReaderFactory(Map properties, SerializableConfiguration hadoopConfiguration) { - this.properties = properties; - this.hadoopConfiguration = hadoopConfiguration; + FilePartitionReaderFactory(FileContext fileContext) { + this.fileContext = fileContext; } @Override public PartitionReader createReader(InputPartition partition) { - FilePartition filePartition = (FilePartition) partition; - String compression = this.properties.get(Options.READ_FILES_COMPRESSION); - final boolean isZip = "zip".equalsIgnoreCase(compression); - final boolean isGzip = "gzip".equalsIgnoreCase(compression); + final FilePartition filePartition = (FilePartition) partition; + final String fileType = fileContext.getStringOption(Options.READ_FILES_TYPE); - String aggregateXmlElement = this.properties.get(Options.READ_AGGREGATES_XML_ELEMENT); - if (aggregateXmlElement != null && !aggregateXmlElement.trim().isEmpty()) { - if (isZip) { - return new ZipAggregateXMLFileReader(filePartition, properties, hadoopConfiguration); - } else if (isGzip) { - return new GZIPAggregateXMLFileReader(filePartition, properties, hadoopConfiguration); + if ("rdf".equalsIgnoreCase(fileType)) { + if (fileContext.isZip()) { + return new RdfZipFileReader(filePartition, fileContext); } - return new AggregateXMLFileReader(filePartition, properties, hadoopConfiguration); - } else if (isZip) { - return new ZipFileReader(filePartition, hadoopConfiguration); - } else if (isGzip) { - return new GZIPFileReader(filePartition, hadoopConfiguration); + return new RdfFileReader(filePartition, fileContext); + } else if ("mlcp_archive".equalsIgnoreCase(fileType)) { + return new MlcpArchiveFileReader(filePartition, fileContext); + } else if ("archive".equalsIgnoreCase(fileType)) { + return new ArchiveFileReader(filePartition, fileContext); + } else if (fileContext.hasOption(Options.READ_AGGREGATES_XML_ELEMENT)) { + return fileContext.isZip() ? + new ZipAggregateXmlFileReader(filePartition, fileContext) : + new AggregateXmlFileReader(filePartition, fileContext); + } else if (fileContext.isZip()) { + return new ZipFileReader(filePartition, fileContext); + } else if (fileContext.isGzip()) { + return new GzipFileReader(filePartition, fileContext); } - throw new ConnectorException("Only zip and gzip files supported, more to come before 2.2.0 release."); + return new GenericFileReader(filePartition, fileContext); } } diff --git a/src/main/java/com/marklogic/spark/reader/file/FileRowSchema.java b/src/main/java/com/marklogic/spark/reader/file/FileRowSchema.java deleted file mode 100644 index e91fae3c..00000000 --- a/src/main/java/com/marklogic/spark/reader/file/FileRowSchema.java +++ /dev/null @@ -1,19 +0,0 @@ -package com.marklogic.spark.reader.file; - -import org.apache.spark.sql.types.DataTypes; -import org.apache.spark.sql.types.StructType; - -public abstract class FileRowSchema { - - // Same as Spark's binaryType. - // See https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html . - public static final StructType SCHEMA = new StructType() - .add("path", DataTypes.StringType) - .add("modificationTime", DataTypes.TimestampType) - .add("length", DataTypes.LongType) - .add("content", DataTypes.BinaryType); - - private FileRowSchema() { - } - -} diff --git a/src/main/java/com/marklogic/spark/reader/file/FileScan.java b/src/main/java/com/marklogic/spark/reader/file/FileScan.java index 3b40bd6d..2bab9b11 100644 --- a/src/main/java/com/marklogic/spark/reader/file/FileScan.java +++ b/src/main/java/com/marklogic/spark/reader/file/FileScan.java @@ -1,5 +1,6 @@ package com.marklogic.spark.reader.file; +import com.marklogic.spark.reader.document.DocumentRowSchema; import org.apache.spark.sql.connector.read.Batch; import org.apache.spark.sql.connector.read.Scan; import org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex; @@ -19,7 +20,7 @@ class FileScan implements Scan { @Override public StructType readSchema() { - return FileRowSchema.SCHEMA; + return DocumentRowSchema.SCHEMA; } @Override diff --git a/src/main/java/com/marklogic/spark/reader/file/FileUtil.java b/src/main/java/com/marklogic/spark/reader/file/FileUtil.java index 18a52ef1..9dd5ccd0 100644 --- a/src/main/java/com/marklogic/spark/reader/file/FileUtil.java +++ b/src/main/java/com/marklogic/spark/reader/file/FileUtil.java @@ -3,19 +3,27 @@ import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; import java.util.zip.ZipEntry; import java.util.zip.ZipInputStream; public interface FileUtil { + /** + * Does not handle file encoding - {@code FileContext} is expected to handle that as it has access to the + * user's options. + * + * @param inputStream + * @return + * @throws IOException + */ static byte[] readBytes(InputStream inputStream) throws IOException { byte[] buffer = new byte[1024]; - int offset = 0; ByteArrayOutputStream baos = new ByteArrayOutputStream(); int read; while ((read = inputStream.read(buffer)) != -1) { - baos.write(buffer, offset, read); - offset += read; + baos.write(buffer, 0, read); } return baos.toByteArray(); } @@ -31,4 +39,26 @@ static ZipEntry findNextFileEntry(ZipInputStream zipInputStream) throws IOExcept } return !entry.isDirectory() ? entry : findNextFileEntry(zipInputStream); } + + static FilePartition[] makeFilePartitions(String[] files, int numPartitions) { + int filesPerPartition = (int) Math.ceil((double) files.length / (double) numPartitions); + if (files.length < numPartitions) { + numPartitions = files.length; + } + final FilePartition[] partitions = new FilePartition[numPartitions]; + List currentPartition = new ArrayList<>(); + int partitionIndex = 0; + for (int i = 0; i < files.length; i++) { + if (currentPartition.size() == filesPerPartition) { + partitions[partitionIndex] = new FilePartition(currentPartition); + partitionIndex++; + currentPartition = new ArrayList<>(); + } + currentPartition.add(files[i]); + } + if (!currentPartition.isEmpty()) { + partitions[partitionIndex] = new FilePartition(currentPartition); + } + return partitions; + } } diff --git a/src/main/java/com/marklogic/spark/reader/file/GZIPAggregateXMLFileReader.java b/src/main/java/com/marklogic/spark/reader/file/GZIPAggregateXMLFileReader.java deleted file mode 100644 index 8a993fea..00000000 --- a/src/main/java/com/marklogic/spark/reader/file/GZIPAggregateXMLFileReader.java +++ /dev/null @@ -1,25 +0,0 @@ -package com.marklogic.spark.reader.file; - -import org.apache.hadoop.fs.Path; -import org.apache.spark.util.SerializableConfiguration; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Map; -import java.util.zip.GZIPInputStream; - -/** - * Functions the same as reading an aggregate XML file - it just needs to wrap the input stream in a - * {@code GZIPInputStream} first. - */ -class GZIPAggregateXMLFileReader extends AggregateXMLFileReader { - - GZIPAggregateXMLFileReader(FilePartition partition, Map properties, SerializableConfiguration hadoopConfiguration) { - super(partition, properties, hadoopConfiguration); - } - - @Override - protected InputStream makeInputStream(Path path, SerializableConfiguration hadoopConfiguration) throws IOException { - return new GZIPInputStream(path.getFileSystem(hadoopConfiguration.value()).open(path)); - } -} diff --git a/src/main/java/com/marklogic/spark/reader/file/GZIPFileReader.java b/src/main/java/com/marklogic/spark/reader/file/GZIPFileReader.java deleted file mode 100644 index af68a4f2..00000000 --- a/src/main/java/com/marklogic/spark/reader/file/GZIPFileReader.java +++ /dev/null @@ -1,89 +0,0 @@ -package com.marklogic.spark.reader.file; - -import com.marklogic.spark.ConnectorException; -import org.apache.commons.io.IOUtils; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; -import org.apache.spark.sql.connector.read.PartitionReader; -import org.apache.spark.unsafe.types.ByteArray; -import org.apache.spark.unsafe.types.UTF8String; -import org.apache.spark.util.SerializableConfiguration; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import java.io.IOException; -import java.util.zip.GZIPInputStream; - -/** - * Expects to read a single gzipped file and return a single row. May expand the scope of this later to expect multiple - * files and to thus return multiple rows. - */ -class GZIPFileReader implements PartitionReader { - - private static final Logger logger = LoggerFactory.getLogger(GZIPFileReader.class); - - private final String path; - private final SerializableConfiguration hadoopConfiguration; - private boolean fileHasBeenRead; - - GZIPFileReader(FilePartition partition, SerializableConfiguration hadoopConfiguration) { - this.path = partition.getPath(); - this.hadoopConfiguration = hadoopConfiguration; - - } - - @Override - public boolean next() { - return !fileHasBeenRead; - } - - @Override - public InternalRow get() { - GZIPInputStream gzipInputStream = openGZIPFile(); - byte[] content = extractGZIPContents(gzipInputStream); - IOUtils.closeQuietly(gzipInputStream); - this.fileHasBeenRead = true; - - String uri = makeURI(); - long length = content.length; - return new GenericInternalRow(new Object[]{ - UTF8String.fromString(uri), null, length, ByteArray.concat(content) - }); - } - - @Override - public void close() { - // Nothing to close. - } - - private GZIPInputStream openGZIPFile() { - try { - if (logger.isTraceEnabled()) { - logger.trace("Reading gzip file {}", this.path); - } - Path hadoopPath = new Path(this.path); - FileSystem fileSystem = hadoopPath.getFileSystem(hadoopConfiguration.value()); - return new GZIPInputStream(fileSystem.open(hadoopPath)); - } catch (IOException e) { - throw new ConnectorException(String.format("Unable to read gzip file at %s; cause: %s", this.path, e.getMessage()), e); - } - } - - private byte[] extractGZIPContents(GZIPInputStream gzipInputStream) { - try { - return FileUtil.readBytes(gzipInputStream); - } catch (IOException e) { - throw new ConnectorException(String.format("Unable to read from gzip file at %s; cause: %s", - this.path, e.getMessage()), e); - } - } - - private String makeURI() { - // Copied from MLCP. - return path.endsWith(".gzip") || path.endsWith(".gz") ? - path.substring(0, path.lastIndexOf(".")) : - path; - } -} diff --git a/src/main/java/com/marklogic/spark/reader/file/GenericFileReader.java b/src/main/java/com/marklogic/spark/reader/file/GenericFileReader.java new file mode 100644 index 00000000..c8849d5f --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/GenericFileReader.java @@ -0,0 +1,67 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; +import org.apache.spark.unsafe.types.ByteArray; +import org.apache.spark.unsafe.types.UTF8String; + +import java.io.IOException; +import java.io.InputStream; + +/** + * "Generic" = read each file as-is with no special processing. + */ +class GenericFileReader implements PartitionReader { + + private final FilePartition filePartition; + private final FileContext fileContext; + + private InternalRow nextRowToReturn; + private int filePathIndex; + + GenericFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + } + + @Override + public boolean next() { + if (filePathIndex >= filePartition.getPaths().size()) { + return false; + } + + final String path = filePartition.getPaths().get(filePathIndex); + filePathIndex++; + try { + try (InputStream inputStream = fileContext.openFile(path)) { + byte[] content = fileContext.readBytes(inputStream); + nextRowToReturn = new GenericInternalRow(new Object[]{ + UTF8String.fromString(path), + ByteArray.concat(content), + null, null, null, null, null, null + }); + } + } catch (Exception ex) { + String message = String.format("Unable to read file at %s; cause: %s", path, ex.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, ex); + } + Util.MAIN_LOGGER.warn(message); + return next(); + } + return true; + } + + @Override + public InternalRow get() { + return nextRowToReturn; + } + + @Override + public void close() throws IOException { + // Nothing to close. + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/GzipFileReader.java b/src/main/java/com/marklogic/spark/reader/file/GzipFileReader.java new file mode 100644 index 00000000..dff15688 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/GzipFileReader.java @@ -0,0 +1,86 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; +import org.apache.spark.unsafe.types.ByteArray; +import org.apache.spark.unsafe.types.UTF8String; + +import java.io.IOException; +import java.io.InputStream; + +/** + * Expects to read a single gzipped file and return a single row. May expand the scope of this later to expect multiple + * files and to thus return multiple rows. + */ +class GzipFileReader implements PartitionReader { + + private final FilePartition filePartition; + private final FileContext fileContext; + + private int nextFilePathIndex; + private InternalRow rowToReturn; + + GzipFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + } + + @Override + public boolean next() { + if (nextFilePathIndex >= filePartition.getPaths().size()) { + return false; + } + + String currentFilePath = filePartition.getPaths().get(nextFilePathIndex); + nextFilePathIndex++; + InputStream gzipInputStream = null; + try { + gzipInputStream = fileContext.openFile(currentFilePath); + byte[] content = extractGZIPContents(currentFilePath, gzipInputStream); + String uri = makeURI(currentFilePath); + this.rowToReturn = new GenericInternalRow(new Object[]{ + UTF8String.fromString(uri), ByteArray.concat(content), + null, null, null, null, null, null + }); + return true; + } catch (RuntimeException ex) { + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + Util.MAIN_LOGGER.warn("Unable to read file at {}; cause: {}", currentFilePath, ex.getMessage()); + return next(); + } finally { + IOUtils.closeQuietly(gzipInputStream); + } + } + + @Override + public InternalRow get() { + return rowToReturn; + } + + @Override + public void close() { + // Nothing to close. + } + + private byte[] extractGZIPContents(String currentFilePath, InputStream gzipInputStream) { + try { + return fileContext.readBytes(gzipInputStream); + } catch (IOException e) { + throw new ConnectorException(String.format("Unable to read from gzip file at %s; cause: %s", + currentFilePath, e.getMessage()), e); + } + } + + private String makeURI(String path) { + // Copied from MLCP. + return path.endsWith(".gzip") || path.endsWith(".gz") ? + path.substring(0, path.lastIndexOf(".")) : + path; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/MlcpArchiveFileReader.java b/src/main/java/com/marklogic/spark/reader/file/MlcpArchiveFileReader.java new file mode 100644 index 00000000..66d0f9e8 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/MlcpArchiveFileReader.java @@ -0,0 +1,219 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import com.marklogic.spark.Util; +import com.marklogic.spark.reader.document.DocumentRowBuilder; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; + +import java.io.ByteArrayInputStream; +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; + +class MlcpArchiveFileReader implements PartitionReader { + + private final FilePartition filePartition; + private final MlcpMetadataConverter mlcpMetadataConverter; + private final FileContext fileContext; + private final List metadataCategories; + + private String currentFilePath; + private ZipInputStream currentZipInputStream; + private int nextFilePathIndex; + private InternalRow nextRowToReturn; + + MlcpArchiveFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + this.mlcpMetadataConverter = new MlcpMetadataConverter(); + + this.metadataCategories = new ArrayList<>(); + if (fileContext.hasOption(Options.READ_ARCHIVES_CATEGORIES)) { + for (String category : fileContext.getStringOption(Options.READ_ARCHIVES_CATEGORIES).split(",")) { + this.metadataCategories.add(category.toLowerCase()); + } + } + + openNextFile(); + } + + @Override + public boolean next() { + ZipEntry metadataZipEntry = getNextMetadataEntry(); + if (metadataZipEntry == null) { + return openNextFileAndReadNextEntry(); + } + + MlcpMetadata mlcpMetadata = readMetadataEntry(metadataZipEntry); + if (mlcpMetadata == null) { + return openNextFileAndReadNextEntry(); + } + + if (metadataZipEntry.getName().endsWith(".naked")) { + if (readNakedEntry(metadataZipEntry, mlcpMetadata)) { + return true; + } + return openNextFileAndReadNextEntry(); + } + + ZipEntry contentZipEntry = getContentEntry(metadataZipEntry); + if (contentZipEntry == null) { + return openNextFileAndReadNextEntry(); + } + + byte[] content = readBytesFromContentEntry(contentZipEntry); + if (content == null || content.length == 0) { + return openNextFileAndReadNextEntry(); + } + + try { + nextRowToReturn = makeRow(contentZipEntry, content, mlcpMetadata); + return true; + } catch (Exception ex) { + String message = String.format("Unable to process entry %s from zip file at %s; cause: %s", + contentZipEntry.getName(), this.currentFilePath, ex.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, ex); + } + Util.MAIN_LOGGER.warn(message); + return openNextFileAndReadNextEntry(); + } + } + + @Override + public InternalRow get() { + return nextRowToReturn; + } + + @Override + public void close() { + IOUtils.closeQuietly(this.currentZipInputStream); + } + + private void openNextFile() { + this.currentFilePath = filePartition.getPaths().get(nextFilePathIndex); + nextFilePathIndex++; + this.currentZipInputStream = new ZipInputStream(fileContext.openFile(this.currentFilePath)); + } + + private boolean openNextFileAndReadNextEntry() { + close(); + if (nextFilePathIndex >= this.filePartition.getPaths().size()) { + return false; + } + openNextFile(); + return next(); + } + + private ZipEntry getNextMetadataEntry() { + // MLCP always includes a metadata entry, even if the user asks for no metadata. And the metadata entry is + // always first. + try { + return FileUtil.findNextFileEntry(currentZipInputStream); + } catch (IOException e) { + String message = String.format("Unable to read from zip file: %s; cause: %s", this.currentFilePath, e.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, e); + } + Util.MAIN_LOGGER.warn(message); + return null; + } + } + + private MlcpMetadata readMetadataEntry(ZipEntry metadataZipEntry) { + try { + return this.mlcpMetadataConverter.convert(new ByteArrayInputStream(fileContext.readBytes(currentZipInputStream))); + } catch (Exception e) { + String message = String.format("Unable to read metadata for entry: %s; file: %s; cause: %s", + metadataZipEntry.getName(), this.currentFilePath, e.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, e); + } + // Contrary to a zip of aggregate XML files, if we get a bad metadata file, we likely do not have a valid + // MLCP archive file and thus there's no reason to continue processing this particular file. + Util.MAIN_LOGGER.warn(message); + return null; + } + } + + private ZipEntry getContentEntry(ZipEntry metadataZipEntry) { + ZipEntry contentZipEntry; + try { + contentZipEntry = FileUtil.findNextFileEntry(currentZipInputStream); + } catch (IOException e) { + String message = String.format("Unable to read content entry from file: %s; cause: %s", this.currentFilePath, e.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, e); + } + Util.MAIN_LOGGER.warn(message); + return null; + } + + if (contentZipEntry == null) { + String message = String.format("No content entry found for metadata entry: %s; file: %s", metadataZipEntry.getName(), this.currentFilePath); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message); + } + Util.MAIN_LOGGER.warn(message); + return null; + } + return contentZipEntry; + } + + private byte[] readBytesFromContentEntry(ZipEntry contentZipEntry) { + try { + return fileContext.readBytes(currentZipInputStream); + } catch (IOException e) { + String message = String.format("Unable to read entry %s from zip file at %s; cause: %s", + contentZipEntry.getName(), this.currentFilePath, e.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, e); + } + Util.MAIN_LOGGER.warn(message); + return new byte[0]; + } + } + + /** + * A "naked" entry refers to a properties fragment with no associate document. MLCP supports exporting these, and + * so we need to support reading them in from an MLCP archive file. + */ + private boolean readNakedEntry(ZipEntry metadataZipEntry, MlcpMetadata mlcpMetadata) { + try { + nextRowToReturn = makeNakedRow(metadataZipEntry, mlcpMetadata); + return true; + } catch (Exception ex) { + String message = String.format("Unable to process entry %s from zip file at %s; cause: %s", + metadataZipEntry.getName(), this.currentFilePath, ex.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, ex); + } + Util.MAIN_LOGGER.warn(message); + return false; + } + } + + private InternalRow makeNakedRow(ZipEntry metadataZipEntry, MlcpMetadata mlcpMetadata) { + return new DocumentRowBuilder(metadataCategories) + .withUri(metadataZipEntry.getName()) + .withMetadata(mlcpMetadata.getMetadata()) + .buildRow(); + } + + private InternalRow makeRow(ZipEntry contentZipEntry, byte[] content, MlcpMetadata mlcpMetadata) { + DocumentRowBuilder rowBuilder = new DocumentRowBuilder(metadataCategories) + .withUri(contentZipEntry.getName()) + .withContent(content) + .withMetadata(mlcpMetadata.getMetadata()); + + if (mlcpMetadata.getFormat() != null) { + rowBuilder.withFormat(mlcpMetadata.getFormat().name()); + } + return rowBuilder.buildRow(); + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/MlcpMetadata.java b/src/main/java/com/marklogic/spark/reader/file/MlcpMetadata.java new file mode 100644 index 00000000..3d7405f0 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/MlcpMetadata.java @@ -0,0 +1,26 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.client.io.Format; + +/** + * Captures all the metadata, including document format, from an XML metadata entry in an MLCP archive file. + */ +class MlcpMetadata { + + private DocumentMetadataHandle metadata; + private Format format; + + MlcpMetadata(DocumentMetadataHandle metadata, Format format) { + this.metadata = metadata; + this.format = format; + } + + DocumentMetadataHandle getMetadata() { + return metadata; + } + + Format getFormat() { + return format; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/MlcpMetadataConverter.java b/src/main/java/com/marklogic/spark/reader/file/MlcpMetadataConverter.java new file mode 100644 index 00000000..05631957 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/MlcpMetadataConverter.java @@ -0,0 +1,162 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.client.io.Format; +import org.jdom2.Document; +import org.jdom2.Element; +import org.jdom2.JDOMException; +import org.jdom2.Namespace; +import org.jdom2.input.SAXBuilder; + +import java.io.IOException; +import java.io.InputStream; +import java.io.StringReader; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Handles converting an MLCP metadata document, generated when creating an MLCP archive, into a + * {@code DocumentMetadataHandle} instance, thus allowing it to be reused with the REST API. The MLCP metadata + * document is expected to have a root element of "com.marklogic.contentpump.DocumentMetadata". + */ +class MlcpMetadataConverter { + + private static final Namespace SECURITY_NAMESPACE = Namespace.getNamespace("sec", "http://marklogic.com/xdmp/security"); + private final SAXBuilder saxBuilder; + + MlcpMetadataConverter() { + this.saxBuilder = new SAXBuilder(); + } + + MlcpMetadata convert(InputStream inputStream) throws JDOMException, IOException { + Document doc = this.saxBuilder.build(inputStream); + Element mlcpMetadata = doc.getRootElement(); + Element properties = mlcpMetadata.getChild("properties"); + + DocumentMetadataHandle restMetadata = properties != null ? + newMetadataWithProperties(properties.getText()) : + new DocumentMetadataHandle(); + + addCollections(mlcpMetadata, restMetadata); + addPermissions(mlcpMetadata, restMetadata); + addQuality(mlcpMetadata, restMetadata); + addMetadataValues(mlcpMetadata, restMetadata); + + Format javaFormat = getFormat(mlcpMetadata); + return new MlcpMetadata(restMetadata, javaFormat); + } + + private Format getFormat(Element mlcpMetadata) { + Element format = mlcpMetadata.getChild("format"); + if (format != null && format.getChild("name") != null) { + String value = format.getChildText("name"); + // MLCP uses "text()" for an unknown reason. + if (value.startsWith("text")) { + value = "text"; + } + return Format.valueOf(value.toUpperCase()); + } + return null; + } + + /** + * This allows for the logic in DocumentMetadataHandle for parsing the properties fragment to be reused. + * + * @param propertiesXml + * @return + */ + private DocumentMetadataHandle newMetadataWithProperties(String propertiesXml) { + String restXml = String.format("%s", propertiesXml); + DocumentMetadataHandle metadata = new DocumentMetadataHandle(); + metadata.fromBuffer(restXml.getBytes()); + return metadata; + } + + private void addCollections(Element mlcpMetadata, DocumentMetadataHandle restMetadata) { + Element collections = mlcpMetadata.getChild("collectionsList"); + if (collections != null) { + for (Element string : collections.getChildren("string")) { + restMetadata.getCollections().add(string.getText()); + } + } + } + + private void addQuality(Element mlcpMetadata, DocumentMetadataHandle restMetadata) { + Element quality = mlcpMetadata.getChild("quality"); + if (quality != null) { + restMetadata.setQuality(Integer.parseInt(quality.getText())); + } + } + + private void addMetadataValues(Element mlcpMetadata, DocumentMetadataHandle restMetadata) { + Element collections = mlcpMetadata.getChild("meta"); + if (collections != null) { + for (Element entry : collections.getChildren("entry")) { + List strings = entry.getChildren("string"); + String key = strings.get(0).getText(); + String value = strings.get(1).getText(); + restMetadata.getMetadataValues().put(key, value); + } + } + } + + /** + * The "permissionsList" element can contain permissions that have references to other permissions, where it's + * not clear which permission it's referring to. For example: + *

+ * + * + *

+ * It this does not seem reliably to determine the capability of each permission based on each + * com.marklogic.xcc.ContentPermission element. But each such element does associate a roleId and a roleName + * together. The value of the "permString" element, which associates roleIds and capabilities, can then be used as + * a reliable source of permission information, with each roleId being bounced against a map of roleId -> roleName. + * + * @param mlcpMetadata + * @param restMetadata + * @throws IOException + * @throws JDOMException + */ + private void addPermissions(Element mlcpMetadata, DocumentMetadataHandle restMetadata) throws IOException, JDOMException { + Element permString = mlcpMetadata.getChild("permString"); + if (permString == null) { + return; + } + + Element permissionsList = mlcpMetadata.getChild("permissionsList"); + if (permissionsList == null) { + return; + } + + Map roleIdsToNames = buildRoleMap(permissionsList); + + Element perms = this.saxBuilder.build(new StringReader(permString.getText())).getRootElement(); + for (Element perm : perms.getChildren("permission", SECURITY_NAMESPACE)) { + String capability = perm.getChildText("capability", SECURITY_NAMESPACE); + DocumentMetadataHandle.Capability cap = DocumentMetadataHandle.Capability.valueOf(capability.toUpperCase()); + String roleId = perm.getChildText("role-id", SECURITY_NAMESPACE); + String roleName = roleIdsToNames.get(roleId); + if (restMetadata.getPermissions().containsKey(roleName)) { + restMetadata.getPermissions().get(roleName).add(cap); + } else { + restMetadata.getPermissions().add(roleName, cap); + } + } + } + + /** + * @param permissionsList + * @return a map of roleId -> roleName based on the role and role elements in each + * com.marklogic.xcc.ContentPermission element. + */ + private Map buildRoleMap(Element permissionsList) { + Map roleIdsToNames = new HashMap<>(); + for (Element permission : permissionsList.getChildren("com.marklogic.xcc.ContentPermission")) { + String role = permission.getChildText("role"); + String roleId = permission.getChildText("roleId"); + roleIdsToNames.put(roleId, role); + } + return roleIdsToNames; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/QuadStreamReader.java b/src/main/java/com/marklogic/spark/reader/file/QuadStreamReader.java new file mode 100644 index 00000000..6ec59f7c --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/QuadStreamReader.java @@ -0,0 +1,46 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import org.apache.jena.riot.RDFParserBuilder; +import org.apache.jena.riot.RiotException; +import org.apache.jena.riot.system.AsyncParser; +import org.apache.jena.sparql.core.Quad; +import org.apache.spark.sql.catalyst.InternalRow; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Iterator; + +/** + * Knows how to convert a stream of Jena Quad objects into Spark rows. + */ +class QuadStreamReader implements RdfStreamReader { + + private final String path; + private final Iterator quadStream; + private final RdfSerializer rdfSerializer = new RdfSerializer(); + + QuadStreamReader(String path, InputStream inputStream) { + this.path = path; + this.quadStream = AsyncParser.of(RDFParserBuilder.create() + .source(inputStream) + .lang(RdfUtil.getQuadsLang(path)) + .errorHandler(new RdfErrorHandler(path)) + .base(path) + ).streamQuads().iterator(); + } + + @Override + public boolean hasNext() throws IOException { + try { + return this.quadStream.hasNext(); + } catch (RiotException e) { + throw new ConnectorException(String.format("Unable to read %s; cause: %s", this.path, e.getMessage()), e); + } + } + + @Override + public InternalRow get() { + return rdfSerializer.serialize(this.quadStream.next()); + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/RdfErrorHandler.java b/src/main/java/com/marklogic/spark/reader/file/RdfErrorHandler.java new file mode 100644 index 00000000..cf0062d3 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/RdfErrorHandler.java @@ -0,0 +1,63 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.Util; +import org.apache.jena.riot.system.ErrorHandler; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Copied from MLCP. Once RDF writing is supported, need to test via the files referenced in + * https://progresssoftware.atlassian.net/browse/MLE-2133 . + */ +class RdfErrorHandler implements ErrorHandler { + + private static final Logger logger = LoggerFactory.getLogger(RdfErrorHandler.class); + + private final String path; + + RdfErrorHandler(String path) { + this.path = path; + } + + @Override + public void warning(String message, long line, long col) { + boolean isDebugMessage = message.contains("Bad IRI:") || message.contains("Illegal character in IRI") || message.contains("Not advised IRI"); + if (isDebugMessage) { + if (logger.isDebugEnabled()) { + logger.debug(formatMessage(message, line, col)); + } + } else if (Util.MAIN_LOGGER.isWarnEnabled()) { + Util.MAIN_LOGGER.warn(formatMessage(message, line, col)); + } + } + + @Override + public void error(String message, long line, long col) { + boolean isDebugMessage = message.contains("Bad character in IRI") || message.contains("Problem setting StAX property"); + if (isDebugMessage) { + if (logger.isDebugEnabled()) { + logger.debug(formatMessage(message, line, col)); + } + } else if (Util.MAIN_LOGGER.isErrorEnabled()) { + Util.MAIN_LOGGER.error(formatMessage(message, line, col)); + } + } + + @Override + public void fatal(String message, long line, long col) { + if (Util.MAIN_LOGGER.isErrorEnabled()) { + Util.MAIN_LOGGER.error(formatMessage(message, line, col)); + } + } + + private String formatMessage(String message, long line, long col) { + String preamble = this.path + ":"; + if (line >= 0) { + preamble += line; + if (col >= 0) { + preamble += ":" + col; + } + } + return preamble + " " + message; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/RdfFileReader.java b/src/main/java/com/marklogic/spark/reader/file/RdfFileReader.java new file mode 100644 index 00000000..280424ea --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/RdfFileReader.java @@ -0,0 +1,96 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; + +/** + * Reduces some duplication across RdfFileReader and QuadsFileReader. + */ +class RdfFileReader implements PartitionReader { + + private final Logger logger = LoggerFactory.getLogger(getClass()); + + private final FilePartition filePartition; + private final FileContext fileContext; + + private String currentFilePath; + private RdfStreamReader currentRdfStreamReader; + private InputStream currentInputStream; + private int nextFilePathIndex; + + RdfFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + } + + @Override + public boolean next() { + try { + if (currentRdfStreamReader == null) { + if (nextFilePathIndex >= this.filePartition.getPaths().size()) { + return false; + } + if (!initializeRdfStreamReader()) { + return next(); + } + } + if (currentRdfStreamReader.hasNext()) { + return true; + } + currentRdfStreamReader = null; + return next(); + } catch (ConnectorException ex) { + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + return next(); + } catch (Exception ex) { + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(String.format("Unable to process RDF file %s, cause: %s", currentFilePath, ex.getMessage()), ex); + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + return next(); + } + } + + @Override + public InternalRow get() { + return currentRdfStreamReader.get(); + } + + @Override + public void close() throws IOException { + IOUtils.closeQuietly(this.currentInputStream); + } + + private boolean initializeRdfStreamReader() { + this.currentFilePath = this.filePartition.getPaths().get(nextFilePathIndex); + if (logger.isDebugEnabled()) { + logger.debug("Reading file {}", this.currentFilePath); + } + try { + this.currentInputStream = fileContext.openFile(this.currentFilePath); + this.currentRdfStreamReader = RdfUtil.isQuadsFile(this.currentFilePath) ? + new QuadStreamReader(this.currentFilePath, this.currentInputStream) : + new TripleStreamReader(this.currentFilePath, this.currentInputStream); + this.nextFilePathIndex++; + return true; + } catch (Exception e) { + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(String.format( + "Unable to read file at %s; cause: %s", this.currentFilePath, e.getMessage()), e); + } + Util.MAIN_LOGGER.warn("Unable to read file at {}; cause: {}", this.currentFilePath, e.getMessage()); + return false; + } + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/RdfSerializer.java b/src/main/java/com/marklogic/spark/reader/file/RdfSerializer.java new file mode 100644 index 00000000..8825f706 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/RdfSerializer.java @@ -0,0 +1,120 @@ +package com.marklogic.spark.reader.file; + +import org.apache.jena.graph.Node; +import org.apache.jena.graph.Triple; +import org.apache.jena.sparql.core.Quad; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; +import org.apache.spark.unsafe.types.UTF8String; + +import java.util.Random; + +/** + * Captures the logic from Content Pump for serializing a Jena Triple or Quad into a string representation. + */ +class RdfSerializer { + + private static final String DEFAULT_GRAPH = "http://marklogic.com/semantics#default-graph"; + + // These are both used in the MLCP-specific code below for generating a "blank" value. + private static final long HASH64_STEP = 15485863L; + + // Sonar is suspicious about the use of Random, the actual random number doesn't matter for any functionality. + @SuppressWarnings("java:S2245") + private final Random random = new Random(); + + InternalRow serialize(Triple triple) { + String[] objectValues = serializeObject(triple.getObject()); + return new GenericInternalRow(new Object[]{ + UTF8String.fromString(serialize(triple.getSubject())), + UTF8String.fromString(serialize(triple.getPredicate())), + UTF8String.fromString(objectValues[0]), + objectValues[1] != null ? UTF8String.fromString(objectValues[1]) : null, + objectValues[2] != null ? UTF8String.fromString(objectValues[2]) : null, + null + }); + } + + InternalRow serialize(Quad quad) { + String[] objectValues = serializeObject(quad.getObject()); + return new GenericInternalRow(new Object[]{ + UTF8String.fromString(serialize(quad.getSubject())), + UTF8String.fromString(serialize(quad.getPredicate())), + UTF8String.fromString(objectValues[0]), + objectValues[1] != null ? UTF8String.fromString(objectValues[1]) : null, + objectValues[2] != null ? UTF8String.fromString(objectValues[2]) : null, + quad.getGraph() != null ? + UTF8String.fromString(serialize(quad.getGraph())) : + UTF8String.fromString(DEFAULT_GRAPH) + }); + } + + private String serialize(Node node) { + return node.isBlank() ? generateBlankValue(node) : node.toString(); + } + + /** + * @param triple + * @return an array containing a string serialization of the object; an optional datatype; and an optional "lang" value. + */ + private String[] serializeObject(Node object) { + if (object.isLiteral()) { + String type = object.getLiteralDatatypeURI(); + String lang = object.getLiteralLanguage(); + if ("".equals(lang)) { + lang = null; + } + if (lang != null && lang.trim().length() > 0) { + // MarkLogic uses this datatype when a string has a lang associated with it. + type = "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString"; + } else if ("".equals(lang) || lang == null) { + if (type == null) { + type = "http://www.w3.org/2001/XMLSchema#string"; + } + } else { + type = null; + } + return new String[]{object.getLiteralLexicalForm(), type, lang}; + } else if (object.isBlank()) { + return new String[]{generateBlankValue(object), null, null}; + } else { + return new String[]{object.toString(), null, null}; + } + } + + /** + * Reuses copy/pasted code from the MLCP codebase for generating a blank value for a "blank node" - see + * https://en.wikipedia.org/wiki/Blank_node for more details. It is not known why a UUID isn't used. + * + * @return + */ + private String generateBlankValue(Node blankNode) { + String value = Long.toHexString( + hash64( + fuse(scramble(System.currentTimeMillis()), random.nextLong()), + blankNode.getBlankNodeLabel() + ) + ); + return "http://marklogic.com/semantics/blank/" + value; + } + + private long hash64(long value, String str) { + char[] arr = str.toCharArray(); + for (int i = 0; i < str.length(); i++) { + value = (value + Character.getNumericValue(arr[i])) * HASH64_STEP; + } + return value; + } + + private long fuse(long a, long b) { + return rotl(a, 8) ^ b; + } + + private long scramble(long x) { + return x ^ rotl(x, 20) ^ rotl(x, 40); + } + + private long rotl(long x, long y) { + return (x << y) ^ (x >> (64 - y)); + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/RdfStreamReader.java b/src/main/java/com/marklogic/spark/reader/file/RdfStreamReader.java new file mode 100644 index 00000000..79c13d02 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/RdfStreamReader.java @@ -0,0 +1,16 @@ +package com.marklogic.spark.reader.file; + +import org.apache.spark.sql.catalyst.InternalRow; + +import java.io.IOException; + +/** + * Allows the logic for reading Jena quads and triples as Spark rows to be easily reused without being tied to a + * specific Spark partition reader. + */ +public interface RdfStreamReader { + + boolean hasNext() throws IOException; + + InternalRow get(); +} diff --git a/src/main/java/com/marklogic/spark/reader/file/RdfUtil.java b/src/main/java/com/marklogic/spark/reader/file/RdfUtil.java new file mode 100644 index 00000000..714b116e --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/RdfUtil.java @@ -0,0 +1,29 @@ +package com.marklogic.spark.reader.file; + +import org.apache.jena.riot.Lang; + +public interface RdfUtil { + + static boolean isQuadsFile(String filename) { + return isTrigFile(filename) || isTrixFile(filename) || + filename.endsWith(".nq") || + filename.endsWith(".nq.gz"); + } + + static Lang getQuadsLang(String filename) { + if (isTrigFile(filename)) { + return Lang.TRIG; + } else if (isTrixFile(filename)) { + return Lang.TRIX; + } + return Lang.NQ; + } + + private static boolean isTrigFile(String filename) { + return filename.endsWith(".trig") || filename.endsWith(".trig.gz"); + } + + private static boolean isTrixFile(String filename) { + return filename.endsWith(".trix") || filename.endsWith(".trix.gz"); + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/RdfZipFileReader.java b/src/main/java/com/marklogic/spark/reader/file/RdfZipFileReader.java new file mode 100644 index 00000000..7adace25 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/RdfZipFileReader.java @@ -0,0 +1,117 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; +import org.jetbrains.annotations.NotNull; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; + +class RdfZipFileReader implements PartitionReader { + + private static final Logger logger = LoggerFactory.getLogger(RdfZipFileReader.class); + + private final FilePartition filePartition; + private final FileContext fileContext; + + private String currentFilePath; + private CustomZipInputStream currentZipInputStream; + private RdfStreamReader currentRdfStreamReader; + private int nextFilePathIndex; + + RdfZipFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + } + + @Override + public boolean next() { + try { + // If we've got a reader and a row, we're good to go. + if (currentRdfStreamReader != null && currentRdfStreamReader.hasNext()) { + return true; + } + + // In case this was not null but just out of entries, set it to null. + currentRdfStreamReader = null; + + // If we have a zip open, look for the next RDF entry to process. + if (currentZipInputStream != null) { + ZipEntry zipEntry = FileUtil.findNextFileEntry(currentZipInputStream); + if (zipEntry == null) { + // If the zip is empty, go to the next file. + currentZipInputStream = null; + return next(); + } + if (logger.isTraceEnabled()) { + logger.trace("Reading entry {} in {}", zipEntry.getName(), this.currentFilePath); + } + this.currentRdfStreamReader = RdfUtil.isQuadsFile(zipEntry.getName()) ? + new QuadStreamReader(zipEntry.getName(), currentZipInputStream) : + new TripleStreamReader(zipEntry.getName(), currentZipInputStream); + return next(); + } + + // If we get here, it's time for the next file, if one exists. + if (nextFilePathIndex >= filePartition.getPaths().size()) { + return false; + } + + // Open up the next zip. + this.currentFilePath = filePartition.getPaths().get(nextFilePathIndex); + nextFilePathIndex++; + this.currentZipInputStream = new CustomZipInputStream(fileContext.openFile(currentFilePath)); + return next(); + } catch (Exception ex) { + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(String.format("Unable to process zip file at %s, cause: %s", currentFilePath, ex.getMessage()), ex); + } + Util.MAIN_LOGGER.warn(ex.getMessage()); + return next(); + } + } + + @Override + public InternalRow get() { + return this.currentRdfStreamReader.get(); + } + + @Override + public void close() throws IOException { + if (this.currentZipInputStream != null) { + this.currentZipInputStream.readyToClose = true; + IOUtils.closeQuietly(this.currentZipInputStream); + } + } + + /** + * Per https://jena.apache.org/documentation/io/rdf-input.html#iterating-over-parser-output , Jena will call + * close() on an iterator after reading all the triples/quads. This results in the ZipInputStream being closed, + * which prevents any additional entries from being read. + *

+ * We know we only want to close the stream when Spark calls close() or abort(). So this modifies the close() method + * to only close after a boolean has been flipped to true. + */ + private static class CustomZipInputStream extends ZipInputStream { + + private boolean readyToClose = false; + + public CustomZipInputStream(@NotNull InputStream in) { + super(in); + } + + @Override + public void close() throws IOException { + if (readyToClose) { + super.close(); + } + } + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/TripleRowSchema.java b/src/main/java/com/marklogic/spark/reader/file/TripleRowSchema.java new file mode 100644 index 00000000..186009d5 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/TripleRowSchema.java @@ -0,0 +1,22 @@ +package com.marklogic.spark.reader.file; + +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.StructType; + +/** + * Represents a triple as read from an RDF file and serialized into the 3 XML elements comprising + * a MarkLogic triple. + */ +public abstract class TripleRowSchema { + + public static final StructType SCHEMA = new StructType() + .add("subject", DataTypes.StringType) + .add("predicate", DataTypes.StringType) + .add("object", DataTypes.StringType) + .add("datatype", DataTypes.StringType) + .add("lang", DataTypes.StringType) + .add("graph", DataTypes.StringType); + + private TripleRowSchema() { + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/TripleStreamReader.java b/src/main/java/com/marklogic/spark/reader/file/TripleStreamReader.java new file mode 100644 index 00000000..e2966b56 --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/TripleStreamReader.java @@ -0,0 +1,73 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import org.apache.jena.graph.Triple; +import org.apache.jena.riot.Lang; +import org.apache.jena.riot.RDFParserBuilder; +import org.apache.jena.riot.RiotException; +import org.apache.jena.riot.system.AsyncParser; +import org.apache.spark.sql.catalyst.InternalRow; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Iterator; + +/** + * Knows how to convert a stream of Jena Triple objects into Spark rows. + */ +class TripleStreamReader implements RdfStreamReader { + + private final String path; + private final Iterator tripleStream; + private final RdfSerializer rdfSerializer = new RdfSerializer(); + + TripleStreamReader(String path, InputStream inputStream) { + this.path = path; + RDFParserBuilder parserBuilder = RDFParserBuilder.create() + .source(inputStream) + .errorHandler(new RdfErrorHandler(path)) + .lang(determineLang(path)) + .base(path); + this.tripleStream = AsyncParser.of(parserBuilder).streamTriples().iterator(); + } + + @Override + public boolean hasNext() throws IOException { + try { + return this.tripleStream.hasNext(); + } catch (RiotException e) { + if (e.getMessage().contains("Failed to determine the RDF syntax")) { + throw new ConnectorException(String.format("Unable to read file at %s; RDF syntax is not supported or " + + "the file extension is not recognized.", this.path), e); + } + throw new ConnectorException(String.format("Unable to read %s; cause: %s", this.path, e.getMessage()), e); + } + } + + @Override + public InternalRow get() { + // Per the Jena javadocs, next() is not expected to throw an error; if an were to occur, it would occur when + // calling hasNext() first. + return rdfSerializer.serialize(this.tripleStream.next()); + } + + /** + * This is only defining extensions that Jena does not appear to recognize. Testing has shown that providing the + * file path for the Jena {@code base} method will work for all the file types we support - except for RDF JSON, + * we need to Jena that ".json" maps to RDF JSON. + * + * @param path + * @return + */ + private Lang determineLang(String path) { + if (path.endsWith(".json") || path.endsWith(".json.gz")) { + return Lang.RDFJSON; + } else if (path.endsWith(".thrift")) { + return Lang.RDFTHRIFT; + } else if (path.endsWith(".binpb")) { + // See https://protobuf.dev/programming-guides/techniques/#suffixes . + return Lang.RDFPROTO; + } + return null; + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/ZipAggregateXMLFileReader.java b/src/main/java/com/marklogic/spark/reader/file/ZipAggregateXMLFileReader.java deleted file mode 100644 index 39a7154e..00000000 --- a/src/main/java/com/marklogic/spark/reader/file/ZipAggregateXMLFileReader.java +++ /dev/null @@ -1,76 +0,0 @@ -package com.marklogic.spark.reader.file; - -import com.marklogic.spark.ConnectorException; -import org.apache.commons.io.IOUtils; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; -import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.connector.read.PartitionReader; -import org.apache.spark.util.SerializableConfiguration; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import java.io.IOException; -import java.util.Map; -import java.util.zip.ZipEntry; -import java.util.zip.ZipInputStream; - -class ZipAggregateXMLFileReader implements PartitionReader { - - private static final Logger logger = LoggerFactory.getLogger(ZipAggregateXMLFileReader.class); - - private final Map properties; - private final ZipInputStream zipInputStream; - private final String path; - - private AggregateXMLSplitter aggregateXMLSplitter; - - // Used solely for a default URI prefix. - private int entryCounter; - - ZipAggregateXMLFileReader(FilePartition partition, Map properties, SerializableConfiguration hadoopConfiguration) { - this.properties = properties; - this.path = partition.getPath(); - if (logger.isTraceEnabled()) { - logger.trace("Reading path: {}", this.path); - } - try { - Path hadoopPath = new Path(partition.getPath()); - FileSystem fileSystem = hadoopPath.getFileSystem(hadoopConfiguration.value()); - this.zipInputStream = new ZipInputStream(fileSystem.open(hadoopPath)); - } catch (Exception e) { - throw new ConnectorException(String.format("Unable to read %s; cause: %s", this.path, e.getMessage()), e); - } - } - - @Override - public boolean next() throws IOException { - if (aggregateXMLSplitter != null && aggregateXMLSplitter.hasNext()) { - return true; - } - - // Once we no longer have any valid zip entries, we're done. - ZipEntry zipEntry = FileUtil.findNextFileEntry(zipInputStream); - if (zipEntry == null) { - return false; - } - - if (logger.isTraceEnabled()) { - logger.trace("Reading entry {} in {}", zipEntry.getName(), this.path); - } - entryCounter++; - String identifierForError = "entry " + zipEntry.getName() + " in " + this.path; - aggregateXMLSplitter = new AggregateXMLSplitter(identifierForError, this.zipInputStream, properties); - return true; - } - - @Override - public InternalRow get() { - return this.aggregateXMLSplitter.nextRow(this.path + "-" + entryCounter); - } - - @Override - public void close() throws IOException { - IOUtils.closeQuietly(this.zipInputStream); - } -} diff --git a/src/main/java/com/marklogic/spark/reader/file/ZipAggregateXmlFileReader.java b/src/main/java/com/marklogic/spark/reader/file/ZipAggregateXmlFileReader.java new file mode 100644 index 00000000..421803dc --- /dev/null +++ b/src/main/java/com/marklogic/spark/reader/file/ZipAggregateXmlFileReader.java @@ -0,0 +1,159 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; +import org.apache.commons.io.IOUtils; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.read.PartitionReader; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; + +class ZipAggregateXmlFileReader implements PartitionReader { + + private static final Logger logger = LoggerFactory.getLogger(ZipAggregateXmlFileReader.class); + + private final FileContext fileContext; + private final FilePartition filePartition; + private AggregateXmlSplitter aggregateXMLSplitter; + + // Used solely for a default URI prefix. + private int entryCounter; + + private InternalRow rowToReturn; + + private int nextFilePathIndex = 0; + private String currentFilePath; + private ZipInputStream currentZipInputStream; + + ZipAggregateXmlFileReader(FilePartition filePartition, FileContext fileContext) { + this.fileContext = fileContext; + this.filePartition = filePartition; + this.openNextFile(); + } + + /** + * Finds the next valid XML element from either the current zip entry or the next valid zip entry. + * + * @return + * @throws IOException + */ + @Override + public boolean next() { + while (true) { + // If we don't already have a splitter open on a zip entry, find the next valid zip entry to process. + if (aggregateXMLSplitter == null) { + boolean foundZipEntry = findNextValidZipEntry(); + if (!foundZipEntry) { + close(); + if (nextFilePathIndex >= filePartition.getPaths().size()) { + return false; + } + this.openNextFile(); + } + } + + // If we have a splitter open on a zip entry, find the next valid row to return from the entry. + if (aggregateXMLSplitter != null) { + boolean foundRow = findNextRowToReturn(); + if (foundRow) { + return true; + } + } + } + } + + @Override + public InternalRow get() { + return this.rowToReturn; + } + + @Override + public void close() { + IOUtils.closeQuietly(this.currentZipInputStream); + } + + private void openNextFile() { + this.currentFilePath = filePartition.getPaths().get(nextFilePathIndex); + nextFilePathIndex++; + this.currentZipInputStream = new ZipInputStream(fileContext.openFile(this.currentFilePath)); + } + + /** + * Find the next valid entry in the zip file. The most likely reason an entry will fail is because it's not XML + * or not a valid XML document. + * + * @return false if there are no more valid entries in the zip; true otherwise. + * @throws IOException + */ + private boolean findNextValidZipEntry() { + while (true) { + // Once we no longer have any valid zip entries, we're done. + ZipEntry zipEntry; + try { + zipEntry = FileUtil.findNextFileEntry(currentZipInputStream); + } catch (IOException e) { + String message = String.format("Unable to read zip entry from %s; cause: %s", this.currentFilePath, e.getMessage()); + if (fileContext.isReadAbortOnFailure()) { + throw new ConnectorException(message, e); + } + Util.MAIN_LOGGER.warn(message); + return false; + } + + if (zipEntry == null) { + return false; + } + if (logger.isTraceEnabled()) { + logger.trace("Reading entry {} in {}", zipEntry.getName(), this.currentFilePath); + } + entryCounter++; + String identifierForError = "entry " + zipEntry.getName() + " in " + this.currentFilePath; + + try { + aggregateXMLSplitter = new AggregateXmlSplitter(identifierForError, this.currentZipInputStream, this.fileContext); + // Fail fast if the next entry is not valid XML. + aggregateXMLSplitter.hasNext(); + return true; + } catch (Exception ex) { + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + aggregateXMLSplitter = null; + Util.MAIN_LOGGER.warn(ex.getMessage()); + } + } + } + + /** + * Find the next row to return, where a row is constructed from a child XML element as specified by the user-defined + * aggregate XML element name and optional namespace. The most likely reason this will fail is because a + * child element is found but it does not have the user-defined URI element in it. + * + * @return + */ + private boolean findNextRowToReturn() { + while (true) { + // This hasNext() call shouldn't fail except when the splitter is first created, and we call it then to + // ensure that the entry is a valid XML file. + if (!aggregateXMLSplitter.hasNext()) { + aggregateXMLSplitter = null; + return false; + } + + try { + this.rowToReturn = this.aggregateXMLSplitter.nextRow(this.currentFilePath + "-" + entryCounter); + return true; + } catch (Exception ex) { + if (fileContext.isReadAbortOnFailure()) { + throw ex; + } + // Warn that the element failed, and keep going. + Util.MAIN_LOGGER.warn(ex.getMessage()); + } + } + } +} diff --git a/src/main/java/com/marklogic/spark/reader/file/ZipFileReader.java b/src/main/java/com/marklogic/spark/reader/file/ZipFileReader.java index 9a3f2369..7fbd8996 100644 --- a/src/main/java/com/marklogic/spark/reader/file/ZipFileReader.java +++ b/src/main/java/com/marklogic/spark/reader/file/ZipFileReader.java @@ -1,14 +1,12 @@ package com.marklogic.spark.reader.file; import com.marklogic.spark.ConnectorException; -import org.apache.hadoop.fs.FileSystem; -import org.apache.hadoop.fs.Path; +import org.apache.commons.crypto.utils.IoUtils; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; import org.apache.spark.sql.connector.read.PartitionReader; import org.apache.spark.unsafe.types.ByteArray; import org.apache.spark.unsafe.types.UTF8String; -import org.apache.spark.util.SerializableConfiguration; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -21,57 +19,66 @@ class ZipFileReader implements PartitionReader { private static final Logger logger = LoggerFactory.getLogger(ZipFileReader.class); - private final String path; - private final ZipInputStream zipInputStream; + private final FilePartition filePartition; + private final FileContext fileContext; + private int nextFilePathIndex; + private String currentFilePath; + private ZipInputStream currentZipInputStream; private ZipEntry currentZipEntry; - ZipFileReader(FilePartition partition, SerializableConfiguration hadoopConfiguration) { - this.path = partition.getPath(); - try { - if (logger.isDebugEnabled()) { - logger.debug("Reading zip file {}", this.path); - } - Path hadoopPath = new Path(this.path); - FileSystem fileSystem = hadoopPath.getFileSystem(hadoopConfiguration.value()); - this.zipInputStream = new ZipInputStream(fileSystem.open(hadoopPath)); - } catch (IOException e) { - throw new ConnectorException(String.format("Unable to read zip file at %s; cause: %s", this.path, e.getMessage()), e); - } + ZipFileReader(FilePartition filePartition, FileContext fileContext) { + this.filePartition = filePartition; + this.fileContext = fileContext; + openNextFile(); } @Override public boolean next() throws IOException { - currentZipEntry = FileUtil.findNextFileEntry(zipInputStream); - return currentZipEntry != null; + currentZipEntry = FileUtil.findNextFileEntry(currentZipInputStream); + if (currentZipEntry != null) { + return true; + } + close(); + if (nextFilePathIndex == filePartition.getPaths().size()) { + return false; + } + openNextFile(); + return next(); } @Override public InternalRow get() { String zipEntryName = currentZipEntry.getName(); if (logger.isTraceEnabled()) { - logger.trace("Reading zip entry {} from zip file {}.", zipEntryName, this.path); + logger.trace("Reading zip entry {} from zip file {}.", zipEntryName, this.currentFilePath); } String uri = zipEntryName.startsWith("/") ? - this.path + zipEntryName : - this.path + "/" + zipEntryName; + this.currentFilePath + zipEntryName : + this.currentFilePath + "/" + zipEntryName; byte[] content = readZipEntry(); - long length = content.length; return new GenericInternalRow(new Object[]{ - UTF8String.fromString(uri), null, length, ByteArray.concat(content) + UTF8String.fromString(uri), ByteArray.concat(content), + null, null, null, null, null, null }); } @Override - public void close() throws IOException { - this.zipInputStream.close(); + public void close() { + IoUtils.closeQuietly(this.currentZipInputStream); + } + + private void openNextFile() { + this.currentFilePath = this.filePartition.getPaths().get(nextFilePathIndex); + nextFilePathIndex++; + this.currentZipInputStream = new ZipInputStream(fileContext.openFile(this.currentFilePath)); } private byte[] readZipEntry() { try { - return FileUtil.readBytes(zipInputStream); + return fileContext.readBytes(currentZipInputStream); } catch (IOException e) { throw new ConnectorException(String.format("Unable to read from zip file at %s; cause: %s", - this.path, e.getMessage()), e); + this.currentFilePath, e.getMessage()), e); } } } diff --git a/src/main/java/com/marklogic/spark/reader/filter/OpticFilter.java b/src/main/java/com/marklogic/spark/reader/filter/OpticFilter.java index 7ea80d26..ee81a437 100644 --- a/src/main/java/com/marklogic/spark/reader/filter/OpticFilter.java +++ b/src/main/java/com/marklogic/spark/reader/filter/OpticFilter.java @@ -41,4 +41,14 @@ public interface OpticFilter extends Serializable { * @return */ PlanBuilder.Plan bindFilterValue(PlanBuilder.Plan plan); + + /** + * Allows the filter to determine if - after having been constructed - it's not a valid Optic expression and thus + * cannot be pushed down to MarkLogic. + * + * @return + */ + default boolean isValid() { + return true; + } } diff --git a/src/main/java/com/marklogic/spark/reader/filter/ParentFilter.java b/src/main/java/com/marklogic/spark/reader/filter/ParentFilter.java index 6d00ea45..b1a48798 100644 --- a/src/main/java/com/marklogic/spark/reader/filter/ParentFilter.java +++ b/src/main/java/com/marklogic/spark/reader/filter/ParentFilter.java @@ -52,6 +52,28 @@ class ParentFilter implements OpticFilter { this.filters = filters; } + /** + * Per the docs at https://docs.marklogic.com/op.sqlCondition, an Optic sqlCondition returns a "filterdef" + * instead of a boolean expression and thus cannot be used in an Optic and/or/not clause. Thus, if this filter + * contains a SqlConditionFilter at any depth, it is not valid and cannot be pushed down. + * + * @return + */ + @Override + public boolean isValid() { + for (OpticFilter filter : filters) { + if (filter instanceof SqlConditionFilter) { + return false; + } else if (filter instanceof ParentFilter) { + boolean valid = filter.isValid(); + if (!valid) { + return false; + } + } + } + return true; + } + @Override public void populateArg(ObjectNode arg) { ArrayNode args = arg.put("ns", "op").put("fn", this.functionName).putArray("args"); diff --git a/src/main/java/com/marklogic/spark/reader/optic/OpticBatch.java b/src/main/java/com/marklogic/spark/reader/optic/OpticBatch.java index 555dcf55..453742ac 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/OpticBatch.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticBatch.java @@ -25,12 +25,12 @@ class OpticBatch implements Batch { private static final Logger logger = LoggerFactory.getLogger(OpticBatch.class); - private final ReadContext readContext; + private final OpticReadContext opticReadContext; private final InputPartition[] partitions; - OpticBatch(ReadContext readContext) { - this.readContext = readContext; - PlanAnalysis planAnalysis = readContext.getPlanAnalysis(); + OpticBatch(OpticReadContext opticReadContext) { + this.opticReadContext = opticReadContext; + PlanAnalysis planAnalysis = opticReadContext.getPlanAnalysis(); partitions = planAnalysis != null ? planAnalysis.getPartitionArray() : new InputPartition[]{}; @@ -43,9 +43,9 @@ public InputPartition[] planInputPartitions() { @Override public PartitionReaderFactory createReaderFactory() { - if (logger.isDebugEnabled()) { - logger.debug("Creating new partition reader factory"); + if (logger.isTraceEnabled()) { + logger.trace("Creating new partition reader factory"); } - return new OpticPartitionReaderFactory(readContext); + return new OpticPartitionReaderFactory(opticReadContext); } } diff --git a/src/main/java/com/marklogic/spark/reader/optic/OpticMicroBatchStream.java b/src/main/java/com/marklogic/spark/reader/optic/OpticMicroBatchStream.java index ba0b255c..41da87e8 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/OpticMicroBatchStream.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticMicroBatchStream.java @@ -39,13 +39,13 @@ class OpticMicroBatchStream implements MicroBatchStream { private static final Logger logger = LoggerFactory.getLogger(OpticMicroBatchStream.class); - private ReadContext readContext; + private OpticReadContext opticReadContext; private List allBuckets; private int bucketIndex; - OpticMicroBatchStream(ReadContext readContext) { - this.readContext = readContext; - this.allBuckets = this.readContext.getPlanAnalysis().getAllBuckets(); + OpticMicroBatchStream(OpticReadContext opticReadContext) { + this.opticReadContext = opticReadContext; + this.allBuckets = this.opticReadContext.getPlanAnalysis().getAllBuckets(); } @Override @@ -77,7 +77,7 @@ public InputPartition[] planInputPartitions(Offset start, Offset end) { @Override public PartitionReaderFactory createReaderFactory() { - return new OpticPartitionReaderFactory(this.readContext); + return new OpticPartitionReaderFactory(this.opticReadContext); } @Override diff --git a/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReader.java b/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReader.java index 2a2c9cb1..4760b9e0 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReader.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReader.java @@ -20,6 +20,7 @@ import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.node.ObjectNode; import com.marklogic.client.row.RowManager; +import com.marklogic.spark.ReadProgressLogger; import com.marklogic.spark.reader.JsonRowDeserializer; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.connector.read.PartitionReader; @@ -33,7 +34,7 @@ class OpticPartitionReader implements PartitionReader { private static final Logger logger = LoggerFactory.getLogger(OpticPartitionReader.class); - private final ReadContext readContext; + private final OpticReadContext opticReadContext; private final PlanAnalysis.Partition partition; private final RowManager rowManager; @@ -46,20 +47,23 @@ class OpticPartitionReader implements PartitionReader { // Used solely for logging metrics private long totalRowCount; private long totalDuration; + private long progressCounter; + private final long batchSize; // Used solely for testing purposes; is never expected to be used in production. Intended to provide a way for // a test to get the count of rows returned from MarkLogic, which is important for ensuring that pushdown operations // are working correctly. static Consumer totalRowCountListener; - OpticPartitionReader(ReadContext readContext, PlanAnalysis.Partition partition) { - this.readContext = readContext; + OpticPartitionReader(OpticReadContext opticReadContext, PlanAnalysis.Partition partition) { + this.opticReadContext = opticReadContext; + this.batchSize = opticReadContext.getBatchSize(); this.partition = partition; - this.rowManager = readContext.connectToMarkLogic().newRowManager(); + this.rowManager = opticReadContext.connectToMarkLogic().newRowManager(); // Nested values won't work with the JacksonParser used by JsonRowDeserializer, so we ask for type info to not // be in the rows. this.rowManager.setDatatypeStyle(RowManager.RowSetPart.HEADER); - this.jsonRowDeserializer = new JsonRowDeserializer(readContext.getSchema()); + this.jsonRowDeserializer = new JsonRowDeserializer(opticReadContext.getSchema()); } @Override @@ -86,7 +90,7 @@ public boolean next() { PlanAnalysis.Bucket bucket = partition.getBuckets().get(nextBucketIndex); nextBucketIndex++; long start = System.currentTimeMillis(); - this.rowIterator = readContext.readRowsInBucket(rowManager, partition, bucket); + this.rowIterator = opticReadContext.readRowsInBucket(rowManager, partition, bucket); if (logger.isDebugEnabled()) { this.totalDuration += System.currentTimeMillis() - start; } @@ -101,6 +105,11 @@ public boolean next() { public InternalRow get() { this.currentBucketRowCount++; this.totalRowCount++; + this.progressCounter++; + if (this.progressCounter >= this.batchSize) { + ReadProgressLogger.logProgressIfNecessary(this.progressCounter); + this.progressCounter = 0; + } JsonNode row = rowIterator.next(); return this.jsonRowDeserializer.deserializeJson(row.toString()); } diff --git a/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReaderFactory.java b/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReaderFactory.java index 3a88be9f..cca4bea9 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReaderFactory.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticPartitionReaderFactory.java @@ -27,10 +27,10 @@ class OpticPartitionReaderFactory implements PartitionReaderFactory { static final long serialVersionUID = 1; private static final Logger logger = LoggerFactory.getLogger(OpticPartitionReaderFactory.class); - private final ReadContext readContext; + private final OpticReadContext opticReadContext; - OpticPartitionReaderFactory(ReadContext readContext) { - this.readContext = readContext; + OpticPartitionReaderFactory(OpticReadContext opticReadContext) { + this.opticReadContext = opticReadContext; } @Override @@ -38,6 +38,6 @@ public PartitionReader createReader(InputPartition partition) { if (logger.isDebugEnabled()) { logger.debug("Creating reader for partition: {}", partition); } - return new OpticPartitionReader(this.readContext, (PlanAnalysis.Partition) partition); + return new OpticPartitionReader(this.opticReadContext, (PlanAnalysis.Partition) partition); } } diff --git a/src/main/java/com/marklogic/spark/reader/optic/ReadContext.java b/src/main/java/com/marklogic/spark/reader/optic/OpticReadContext.java similarity index 90% rename from src/main/java/com/marklogic/spark/reader/optic/ReadContext.java rename to src/main/java/com/marklogic/spark/reader/optic/OpticReadContext.java index 60fae43a..9be812db 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/ReadContext.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticReadContext.java @@ -40,23 +40,20 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.util.ArrayList; -import java.util.Iterator; -import java.util.List; -import java.util.Map; +import java.util.*; import java.util.stream.Collectors; import java.util.stream.Stream; /** - * Captures state - all of which is serializable - that can be calculated at different times based on a user's inputs. + * Captures state - all of which is serializable - associated with a read operation involving an Optic query. * Also simplifies passing state around to the various Spark-required classes, as we only need one argument instead of * N arguments. */ -public class ReadContext extends ContextSupport { +public class OpticReadContext extends ContextSupport { static final long serialVersionUID = 1; - private static final Logger logger = LoggerFactory.getLogger(ReadContext.class); + private static final Logger logger = LoggerFactory.getLogger(OpticReadContext.class); // The ideal batch size depends highly on what a user chooses to do after a load() - and of course the user may // choose to perform multiple operations on the dataset, each of which may benefit from a fairly different batch @@ -68,17 +65,18 @@ public class ReadContext extends ContextSupport { private StructType schema; private long serverTimestamp; private List opticFilters; + private final long batchSize; - public ReadContext(Map properties, StructType schema, int defaultMinPartitions) { + public OpticReadContext(Map properties, StructType schema, int defaultMinPartitions) { super(properties); this.schema = schema; final long partitionCount = getNumericOption(Options.READ_NUM_PARTITIONS, defaultMinPartitions, 1); - final long batchSize = getNumericOption(Options.READ_BATCH_SIZE, DEFAULT_BATCH_SIZE, 0); + this.batchSize = getNumericOption(Options.READ_BATCH_SIZE, DEFAULT_BATCH_SIZE, 0); final String dslQuery = properties.get(Options.READ_OPTIC_QUERY); if (dslQuery == null || dslQuery.trim().length() < 1) { - throw new IllegalArgumentException(String.format("No Optic query found; must define %s", Options.READ_OPTIC_QUERY)); + throw new ConnectorException(Util.getOptionNameForErrorMessage("spark.marklogic.read.noOpticQuery")); } DatabaseClient client = connectToMarkLogic(); @@ -111,7 +109,7 @@ private void handlePlanAnalysisError(String query, FailedRequestException ex) { if (ex.getMessage().contains(indicatorOfNoRowsExisting)) { Util.MAIN_LOGGER.info("No rows were found, so will not create any partitions."); } else { - throw new ConnectorException(String.format("Unable to run Optic DSL query %s; cause: %s", query, ex.getMessage()), ex); + throw new ConnectorException(String.format("Unable to run Optic query %s; cause: %s", query, ex.getMessage()), ex); } } @@ -155,7 +153,9 @@ private PlanBuilder.Plan buildPlanForBucket(RowManager rowManager, PlanAnalysis. void pushDownFiltersIntoOpticQuery(List opticFilters) { this.opticFilters = opticFilters; - addOperatorToPlan(PlanUtil.buildWhere(opticFilters)); + // Add each filter in a separate "where" so we don't toss an op.sqlCondition into an op.and, + // which Optic does not allow. + opticFilters.forEach(filter -> addOperatorToPlan(PlanUtil.buildWhere(filter))); } void pushDownLimit(int limit) { @@ -172,7 +172,10 @@ void pushDownAggregation(Aggregation aggregation) { .map(PlanUtil::expressionToColumnName) .collect(Collectors.toList()); - addOperatorToPlan(PlanUtil.buildGroupByAggregation(groupByColumnNames, aggregation)); + if (logger.isDebugEnabled()) { + logger.debug("groupBy column names: {}", groupByColumnNames); + } + addOperatorToPlan(PlanUtil.buildGroupByAggregation(new HashSet<>(groupByColumnNames), aggregation)); StructType newSchema = buildSchemaWithColumnNames(groupByColumnNames); @@ -211,6 +214,9 @@ void pushDownAggregation(Aggregation aggregation) { this.planAnalysis = new PlanAnalysis(planAnalysis.getBoundedPlan(), mergedPartitions); } + if (Util.MAIN_LOGGER.isDebugEnabled()) { + Util.MAIN_LOGGER.debug("Schema after pushing down aggregation: {}", newSchema); + } this.schema = newSchema; } @@ -278,4 +284,8 @@ PlanAnalysis getPlanAnalysis() { long getBucketCount() { return planAnalysis != null ? planAnalysis.getAllBuckets().size() : 0; } + + long getBatchSize() { + return batchSize; + } } diff --git a/src/main/java/com/marklogic/spark/reader/optic/OpticScan.java b/src/main/java/com/marklogic/spark/reader/optic/OpticScan.java index f09a8fc2..4dabc829 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/OpticScan.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticScan.java @@ -27,15 +27,15 @@ class OpticScan implements Scan { private static final Logger logger = LoggerFactory.getLogger(OpticScan.class); - private ReadContext readContext; + private OpticReadContext opticReadContext; - OpticScan(ReadContext readContext) { - this.readContext = readContext; + OpticScan(OpticReadContext opticReadContext) { + this.opticReadContext = opticReadContext; } @Override public StructType readSchema() { - return readContext.getSchema(); + return opticReadContext.getSchema(); } @Override @@ -45,14 +45,14 @@ public String description() { @Override public Batch toBatch() { - if (logger.isDebugEnabled()) { - logger.debug("Creating new batch"); + if (logger.isTraceEnabled()) { + logger.trace("Creating new batch"); } - return new OpticBatch(readContext); + return new OpticBatch(opticReadContext); } @Override public MicroBatchStream toMicroBatchStream(String checkpointLocation) { - return new OpticMicroBatchStream(readContext); + return new OpticMicroBatchStream(opticReadContext); } } diff --git a/src/main/java/com/marklogic/spark/reader/optic/OpticScanBuilder.java b/src/main/java/com/marklogic/spark/reader/optic/OpticScanBuilder.java index 91a4beb7..47908060 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/OpticScanBuilder.java +++ b/src/main/java/com/marklogic/spark/reader/optic/OpticScanBuilder.java @@ -35,7 +35,7 @@ public class OpticScanBuilder implements ScanBuilder, SupportsPushDownFilters, S private static final Logger logger = LoggerFactory.getLogger(OpticScanBuilder.class); - private final ReadContext readContext; + private final OpticReadContext opticReadContext; private List pushedFilters; private static final Set> SUPPORTED_AGGREGATE_FUNCTIONS = new HashSet<>(); @@ -49,16 +49,16 @@ public class OpticScanBuilder implements ScanBuilder, SupportsPushDownFilters, S SUPPORTED_AGGREGATE_FUNCTIONS.add(Sum.class); } - public OpticScanBuilder(ReadContext readContext) { - this.readContext = readContext; + public OpticScanBuilder(OpticReadContext opticReadContext) { + this.opticReadContext = opticReadContext; } @Override public Scan build() { - if (logger.isDebugEnabled()) { - logger.debug("Creating new scan"); + if (logger.isTraceEnabled()) { + logger.trace("Creating new scan"); } - return new OpticScan(readContext); + return new OpticScan(opticReadContext); } /** @@ -69,7 +69,7 @@ public Scan build() { @Override public Filter[] pushFilters(Filter[] filters) { pushedFilters = new ArrayList<>(); - if (readContext.planAnalysisFoundNoRows()) { + if (opticReadContext.planAnalysisFoundNoRows()) { return filters; } @@ -80,7 +80,7 @@ public Filter[] pushFilters(Filter[] filters) { } for (Filter filter : filters) { OpticFilter opticFilter = FilterFactory.toPlanFilter(filter); - if (opticFilter != null) { + if (opticFilter != null && opticFilter.isValid()) { if (logger.isDebugEnabled()) { logger.debug("Pushing down filter: {}", filter); } @@ -94,7 +94,7 @@ public Filter[] pushFilters(Filter[] filters) { } } - readContext.pushDownFiltersIntoOpticQuery(opticFilters); + opticReadContext.pushDownFiltersIntoOpticQuery(opticFilters); return unsupportedFilters.toArray(new Filter[0]); } @@ -108,19 +108,19 @@ public Filter[] pushedFilters() { @Override public boolean pushLimit(int limit) { - if (readContext.planAnalysisFoundNoRows()) { + if (opticReadContext.planAnalysisFoundNoRows()) { return false; } if (logger.isDebugEnabled()) { logger.debug("Pushing down limit: {}", limit); } - readContext.pushDownLimit(limit); + opticReadContext.pushDownLimit(limit); return true; } @Override public boolean pushTopN(SortOrder[] orders, int limit) { - if (readContext.planAnalysisFoundNoRows()) { + if (opticReadContext.planAnalysisFoundNoRows()) { return false; } // This will be invoked when the user calls both orderBy and limit in their Spark program. If the user only @@ -129,7 +129,7 @@ public boolean pushTopN(SortOrder[] orders, int limit) { if (logger.isDebugEnabled()) { logger.debug("Pushing down topN: {}; limit: {}", Arrays.asList(orders), limit); } - readContext.pushDownTopN(orders, limit); + opticReadContext.pushDownTopN(orders, limit); return true; } @@ -138,7 +138,7 @@ public boolean isPartiallyPushed() { // If a single bucket exists - i.e. a single call will be made to MarkLogic - then any limit/orderBy can be // fully pushed down to MarkLogic. Otherwise, we also need Spark to apply the limit/orderBy to the returned rows // to ensure that the user receives the correct response. - return readContext.getBucketCount() > 1; + return opticReadContext.getBucketCount() > 1; } /** @@ -152,7 +152,7 @@ public boolean isPartiallyPushed() { */ @Override public boolean supportCompletePushDown(Aggregation aggregation) { - if (readContext.planAnalysisFoundNoRows() || pushDownAggregatesIsDisabled()) { + if (opticReadContext.planAnalysisFoundNoRows() || pushDownAggregatesIsDisabled()) { return false; } @@ -164,7 +164,7 @@ public boolean supportCompletePushDown(Aggregation aggregation) { return false; } - if (readContext.getBucketCount() > 1) { + if (opticReadContext.getBucketCount() > 1) { if (Util.MAIN_LOGGER.isInfoEnabled()) { Util.MAIN_LOGGER.info("Multiple requests will be made to MarkLogic; aggregation will be applied by Spark as well: {}", describeAggregation(aggregation)); @@ -179,7 +179,7 @@ public boolean pushAggregation(Aggregation aggregation) { // For the initial 2.0 release, there aren't any known unsupported aggregate functions that can be called // after a "groupBy". If one is detected though, the aggregation won't be pushed down as it's uncertain if // pushing it down would produce the correct results. - if (readContext.planAnalysisFoundNoRows() || hasUnsupportedAggregateFunction(aggregation)) { + if (opticReadContext.planAnalysisFoundNoRows() || hasUnsupportedAggregateFunction(aggregation)) { return false; } @@ -191,16 +191,16 @@ public boolean pushAggregation(Aggregation aggregation) { if (logger.isDebugEnabled()) { logger.debug("Pushing down aggregation: {}", describeAggregation(aggregation)); } - readContext.pushDownAggregation(aggregation); + opticReadContext.pushDownAggregation(aggregation); return true; } @Override public void pruneColumns(StructType requiredSchema) { - if (readContext.planAnalysisFoundNoRows()) { + if (opticReadContext.planAnalysisFoundNoRows()) { return; } - if (requiredSchema.equals(readContext.getSchema())) { + if (requiredSchema.equals(opticReadContext.getSchema())) { if (logger.isDebugEnabled()) { logger.debug("The schema to push down is equal to the existing schema, so not pushing it down."); } @@ -209,7 +209,7 @@ public void pruneColumns(StructType requiredSchema) { if (logger.isDebugEnabled()) { logger.debug("Pushing down required schema: {}", requiredSchema.json()); } - readContext.pushDownRequiredSchema(requiredSchema); + opticReadContext.pushDownRequiredSchema(requiredSchema); } } @@ -226,6 +226,6 @@ private String describeAggregation(Aggregation aggregation) { } private boolean pushDownAggregatesIsDisabled() { - return "false".equalsIgnoreCase(readContext.getProperties().get(Options.READ_PUSH_DOWN_AGGREGATES)); + return "false".equalsIgnoreCase(opticReadContext.getProperties().get(Options.READ_PUSH_DOWN_AGGREGATES)); } } diff --git a/src/main/java/com/marklogic/spark/reader/optic/PlanUtil.java b/src/main/java/com/marklogic/spark/reader/optic/PlanUtil.java index 07d4168a..e3538c46 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/PlanUtil.java +++ b/src/main/java/com/marklogic/spark/reader/optic/PlanUtil.java @@ -29,9 +29,7 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.util.HashMap; -import java.util.List; -import java.util.Map; +import java.util.*; import java.util.function.Consumer; import java.util.function.Function; @@ -72,7 +70,13 @@ public abstract class PlanUtil { private PlanUtil() { } - static ObjectNode buildGroupByAggregation(List columnNames, Aggregation aggregation) { + /** + * @param columnNames a set of unique column names is needed as Optic will otherwise throw an error via a + * "duplicate check" per a fix for https://bugtrack.marklogic.com/56662 . + * @param aggregation + * @return + */ + static ObjectNode buildGroupByAggregation(Set columnNames, Aggregation aggregation) { return newOperation("group-by", groupByArgs -> { ArrayNode columns = groupByArgs.addArray(); columnNames.forEach(columnName -> populateSchemaCol(columns.addObject(), columnName)); @@ -181,16 +185,8 @@ private static String removeTickMarksFromColumnName(String columnName) { columnName; } - static ObjectNode buildWhere(List opticFilters) { - return newOperation("where", args -> { - // If there's only one filter, can toss it into the "where" clause. Else, toss an "and" into the "where" and - // then toss every filter into the "and" clause (which accepts 2 to N args). - final ArrayNode targetArgs = opticFilters.size() == 1 ? - args : - args.addObject().put("ns", "op").put("fn", "and").putArray("args"); - - opticFilters.forEach(planFilter -> planFilter.populateArg(targetArgs.addObject())); - }); + static ObjectNode buildWhere(OpticFilter filter) { + return newOperation("where", args -> filter.populateArg(args.addObject())); } private static ObjectNode newOperation(String name, Consumer withArgs) { diff --git a/src/main/java/com/marklogic/spark/reader/optic/SchemaInferrer.java b/src/main/java/com/marklogic/spark/reader/optic/SchemaInferrer.java index cd725910..2e527986 100644 --- a/src/main/java/com/marklogic/spark/reader/optic/SchemaInferrer.java +++ b/src/main/java/com/marklogic/spark/reader/optic/SchemaInferrer.java @@ -83,7 +83,7 @@ public static StructType inferSchema(String columnInfoResponse) { } schema = schema.add(makeColumnName(column), determineSparkType(column), isColumnNullable(column)); } catch (JsonProcessingException e) { - throw new ConnectorException(String.format("Unable to parse JSON: %s; cause: %s", columnInfo, e.getMessage()), e); + throw new ConnectorException(String.format("Unable to parse schema JSON: %s; cause: %s", columnInfo, e.getMessage()), e); } } return schema; diff --git a/src/main/java/com/marklogic/spark/writer/ArbitraryRowConverter.java b/src/main/java/com/marklogic/spark/writer/ArbitraryRowConverter.java new file mode 100644 index 00000000..909f1bbd --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/ArbitraryRowConverter.java @@ -0,0 +1,166 @@ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.node.ObjectNode; +import com.fasterxml.jackson.dataformat.xml.XmlMapper; +import com.marklogic.client.io.Format; +import com.marklogic.client.io.JacksonHandle; +import com.marklogic.client.io.StringHandle; +import com.marklogic.client.io.marker.AbstractWriteHandle; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.JsonRowSerializer; +import com.marklogic.spark.Options; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +import java.util.ArrayList; +import java.util.List; +import java.util.Optional; +import java.util.UUID; + +/** + * Handles building a document from an "arbitrary" row - i.e. one with an unknown schema, where the row will be + * serialized by Spark to a JSON object. + */ +class ArbitraryRowConverter implements RowConverter { + + private static final String MARKLOGIC_SPARK_FILE_PATH_COLUMN_NAME = "marklogic_spark_file_path"; + + private final ObjectMapper objectMapper; + private final XmlMapper xmlMapper; + private final JsonRowSerializer jsonRowSerializer; + private final String uriTemplate; + private final String jsonRootName; + private final String xmlRootName; + private final String xmlNamespace; + private final int filePathIndex; + + ArbitraryRowConverter(WriteContext writeContext) { + this.filePathIndex = determineFilePathIndex(writeContext.getSchema()); + this.uriTemplate = writeContext.getStringOption(Options.WRITE_URI_TEMPLATE); + this.jsonRootName = writeContext.getStringOption(Options.WRITE_JSON_ROOT_NAME); + this.xmlRootName = writeContext.getStringOption(Options.WRITE_XML_ROOT_NAME); + this.xmlNamespace = writeContext.getStringOption(Options.WRITE_XML_NAMESPACE); + this.objectMapper = new ObjectMapper(); + this.xmlMapper = this.xmlRootName != null ? new XmlMapper() : null; + this.jsonRowSerializer = new JsonRowSerializer(writeContext.getSchema(), writeContext.getProperties()); + } + + @Override + public Optional convertRow(InternalRow row) { + String initialUri = null; + if (this.filePathIndex > -1) { + initialUri = row.getString(this.filePathIndex) + "/" + UUID.randomUUID(); + row.setNullAt(this.filePathIndex); + } + + final String json = this.jsonRowSerializer.serializeRowToJson(row); + + AbstractWriteHandle contentHandle = null; + ObjectNode deserializedJson = null; + ObjectNode uriTemplateValues = null; + final boolean mustRemoveFilePathField = this.filePathIndex > 1 && jsonRowSerializer.isIncludeNullFields(); + + if (this.jsonRootName != null || this.xmlRootName != null || this.uriTemplate != null || mustRemoveFilePathField) { + deserializedJson = readTree(json); + if (mustRemoveFilePathField) { + deserializedJson.remove(MARKLOGIC_SPARK_FILE_PATH_COLUMN_NAME); + } + } + + if (this.uriTemplate != null) { + uriTemplateValues = deserializedJson; + } + + if (this.jsonRootName != null) { + ObjectNode jsonObjectWithRootName = objectMapper.createObjectNode(); + jsonObjectWithRootName.set(jsonRootName, deserializedJson); + contentHandle = new JacksonHandle(jsonObjectWithRootName); + if (this.uriTemplate != null) { + uriTemplateValues = jsonObjectWithRootName; + } + } + + if (contentHandle == null) { + // If the user wants XML, then we've definitely deserialized the JSON and removed the file path if + // needed. So use that JsonNode to produce an XML string. + if (xmlRootName != null) { + contentHandle = new StringHandle(convertJsonToXml(deserializedJson)).withFormat(Format.XML); + } + // If we've already gone to the effort of creating deserializedJson, use it for the content. + else if (deserializedJson != null) { + contentHandle = new JacksonHandle(deserializedJson); + } else { + // Simplest scenario where we never have a reason to incur the expense of deserializing the JSON string, + // so we can just use StringHandle. + contentHandle = new StringHandle(json).withFormat(Format.JSON); + } + } + + return Optional.of(new DocBuilder.DocumentInputs(initialUri, contentHandle, uriTemplateValues, null)); + } + + @Override + public List getRemainingDocumentInputs() { + return new ArrayList<>(); + } + + /** + * A Spark user can add a column via: + * withColumn("marklogic_spark_file_path", new Column("_metadata.file_path")) + *

+ * This allows access to the file path when using a Spark data source - e.g. CSV, Parquet - to read a file. + * The column will be used to generate an initial URI for the corresponding document, and the column will then + * be removed after that so that it's not included in the document. + * + * @return + */ + private int determineFilePathIndex(StructType schema) { + StructField[] fields = schema.fields(); + for (int i = 0; i < fields.length; i++) { + if (MARKLOGIC_SPARK_FILE_PATH_COLUMN_NAME.equals(fields[i].name())) { + return i; + } + } + return -1; + } + + private ObjectNode readTree(String json) { + // We don't ever expect this to fail, as the JSON is produced by Spark's JacksonGenerator and should always + // be valid JSON. But Jackson throws a checked exception, so gotta handle it. + try { + return (ObjectNode) objectMapper.readTree(json); + } catch (JsonProcessingException e) { + throw new ConnectorException(String.format("Unable to read JSON row: %s", e.getMessage()), e); + } + } + + /** + * jackson-xml-mapper unfortunately does not yet support a root namespace. Nor does it allow for the root element + * to be omitted. So we always end up with "ObjectNode" as a root element. See + * https://github.com/FasterXML/jackson-dataformat-xml/issues/541 for more information. So this method does some + * work to replace that root element with one based on user inputs. + * + * @param doc + * @return + */ + private String convertJsonToXml(JsonNode doc) { + try { + String xml = xmlMapper.writer().writeValueAsString(doc); + String startTag = this.xmlNamespace != null ? + String.format("<%s xmlns='%s'>", this.xmlRootName, this.xmlNamespace) : + String.format("<%s>", this.xmlRootName); + return new StringBuilder(startTag) + .append(xml.substring("".length(), xml.length() - "".length())) + .append(String.format("", this.xmlRootName)) + .toString(); + } catch (JsonProcessingException e) { + // We don't expect this occur; Jackson should be able to convert any JSON object that it created into + // a valid XML document. + throw new ConnectorException(String.format("Unable to convert JSON to XML for doc: %s", doc), e); + } + } +} diff --git a/src/main/java/com/marklogic/spark/writer/ArbitraryRowFunction.java b/src/main/java/com/marklogic/spark/writer/ArbitraryRowFunction.java deleted file mode 100644 index a89439b0..00000000 --- a/src/main/java/com/marklogic/spark/writer/ArbitraryRowFunction.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.marklogic.spark.writer; - -import com.fasterxml.jackson.core.JsonProcessingException; -import com.fasterxml.jackson.databind.ObjectMapper; -import com.fasterxml.jackson.databind.node.ObjectNode; -import com.marklogic.client.io.Format; -import com.marklogic.client.io.StringHandle; -import com.marklogic.spark.ConnectorException; -import com.marklogic.spark.Options; -import com.marklogic.spark.Util; -import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.catalyst.json.JacksonGenerator; - -import java.io.StringWriter; -import java.util.function.Function; - -/** - * Handles building a document from an "arbitrary" row - i.e. one with an unknown schema, where the row will be - * serialized by Spark to a JSON object. - */ -class ArbitraryRowFunction implements Function { - - private static ObjectMapper objectMapper = new ObjectMapper(); - - private WriteContext writeContext; - - ArbitraryRowFunction(WriteContext writeContext) { - this.writeContext = writeContext; - } - - @Override - public DocBuilder.DocumentInputs apply(InternalRow row) { - String json = convertRowToJSONString(row); - StringHandle content = new StringHandle(json).withFormat(Format.JSON); - ObjectNode columnValues = null; - if (writeContext.hasOption(Options.WRITE_URI_TEMPLATE)) { - try { - columnValues = (ObjectNode) objectMapper.readTree(json); - } catch (JsonProcessingException e) { - throw new ConnectorException(String.format("Unable to read JSON row: %s", e.getMessage()), e); - } - } - return new DocBuilder.DocumentInputs(null, content, columnValues, null); - } - - private String convertRowToJSONString(InternalRow row) { - StringWriter jsonObjectWriter = new StringWriter(); - JacksonGenerator jacksonGenerator = new JacksonGenerator( - this.writeContext.getSchema(), - jsonObjectWriter, - Util.DEFAULT_JSON_OPTIONS - ); - jacksonGenerator.write(row); - jacksonGenerator.flush(); - return jsonObjectWriter.toString(); - } -} diff --git a/src/main/java/com/marklogic/spark/writer/BatchRetrier.java b/src/main/java/com/marklogic/spark/writer/BatchRetrier.java new file mode 100644 index 00000000..18bb5772 --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/BatchRetrier.java @@ -0,0 +1,89 @@ +package com.marklogic.spark.writer; + +import com.marklogic.client.datamovement.WriteBatch; +import com.marklogic.client.datamovement.WriteEvent; +import com.marklogic.client.document.DocumentWriteOperation; +import com.marklogic.client.document.DocumentWriteSet; +import com.marklogic.client.impl.GenericDocumentImpl; + +import java.util.Iterator; +import java.util.function.BiConsumer; +import java.util.function.Consumer; + +/** + * Handles retrying a failed batch from a DMSDK WriteBatcher. The client has no idea how many documents in a batch + * failed or which ones failed. So this class divides the batch into two and retries each smaller batch, repeating that + * processing until a batch either succeeds or it fails with a single document in it. Once the latter occurs, this + * class logs the URI of the document that failed along with the error message. + */ +class BatchRetrier { + + private final GenericDocumentImpl documentManager; + private final String temporalCollection; + private final BiConsumer failedDocumentConsumer; + private final Consumer successfulBatchConsumer; + + /** + * @param documentManager requires the concrete class so that the methods that allow a temporal collection are available. + * MLE-13453 was created to address this. + * @param temporalCollection optional temporal collection. + * @param successfulBatchConsumer client provides an implementation of this to optionally perform any processing after a + * batch is successfully written. + * @param failedDocumentConsumer client provides an implementation of this to handle whatever logic is required + * when a failed document is identified. + */ + BatchRetrier(GenericDocumentImpl documentManager, String temporalCollection, + Consumer successfulBatchConsumer, + BiConsumer failedDocumentConsumer) { + this.documentManager = documentManager; + this.temporalCollection = temporalCollection; + this.successfulBatchConsumer = successfulBatchConsumer; + this.failedDocumentConsumer = failedDocumentConsumer; + } + + /** + * Intended to be invoked by a DMSDK batch failure listener. + * + * @param batch + * @param failure + */ + void retryBatch(WriteBatch batch, Throwable failure) { + DocumentWriteSet writeSet = this.documentManager.newWriteSet(); + for (WriteEvent item : batch.getItems()) { + writeSet.add(item.getTargetUri(), item.getMetadata(), item.getContent()); + } + divideInHalfAndRetryEachBatch(writeSet, failure); + } + + /** + * Recursive method that ends when the set of failed documents has a size of 1, as there's nothing more to split + * up and retry. + * + * @param failedWriteSet + */ + private void divideInHalfAndRetryEachBatch(DocumentWriteSet failedWriteSet, Throwable failure) { + final int docCount = failedWriteSet.size(); + if (docCount == 1) { + DocumentWriteOperation failedDoc = failedWriteSet.iterator().next(); + this.failedDocumentConsumer.accept(failedDoc, failure); + return; + } + + DocumentWriteSet newWriteSet = this.documentManager.newWriteSet(); + Iterator failedDocs = failedWriteSet.iterator(); + while (failedDocs.hasNext()) { + newWriteSet.add(failedDocs.next()); + if (newWriteSet.size() == docCount / 2 || !failedDocs.hasNext()) { + try { + this.documentManager.write(newWriteSet, null, null, this.temporalCollection); + if (this.successfulBatchConsumer != null) { + this.successfulBatchConsumer.accept(newWriteSet); + } + } catch (Exception ex) { + divideInHalfAndRetryEachBatch(newWriteSet, ex); + } + newWriteSet = this.documentManager.newWriteSet(); + } + } + } +} diff --git a/src/main/java/com/marklogic/spark/writer/CommitMessage.java b/src/main/java/com/marklogic/spark/writer/CommitMessage.java index 8bb3703c..2a6b7af9 100644 --- a/src/main/java/com/marklogic/spark/writer/CommitMessage.java +++ b/src/main/java/com/marklogic/spark/writer/CommitMessage.java @@ -2,36 +2,36 @@ import org.apache.spark.sql.connector.write.WriterCommitMessage; +import java.util.Set; + public class CommitMessage implements WriterCommitMessage { private final int successItemCount; private final int failedItemCount; - private final int partitionId; - private final long taskId; - private final long epochId; + private final Set graphs; - public CommitMessage(int successItemCount, int failedItemCount, int partitionId, long taskId, long epochId) { + /** + * @param successItemCount + * @param failedItemCount + * @param graphs zero or more MarkLogic Semantics graph names, each of which is associated with a + * graph document in MarkLogic that must be created after all the documents have been + * written. + */ + public CommitMessage(int successItemCount, int failedItemCount, Set graphs) { this.successItemCount = successItemCount; this.failedItemCount = failedItemCount; - this.partitionId = partitionId; - this.taskId = taskId; - this.epochId = epochId; + this.graphs = graphs; } - public int getSuccessItemCount() { + int getSuccessItemCount() { return successItemCount; } - public int getFailedItemCount() { + int getFailedItemCount() { return failedItemCount; } - @Override - public String toString() { - return epochId != 0L ? - String.format("[partitionId: %d; taskId: %d; epochId: %d]; docCount: %d; failedItemCount: %d", - partitionId, taskId, epochId, successItemCount, failedItemCount) : - String.format("[partitionId: %d; taskId: %d]; docCount: %d; failedItemCount: %d", - partitionId, taskId, successItemCount, failedItemCount); + Set getGraphs() { + return graphs; } } diff --git a/src/main/java/com/marklogic/spark/writer/DocBuilder.java b/src/main/java/com/marklogic/spark/writer/DocBuilder.java index 583f6a21..6fade8bb 100644 --- a/src/main/java/com/marklogic/spark/writer/DocBuilder.java +++ b/src/main/java/com/marklogic/spark/writer/DocBuilder.java @@ -15,46 +15,63 @@ */ package com.marklogic.spark.writer; -import com.fasterxml.jackson.databind.node.ObjectNode; +import com.fasterxml.jackson.databind.JsonNode; import com.marklogic.client.document.DocumentWriteOperation; import com.marklogic.client.impl.DocumentWriteOperationImpl; import com.marklogic.client.io.DocumentMetadataHandle; import com.marklogic.client.io.marker.AbstractWriteHandle; -class DocBuilder { +public class DocBuilder { public interface UriMaker { - String makeURI(String initialUri, ObjectNode columnValues); + String makeURI(String initialUri, JsonNode uriTemplateValues); } + /** + * Captures the various inputs used for constructing a document to be written to MarkLogic. {@code graph} refers + * to an optional MarkLogic semantics graph, which must be added to the final set of collections for the + * document. + */ public static class DocumentInputs { private final String initialUri; private final AbstractWriteHandle content; - private final ObjectNode columnValuesForUriTemplate; + private final JsonNode columnValuesForUriTemplate; private final DocumentMetadataHandle initialMetadata; + private final String graph; - public DocumentInputs(String initialUri, AbstractWriteHandle content, ObjectNode columnValuesForUriTemplate, DocumentMetadataHandle initialMetadata) { + public DocumentInputs(String initialUri, AbstractWriteHandle content, JsonNode columnValuesForUriTemplate, + DocumentMetadataHandle initialMetadata) { + this(initialUri, content, columnValuesForUriTemplate, initialMetadata, null); + } + + public DocumentInputs(String initialUri, AbstractWriteHandle content, JsonNode columnValuesForUriTemplate, + DocumentMetadataHandle initialMetadata, String graph) { this.initialUri = initialUri; this.content = content; this.columnValuesForUriTemplate = columnValuesForUriTemplate; this.initialMetadata = initialMetadata; + this.graph = graph; } - public String getInitialUri() { + String getInitialUri() { return initialUri; } - public AbstractWriteHandle getContent() { + AbstractWriteHandle getContent() { return content; } - public ObjectNode getColumnValuesForUriTemplate() { + JsonNode getColumnValuesForUriTemplate() { return columnValuesForUriTemplate; } - public DocumentMetadataHandle getInitialMetadata() { + DocumentMetadataHandle getInitialMetadata() { return initialMetadata; } + + String getGraph() { + return graph; + } } private final UriMaker uriMaker; @@ -66,12 +83,31 @@ public DocumentMetadataHandle getInitialMetadata() { } DocumentWriteOperation build(DocumentInputs inputs) { - String uri = uriMaker.makeURI(inputs.getInitialUri(), inputs.getColumnValuesForUriTemplate()); - DocumentMetadataHandle initialMetadata = inputs.getInitialMetadata(); + final String uri = uriMaker.makeURI(inputs.getInitialUri(), inputs.getColumnValuesForUriTemplate()); + final String graph = inputs.getGraph(); + final DocumentMetadataHandle initialMetadata = inputs.getInitialMetadata(); + + final boolean isNakedProperties = inputs.getContent() == null; + if (isNakedProperties) { + if (initialMetadata != null) { + overrideInitialMetadata(initialMetadata); + } + return new DocumentWriteOperationImpl(uri, initialMetadata, null); + } + if (initialMetadata != null) { overrideInitialMetadata(initialMetadata); + if (graph != null) { + initialMetadata.getCollections().add(graph); + } return new DocumentWriteOperationImpl(uri, initialMetadata, inputs.getContent()); } + + if (graph != null && !metadata.getCollections().contains(graph)) { + DocumentMetadataHandle newMetadata = newMetadataWithGraph(graph); + return new DocumentWriteOperationImpl(uri, newMetadata, inputs.getContent()); + } + return new DocumentWriteOperationImpl(uri, metadata, inputs.getContent()); } @@ -98,4 +134,24 @@ private void overrideInitialMetadata(DocumentMetadataHandle initialMetadata) { initialMetadata.setMetadataValues(metadata.getMetadataValues()); } } + + /** + * If a semantics graph is specified in the set of document inputs, must copy the DocumentMetadataHandle instance + * in this class to a new DocumentMetadataHandle instance that includes the graph as a collection. This is done to + * avoid modifying the DocumentMetadataHandle instance owned by this class which is expected to be reused for + * many documents. + * + * @param graph + * @return + */ + private DocumentMetadataHandle newMetadataWithGraph(String graph) { + DocumentMetadataHandle newMetadata = new DocumentMetadataHandle(); + newMetadata.getCollections().addAll(metadata.getCollections()); + newMetadata.getPermissions().putAll(metadata.getPermissions()); + newMetadata.setQuality(metadata.getQuality()); + newMetadata.setProperties(metadata.getProperties()); + newMetadata.setMetadataValues(metadata.getMetadataValues()); + newMetadata.getCollections().add(graph); + return newMetadata; + } } diff --git a/src/main/java/com/marklogic/spark/writer/DocumentRowConverter.java b/src/main/java/com/marklogic/spark/writer/DocumentRowConverter.java new file mode 100644 index 00000000..dd556e30 --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/DocumentRowConverter.java @@ -0,0 +1,75 @@ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.node.ObjectNode; +import com.marklogic.client.io.BytesHandle; +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.client.io.Format; +import com.marklogic.spark.Options; +import com.marklogic.spark.reader.document.DocumentRowSchema; +import org.apache.spark.sql.catalyst.InternalRow; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Optional; + +/** + * Knows how to build a document from a row corresponding to our {@code DocumentRowSchema}. + */ +class DocumentRowConverter implements RowConverter { + + private final ObjectMapper objectMapper; + private final String uriTemplate; + private final Format documentFormat; + + DocumentRowConverter(WriteContext writeContext) { + this.uriTemplate = writeContext.getStringOption(Options.WRITE_URI_TEMPLATE); + this.documentFormat = writeContext.getDocumentFormat(); + this.objectMapper = new ObjectMapper(); + } + + @Override + public Optional convertRow(InternalRow row) { + final String uri = row.getString(0); + + final boolean isNakedProperties = row.isNullAt(1); + if (isNakedProperties) { + DocumentMetadataHandle metadata = DocumentRowSchema.makeDocumentMetadata(row); + return Optional.of(new DocBuilder.DocumentInputs(uri, null, null, metadata)); + } + + final BytesHandle content = new BytesHandle(row.getBinary(1)); + if (this.documentFormat != null) { + content.withFormat(this.documentFormat); + } + + JsonNode uriTemplateValues = null; + if (this.uriTemplate != null && this.uriTemplate.trim().length() > 0) { + String format = row.isNullAt(2) ? null : row.getString(2); + uriTemplateValues = deserializeContentToJson(uri, content, format); + } + DocumentMetadataHandle metadata = DocumentRowSchema.makeDocumentMetadata(row); + return Optional.of(new DocBuilder.DocumentInputs(uri, content, uriTemplateValues, metadata)); + } + + @Override + public List getRemainingDocumentInputs() { + return new ArrayList<>(); + } + + private JsonNode deserializeContentToJson(String initialUri, BytesHandle contentHandle, String format) { + try { + return objectMapper.readTree(contentHandle.get()); + } catch (IOException e) { + // Preserves the initial support in the 2.2.0 release. + ObjectNode values = objectMapper.createObjectNode(); + values.put("URI", initialUri); + if (format != null) { + values.put("format", format); + } + return values; + } + } +} diff --git a/src/main/java/com/marklogic/spark/writer/DocumentRowFunction.java b/src/main/java/com/marklogic/spark/writer/DocumentRowFunction.java deleted file mode 100644 index 49a8d5cd..00000000 --- a/src/main/java/com/marklogic/spark/writer/DocumentRowFunction.java +++ /dev/null @@ -1,95 +0,0 @@ -package com.marklogic.spark.writer; - -import com.fasterxml.jackson.databind.ObjectMapper; -import com.fasterxml.jackson.databind.node.ObjectNode; -import com.marklogic.client.io.BytesHandle; -import com.marklogic.client.io.DocumentMetadataHandle; -import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.catalyst.util.ArrayData; -import org.apache.spark.sql.catalyst.util.MapData; -import org.apache.spark.sql.types.DataTypes; - -import javax.xml.namespace.QName; -import java.util.function.Function; - -/** - * Knows how to build a document from a row corresponding to our {@code DocumentRowSchema}. - */ -class DocumentRowFunction implements Function { - - private static ObjectMapper objectMapper = new ObjectMapper(); - - @Override - public DocBuilder.DocumentInputs apply(InternalRow row) { - String uri = row.getString(0); - BytesHandle content = new BytesHandle(row.getBinary(1)); - - ObjectNode columnValues = objectMapper.createObjectNode(); - columnValues.put("URI", uri); - columnValues.put("format", row.getString(2)); - - DocumentMetadataHandle metadata = new DocumentMetadataHandle(); - addCollectionsToMetadata(row, metadata); - addPermissionsToMetadata(row, metadata); - if (!row.isNullAt(5)) { - metadata.setQuality(row.getInt(5)); - } - addPropertiesToMetadata(row, metadata); - addMetadataValuesToMetadata(row, metadata); - return new DocBuilder.DocumentInputs(uri, content, columnValues, metadata); - } - - private void addCollectionsToMetadata(InternalRow row, DocumentMetadataHandle metadata) { - if (!row.isNullAt(3)) { - ArrayData collections = row.getArray(3); - for (int i = 0; i < collections.numElements(); i++) { - String value = collections.get(i, DataTypes.StringType).toString(); - metadata.getCollections().add(value); - } - } - } - - private void addPermissionsToMetadata(InternalRow row, DocumentMetadataHandle metadata) { - if (!row.isNullAt(4)) { - MapData permissions = row.getMap(4); - ArrayData roles = permissions.keyArray(); - ArrayData capabilities = permissions.valueArray(); - for (int i = 0; i < roles.numElements(); i++) { - String role = roles.get(i, DataTypes.StringType).toString(); - ArrayData caps = capabilities.getArray(i); - DocumentMetadataHandle.Capability[] capArray = new DocumentMetadataHandle.Capability[caps.numElements()]; - for (int j = 0; j < caps.numElements(); j++) { - String value = caps.get(j, DataTypes.StringType).toString(); - capArray[j] = DocumentMetadataHandle.Capability.valueOf(value.toUpperCase()); - } - metadata.getPermissions().add(role, capArray); - } - } - } - - private void addPropertiesToMetadata(InternalRow row, DocumentMetadataHandle metadata) { - if (!row.isNullAt(6)) { - MapData properties = row.getMap(6); - ArrayData qnames = properties.keyArray(); - ArrayData values = properties.valueArray(); - for (int i = 0; i < qnames.numElements(); i++) { - String qname = qnames.get(i, DataTypes.StringType).toString(); - String value = values.get(i, DataTypes.StringType).toString(); - metadata.getProperties().put(QName.valueOf(qname), value); - } - } - } - - private void addMetadataValuesToMetadata(InternalRow row, DocumentMetadataHandle metadata) { - if (!row.isNullAt(7)) { - MapData properties = row.getMap(7); - ArrayData keys = properties.keyArray(); - ArrayData values = properties.valueArray(); - for (int i = 0; i < keys.numElements(); i++) { - String key = keys.get(i, DataTypes.StringType).toString(); - String value = values.get(i, DataTypes.StringType).toString(); - metadata.getMetadataValues().put(key, value); - } - } - } -} diff --git a/src/main/java/com/marklogic/spark/writer/FileRowConverter.java b/src/main/java/com/marklogic/spark/writer/FileRowConverter.java new file mode 100644 index 00000000..0d9979cd --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/FileRowConverter.java @@ -0,0 +1,80 @@ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.node.ObjectNode; +import com.marklogic.client.io.BytesHandle; +import com.marklogic.client.io.Format; +import com.marklogic.spark.Options; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.types.DataTypes; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Optional; + +/** + * Knows how to build a document from a row corresponding to our {@code FileRowSchema}. + */ +class FileRowConverter implements RowConverter { + + private final WriteContext writeContext; + private final ObjectMapper objectMapper; + private final String uriTemplate; + + FileRowConverter(WriteContext writeContext) { + this.writeContext = writeContext; + this.uriTemplate = writeContext.getStringOption(Options.WRITE_URI_TEMPLATE); + this.objectMapper = new ObjectMapper(); + } + + @Override + public Optional convertRow(InternalRow row) { + final String path = row.getString(writeContext.getFileSchemaPathPosition()); + BytesHandle contentHandle = new BytesHandle(row.getBinary(writeContext.getFileSchemaContentPosition())); + forceFormatIfNecessary(contentHandle); + Optional uriTemplateValues = deserializeContentToJson(path, contentHandle, row); + return Optional.of(new DocBuilder.DocumentInputs(path, contentHandle, uriTemplateValues.orElse(null), null)); + } + + @Override + public List getRemainingDocumentInputs() { + return new ArrayList<>(); + } + + // Telling Sonar to not tell us to remove this code, since we can't until 3.0. + @SuppressWarnings("java:S1874") + private void forceFormatIfNecessary(BytesHandle content) { + Format format = writeContext.getDocumentFormat(); + if (format != null) { + content.withFormat(format); + } else { + format = writeContext.getDeprecatedFileRowsDocumentFormat(); + if (format != null) { + content.withFormat(format); + } + } + } + + private Optional deserializeContentToJson(String path, BytesHandle contentHandle, InternalRow row) { + if (this.uriTemplate == null || this.uriTemplate.trim().length() == 0) { + return Optional.empty(); + } + try { + JsonNode json = objectMapper.readTree(contentHandle.get()); + return Optional.of(json); + } catch (IOException e) { + // Preserves the initial support in the 2.2.0 release. + ObjectNode values = objectMapper.createObjectNode(); + values.put("path", path); + if (!row.isNullAt(1)) { + values.put("modificationTime", row.get(1, DataTypes.LongType).toString()); + } + if (!row.isNullAt(2)) { + values.put("length", row.getLong(2)); + } + return Optional.of(values); + } + } +} diff --git a/src/main/java/com/marklogic/spark/writer/FileRowFunction.java b/src/main/java/com/marklogic/spark/writer/FileRowFunction.java deleted file mode 100644 index 726c0ba7..00000000 --- a/src/main/java/com/marklogic/spark/writer/FileRowFunction.java +++ /dev/null @@ -1,57 +0,0 @@ -package com.marklogic.spark.writer; - -import com.fasterxml.jackson.databind.ObjectMapper; -import com.fasterxml.jackson.databind.node.ObjectNode; -import com.marklogic.client.io.BytesHandle; -import com.marklogic.client.io.Format; -import com.marklogic.spark.ConnectorException; -import com.marklogic.spark.Options; -import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.types.DataTypes; - -import java.util.function.Function; - -/** - * Knows how to build a document from a row corresponding to our {@code FileRowSchema}. - */ -class FileRowFunction implements Function { - - private static ObjectMapper objectMapper = new ObjectMapper(); - - private WriteContext writeContext; - - FileRowFunction(WriteContext writeContext) { - this.writeContext = writeContext; - } - - @Override - public DocBuilder.DocumentInputs apply(InternalRow row) { - String initialUri = row.getString(writeContext.getFileSchemaPathPosition()); - BytesHandle content = new BytesHandle(row.getBinary(writeContext.getFileSchemaContentPosition())); - forceFormatIfNecessary(content); - ObjectNode columnValues = null; - if (writeContext.hasOption(Options.WRITE_URI_TEMPLATE)) { - columnValues = objectMapper.createObjectNode(); - columnValues.put("path", initialUri); - Object modificationTime = row.get(1, DataTypes.LongType); - if (modificationTime != null) { - columnValues.put("modificationTime", modificationTime.toString()); - } - columnValues.put("length", row.getLong(2)); - // Not including content as it's a byte array that is not expected to be helpful for making a URI. - } - return new DocBuilder.DocumentInputs(initialUri, content, columnValues, null); - } - - private void forceFormatIfNecessary(BytesHandle content) { - if (writeContext.hasOption(Options.WRITE_FILE_ROWS_DOCUMENT_TYPE)) { - String value = writeContext.getProperties().get(Options.WRITE_FILE_ROWS_DOCUMENT_TYPE); - try { - content.withFormat(Format.valueOf(value.toUpperCase())); - } catch (IllegalArgumentException e) { - String message = "Invalid value for option %s: %s; must be one of 'JSON', 'XML', or 'TEXT'."; - throw new ConnectorException(String.format(message, Options.WRITE_FILE_ROWS_DOCUMENT_TYPE, value)); - } - } - } -} diff --git a/src/main/java/com/marklogic/spark/writer/MarkLogicWrite.java b/src/main/java/com/marklogic/spark/writer/MarkLogicWrite.java index 2b8fa5db..e4f5c2fc 100644 --- a/src/main/java/com/marklogic/spark/writer/MarkLogicWrite.java +++ b/src/main/java/com/marklogic/spark/writer/MarkLogicWrite.java @@ -19,36 +19,29 @@ import com.marklogic.spark.Util; import com.marklogic.spark.reader.customcode.CustomCodeContext; import com.marklogic.spark.writer.customcode.CustomCodeWriterFactory; +import com.marklogic.spark.writer.rdf.GraphWriter; +import org.apache.hadoop.conf.Configuration; +import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.connector.write.BatchWrite; import org.apache.spark.sql.connector.write.DataWriterFactory; import org.apache.spark.sql.connector.write.PhysicalWriteInfo; import org.apache.spark.sql.connector.write.WriterCommitMessage; import org.apache.spark.sql.connector.write.streaming.StreamingDataWriterFactory; import org.apache.spark.sql.connector.write.streaming.StreamingWrite; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; +import org.apache.spark.util.SerializableConfiguration; -import java.util.Arrays; +import java.util.HashSet; +import java.util.Set; import java.util.function.Consumer; public class MarkLogicWrite implements BatchWrite, StreamingWrite { - private static final Logger logger = LoggerFactory.getLogger("com.marklogic.spark"); - private WriteContext writeContext; // Used solely for testing. Will never be populated in a real world scenario. private static Consumer successCountConsumer; private static Consumer failureCountConsumer; - public static void setSuccessCountConsumer(Consumer consumer) { - successCountConsumer = consumer; - } - - public static void setFailureCountConsumer(Consumer consumer) { - failureCountConsumer = consumer; - } - MarkLogicWrite(WriteContext writeContext) { this.writeContext = writeContext; } @@ -60,46 +53,47 @@ public boolean useCommitCoordinator() { @Override public DataWriterFactory createBatchWriterFactory(PhysicalWriteInfo info) { - int partitions = info.numPartitions(); - int threadCount = writeContext.getThreadCount(); - Util.MAIN_LOGGER.info("Number of partitions: {}; thread count per partition: {}; total threads used for writing: {}", - partitions, threadCount, partitions * threadCount); - return (DataWriterFactory) determineWriterFactory(); + int numPartitions = info.numPartitions(); + writeContext.setNumPartitions(numPartitions); + DataWriterFactory dataWriterFactory = determineWriterFactory(); + if (dataWriterFactory instanceof WriteBatcherDataWriterFactory) { + logPartitionAndThreadCounts(numPartitions); + } else { + Util.MAIN_LOGGER.info("Number of partitions: {}", numPartitions); + } + return dataWriterFactory; } @Override public void commit(WriterCommitMessage[] messages) { if (messages != null && messages.length > 0) { - if (logger.isDebugEnabled()) { - logger.debug("Commit messages received: {}", Arrays.asList(messages)); - } - int successCount = 0; - int failureCount = 0; - for (WriterCommitMessage message : messages) { - CommitMessage msg = (CommitMessage)message; - successCount += msg.getSuccessItemCount(); - failureCount += msg.getFailedItemCount(); + final CommitResults commitResults = aggregateCommitMessages(messages); + if (!commitResults.graphs.isEmpty()) { + new GraphWriter( + writeContext.connectToMarkLogic(), + writeContext.getProperties().get(Options.WRITE_PERMISSIONS) + ).createGraphs(commitResults.graphs); } + if (successCountConsumer != null) { - successCountConsumer.accept(successCount); + successCountConsumer.accept(commitResults.successCount); } if (failureCountConsumer != null) { - failureCountConsumer.accept(failureCount); + failureCountConsumer.accept(commitResults.failureCount); } + if (Util.MAIN_LOGGER.isInfoEnabled()) { - Util.MAIN_LOGGER.info("Success count: {}", successCount); + Util.MAIN_LOGGER.info("Success count: {}", commitResults.successCount); } - if (failureCount > 0) { - Util.MAIN_LOGGER.error("Failure count: {}", failureCount); + if (commitResults.failureCount > 0) { + Util.MAIN_LOGGER.error("Failure count: {}", commitResults.failureCount); } } } @Override public void abort(WriterCommitMessage[] messages) { - if (messages != null && messages.length > 0 && messages[0] != null) { - Util.MAIN_LOGGER.warn("Abort messages received: {}", Arrays.asList(messages)); - } + // No action. We may eventually want to show the partial progress via the commit messages. } @Override @@ -109,25 +103,75 @@ public StreamingDataWriterFactory createStreamingWriterFactory(PhysicalWriteInfo @Override public void commit(long epochId, WriterCommitMessage[] messages) { - if (messages != null && messages.length > 0 && logger.isDebugEnabled()) { - logger.debug("Commit messages received for epochId {}: {}", epochId, Arrays.asList(messages)); - } + commit(messages); } @Override public void abort(long epochId, WriterCommitMessage[] messages) { - if (messages != null && messages.length > 0) { - Util.MAIN_LOGGER.warn("Abort messages received for epochId {}: {}", epochId, Arrays.asList(messages)); - } + abort(messages); } - private Object determineWriterFactory() { - if (writeContext.hasOption(Options.WRITE_INVOKE, Options.WRITE_JAVASCRIPT, Options.WRITE_XQUERY)) { + private DataWriterFactory determineWriterFactory() { + if (Util.isWriteWithCustomCodeOperation(writeContext.getProperties())) { CustomCodeContext context = new CustomCodeContext( writeContext.getProperties(), writeContext.getSchema(), Options.WRITE_VARS_PREFIX ); return new CustomCodeWriterFactory(context); } - return new WriteBatcherDataWriterFactory(writeContext); + + // This is the last chance we have for accessing the hadoop config, which is needed by the writer. + // SerializableConfiguration allows for it to be sent to the factory. + Configuration config = SparkSession.active().sparkContext().hadoopConfiguration(); + return new WriteBatcherDataWriterFactory(writeContext, new SerializableConfiguration(config)); } + + private void logPartitionAndThreadCounts(int numPartitions) { + int userDefinedPartitionThreadCount = writeContext.getUserDefinedThreadCountPerPartition(); + if (userDefinedPartitionThreadCount > 0) { + Util.MAIN_LOGGER.info("Number of partitions: {}; total thread count: {}; thread count per partition: {}", + numPartitions, numPartitions * userDefinedPartitionThreadCount, userDefinedPartitionThreadCount); + } else { + Util.MAIN_LOGGER.info("Number of partitions: {}; total threads used for writing: {}", + numPartitions, writeContext.getTotalThreadCount()); + } + } + + private CommitResults aggregateCommitMessages(WriterCommitMessage[] messages) { + int successCount = 0; + int failureCount = 0; + Set graphs = new HashSet<>(); + for (WriterCommitMessage message : messages) { + CommitMessage msg = (CommitMessage) message; + successCount += msg.getSuccessItemCount(); + failureCount += msg.getFailedItemCount(); + if (msg.getGraphs() != null) { + graphs.addAll(msg.getGraphs()); + } + } + return new CommitResults(successCount, failureCount, graphs); + } + + /** + * Aggregates the results of each CommitMessage. + */ + private static class CommitResults { + final int successCount; + final int failureCount; + final Set graphs; + + public CommitResults(int successCount, int failureCount, Set graphs) { + this.successCount = successCount; + this.failureCount = failureCount; + this.graphs = graphs; + } + } + + public static void setSuccessCountConsumer(Consumer consumer) { + successCountConsumer = consumer; + } + + public static void setFailureCountConsumer(Consumer consumer) { + failureCountConsumer = consumer; + } + } diff --git a/src/main/java/com/marklogic/spark/writer/RowConverter.java b/src/main/java/com/marklogic/spark/writer/RowConverter.java new file mode 100644 index 00000000..e22b07bb --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/RowConverter.java @@ -0,0 +1,29 @@ +package com.marklogic.spark.writer; + +import org.apache.spark.sql.catalyst.InternalRow; + +import java.util.List; +import java.util.Optional; + +/** + * Strategy interface for how a Spark row is converted into a set of inputs for writing a document to MarkLogic. + */ +public interface RowConverter { + + /** + * An implementation can return an empty Optional, which will happen when the row will be used with other rows to + * form a document. + * + * @param row + * @return + */ + Optional convertRow(InternalRow row); + + /** + * Called when {@code WriteBatcherDataWriter} has no more rows to send, but the implementation may have one or + * more documents that haven't been returned yet via {@code convertRow}. + * + * @return + */ + List getRemainingDocumentInputs(); +} diff --git a/src/main/java/com/marklogic/spark/writer/SparkRowUriMaker.java b/src/main/java/com/marklogic/spark/writer/SparkRowUriMaker.java index 8b869756..a2c14c3e 100644 --- a/src/main/java/com/marklogic/spark/writer/SparkRowUriMaker.java +++ b/src/main/java/com/marklogic/spark/writer/SparkRowUriMaker.java @@ -16,9 +16,7 @@ package com.marklogic.spark.writer; import com.fasterxml.jackson.databind.JsonNode; -import com.fasterxml.jackson.databind.node.ObjectNode; import com.marklogic.spark.ConnectorException; -import com.marklogic.spark.Options; import java.util.regex.Matcher; import java.util.regex.Pattern; @@ -28,28 +26,35 @@ */ class SparkRowUriMaker implements DocBuilder.UriMaker { - private String uriTemplate; + private final String uriTemplate; + private final String uriTemplateOptionName; // The matcher can be reused as this class is not expected to be thread-safe, as each WriteBatcherDataWriter creates // its own and never has multiple threads trying to access it at the same time. private Matcher matcher; - SparkRowUriMaker(String uriTemplate) { - validateUriTemplate(uriTemplate); + SparkRowUriMaker(String uriTemplate, String uriTemplateOptionName) { this.uriTemplate = uriTemplate; + this.uriTemplateOptionName = uriTemplateOptionName; + validateUriTemplate(uriTemplate); this.matcher = Pattern.compile("\\{([^}]+)\\}", Pattern.CASE_INSENSITIVE).matcher(uriTemplate); } @Override - public String makeURI(String initialUri, ObjectNode columnValues) { + public String makeURI(String initialUri, JsonNode uriTemplateValues) { + if (uriTemplateValues == null) { + throw new ConnectorException(String.format("Unable to create URI using template '%s' for initial URI '%s'; no URI template values found.", + this.uriTemplate, initialUri)); + } + // initialUri is ignored as the intent is to build the entire URI from the template. // Inspired by https://www.baeldung.com/java-regex-token-replacement int lastIndex = 0; StringBuilder output = new StringBuilder(); while (matcher.find()) { output.append(this.uriTemplate, lastIndex, matcher.start()); - String columnName = matcher.group(1); - output.append(getColumnValue(columnValues, columnName)); + String expression = matcher.group(1); + output.append(getExpressionValue(uriTemplateValues, expression)); lastIndex = matcher.end(); } if (lastIndex < this.uriTemplate.length()) { @@ -62,22 +67,22 @@ public String makeURI(String initialUri, ObjectNode columnValues) { private void validateUriTemplate(String uriTemplate) { // Copied from the DHF Spark 2 connector - final String preamble = String.format("Invalid value for %s: %s; ", Options.WRITE_URI_TEMPLATE, uriTemplate); + final String preamble = String.format("Invalid value for %s: %s; ", uriTemplateOptionName, uriTemplate); boolean inToken = false; int tokenSize = 0; char[] chars = uriTemplate.toCharArray(); for (char ch : chars) { if (ch == '}') { if (!inToken) { - throw new IllegalArgumentException(preamble + "closing brace found before opening brace"); + throw new ConnectorException(preamble + "closing brace found before opening brace"); } if (tokenSize == 0) { - throw new IllegalArgumentException(preamble + "no column name within opening and closing brace"); + throw new ConnectorException(preamble + "no column name within opening and closing brace"); } inToken = false; } else if (ch == '{') { if (inToken) { - throw new IllegalArgumentException(preamble + "expected closing brace, but found opening brace"); + throw new ConnectorException(preamble + "expected closing brace, but found opening brace"); } inToken = true; tokenSize = 0; @@ -86,27 +91,33 @@ private void validateUriTemplate(String uriTemplate) { } } if (inToken) { - throw new IllegalArgumentException(preamble + "opening brace without closing brace"); + throw new ConnectorException(preamble + "opening brace without closing brace"); } } - private String getColumnValue(JsonNode row, String columnName) { - if (!row.has(columnName)) { + private String getExpressionValue(JsonNode uriTemplateValues, String expression) { + JsonNode node; + // As of 2.3.0, now supports a JSONPointer expression, which is indicated by the first character being a "/". + if (expression.startsWith("/")) { + node = uriTemplateValues.at(expression); + } else { + node = uriTemplateValues.has(expression) ? uriTemplateValues.get(expression) : null; + } + + if (node == null || node.isMissingNode()) { throw new ConnectorException( - String.format("Did not find column '%s' in row: %s; column is required by URI template: %s", - columnName, row, uriTemplate + String.format("Expression '%s' did not resolve to a value in row: %s; expression is required by URI template: %s", + expression, uriTemplateValues, uriTemplate )); } - String text = row.get(columnName).asText(); + String text = node.asText(); if (text.trim().length() == 0) { throw new ConnectorException( - String.format("Column '%s' is empty in row: %s; column is required by URI template: %s", - columnName, row, uriTemplate + String.format("Expression '%s' resolved to an empty string in row: %s; expression is required by URI template: %s", + expression, uriTemplateValues, uriTemplate )); } - return text; } - } diff --git a/src/main/java/com/marklogic/spark/writer/StandardUriMaker.java b/src/main/java/com/marklogic/spark/writer/StandardUriMaker.java index 685160be..1307f87d 100644 --- a/src/main/java/com/marklogic/spark/writer/StandardUriMaker.java +++ b/src/main/java/com/marklogic/spark/writer/StandardUriMaker.java @@ -1,6 +1,6 @@ package com.marklogic.spark.writer; -import com.fasterxml.jackson.databind.node.ObjectNode; +import com.fasterxml.jackson.databind.JsonNode; import com.marklogic.spark.ConnectorException; import java.util.UUID; @@ -18,7 +18,7 @@ class StandardUriMaker implements DocBuilder.UriMaker { } @Override - public String makeURI(String initialUri, ObjectNode columnValues) { + public String makeURI(String initialUri, JsonNode uriTemplateValues) { String uri = initialUri != null ? initialUri : ""; if (uriReplace != null && uriReplace.trim().length() > 0) { uri = applyUriReplace(uri); diff --git a/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriter.java b/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriter.java index 510944a4..86512c45 100644 --- a/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriter.java +++ b/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriter.java @@ -18,18 +18,33 @@ import com.marklogic.client.DatabaseClient; import com.marklogic.client.datamovement.DataMovementManager; import com.marklogic.client.datamovement.WriteBatcher; +import com.marklogic.client.document.DocumentWriteOperation; +import com.marklogic.client.impl.HandleAccessor; +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.client.io.marker.AbstractWriteHandle; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; import com.marklogic.spark.Util; +import com.marklogic.spark.reader.document.DocumentRowBuilder; import com.marklogic.spark.reader.document.DocumentRowSchema; +import com.marklogic.spark.reader.file.TripleRowSchema; +import com.marklogic.spark.writer.file.ZipFileWriter; +import com.marklogic.spark.writer.rdf.RdfRowConverter; import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.catalyst.expressions.GenericInternalRow; import org.apache.spark.sql.connector.write.DataWriter; import org.apache.spark.sql.connector.write.WriterCommitMessage; +import org.apache.spark.unsafe.types.ByteArray; +import org.apache.spark.util.SerializableConfiguration; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import java.util.Optional; +import java.util.Set; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicReference; -import java.util.function.Function; /** * Uses the Java Client's WriteBatcher to handle writing rows as documents to MarkLogic. @@ -42,93 +57,126 @@ class WriteBatcherDataWriter implements DataWriter { private final DatabaseClient databaseClient; private final DataMovementManager dataMovementManager; private final WriteBatcher writeBatcher; + private final BatchRetrier batchRetrier; + private final ZipFileWriter archiveWriter; + private final DocBuilder docBuilder; - private final int partitionId; - private final long taskId; - private final long epochId; // Used to capture the first failure that occurs during a request to MarkLogic. private final AtomicReference writeFailure; - private final Function rowToDocumentFunction; + private final RowConverter rowConverter; // Updated as batches are processed. - private AtomicInteger successItemCount = new AtomicInteger(0); - private AtomicInteger failedItemCount = new AtomicInteger(0); + private final AtomicInteger successItemCount = new AtomicInteger(0); + private final AtomicInteger failedItemCount = new AtomicInteger(0); - WriteBatcherDataWriter(WriteContext writeContext, int partitionId, long taskId, long epochId) { + WriteBatcherDataWriter(WriteContext writeContext, SerializableConfiguration hadoopConfiguration, int partitionId) { this.writeContext = writeContext; - this.partitionId = partitionId; - this.taskId = taskId; - this.epochId = epochId; this.writeFailure = new AtomicReference<>(); this.docBuilder = this.writeContext.newDocBuilder(); this.databaseClient = writeContext.connectToMarkLogic(); - this.dataMovementManager = this.databaseClient.newDataMovementManager(); - this.writeBatcher = writeContext.newWriteBatcher(this.dataMovementManager); - addBatchListeners(); - this.dataMovementManager.startJob(this.writeBatcher); + this.rowConverter = determineRowConverter(); - if (writeContext.isUsingFileSchema()) { - rowToDocumentFunction = new FileRowFunction(writeContext); - } else if (DocumentRowSchema.SCHEMA.equals(writeContext.getSchema())) { - rowToDocumentFunction = new DocumentRowFunction(); + if (writeContext.isAbortOnFailure()) { + this.batchRetrier = null; + this.archiveWriter = null; } else { - rowToDocumentFunction = new ArbitraryRowFunction(writeContext); + this.batchRetrier = makeBatchRetrier(); + this.archiveWriter = writeContext.hasOption(Options.WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS) ? + createArchiveWriter(hadoopConfiguration, partitionId) : null; } + + this.dataMovementManager = this.databaseClient.newDataMovementManager(); + this.writeBatcher = writeContext.newWriteBatcher(this.dataMovementManager); + addBatchListeners(this.writeBatcher); + this.dataMovementManager.startJob(this.writeBatcher); } @Override - public void write(InternalRow row) throws IOException { + public void write(InternalRow row) { throwWriteFailureIfExists(); - DocBuilder.DocumentInputs inputs = rowToDocumentFunction.apply(row); - this.writeBatcher.add(this.docBuilder.build(inputs)); + Optional document = rowConverter.convertRow(row); + if (document.isPresent()) { + this.writeBatcher.add(this.docBuilder.build(document.get())); + } } @Override - public WriterCommitMessage commit() throws IOException { - this.writeBatcher.flushAndWait(); - CommitMessage message = new CommitMessage(successItemCount.get(), failedItemCount.get(), partitionId, taskId, epochId); - if (logger.isDebugEnabled()) { - logger.debug("Committing {}", message); + public WriterCommitMessage commit() { + List documentInputs = rowConverter.getRemainingDocumentInputs(); + if (documentInputs != null) { + documentInputs.forEach(inputs -> { + DocumentWriteOperation writeOp = this.docBuilder.build(inputs); + this.writeBatcher.add(writeOp); + }); } + this.writeBatcher.flushAndWait(); + throwWriteFailureIfExists(); - return message; + + // Need this hack so that the complete set of graphs can be reported back to MarkLogicWrite, which handles + // creating the graphs after all documents have been written. + Set graphs = null; + if (this.rowConverter instanceof RdfRowConverter) { + graphs = ((RdfRowConverter) rowConverter).getGraphs(); + } + + return new CommitMessage(successItemCount.get(), failedItemCount.get(), graphs); } @Override public void abort() { - Util.MAIN_LOGGER.warn("Abort called; stopping job"); + Util.MAIN_LOGGER.warn("Abort called."); stopJobAndRelease(); + closeArchiveWriter(); + Util.MAIN_LOGGER.info("Finished abort."); } @Override public void close() { if (logger.isDebugEnabled()) { - logger.debug("Close called; stopping job."); + logger.debug("Close called."); } stopJobAndRelease(); + closeArchiveWriter(); } - private void addBatchListeners() { - this.writeBatcher.onBatchSuccess(batch -> this.successItemCount.getAndAdd(batch.getItems().length)); + private void addBatchListeners(WriteBatcher writeBatcher) { + writeBatcher.onBatchSuccess(batch -> this.successItemCount.getAndAdd(batch.getItems().length)); if (writeContext.isAbortOnFailure()) { - // Logging not needed here, as WriteBatcherImpl already logs this at the warning level. - this.writeBatcher.onBatchFailure((batch, failure) -> - this.writeFailure.compareAndSet(null, failure) - ); + // WriteBatcherImpl has its own warn-level logging which is a bit verbose, including more than just the + // message from the server. This is intended to always show up and be associated with our Spark connector + // and also to be more brief, just capturing the main message from the server. + writeBatcher.onBatchFailure((batch, failure) -> { + Util.MAIN_LOGGER.error("Failed to write documents: {}", failure.getMessage()); + this.writeFailure.compareAndSet(null, failure); + }); } else { - this.writeBatcher.onBatchFailure((batch, failure) -> - this.failedItemCount.getAndAdd(batch.getItems().length) - ); + writeBatcher.onBatchFailure(this.batchRetrier::retryBatch); } } - private synchronized void throwWriteFailureIfExists() throws IOException { + private RowConverter determineRowConverter() { + if (writeContext.isUsingFileSchema()) { + return new FileRowConverter(writeContext); + } else if (DocumentRowSchema.SCHEMA.equals(writeContext.getSchema())) { + return new DocumentRowConverter(writeContext); + } else if (TripleRowSchema.SCHEMA.equals(writeContext.getSchema())) { + return new RdfRowConverter(writeContext); + } + return new ArbitraryRowConverter(writeContext); + } + + private synchronized void throwWriteFailureIfExists() { if (writeFailure.get() != null) { + Throwable failure = writeFailure.get(); + if (failure instanceof ConnectorException) { + throw (ConnectorException) failure; + } // Only including the message seems sufficient here, as Spark is logging the stacktrace. And the user // most likely only needs to know the message. - throw new IOException(writeFailure.get().getMessage()); + throw new ConnectorException(failure.getMessage()); } } @@ -140,4 +188,60 @@ private void stopJobAndRelease() { this.databaseClient.release(); } } + + private BatchRetrier makeBatchRetrier() { + return new BatchRetrier( + writeContext.newDocumentManager(this.databaseClient), + writeContext.getStringOption(Options.WRITE_TEMPORAL_COLLECTION), + successfulBatch -> successItemCount.getAndAdd(successfulBatch.size()), + (failedDoc, failure) -> { + Util.MAIN_LOGGER.error("Unable to write document with URI: {}; cause: {}", failedDoc.getUri(), failure.getMessage()); + failedItemCount.incrementAndGet(); + if (this.archiveWriter != null) { + writeFailedDocumentToArchive(failedDoc); + } + } + ); + } + + /** + * Need this to be synchronized so that 2 or more WriteBatcher threads don't try to write zip entries at the same + * time. + * + * @param failedDoc + */ + private synchronized void writeFailedDocumentToArchive(DocumentWriteOperation failedDoc) { + AbstractWriteHandle contentHandle = failedDoc.getContent(); + byte[] content = ByteArray.concat(HandleAccessor.contentAsString(contentHandle).getBytes()); + + GenericInternalRow row = new DocumentRowBuilder(new ArrayList<>()) + .withUri(failedDoc.getUri()).withContent(content) + .withMetadata((DocumentMetadataHandle) failedDoc.getMetadata()) + .buildRow(); + + try { + archiveWriter.write(row); + } catch (Exception e) { + ConnectorException ex = new ConnectorException(String.format( + "Unable to write failed documents to archive file at %s; URI of failed document: %s; cause: %s", + archiveWriter.getZipPath(), failedDoc.getUri(), e.getMessage() + ), e); + this.writeFailure.compareAndSet(null, ex); + throw ex; + } + } + + private ZipFileWriter createArchiveWriter(SerializableConfiguration hadoopConfiguration, int partitionId) { + String path = writeContext.getStringOption(Options.WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS); + // The zip file is expected to be created lazily - i.e. only when a document fails. This avoids creating + // empty archive zip files when no errors occur. + return new ZipFileWriter(path, writeContext.getProperties(), hadoopConfiguration, partitionId, false); + } + + private void closeArchiveWriter() { + if (archiveWriter != null) { + Util.MAIN_LOGGER.info("Wrote failed documents to archive file at {}.", archiveWriter.getZipPath()); + archiveWriter.close(); + } + } } diff --git a/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriterFactory.java b/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriterFactory.java index ebd73979..220290ac 100644 --- a/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriterFactory.java +++ b/src/main/java/com/marklogic/spark/writer/WriteBatcherDataWriterFactory.java @@ -19,22 +19,25 @@ import org.apache.spark.sql.connector.write.DataWriter; import org.apache.spark.sql.connector.write.DataWriterFactory; import org.apache.spark.sql.connector.write.streaming.StreamingDataWriterFactory; +import org.apache.spark.util.SerializableConfiguration; class WriteBatcherDataWriterFactory implements DataWriterFactory, StreamingDataWriterFactory { - private WriteContext writeContext; + private final WriteContext writeContext; + private final SerializableConfiguration hadoopConfiguration; - WriteBatcherDataWriterFactory(WriteContext writeContext) { + WriteBatcherDataWriterFactory(WriteContext writeContext, SerializableConfiguration hadoopConfiguration) { this.writeContext = writeContext; + this.hadoopConfiguration = hadoopConfiguration; } @Override public DataWriter createWriter(int partitionId, long taskId) { - return new WriteBatcherDataWriter(writeContext, partitionId, taskId, 0L); + return new WriteBatcherDataWriter(writeContext, hadoopConfiguration, partitionId); } @Override public DataWriter createWriter(int partitionId, long taskId, long epochId) { - return new WriteBatcherDataWriter(writeContext, partitionId, taskId, epochId); + return createWriter(partitionId, taskId); } } diff --git a/src/main/java/com/marklogic/spark/writer/WriteContext.java b/src/main/java/com/marklogic/spark/writer/WriteContext.java index 67d4af5e..e5bafe7f 100644 --- a/src/main/java/com/marklogic/spark/writer/WriteContext.java +++ b/src/main/java/com/marklogic/spark/writer/WriteContext.java @@ -15,20 +15,24 @@ */ package com.marklogic.spark.writer; +import com.marklogic.client.DatabaseClient; import com.marklogic.client.datamovement.DataMovementManager; import com.marklogic.client.datamovement.WriteBatch; import com.marklogic.client.datamovement.WriteBatcher; import com.marklogic.client.datamovement.WriteEvent; +import com.marklogic.client.document.GenericDocumentManager; import com.marklogic.client.document.ServerTransform; -import com.marklogic.spark.ContextSupport; -import com.marklogic.spark.Options; -import com.marklogic.spark.Util; +import com.marklogic.client.impl.GenericDocumentImpl; +import com.marklogic.client.io.Format; +import com.marklogic.spark.*; import com.marklogic.spark.reader.document.DocumentRowSchema; +import com.marklogic.spark.reader.file.TripleRowSchema; import org.apache.spark.sql.types.StructType; import java.util.Arrays; import java.util.List; import java.util.Map; +import java.util.Optional; import java.util.stream.Stream; public class WriteContext extends ContextSupport { @@ -37,13 +41,18 @@ public class WriteContext extends ContextSupport { private final StructType schema; private final boolean usingFileSchema; + private final int batchSize; private int fileSchemaContentPosition; private int fileSchemaPathPosition; + // This unfortunately is not final as we don't know it when this object is created. + private int numPartitions; + public WriteContext(StructType schema, Map properties) { super(properties); this.schema = schema; + this.batchSize = (int) getNumericOption(Options.WRITE_BATCH_SIZE, 100, 1); // We support the Spark binaryFile schema - https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html - // so that reader can be reused for loading files as-is. @@ -61,34 +70,72 @@ public StructType getSchema() { return schema; } - int getThreadCount() { - return (int)getNumericOption(Options.WRITE_THREAD_COUNT, 4, 1); + /** + * @return the total number of threads to use across all partitions. This is typically how a user thinks in terms + * of, as they are not likely to know how many partitions will be created. But they will typically know how many + * hosts are in their MarkLogic cluster and how many threads are available to an app server on each host. + */ + int getTotalThreadCount() { + return (int) getNumericOption(Options.WRITE_THREAD_COUNT, 4, 1); + } + + /** + * @return the thread count to use per partition where a user has specified the total thread count across all + * partitions. + */ + int getCalculatedThreadCountPerPartition() { + int threadCount = getTotalThreadCount(); + if (this.numPartitions > 0) { + return (int) Math.ceil((double) threadCount / (double) numPartitions); + } + return threadCount; + } + + /** + * @return the thread count to use per partition where a user has used an option to explicitly define how many + * threads should be used by a partition. + */ + int getUserDefinedThreadCountPerPartition() { + return (int) getNumericOption(Options.WRITE_THREAD_COUNT_PER_PARTITION, 0, 1); } WriteBatcher newWriteBatcher(DataMovementManager dataMovementManager) { + // If the user told us how many threads they want per partition (we expect this to be rare), then use that. + // Otherwise, use the calculated number of threads per partition based on the total thread count that either + // the user configured or using the default value for that option. + final int threadCount = getUserDefinedThreadCountPerPartition() > 0 ? + getUserDefinedThreadCountPerPartition() : getCalculatedThreadCountPerPartition(); + + if (Util.MAIN_LOGGER.isDebugEnabled()) { + Util.MAIN_LOGGER.debug("Creating new batcher with thread count of {} and batch size of {}.", threadCount, batchSize); + } WriteBatcher writeBatcher = dataMovementManager .newWriteBatcher() - .withBatchSize((int) getNumericOption(Options.WRITE_BATCH_SIZE, 100, 1)) - .withThreadCount(getThreadCount()) - // WriteBatcherImpl has its own warn-level logging which is a bit verbose, including more than just the - // message from the server. This is intended to always show up and be associated with our Spark connector - // and also to be more brief, just capturing the main message from the server. - .onBatchFailure((batch, failure) -> - Util.MAIN_LOGGER.error("Failed to write documents: {}", failure.getMessage()) - ); + .withBatchSize(batchSize) + .withThreadCount(threadCount) + .withTemporalCollection(getStringOption(Options.WRITE_TEMPORAL_COLLECTION)) + .onBatchSuccess(this::logBatchOnSuccess); - if (logger.isDebugEnabled()) { - writeBatcher.onBatchSuccess(this::logBatchOnSuccess); + Optional transform = makeRestTransform(); + if (transform.isPresent()) { + writeBatcher.withTransform(transform.get()); } + return writeBatcher; + } - String temporalCollection = getProperties().get(Options.WRITE_TEMPORAL_COLLECTION); - if (temporalCollection != null && temporalCollection.trim().length() > 0) { - writeBatcher.withTemporalCollection(temporalCollection); + /** + * @param client + * @return a {@code GenericDocumentImpl}, which exposes the methods that accept a temporal collection as an input. + * Has the same configuration as the {@code WriteBatcher} created by this class as well, thus allowing for documents + * in a failed batch to be retried via this document manager. + */ + GenericDocumentImpl newDocumentManager(DatabaseClient client) { + GenericDocumentManager mgr = client.newDocumentManager(); + Optional transform = makeRestTransform(); + if (transform.isPresent()) { + mgr.setWriteTransform(transform.get()); } - - configureRestTransform(writeBatcher); - - return writeBatcher; + return (GenericDocumentImpl) mgr; } DocBuilder newDocBuilder() { @@ -96,32 +143,93 @@ DocBuilder newDocBuilder() { .withCollections(getProperties().get(Options.WRITE_COLLECTIONS)) .withPermissions(getProperties().get(Options.WRITE_PERMISSIONS)); - String uriTemplate = getProperties().get(Options.WRITE_URI_TEMPLATE); - if (uriTemplate != null && uriTemplate.trim().length() > 0) { - factory.withUriMaker(new SparkRowUriMaker(uriTemplate)); - Stream.of(Options.WRITE_URI_PREFIX, Options.WRITE_URI_SUFFIX, Options.WRITE_URI_REPLACE).forEach(option -> { - String value = getProperties().get(option); - if (value != null && value.trim().length() > 0) { - Util.MAIN_LOGGER.warn("Option {} will be ignored since option {} was specified.", option, Options.WRITE_URI_TEMPLATE); - } - }); + if (hasOption(Options.WRITE_URI_TEMPLATE)) { + configureTemplateUriMaker(factory); } else { - String uriSuffix = null; - if (hasOption(Options.WRITE_URI_SUFFIX)) { - uriSuffix = getProperties().get(Options.WRITE_URI_SUFFIX); - } else if (!isUsingFileSchema() && !DocumentRowSchema.SCHEMA.equals(this.schema)) { - uriSuffix = ".json"; - } - factory.withUriMaker(new StandardUriMaker( - getProperties().get(Options.WRITE_URI_PREFIX), uriSuffix, - getProperties().get(Options.WRITE_URI_REPLACE) - )); + configureStandardUriMaker(factory); } return factory.newDocBuilder(); } - private void configureRestTransform(WriteBatcher writeBatcher) { + Format getDocumentFormat() { + if (hasOption(Options.WRITE_DOCUMENT_TYPE)) { + String value = getStringOption(Options.WRITE_DOCUMENT_TYPE); + try { + return Format.valueOf(value.toUpperCase()); + } catch (IllegalArgumentException e) { + String message = "Invalid value for %s: %s; must be one of 'JSON', 'XML', or 'TEXT'."; + String optionAlias = getOptionNameForMessage(Options.WRITE_DOCUMENT_TYPE); + throw new ConnectorException(String.format(message, optionAlias, value)); + } + } + return null; + } + + /** + * @deprecated since 2.3.0; users should use getDocumentFormat instead. + */ + @Deprecated(since = "2.3.0") + // We don't need Sonar to remind us of this deprecation. + @SuppressWarnings("java:S1133") + Format getDeprecatedFileRowsDocumentFormat() { + final String deprecatedOption = Options.WRITE_FILE_ROWS_DOCUMENT_TYPE; + if (hasOption(deprecatedOption)) { + String value = getStringOption(deprecatedOption); + try { + return Format.valueOf(value.toUpperCase()); + } catch (IllegalArgumentException e) { + String message = "Invalid value for %s: %s; must be one of 'JSON', 'XML', or 'TEXT'."; + String optionAlias = getOptionNameForMessage(deprecatedOption); + throw new ConnectorException(String.format(message, optionAlias, value)); + } + } + return null; + } + + /** + * The URI template approach will typically be used with rows with an "arbitrary" schema where each column value + * may be useful in constructing a URI. + * + * @param factory + */ + private void configureTemplateUriMaker(DocBuilderFactory factory) { + String uriTemplate = getProperties().get(Options.WRITE_URI_TEMPLATE); + String optionAlias = getOptionNameForMessage(Options.WRITE_URI_TEMPLATE); + factory.withUriMaker(new SparkRowUriMaker(uriTemplate, optionAlias)); + Stream.of(Options.WRITE_URI_PREFIX, Options.WRITE_URI_SUFFIX, Options.WRITE_URI_REPLACE).forEach(option -> { + String value = getProperties().get(option); + if (value != null && value.trim().length() > 0) { + Util.MAIN_LOGGER.warn("Option {} will be ignored since option {} was specified.", option, Options.WRITE_URI_TEMPLATE); + } + }); + } + + /** + * For rows with an "arbitrary" schema, the URI suffix defaults to ".json" or ".xml" as we know there won't be an + * initial URI for these rows. + * + * @param factory + */ + private void configureStandardUriMaker(DocBuilderFactory factory) { + String uriSuffix = null; + if (hasOption(Options.WRITE_URI_SUFFIX)) { + uriSuffix = getProperties().get(Options.WRITE_URI_SUFFIX); + } else if (!isUsingFileSchema() && !DocumentRowSchema.SCHEMA.equals(this.schema) && !TripleRowSchema.SCHEMA.equals(this.schema)) { + String xmlRootName = getStringOption(Options.WRITE_XML_ROOT_NAME); + if (xmlRootName != null && getStringOption(Options.WRITE_JSON_ROOT_NAME) != null) { + throw new ConnectorException(String.format("Cannot specify both %s and %s", + getOptionNameForMessage(Options.WRITE_JSON_ROOT_NAME), getOptionNameForMessage(Options.WRITE_XML_ROOT_NAME))); + } + uriSuffix = xmlRootName != null ? ".xml" : ".json"; + } + factory.withUriMaker(new StandardUriMaker( + getProperties().get(Options.WRITE_URI_PREFIX), uriSuffix, + getProperties().get(Options.WRITE_URI_REPLACE) + )); + } + + private Optional makeRestTransform() { String transformName = getProperties().get(Options.WRITE_TRANSFORM_NAME); if (transformName != null && transformName.trim().length() > 0) { ServerTransform transform = new ServerTransform(transformName); @@ -129,8 +237,9 @@ private void configureRestTransform(WriteBatcher writeBatcher) { if (paramsValue != null && paramsValue.trim().length() > 0) { addRestTransformParams(transform, paramsValue); } - writeBatcher.withTransform(transform); + return Optional.of(transform); } + return Optional.empty(); } private void addRestTransformParams(ServerTransform transform, String paramsValue) { @@ -138,9 +247,9 @@ private void addRestTransformParams(ServerTransform transform, String paramsValu String delimiter = delimiterValue != null && delimiterValue.trim().length() > 0 ? delimiterValue : ","; String[] params = paramsValue.split(delimiter); if (params.length % 2 != 0) { - throw new IllegalArgumentException( + throw new ConnectorException( String.format("The %s option must contain an equal number of parameter names and values; received: %s", - Options.WRITE_TRANSFORM_PARAMS, paramsValue) + getOptionNameForMessage(Options.WRITE_TRANSFORM_PARAMS), paramsValue) ); } for (int i = 0; i < params.length; i += 2) { @@ -154,12 +263,14 @@ private void logBatchOnSuccess(WriteBatch batch) { WriteEvent firstEvent = batch.getItems()[0]; // If the first event is the item added by DMSDK for the default metadata object, ignore it when showing // the count of documents in the batch. - // the count of documents in the batch. if (firstEvent.getTargetUri() == null && firstEvent.getMetadata() != null) { docCount--; } } - logger.debug("Wrote batch; length: {}; job batch number: {}", docCount, batch.getJobBatchNumber()); + WriteProgressLogger.logProgressIfNecessary(docCount); + if (logger.isTraceEnabled()) { + logger.trace("Wrote batch; length: {}; job batch number: {}", docCount, batch.getJobBatchNumber()); + } } boolean isUsingFileSchema() { @@ -173,4 +284,8 @@ int getFileSchemaContentPosition() { int getFileSchemaPathPosition() { return fileSchemaPathPosition; } + + public void setNumPartitions(int numPartitions) { + this.numPartitions = numPartitions; + } } diff --git a/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriter.java b/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriter.java index bc2af7b8..13a875aa 100644 --- a/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriter.java +++ b/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriter.java @@ -9,19 +9,15 @@ import com.marklogic.client.io.JacksonHandle; import com.marklogic.client.io.StringHandle; import com.marklogic.client.io.marker.AbstractWriteHandle; -import com.marklogic.spark.ConnectorException; -import com.marklogic.spark.Options; -import com.marklogic.spark.Util; +import com.marklogic.spark.*; import com.marklogic.spark.reader.customcode.CustomCodeContext; import com.marklogic.spark.writer.CommitMessage; import org.apache.spark.sql.catalyst.InternalRow; -import org.apache.spark.sql.catalyst.json.JacksonGenerator; import org.apache.spark.sql.connector.write.DataWriter; import org.apache.spark.sql.connector.write.WriterCommitMessage; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import java.io.StringWriter; import java.util.ArrayList; import java.util.List; import java.util.stream.Collectors; @@ -32,31 +28,23 @@ class CustomCodeWriter implements DataWriter { private final DatabaseClient databaseClient; private final CustomCodeContext customCodeContext; - + private final JsonRowSerializer jsonRowSerializer; private final int batchSize; + private final List currentBatch = new ArrayList<>(); private final String externalVariableDelimiter; private ObjectMapper objectMapper; - // Only used for commit messages - private final int partitionId; - private final long taskId; - private final long epochId; - // Updated after each call to MarkLogic. private int successItemCount; private int failedItemCount; - CustomCodeWriter(CustomCodeContext customCodeContext, int partitionId, long taskId, long epochId) { + CustomCodeWriter(CustomCodeContext customCodeContext) { this.customCodeContext = customCodeContext; - this.partitionId = partitionId; - this.taskId = taskId; - this.epochId = epochId; - this.databaseClient = customCodeContext.connectToMarkLogic(); + this.jsonRowSerializer = new JsonRowSerializer(customCodeContext.getSchema(), customCodeContext.getProperties()); - this.batchSize = customCodeContext.optionExists(Options.WRITE_BATCH_SIZE) ? - Integer.parseInt(customCodeContext.getProperties().get(Options.WRITE_BATCH_SIZE)) : 1; + this.batchSize = (int) customCodeContext.getNumericOption(Options.WRITE_BATCH_SIZE, 1, 1); this.externalVariableDelimiter = customCodeContext.optionExists(Options.WRITE_EXTERNAL_VARIABLE_DELIMITER) ? customCodeContext.getProperties().get(Options.WRITE_EXTERNAL_VARIABLE_DELIMITER) : ","; @@ -69,11 +57,10 @@ class CustomCodeWriter implements DataWriter { @Override public void write(InternalRow row) { String rowValue = customCodeContext.isCustomSchema() ? - convertRowToJSONString(row) : + jsonRowSerializer.serializeRowToJson(row) : row.getString(0); this.currentBatch.add(rowValue); - if (this.currentBatch.size() >= this.batchSize) { flush(); } @@ -82,7 +69,7 @@ public void write(InternalRow row) { @Override public WriterCommitMessage commit() { flush(); - CommitMessage message = new CommitMessage(successItemCount, failedItemCount, partitionId, taskId, epochId); + CommitMessage message = new CommitMessage(successItemCount, failedItemCount, null); if (logger.isDebugEnabled()) { logger.debug("Committing {}", message); } @@ -91,13 +78,13 @@ public WriterCommitMessage commit() { @Override public void abort() { - Util.MAIN_LOGGER.warn("Abort called; stopping job"); + // No action to take. } @Override public void close() { if (logger.isDebugEnabled()) { - logger.debug("Close called; stopping job."); + logger.debug("Close called."); } if (databaseClient != null) { databaseClient.release(); @@ -113,13 +100,14 @@ private void flush() { return; } - if (logger.isDebugEnabled()) { - logger.debug("Calling custom code in MarkLogic"); + if (logger.isTraceEnabled()) { + logger.trace("Calling custom code in MarkLogic"); } final int itemCount = currentBatch.size(); ServerEvaluationCall call = customCodeContext.buildCall( this.databaseClient, - new CustomCodeContext.CallOptions(Options.WRITE_INVOKE, Options.WRITE_JAVASCRIPT, Options.WRITE_XQUERY) + new CustomCodeContext.CallOptions(Options.WRITE_INVOKE, Options.WRITE_JAVASCRIPT, Options.WRITE_XQUERY, + Options.WRITE_JAVASCRIPT_FILE, Options.WRITE_XQUERY_FILE) ); call.addVariable(determineExternalVariableName(), makeVariableValue()); currentBatch.clear(); @@ -156,22 +144,11 @@ private ArrayNode makeArrayFromCurrentBatch() { return array; } - private String convertRowToJSONString(InternalRow row) { - StringWriter jsonObjectWriter = new StringWriter(); - JacksonGenerator jacksonGenerator = new JacksonGenerator( - this.customCodeContext.getSchema(), - jsonObjectWriter, - Util.DEFAULT_JSON_OPTIONS - ); - jacksonGenerator.write(row); - jacksonGenerator.flush(); - return jsonObjectWriter.toString(); - } - private void executeCall(ServerEvaluationCall call, int itemCount) { try { call.evalAs(String.class); this.successItemCount += itemCount; + WriteProgressLogger.logProgressIfNecessary(itemCount); } catch (RuntimeException ex) { if (customCodeContext.isAbortOnFailure()) { throw ex; diff --git a/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriterFactory.java b/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriterFactory.java index 28daef21..baaab9ad 100644 --- a/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriterFactory.java +++ b/src/main/java/com/marklogic/spark/writer/customcode/CustomCodeWriterFactory.java @@ -16,11 +16,11 @@ public CustomCodeWriterFactory(CustomCodeContext customCodeContext) { @Override public DataWriter createWriter(int partitionId, long taskId) { - return new CustomCodeWriter(customCodeContext, partitionId, taskId, 0l); + return new CustomCodeWriter(customCodeContext); } @Override public DataWriter createWriter(int partitionId, long taskId, long epochId) { - return new CustomCodeWriter(customCodeContext, partitionId, taskId, epochId); + return new CustomCodeWriter(customCodeContext); } } diff --git a/src/main/java/com/marklogic/spark/writer/file/ContentWriter.java b/src/main/java/com/marklogic/spark/writer/file/ContentWriter.java new file mode 100644 index 00000000..d59657ad --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/file/ContentWriter.java @@ -0,0 +1,141 @@ +package com.marklogic.spark.writer.file; + +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import com.marklogic.spark.reader.document.DocumentRowSchema; +import org.apache.spark.sql.catalyst.InternalRow; + +import javax.xml.XMLConstants; +import javax.xml.transform.*; +import javax.xml.transform.stream.StreamResult; +import javax.xml.transform.stream.StreamSource; +import java.io.ByteArrayInputStream; +import java.io.IOException; +import java.io.OutputStream; +import java.nio.charset.Charset; +import java.util.Map; + +/** + * Knows how to write the value in the "content" column of a row conforming to our {@code DocumentRowSchema}. Supports + * pretty-printing as well. This keeps an instance of a JAXP Transformer, which is safe for one thread to use + * multiple times. + */ +class ContentWriter { + + private final Transformer transformer; + private final ObjectMapper objectMapper; + private final boolean prettyPrint; + private final Charset encoding; + + ContentWriter(Map properties) { + this.encoding = determineEncoding(properties); + this.prettyPrint = "true".equalsIgnoreCase(properties.get(Options.WRITE_FILES_PRETTY_PRINT)); + if (this.prettyPrint) { + this.objectMapper = new ObjectMapper(); + this.transformer = newTransformer(); + } else { + this.transformer = null; + this.objectMapper = null; + } + } + + void writeContent(InternalRow row, OutputStream outputStream) throws IOException { + if (this.prettyPrint) { + prettyPrintContent(row, outputStream); + } else { + byte[] bytes = row.getBinary(1); + if (this.encoding != null) { + // We know the string from MarkLogic is UTF-8, so we use getBytes to convert it to the user's + // specified encoding (as opposed to new String(bytes, encoding)). + outputStream.write(new String(bytes).getBytes(this.encoding)); + } else { + outputStream.write(row.getBinary(1)); + } + } + } + + void writeMetadata(InternalRow row, OutputStream outputStream) throws IOException { + String metadataXml = DocumentRowSchema.makeDocumentMetadata(row).toString(); + // Must honor the encoding here as well, as a user could easily have values that require encoding in metadata + // values or in a properties fragment. + if (this.encoding != null) { + outputStream.write(metadataXml.getBytes(this.encoding)); + } else { + outputStream.write(metadataXml.getBytes()); + } + } + + private Charset determineEncoding(Map properties) { + String encodingValue = properties.get(Options.WRITE_FILES_ENCODING); + if (encodingValue != null && encodingValue.trim().length() > 0) { + try { + return Charset.forName(encodingValue); + } catch (Exception ex) { + throw new ConnectorException(String.format("Unsupported encoding value: %s", encodingValue), ex); + } + } + return null; + } + + private Transformer newTransformer() { + try { + TransformerFactory factory = TransformerFactory.newInstance(); + // Disables certain features as recommended by Sonar to prevent security vulnerabilities. + // Also see https://stackoverflow.com/questions/32178558/how-to-prevent-xml-external-entity-injection-on-transformerfactory . + factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true); + factory.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, ""); + factory.setAttribute(XMLConstants.ACCESS_EXTERNAL_STYLESHEET, ""); + final Transformer t = factory.newTransformer(); + if (this.encoding != null) { + t.setOutputProperty(OutputKeys.ENCODING, this.encoding.name()); + } else { + t.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); + } + t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); + t.setOutputProperty(OutputKeys.INDENT, "yes"); + return t; + } catch (TransformerConfigurationException e) { + throw new ConnectorException( + String.format("Unable to instantiate transformer for pretty-printing XML; cause: %s", e.getMessage()), e + ); + } + } + + private void prettyPrintContent(InternalRow row, OutputStream outputStream) throws IOException { + final byte[] content = row.getBinary(1); + final String format = row.isNullAt(2) ? null : row.getString(2); + if ("JSON".equalsIgnoreCase(format)) { + prettyPrintJson(content, outputStream); + } else if ("XML".equalsIgnoreCase(format)) { + prettyPrintXml(content, outputStream); + } else { + if (this.encoding != null) { + outputStream.write(new String(content).getBytes(this.encoding)); + } else { + outputStream.write(content); + } + } + } + + private void prettyPrintJson(byte[] content, OutputStream outputStream) throws IOException { + JsonNode node = this.objectMapper.readTree(content); + String prettyJson = node.toPrettyString(); + if (this.encoding != null) { + outputStream.write(prettyJson.getBytes(this.encoding)); + } else { + outputStream.write(prettyJson.getBytes()); + } + } + + private void prettyPrintXml(byte[] content, OutputStream outputStream) { + Result result = new StreamResult(outputStream); + Source source = new StreamSource(new ByteArrayInputStream(content)); + try { + this.transformer.transform(source, result); + } catch (TransformerException e) { + throw new ConnectorException(String.format("Unable to pretty print XML; cause: %s", e.getMessage()), e); + } + } +} diff --git a/src/main/java/com/marklogic/spark/writer/file/DocumentFileBatch.java b/src/main/java/com/marklogic/spark/writer/file/DocumentFileBatch.java index 276a9e93..5bf31418 100644 --- a/src/main/java/com/marklogic/spark/writer/file/DocumentFileBatch.java +++ b/src/main/java/com/marklogic/spark/writer/file/DocumentFileBatch.java @@ -1,11 +1,13 @@ package com.marklogic.spark.writer.file; +import com.marklogic.spark.Util; import org.apache.hadoop.conf.Configuration; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.connector.write.BatchWrite; import org.apache.spark.sql.connector.write.DataWriterFactory; import org.apache.spark.sql.connector.write.PhysicalWriteInfo; import org.apache.spark.sql.connector.write.WriterCommitMessage; +import org.apache.spark.sql.types.StructType; import org.apache.spark.util.SerializableConfiguration; import java.util.Map; @@ -13,9 +15,11 @@ class DocumentFileBatch implements BatchWrite { private final Map properties; + private final StructType schema; - DocumentFileBatch(Map properties) { + DocumentFileBatch(Map properties, StructType schema) { this.properties = properties; + this.schema = schema; } @Override @@ -23,12 +27,36 @@ public DataWriterFactory createBatchWriterFactory(PhysicalWriteInfo info) { // This is the last chance we have for accessing the hadoop config, which is needed by the writer. // SerializableConfiguration allows for it to be sent to the factory. Configuration config = SparkSession.active().sparkContext().hadoopConfiguration(); - return new DocumentFileWriterFactory(properties, new SerializableConfiguration(config)); + return new DocumentFileWriterFactory(properties, new SerializableConfiguration(config), schema); } @Override public void commit(WriterCommitMessage[] messages) { - // No messages expected yet. + if (Util.MAIN_LOGGER.isInfoEnabled() && messages != null && messages.length > 0) { + int fileCount = 0; + int zipFileCount = 0; + int zipEntryCount = 0; + String path = null; + for (WriterCommitMessage message : messages) { + if (message instanceof FileCommitMessage) { + path = ((FileCommitMessage)message).getPath(); + fileCount += ((FileCommitMessage) message).getFileCount(); + } else if (message instanceof ZipCommitMessage) { + path = ((ZipCommitMessage)message).getPath(); + zipFileCount++; + zipEntryCount += ((ZipCommitMessage) message).getZipEntryCount(); + } + } + if (fileCount == 1) { + Util.MAIN_LOGGER.info("Wrote 1 file to {}.", path); + } else if (fileCount > 1) { + Util.MAIN_LOGGER.info("Wrote {} files to {}.", fileCount, path); + } else if (zipFileCount == 1) { + Util.MAIN_LOGGER.info("Wrote 1 zip file containing {} entries to {}.", zipEntryCount, path); + } else if (zipFileCount > 1) { + Util.MAIN_LOGGER.info("Wrote {} zip files containing a total of {} entries to {}.", zipFileCount, zipEntryCount, path); + } + } } @Override diff --git a/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriteBuilder.java b/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriteBuilder.java index bce5e80d..9ce5e53b 100644 --- a/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriteBuilder.java +++ b/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriteBuilder.java @@ -3,15 +3,18 @@ import org.apache.spark.sql.connector.write.BatchWrite; import org.apache.spark.sql.connector.write.Write; import org.apache.spark.sql.connector.write.WriteBuilder; +import org.apache.spark.sql.types.StructType; import java.util.Map; public class DocumentFileWriteBuilder implements WriteBuilder { private final Map properties; + private final StructType schema; - public DocumentFileWriteBuilder(Map properties) { + public DocumentFileWriteBuilder(Map properties, StructType schema) { this.properties = properties; + this.schema = schema; } @Override @@ -19,7 +22,7 @@ public Write build() { return new Write() { @Override public BatchWrite toBatch() { - return new DocumentFileBatch(properties); + return new DocumentFileBatch(properties, schema); } }; } diff --git a/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriter.java b/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriter.java index 47a7a62b..dfcadc81 100644 --- a/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriter.java +++ b/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriter.java @@ -21,23 +21,28 @@ class DocumentFileWriter implements DataWriter { private static final Logger logger = LoggerFactory.getLogger(DocumentFileWriter.class); - private final Map properties; private final SerializableConfiguration hadoopConfiguration; + private final ContentWriter contentWriter; + + private final String path; + private int fileCounter; DocumentFileWriter(Map properties, SerializableConfiguration hadoopConfiguration) { - this.properties = properties; + this.path = properties.get("path"); this.hadoopConfiguration = hadoopConfiguration; + this.contentWriter = new ContentWriter(properties); } @Override public void write(InternalRow row) throws IOException { - final Path path = makePath(row); + final Path filePath = makeFilePath(row); if (logger.isTraceEnabled()) { - logger.trace("Will write to: {}", path); + logger.trace("Will write to: {}", filePath); } - OutputStream outputStream = makeOutputStream(path); + OutputStream outputStream = makeOutputStream(filePath); try { - outputStream.write(row.getBinary(1)); + this.contentWriter.writeContent(row, outputStream); + fileCounter++; } finally { IOUtils.closeQuietly(outputStream); } @@ -45,7 +50,7 @@ public void write(InternalRow row) throws IOException { @Override public WriterCommitMessage commit() { - return null; + return new FileCommitMessage(this.path, fileCounter); } @Override @@ -58,11 +63,10 @@ public void close() { // Nothing to close. } - private Path makePath(InternalRow row) { - String dir = properties.get("path"); + private Path makeFilePath(InternalRow row) { final String uri = row.getString(0); - String path = makeFilePath(uri); - return path.charAt(0) == '/' ? new Path(dir + path) : new Path(dir, path); + String filePath = makeFilePath(uri); + return filePath.charAt(0) == '/' ? new Path(this.path + filePath) : new Path(this.path, filePath); } // Protected so it can be overridden by subclass. diff --git a/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriterFactory.java b/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriterFactory.java index e5abe42d..5537a329 100644 --- a/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriterFactory.java +++ b/src/main/java/com/marklogic/spark/writer/file/DocumentFileWriterFactory.java @@ -1,9 +1,11 @@ package com.marklogic.spark.writer.file; import com.marklogic.spark.Options; +import com.marklogic.spark.reader.file.TripleRowSchema; import org.apache.spark.sql.catalyst.InternalRow; import org.apache.spark.sql.connector.write.DataWriter; import org.apache.spark.sql.connector.write.DataWriterFactory; +import org.apache.spark.sql.types.StructType; import org.apache.spark.util.SerializableConfiguration; import java.util.Map; @@ -14,20 +16,25 @@ class DocumentFileWriterFactory implements DataWriterFactory { private final Map properties; private final SerializableConfiguration hadoopConfiguration; + private final StructType schema; - DocumentFileWriterFactory(Map properties, SerializableConfiguration hadoopConfiguration) { + DocumentFileWriterFactory(Map properties, SerializableConfiguration hadoopConfiguration, StructType schema) { this.properties = properties; this.hadoopConfiguration = hadoopConfiguration; + this.schema = schema; } @Override public DataWriter createWriter(int partitionId, long taskId) { + if (this.schema.equals(TripleRowSchema.SCHEMA)) { + return new RdfFileWriter(properties, hadoopConfiguration, partitionId); + } String compression = this.properties.get(Options.WRITE_FILES_COMPRESSION); if (compression != null && compression.length() > 0) { if ("zip".equalsIgnoreCase(compression)) { return new ZipFileWriter(properties, hadoopConfiguration, partitionId); } - return new GZIPFileWriter(properties, hadoopConfiguration); + return new GzipFileWriter(properties, hadoopConfiguration); } return new DocumentFileWriter(properties, hadoopConfiguration); } diff --git a/src/main/java/com/marklogic/spark/writer/file/FileCommitMessage.java b/src/main/java/com/marklogic/spark/writer/file/FileCommitMessage.java new file mode 100644 index 00000000..9f05aa83 --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/file/FileCommitMessage.java @@ -0,0 +1,22 @@ +package com.marklogic.spark.writer.file; + +import org.apache.spark.sql.connector.write.WriterCommitMessage; + +class FileCommitMessage implements WriterCommitMessage { + + private final String path; + private final int fileCount; + + FileCommitMessage(String path, int fileCount) { + this.path = path; + this.fileCount = fileCount; + } + + String getPath() { + return path; + } + + int getFileCount() { + return fileCount; + } +} diff --git a/src/main/java/com/marklogic/spark/writer/file/FileUtil.java b/src/main/java/com/marklogic/spark/writer/file/FileUtil.java index 8ebfa62d..2d5d7e7b 100644 --- a/src/main/java/com/marklogic/spark/writer/file/FileUtil.java +++ b/src/main/java/com/marklogic/spark/writer/file/FileUtil.java @@ -1,6 +1,6 @@ package com.marklogic.spark.writer.file; -import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Util; import java.net.URI; import java.net.URISyntaxException; @@ -8,15 +8,19 @@ abstract class FileUtil { static String makePathFromDocumentURI(String documentURI) { - // Copied from MLCP - URI uri; + // Mostly copied from MLCP. try { - uri = new URI(documentURI); + URI uri = new URI(documentURI); + // The isOpaque check is made because an opaque URI will not have a path. + return uri.isOpaque() ? uri.getSchemeSpecificPart() : uri.getPath(); } catch (URISyntaxException e) { - throw new ConnectorException(String.format("Unable to construct URI from: %s", documentURI), e); + // MLCP logs errors from parsing the URI at the "WARN" level. That seems noisy, as large numbers of URIs + // could e.g. have spaces in them. So DEBUG is used instead. + if (Util.MAIN_LOGGER.isDebugEnabled()) { + Util.MAIN_LOGGER.debug("Unable to parse document URI: {}; will use unparsed URI as file path.", documentURI); + } + return documentURI; } - // The isOpaque check is made because an opaque URI will not have a path. - return uri.isOpaque() ? uri.getSchemeSpecificPart() : uri.getPath(); } private FileUtil() { diff --git a/src/main/java/com/marklogic/spark/writer/file/GZIPFileWriter.java b/src/main/java/com/marklogic/spark/writer/file/GzipFileWriter.java similarity index 84% rename from src/main/java/com/marklogic/spark/writer/file/GZIPFileWriter.java rename to src/main/java/com/marklogic/spark/writer/file/GzipFileWriter.java index ca479f41..cdd1d040 100644 --- a/src/main/java/com/marklogic/spark/writer/file/GZIPFileWriter.java +++ b/src/main/java/com/marklogic/spark/writer/file/GzipFileWriter.java @@ -8,9 +8,9 @@ import java.util.Map; import java.util.zip.GZIPOutputStream; -class GZIPFileWriter extends DocumentFileWriter { +class GzipFileWriter extends DocumentFileWriter { - GZIPFileWriter(Map properties, SerializableConfiguration hadoopConfiguration) { + GzipFileWriter(Map properties, SerializableConfiguration hadoopConfiguration) { super(properties, hadoopConfiguration); } diff --git a/src/main/java/com/marklogic/spark/writer/file/RdfFileWriter.java b/src/main/java/com/marklogic/spark/writer/file/RdfFileWriter.java new file mode 100644 index 00000000..2082e5e7 --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/file/RdfFileWriter.java @@ -0,0 +1,212 @@ +package com.marklogic.spark.writer.file; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.ContextSupport; +import com.marklogic.spark.Options; +import com.marklogic.spark.Util; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.LocalFileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.jena.datatypes.BaseDatatype; +import org.apache.jena.graph.Triple; +import org.apache.jena.rdf.model.Property; +import org.apache.jena.rdf.model.RDFNode; +import org.apache.jena.rdf.model.Resource; +import org.apache.jena.rdf.model.ResourceFactory; +import org.apache.jena.riot.Lang; +import org.apache.jena.riot.system.StreamRDF; +import org.apache.jena.riot.system.StreamRDFWriter; +import org.apache.jena.sparql.core.Quad; +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.connector.write.DataWriter; +import org.apache.spark.sql.connector.write.WriterCommitMessage; +import org.apache.spark.util.SerializableConfiguration; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.*; +import java.text.SimpleDateFormat; +import java.util.Date; +import java.util.Map; +import java.util.zip.GZIPOutputStream; + +class RdfFileWriter implements DataWriter { + + private static final Logger logger = LoggerFactory.getLogger(RdfFileWriter.class); + + private final RdfContext rdfContext; + private final String path; + private final SerializableConfiguration hadoopConfiguration; + private final int partitionId; + private final LangAndExtension langAndExtension; + private final String graphOverride; + private final boolean isGZIP; + + private OutputStream outputStream; + private StreamRDF stream; + + RdfFileWriter(Map properties, SerializableConfiguration hadoopConfiguration, int partitionId) { + this.rdfContext = new RdfContext(properties); + this.path = properties.get("path"); + this.graphOverride = rdfContext.getStringOption(Options.WRITE_RDF_FILES_GRAPH); + this.hadoopConfiguration = hadoopConfiguration; + this.partitionId = partitionId; + this.langAndExtension = determineLangAndExtension(); + + String value = rdfContext.getStringOption(Options.WRITE_FILES_COMPRESSION); + if ("gzip".equals(value)) { + this.isGZIP = true; + } else if (value != null && value.trim().length() > 0) { + throw new ConnectorException(String.format("Unsupported compression value; only 'gzip' is supported: %s", value)); + } else { + this.isGZIP = false; + } + + if (!this.langAndExtension.supportsGraph() && this.graphOverride != null) { + Util.MAIN_LOGGER.warn("RDF graph '{}' will be ignored since the target RDF format of '{}' does not support graphs.", + this.graphOverride, this.langAndExtension.lang.getName()); + } + } + + @Override + public void write(InternalRow row) throws IOException { + if (outputStream == null) { + createStream(); + } + + final Triple triple = makeTriple(row); + final String graph = determineGraph(row); + if (graph == null || !this.langAndExtension.supportsGraph()) { + this.stream.triple(triple); + } else { + this.stream.quad(new Quad(ResourceFactory.createResource(graph).asNode(), triple)); + } + } + + @Override + public WriterCommitMessage commit() throws IOException { + if (this.stream != null) { + this.stream.finish(); + } + return null; + } + + @Override + public void abort() { + // Nothing to do yet. + } + + @Override + public void close() throws IOException { + IOUtils.closeQuietly(this.outputStream); + } + + private void createStream() throws IOException { + final Path filePath = makeFilePath(); + logger.info("Will write to: {}", filePath.toUri()); + + FileSystem fileSystem = filePath.getFileSystem(this.hadoopConfiguration.value()); + fileSystem.setWriteChecksum(false); + if (fileSystem instanceof LocalFileSystem) { + File file = new File(filePath.toUri().getPath()); + if (!file.exists() && file.getParentFile() != null) { + file.getParentFile().mkdirs(); + } + this.outputStream = new BufferedOutputStream(new FileOutputStream(file, false)); + } else { + this.outputStream = new BufferedOutputStream(fileSystem.create(filePath, false)); + } + + if (isGZIP) { + this.outputStream = new GZIPOutputStream(this.outputStream); + } + + this.stream = StreamRDFWriter.getWriterStream(this.outputStream, langAndExtension.lang); + this.stream.start(); + } + + private Path makeFilePath() { + final String timestamp = new SimpleDateFormat("yyyyMMddHHmmssZ").format(new Date()); + String filename = String.format("%s-%d.%s", timestamp, partitionId, langAndExtension.extension); + if (this.isGZIP) { + filename += ".gz"; + } + return new Path(this.path, filename); + } + + /** + * See https://jena.apache.org/documentation/io/streaming-io.html#rdfformat-and-lang . Tried using RDFFormat for + * more choices, but oddly ran into errors when using TTL and a couple other formats. Lang seems to work just fine. + * + * @return + */ + private LangAndExtension determineLangAndExtension() { + if (rdfContext.hasOption(Options.WRITE_RDF_FILES_FORMAT)) { + String value = rdfContext.getStringOption(Options.WRITE_RDF_FILES_FORMAT); + if ("trig".equalsIgnoreCase(value)) { + return new LangAndExtension(Lang.TRIG, "trig"); + } else if ("nt".equalsIgnoreCase(value) || "ntriples".equalsIgnoreCase(value)) { + return new LangAndExtension(Lang.NTRIPLES, "nt"); + } else if ("nq".equalsIgnoreCase(value) || "nquads".equalsIgnoreCase(value)) { + return new LangAndExtension(Lang.NQUADS, "nq"); + } else if ("trix".equalsIgnoreCase(value)) { + return new LangAndExtension(Lang.TRIX, "trix"); + } else if ("rdfthrift".equalsIgnoreCase(value)) { + return new LangAndExtension(Lang.RDFTHRIFT, "thrift"); + } else if ("rdfproto".equalsIgnoreCase(value)) { + // See https://protobuf.dev/programming-guides/techniques/#suffixes . + return new LangAndExtension(Lang.RDFPROTO, "binpb"); + } + } + return new LangAndExtension(Lang.TTL, "ttl"); + } + + private Triple makeTriple(InternalRow row) { + Resource subject = ResourceFactory.createResource(row.getString(0)); + Property predicate = ResourceFactory.createProperty(row.getString(1)); + RDFNode object = makeObjectNode(row); + return ResourceFactory.createStatement(subject, predicate, object).asTriple(); + } + + private RDFNode makeObjectNode(InternalRow row) { + final String objectValue = row.getString(2); + final int datatypeIndex = 3; + if (row.isNullAt(datatypeIndex)) { + return ResourceFactory.createResource(objectValue); + } + String datatype = row.getString(datatypeIndex); + String lang = row.isNullAt(4) ? null : row.getString(4); + return "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString".equals(datatype) ? + ResourceFactory.createLangLiteral(objectValue, lang) : + ResourceFactory.createTypedLiteral(objectValue, new BaseDatatype(datatype)); + } + + private String determineGraph(InternalRow row) { + if (this.graphOverride != null) { + return this.graphOverride; + } + return row.isNullAt(5) ? null : row.getString(5); + } + + // Exists so we can use the convenience methods on ContextSupport. + private static class RdfContext extends ContextSupport { + private RdfContext(Map properties) { + super(properties); + } + } + + private static class LangAndExtension { + private final Lang lang; + private final String extension; + + LangAndExtension(Lang lang, String extension) { + this.lang = lang; + this.extension = extension; + } + + boolean supportsGraph() { + return lang.equals(Lang.TRIG) || lang.equals(Lang.NQUADS) || lang.equals(Lang.TRIX); + } + } +} diff --git a/src/main/java/com/marklogic/spark/writer/file/ZipCommitMessage.java b/src/main/java/com/marklogic/spark/writer/file/ZipCommitMessage.java new file mode 100644 index 00000000..81b8979e --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/file/ZipCommitMessage.java @@ -0,0 +1,22 @@ +package com.marklogic.spark.writer.file; + +import org.apache.spark.sql.connector.write.WriterCommitMessage; + +class ZipCommitMessage implements WriterCommitMessage { + + private final String path; + private final int zipEntryCount; + + ZipCommitMessage(String path, int zipEntryCount) { + this.path = path; + this.zipEntryCount = zipEntryCount; + } + + String getPath() { + return path; + } + + int getZipEntryCount() { + return zipEntryCount; + } +} diff --git a/src/main/java/com/marklogic/spark/writer/file/ZipFileWriter.java b/src/main/java/com/marklogic/spark/writer/file/ZipFileWriter.java index 3bdb046a..e8229ca6 100644 --- a/src/main/java/com/marklogic/spark/writer/file/ZipFileWriter.java +++ b/src/main/java/com/marklogic/spark/writer/file/ZipFileWriter.java @@ -19,42 +19,60 @@ import java.util.zip.ZipEntry; import java.util.zip.ZipOutputStream; -class ZipFileWriter implements DataWriter { +public class ZipFileWriter implements DataWriter { private static final Logger logger = LoggerFactory.getLogger(ZipFileWriter.class); + private final Map properties; + private final SerializableConfiguration hadoopConfiguration; + + private final String zipPath; + + // These can be instantiated lazily depending on which constructor is used. + private ContentWriter contentWriter; private ZipOutputStream zipOutputStream; + private int zipEntryCounter; + ZipFileWriter(Map properties, SerializableConfiguration hadoopConfiguration, int partitionId) { - Path path = makeFilePath(properties, partitionId); - if (logger.isDebugEnabled()) { - logger.debug("Will write to: {}", path); - } - try { - FileSystem fileSystem = path.getFileSystem(hadoopConfiguration.value()); - fileSystem.setWriteChecksum(false); - zipOutputStream = new ZipOutputStream(fileSystem.create(path, true)); - } catch (IOException e) { - throw new ConnectorException("Unable to create stream for writing zip file: " + e.getMessage(), e); + this(properties.get("path"), properties, hadoopConfiguration, partitionId, true); + } + + public ZipFileWriter(String path, Map properties, SerializableConfiguration hadoopConfiguration, + int partitionId, boolean createZipFileImmediately) { + this.zipPath = makeFilePath(path, partitionId); + this.properties = properties; + this.hadoopConfiguration = hadoopConfiguration; + if (createZipFileImmediately) { + createZipFileAndContentWriter(); } } @Override public void write(InternalRow row) throws IOException { + if (contentWriter == null) { + createZipFileAndContentWriter(); + } final String uri = row.getString(0); final String entryName = FileUtil.makePathFromDocumentURI(uri); zipOutputStream.putNextEntry(new ZipEntry(entryName)); - zipOutputStream.write(row.getBinary(1)); + this.contentWriter.writeContent(row, zipOutputStream); + zipEntryCounter++; + if (hasMetadata(row)) { + zipOutputStream.putNextEntry(new ZipEntry(entryName + ".metadata")); + this.contentWriter.writeMetadata(row, zipOutputStream); + zipEntryCounter++; + } } @Override - public void close() throws IOException { + public void close() { IOUtils.closeQuietly(zipOutputStream); } @Override public WriterCommitMessage commit() { - return null; + return new ZipCommitMessage(zipPath, zipEntryCounter); } @Override @@ -62,6 +80,25 @@ public void abort() { // No action to take. } + private void createZipFileAndContentWriter() { + Path filePath = new Path(zipPath); + if (logger.isDebugEnabled()) { + logger.debug("Will write to: {}", filePath); + } + this.contentWriter = new ContentWriter(properties); + try { + FileSystem fileSystem = filePath.getFileSystem(hadoopConfiguration.value()); + fileSystem.setWriteChecksum(false); + zipOutputStream = new ZipOutputStream(fileSystem.create(filePath, true)); + } catch (IOException e) { + throw new ConnectorException("Unable to create stream for writing zip file: " + e.getMessage(), e); + } + } + + private boolean hasMetadata(InternalRow row) { + return !row.isNullAt(3) || !row.isNullAt(4) || !row.isNullAt(5) || !row.isNullAt(6) || !row.isNullAt(7); + } + /** * Copies some of what MLCP's ArchiveWriter does, but does not create a zip file per document type. The reason * for that behavior in MLCP isn't known. It would not help for importing the zip files, where the URI extension will @@ -70,13 +107,16 @@ public void abort() { * Additionally, a user can arrive at that outcome if desired by using Spark to repartion the dataset based on * the "format" column. * - * @param properties + * @param path * @param partitionId * @return */ - private Path makeFilePath(Map properties, int partitionId) { + private String makeFilePath(String path, int partitionId) { final String timestamp = new SimpleDateFormat("yyyyMMddHHmmssZ").format(new Date()); - String path = String.format("%s%s%s-%d.zip", properties.get("path"), File.separator, timestamp, partitionId); - return new Path(path); + return String.format("%s%s%s-%d.zip", path, File.separator, timestamp, partitionId); + } + + public String getZipPath() { + return zipPath; } } diff --git a/src/main/java/com/marklogic/spark/writer/rdf/GraphWriter.java b/src/main/java/com/marklogic/spark/writer/rdf/GraphWriter.java new file mode 100644 index 00000000..95d9483b --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/rdf/GraphWriter.java @@ -0,0 +1,69 @@ +package com.marklogic.spark.writer.rdf; + +import com.marklogic.client.DatabaseClient; +import com.marklogic.client.eval.ServerEvaluationCall; +import com.marklogic.client.io.DocumentMetadataHandle; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Set; + +/** + * Knows how to use the non-public "sem:create-graph-document(iri, permissions)" function to write sem:graph + * documents. + */ +public class GraphWriter { + + private static final Logger logger = LoggerFactory.getLogger(GraphWriter.class); + + private final DatabaseClient databaseClient; + private final String permissions; + + public GraphWriter(DatabaseClient databaseClient, String permissionsString) { + this.databaseClient = databaseClient; + this.permissions = permissionsString != null && permissionsString.trim().length() > 0 ? + parsePermissions(permissionsString) : + "xdmp:default-permissions()"; + } + + public void createGraphs(Set graphs) { + for (String graph : graphs) { + StringBuilder query = new StringBuilder("declare variable $GRAPH external; "); + query.append(String.format( + "if (fn:doc-available($GRAPH)) then () else sem:create-graph-document(sem:iri($GRAPH), %s)", + permissions) + ); + + if (logger.isDebugEnabled()) { + logger.debug("Writing graph {} if it does not yet exist.", graph); + } + ServerEvaluationCall call = databaseClient.newServerEval().xquery(query.toString()); + call.addVariable("GRAPH", graph); + call.evalAs(String.class); + } + } + + /** + * We know the permissions string is valid at this point, as if it weren't, the writing process would have failed + * before the connector gets to here. + * + * @param permissions + * @return + */ + private String parsePermissions(final String permissions) { + DocumentMetadataHandle metadata = new DocumentMetadataHandle(); + metadata.getPermissions().addFromDelimitedString(permissions); + StringBuilder permissionsString = new StringBuilder("("); + boolean firstOne = true; + for (String role : metadata.getPermissions().keySet()) { + for (DocumentMetadataHandle.Capability cap : metadata.getPermissions().get(role)) { + if (!firstOne) { + permissionsString.append(", "); + } + permissionsString.append(String.format("xdmp:permission('%s', '%s')", role, cap.toString().toLowerCase())); + firstOne = false; + } + } + return permissionsString.append(")").toString(); + } +} diff --git a/src/main/java/com/marklogic/spark/writer/rdf/RdfRowConverter.java b/src/main/java/com/marklogic/spark/writer/rdf/RdfRowConverter.java new file mode 100644 index 00000000..f76eb2d3 --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/rdf/RdfRowConverter.java @@ -0,0 +1,112 @@ +package com.marklogic.spark.writer.rdf; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import com.marklogic.spark.writer.DocBuilder; +import com.marklogic.spark.writer.RowConverter; +import com.marklogic.spark.writer.WriteContext; +import org.apache.spark.sql.catalyst.InternalRow; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.*; +import java.util.stream.Collectors; + +/** + * Converts each row into a sem:triple element, which is then added to a sem:triples XML document associated with a + * graph. + */ +public class RdfRowConverter implements RowConverter { + + static final String DEFAULT_MARKLOGIC_GRAPH = "http://marklogic.com/semantics#default-graph"; + static final String DEFAULT_JENA_GRAPH = "urn:x-arq:DefaultGraphNode"; + + private static final Logger logger = LoggerFactory.getLogger(RdfRowConverter.class); + + // Need to keep track of each graph that is seen in the rows so that they can eventually be created in MarkLogic + // if they don't yet exist. + private final Set graphs = new HashSet<>(); + + // Map of graph name to documents containing sem:triple elements. + private Map triplesDocuments = new HashMap<>(); + + private final String defaultGraph; + private final String graphOverride; + + public RdfRowConverter(WriteContext writeContext) { + String graph = writeContext.getStringOption(Options.WRITE_GRAPH); + String tempGraphOverride = writeContext.getStringOption(Options.WRITE_GRAPH_OVERRIDE); + if (graph != null && tempGraphOverride != null) { + throw new ConnectorException(String.format("Can only specify one of %s and %s.", + writeContext.getOptionNameForMessage(Options.WRITE_GRAPH), + writeContext.getOptionNameForMessage(Options.WRITE_GRAPH_OVERRIDE))); + } + if (graph != null) { + this.defaultGraph = graph; + this.graphOverride = null; + } else if (tempGraphOverride != null) { + this.defaultGraph = tempGraphOverride; + this.graphOverride = tempGraphOverride; + } else { + this.defaultGraph = DEFAULT_MARKLOGIC_GRAPH; + this.graphOverride = null; + } + if (logger.isDebugEnabled()) { + logger.debug("Default graph: {}", defaultGraph); + } + } + + @Override + public Optional convertRow(InternalRow row) { + final String graph = determineGraph(row); + graphs.add(graph); + + TriplesDocument triplesDocument; + if (triplesDocuments.containsKey(graph)) { + triplesDocument = triplesDocuments.get(graph); + } else { + triplesDocument = new TriplesDocument(graph); + triplesDocuments.put(graph, triplesDocument); + } + + triplesDocument.addTriple(row); + if (triplesDocument.hasMaxTriples()) { + triplesDocuments.remove(graph); + return Optional.of(triplesDocument.buildDocument()); + } + return Optional.empty(); + } + + /** + * Return a DocumentInputs for each triples document that has not yet reached "max triples". + * + * @return + */ + @Override + public List getRemainingDocumentInputs() { + return this.triplesDocuments.values().stream() + .map(TriplesDocument::buildDocument) + .collect(Collectors.toList()); + } + + /** + * Allows WriteBatcherDataWriter to access all the graphs that this class has seen. + * + * @return + */ + public Set getGraphs() { + return graphs; + } + + private String determineGraph(InternalRow row) { + if (graphOverride != null) { + return graphOverride; + } + if (row.isNullAt(5)) { + return defaultGraph; + } + String graph = row.getString(5); + return DEFAULT_JENA_GRAPH.equals(graph) ? defaultGraph : graph; + } + +} diff --git a/src/main/java/com/marklogic/spark/writer/rdf/TriplesDocument.java b/src/main/java/com/marklogic/spark/writer/rdf/TriplesDocument.java new file mode 100644 index 00000000..5ada0d64 --- /dev/null +++ b/src/main/java/com/marklogic/spark/writer/rdf/TriplesDocument.java @@ -0,0 +1,57 @@ +package com.marklogic.spark.writer.rdf; + +import com.marklogic.client.extra.jdom.JDOMHandle; +import com.marklogic.spark.writer.DocBuilder; +import org.apache.spark.sql.catalyst.InternalRow; +import org.jdom2.Document; +import org.jdom2.Element; +import org.jdom2.Namespace; + +import java.util.UUID; + +/** + * Keeps track of a sem:triples document containing 1 to many sem:triple elements. + */ +class TriplesDocument { + + static final Namespace SEMANTICS_NAMESPACE = Namespace.getNamespace("sem", "http://marklogic.com/semantics"); + private static final Namespace XML_NAMESPACE = Namespace.getNamespace("xml", "http://www.w3.org/XML/1998/namespace"); + + // May allow the user to configure this. + private static final int TRIPLES_PER_DOCUMENT = 100; + + private final String graph; + private final Document document; + private int tripleCount; + + TriplesDocument(String graph) { + this.graph = graph; + this.document = new Document(new Element("triples", SEMANTICS_NAMESPACE)); + } + + void addTriple(InternalRow row) { + Element triple = new Element("triple", SEMANTICS_NAMESPACE); + document.getRootElement().addContent(triple); + triple.addContent(new Element("subject", SEMANTICS_NAMESPACE).addContent(row.getString(0))); + triple.addContent(new Element("predicate", SEMANTICS_NAMESPACE).addContent(row.getString(1))); + Element object = new Element("object", SEMANTICS_NAMESPACE).addContent(row.getString(2)); + if (!row.isNullAt(3)) { + object.setAttribute("datatype", row.getString(3)); + } + if (!row.isNullAt(4)) { + object.setAttribute("lang", row.getString(4), XML_NAMESPACE); + } + triple.addContent(object); + tripleCount++; + } + + boolean hasMaxTriples() { + return tripleCount >= TRIPLES_PER_DOCUMENT; + } + + DocBuilder.DocumentInputs buildDocument() { + JDOMHandle content = new JDOMHandle(document); + String uri = String.format("/triplestore/%s.xml", UUID.randomUUID()); + return new DocBuilder.DocumentInputs(uri, content, null, null, graph); + } +} diff --git a/src/main/resources/marklogic-spark-messages.properties b/src/main/resources/marklogic-spark-messages.properties new file mode 100644 index 00000000..882404bf --- /dev/null +++ b/src/main/resources/marklogic-spark-messages.properties @@ -0,0 +1,18 @@ +# Defines various messages for the connector. Intended to be inherited and overridden by the ETL tool via +# marklogic-spark-messages_en.properties, where each option name can be associated with a CLI option in the ETL tool. +spark.marklogic.client.uri= +spark.marklogic.read.batchSize= +spark.marklogic.read.documents.partitionsPerForest= +spark.marklogic.read.numPartitions= +spark.marklogic.read.noOpticQuery=No Optic query found; must define spark.marklogic.read.opticQuery +spark.marklogic.write.batchSize= +spark.marklogic.write.documentType= +spark.marklogic.write.fileRows.documentType= +spark.marklogic.write.graph= +spark.marklogic.write.graphOverride= +spark.marklogic.write.jsonRootName= +spark.marklogic.write.threadCount= +spark.marklogic.write.threadCountPerPartition= +spark.marklogic.write.transformParams= +spark.marklogic.write.uriTemplate= +spark.marklogic.write.xmlRootName= diff --git a/src/test/java/com/marklogic/spark/AbstractIntegrationTest.java b/src/test/java/com/marklogic/spark/AbstractIntegrationTest.java index 8898ac2f..28a5d043 100644 --- a/src/test/java/com/marklogic/spark/AbstractIntegrationTest.java +++ b/src/test/java/com/marklogic/spark/AbstractIntegrationTest.java @@ -17,11 +17,13 @@ import com.fasterxml.jackson.databind.ObjectMapper; import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.junit5.XmlNode; import com.marklogic.junit5.spring.AbstractSpringMarkLogicTest; import com.marklogic.junit5.spring.SimpleTestConfig; import org.apache.spark.SparkException; import org.apache.spark.sql.*; import org.apache.spark.util.VersionUtils; +import org.jdom2.Namespace; import org.junit.jupiter.api.AfterEach; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.core.io.ClassPathResource; @@ -49,6 +51,7 @@ public abstract class AbstractIntegrationTest extends AbstractSpringMarkLogicTes protected static final String CONNECTOR_IDENTIFIER = "marklogic"; protected static final String NO_AUTHORS_QUERY = "op.fromView('Medical', 'NoAuthors', '')"; protected static final String DEFAULT_PERMISSIONS = "spark-user-role,read,spark-user-role,update"; + protected static final Namespace PROPERTIES_NAMESPACE = Namespace.getNamespace("prop", "http://marklogic.com/xdmp/property"); protected static final ObjectMapper objectMapper = new ObjectMapper(); @@ -176,4 +179,12 @@ protected final DocumentMetadataHandle readMetadata(String uri) { // This should really be in marklogic-unit-test. return getDatabaseClient().newDocumentManager().readMetadata(uri, new DocumentMetadataHandle()); } + + @Override + protected XmlNode readDocumentProperties(String uri) { + // This should be fixed in marklogic-unit-test to include the properties namespace by default. + XmlNode props = super.readDocumentProperties(uri); + props.setNamespaces(new Namespace[]{PROPERTIES_NAMESPACE}); + return props; + } } diff --git a/src/test/java/com/marklogic/spark/BuildConnectionPropertiesTest.java b/src/test/java/com/marklogic/spark/BuildConnectionPropertiesTest.java index 4d2a485c..3ed34c4d 100644 --- a/src/test/java/com/marklogic/spark/BuildConnectionPropertiesTest.java +++ b/src/test/java/com/marklogic/spark/BuildConnectionPropertiesTest.java @@ -40,6 +40,22 @@ void useDefaults() { "they can simply set this to 'direct' instead."); } + @Test + void clientUriWithDatabase() { + properties.put(Options.CLIENT_DATABASE, "Documents"); + properties.put(Options.CLIENT_URI, "user:password@host:8016"); + Map connectionProps = new ContextSupport(properties).buildConnectionProperties(); + + assertEquals("Documents", connectionProps.get(Options.CLIENT_DATABASE), "If the user does not specify a " + + "database in the connection string, then the database value specified separately should still be used. " + + "This avoids a potential issue where the user is not aware that the connection string accepts a database " + + "and thus specifies it via the database option."); + assertEquals("user", connectionProps.get(Options.CLIENT_USERNAME)); + assertEquals("password", connectionProps.get(Options.CLIENT_PASSWORD)); + assertEquals("host", connectionProps.get(Options.CLIENT_HOST)); + assertEquals("8016", connectionProps.get(Options.CLIENT_PORT)); + } + @Test void overrideDefaults() { properties.put(AUTH_TYPE, "basic"); diff --git a/src/test/java/com/marklogic/spark/reader/customcode/ReadWithCustomCodeTest.java b/src/test/java/com/marklogic/spark/reader/customcode/ReadWithCustomCodeTest.java index 73a53133..95df9240 100644 --- a/src/test/java/com/marklogic/spark/reader/customcode/ReadWithCustomCodeTest.java +++ b/src/test/java/com/marklogic/spark/reader/customcode/ReadWithCustomCodeTest.java @@ -3,6 +3,7 @@ import com.marklogic.client.FailedRequestException; import com.marklogic.client.MarkLogicIOException; import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.SparkException; import org.apache.spark.sql.DataFrameReader; @@ -30,6 +31,16 @@ void evalJavaScript() { verifyUriSchemaIsUsed(rows); } + @Test + void evalJavaScriptFile() { + List rows = readRows(Options.READ_JAVASCRIPT_FILE, "src/test/resources/custom-code/my-reader.js"); + + assertEquals(2, rows.size()); + assertEquals("firstValue", rows.get(0).getString(0)); + assertEquals("secondValue", rows.get(1).getString(0)); + verifyUriSchemaIsUsed(rows); + } + @Test void evalXQuery() { List rows = readRows(Options.READ_XQUERY, "(1,2,3)"); @@ -41,6 +52,25 @@ void evalXQuery() { verifyUriSchemaIsUsed(rows); } + @Test + void evalXQueryFile() { + List rows = readRows(Options.READ_XQUERY_FILE, "src/test/resources/custom-code/my-reader.xqy"); + + assertEquals(3, rows.size(), "Expected 3 rows; actual rows: " + rowsToString(rows)); + assertEquals("1", rows.get(0).getString(0)); + assertEquals("2", rows.get(1).getString(0)); + assertEquals("3", rows.get(2).getString(0)); + verifyUriSchemaIsUsed(rows); + } + + @Test + void evalXQueryFileMissing() { + ConnectorException ex = assertThrowsConnectorException( + () -> readRows(Options.READ_XQUERY_FILE, "doesnt-exist.xqy")); + + assertEquals("Cannot read from file doesnt-exist.xqy; cause: doesnt-exist.xqy was not found.", ex.getMessage()); + } + @Test void invokeJavaScript() { List rows = readRows(Options.READ_INVOKE, "/getAuthors.sjs"); @@ -107,6 +137,13 @@ void partitionsFromJavaScript() { ); } + @Test + void partitionsFromJavaScriptFile() { + verifyRowsAreReadFromEachForest( + Options.READ_PARTITIONS_JAVASCRIPT_FILE, "src/test/resources/custom-code/my-partitions.js" + ); + } + @Test void partitionsFromXQuery() { verifyRowsAreReadFromEachForest( @@ -114,6 +151,13 @@ void partitionsFromXQuery() { ); } + @Test + void partitionsFromXQueryFile() { + verifyRowsAreReadFromEachForest( + Options.READ_PARTITIONS_XQUERY_FILE, "src/test/resources/custom-code/my-partitions.xqy" + ); + } + @Test void partitionsFromInvoke() { verifyRowsAreReadFromEachForest( @@ -129,7 +173,8 @@ void badJavascriptForPartitions() { .load(); RuntimeException ex = assertThrows(RuntimeException.class, () -> dataset.collectAsList()); - assertEquals("Unable to retrieve partitions", ex.getMessage()); + assertTrue(ex.getMessage().contains("Unable to retrieve partitions; cause: Local message: failed to apply resource at eval"), + "Unexpected error: " + ex.getMessage()); assertTrue(ex.getCause() instanceof FailedRequestException, "Unexpected cause: " + ex.getCause()); } @@ -152,6 +197,8 @@ void verifyTimeoutWorks() { private List readRows(String option, String value) { return startRead() .option(option, value) + // Adding these only for manual inspection of logging and to ensure they don't cause errors. + .option(Options.READ_LOG_PROGRESS, "1") .load() .collectAsList(); } diff --git a/src/test/java/com/marklogic/spark/reader/document/BuildSearchQueryTest.java b/src/test/java/com/marklogic/spark/reader/document/BuildSearchQueryTest.java index 8b8bb8a3..7f666a84 100644 --- a/src/test/java/com/marklogic/spark/reader/document/BuildSearchQueryTest.java +++ b/src/test/java/com/marklogic/spark/reader/document/BuildSearchQueryTest.java @@ -48,6 +48,6 @@ void wrongNumberOfParams() { .withTransformParams("param1,value1,param2"); IllegalArgumentException ex = assertThrows(IllegalArgumentException.class, () -> builder.buildQuery(client)); - assertEquals("Transform params must have an equal number of parameter names and values: param1,value1,param2", ex.getMessage()); + assertEquals("Transform parameters must have an equal number of parameter names and values: param1,value1,param2", ex.getMessage()); } } diff --git a/src/test/java/com/marklogic/spark/reader/document/MakeForestPartitionsTest.java b/src/test/java/com/marklogic/spark/reader/document/MakeForestPartitionsTest.java index b06dac93..c4dc9947 100644 --- a/src/test/java/com/marklogic/spark/reader/document/MakeForestPartitionsTest.java +++ b/src/test/java/com/marklogic/spark/reader/document/MakeForestPartitionsTest.java @@ -70,7 +70,7 @@ private void verifyPartition(int partitionIndex, String forestName, int offsetSt ForestPartition partition = partitions[partitionIndex]; assertEquals(forestName, partition.getForestName()); assertEquals(FAKE_SERVER_TIMESTAMP, partition.getServerTimestamp()); - assertEquals(new Long(offsetStart), partition.getOffsetStart()); - assertEquals(offsetEnd == null ? null : new Long(offsetEnd), partition.getOffsetEnd()); + assertEquals(Long.valueOf(offsetStart), partition.getOffsetStart()); + assertEquals(offsetEnd == null ? null : Long.valueOf(offsetEnd), partition.getOffsetEnd()); } } diff --git a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsByUrisTest.java b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsByUrisTest.java new file mode 100644 index 00000000..fa48ef73 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsByUrisTest.java @@ -0,0 +1,88 @@ +package com.marklogic.spark.reader.document; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class ReadDocumentRowsByUrisTest extends AbstractIntegrationTest { + + @Test + void readByUris() { + List rows = startRead() + .option(Options.READ_DOCUMENTS_URIS, "/author/author1.json\n" + + "/author/author12.json\n" + + "/author/author3.json") + .option(Options.READ_DOCUMENTS_DIRECTORY, "/author/") + .option(Options.READ_DOCUMENTS_COLLECTIONS, "author") + .load() + .collectAsList(); + + assertEquals(3, rows.size()); + } + + @Test + void urisWithStringQuery() { + List rows = startRead() + .option(Options.READ_DOCUMENTS_URIS, "/author/author1.json\n" + + "/author/author12.json\n" + + "/author/author3.json") + .option(Options.READ_DOCUMENTS_STRING_QUERY, "Wooles") + .load() + .collectAsList(); + + assertEquals(1, rows.size(), "The string query should result in only author1 being returned."); + assertEquals("/author/author1.json", rows.get(0).getString(0)); + } + + @Test + void urisWithOtherQuery() { + final String query = "{\"ctsquery\": {\"wordQuery\": {\"text\": \"Vivianne\"}}}"; + + List rows = startRead() + .option(Options.READ_DOCUMENTS_URIS, "/author/author12.json") + .option(Options.READ_DOCUMENTS_QUERY, query) + .load() + .collectAsList(); + + assertEquals(1, rows.size(), "When a query is specified along with a list of URIs, the connector should log " + + "a warning about ignoring the query and just using the list of URIs. This is due to us not yet having a " + + "reliable way of combining a directory query with a structured query, a serialized CTS query, and a " + + "combined query, particularly when those can be in XML or JSON."); + assertEquals("/author/author12.json", rows.get(0).getString(0)); + } + + @Test + void urisWithWrongDirectory() { + long count = startRead() + .option(Options.READ_DOCUMENTS_URIS, "/author/author1.json") + .option(Options.READ_DOCUMENTS_DIRECTORY, "/wrong/") + .load() + .count(); + + assertEquals(0, count, "This verifies that the directory impacts the list of URIs."); + } + + @Test + void urisWithWrongCollection() { + long count = startRead() + .option(Options.READ_DOCUMENTS_URIS, "/author/author1.json") + .option(Options.READ_DOCUMENTS_COLLECTIONS, "wrong") + .load() + .count(); + + assertEquals(0, count, "This verifies that the collection impacts the list of URIs."); + } + + private DataFrameReader startRead() { + return newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()); + } + +} diff --git a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsTest.java b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsTest.java index 3c570649..02881878 100644 --- a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsTest.java +++ b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsTest.java @@ -6,6 +6,7 @@ import com.marklogic.spark.AbstractIntegrationTest; import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; +import com.marklogic.spark.writer.AbstractWriteTest; import org.apache.spark.SparkException; import org.apache.spark.sql.DataFrameReader; import org.apache.spark.sql.Dataset; @@ -18,7 +19,7 @@ import static org.junit.jupiter.api.Assertions.*; -class ReadDocumentRowsTest extends AbstractIntegrationTest { +class ReadDocumentRowsTest extends AbstractWriteTest { @Test void readByCollection() { @@ -37,6 +38,20 @@ void readByCollection() { assertEquals("Vivianne", doc.get("ForeName").asText()); } + @Test + void logProgress() { + newWriter().save(); + + Dataset rows = startRead() + .option(Options.READ_DOCUMENTS_PARTITIONS_PER_FOREST, 1) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "write-test") + .option(Options.READ_BATCH_SIZE, 10) + .option(Options.READ_LOG_PROGRESS, 50) + .load(); + + assertEquals(200, rows.count()); + } + @Test void readViaDirectConnect() { Dataset rows = startRead() @@ -66,7 +81,7 @@ void invalidBatchSize() { .load(); ConnectorException ex = assertThrowsConnectorException(() -> dataset.count()); - assertEquals("Value of 'spark.marklogic.read.batchSize' option must be numeric.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.read.batchSize' must be numeric.", ex.getMessage()); } @Test diff --git a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithMetadataTest.java b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithMetadataTest.java index 321096c1..e8b9e2e6 100644 --- a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithMetadataTest.java +++ b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithMetadataTest.java @@ -1,9 +1,11 @@ package com.marklogic.spark.reader.document; +import com.marklogic.junit5.XmlNode; import com.marklogic.spark.AbstractIntegrationTest; import com.marklogic.spark.Options; import com.marklogic.spark.TestUtil; import org.apache.spark.sql.Row; +import org.jdom2.Namespace; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import scala.collection.JavaConverters; @@ -123,10 +125,10 @@ private void verifyAllMetadataColumnsArePopulated(Row row) { assertEquals(10, row.getInt(5)); - Map properties = JavaConverters.mapAsJavaMap((scala.collection.immutable.Map) row.get(6)); - assertEquals(2, properties.size()); - assertEquals("value1", properties.get("{org:example}key1")); - assertEquals("value2", properties.get("key2")); + XmlNode properties = new XmlNode(row.getString(6), Namespace.getNamespace("ex", "org:example"), + PROPERTIES_NAMESPACE); + properties.assertElementValue("/prop:properties/ex:key1", "value1"); + properties.assertElementValue("/prop:properties/key2", "value2"); Map metadataValues = JavaConverters.mapAsJavaMap((scala.collection.immutable.Map) row.get(7)); assertEquals(2, metadataValues.size()); diff --git a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithPartitionCountsTest.java b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithPartitionCountsTest.java index d5f27196..3671370a 100644 --- a/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithPartitionCountsTest.java +++ b/src/test/java/com/marklogic/spark/reader/document/ReadDocumentRowsWithPartitionCountsTest.java @@ -47,7 +47,7 @@ void invalidValue() { .load(); ConnectorException ex = assertThrows(ConnectorException.class, () -> dataset.count()); - assertEquals("Value of 'spark.marklogic.read.documents.partitionsPerForest' option must be numeric.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.read.documents.partitionsPerForest' must be numeric.", ex.getMessage()); } @ParameterizedTest diff --git a/src/test/java/com/marklogic/spark/reader/file/ConvertMlcpMetadataTest.java b/src/test/java/com/marklogic/spark/reader/file/ConvertMlcpMetadataTest.java new file mode 100644 index 00000000..75b0414a --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ConvertMlcpMetadataTest.java @@ -0,0 +1,45 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.client.io.Format; +import com.marklogic.junit5.PermissionsTester; +import org.junit.jupiter.api.Test; +import org.springframework.core.io.ClassPathResource; + +import javax.xml.namespace.QName; +import java.io.InputStream; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class ConvertMlcpMetadataTest { + + @Test + void test() throws Exception { + InputStream input = new ClassPathResource("mlcp-metadata/complete.xml").getInputStream(); + MlcpMetadata mlcpMetadata = new MlcpMetadataConverter().convert(input); + assertEquals(Format.XML, mlcpMetadata.getFormat()); + + DocumentMetadataHandle metadata = mlcpMetadata.getMetadata(); + + assertEquals(2, metadata.getCollections().size()); + assertTrue(metadata.getCollections().contains("collection1")); + assertTrue(metadata.getCollections().contains("collection2")); + + assertEquals(10, metadata.getQuality()); + + assertEquals(2, metadata.getMetadataValues().size()); + assertEquals("value1", metadata.getMetadataValues().get("meta1")); + assertEquals("value2", metadata.getMetadataValues().get("meta2")); + + assertEquals(2, metadata.getProperties().size()); + assertEquals("value1", metadata.getProperties().get(new QName("org:example", "key1"))); + assertEquals("value2", metadata.getProperties().get("key2")); + + assertEquals(2, metadata.getPermissions().size()); + PermissionsTester tester = new PermissionsTester(metadata.getPermissions()); + tester.assertReadPermissionExists("spark-user-role"); + tester.assertUpdatePermissionExists("spark-user-role"); + tester.assertReadPermissionExists("qconsole-user"); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/MakeFilePartitionsTest.java b/src/test/java/com/marklogic/spark/reader/file/MakeFilePartitionsTest.java new file mode 100644 index 00000000..19bf80c1 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/MakeFilePartitionsTest.java @@ -0,0 +1,38 @@ +package com.marklogic.spark.reader.file; + +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class MakeFilePartitionsTest { + + @Test + void oddNumber() { + FilePartition[] partitions = FileUtil.makeFilePartitions(new String[]{"A", "B", "C", "D", "E"}, 3); + assertEquals(3, partitions.length); + assertEquals("A", partitions[0].getPaths().get(0)); + assertEquals("B", partitions[0].getPaths().get(1)); + assertEquals("C", partitions[1].getPaths().get(0)); + assertEquals("D", partitions[1].getPaths().get(1)); + assertEquals("E", partitions[2].getPaths().get(0)); + } + + @Test + void evenPartitions() { + FilePartition[] partitions = FileUtil.makeFilePartitions(new String[]{"A", "B", "C", "D"}, 2); + assertEquals(2, partitions.length); + assertEquals("A", partitions[0].getPaths().get(0)); + assertEquals("B", partitions[0].getPaths().get(1)); + assertEquals("C", partitions[1].getPaths().get(0)); + assertEquals("D", partitions[1].getPaths().get(1)); + } + + @Test + void morePartitionsThanFiles() { + FilePartition[] partitions = FileUtil.makeFilePartitions(new String[]{"A", "B", "C"}, 4); + assertEquals(3, partitions.length); + assertEquals("A", partitions[0].getPaths().get(0)); + assertEquals("B", partitions[1].getPaths().get(0)); + assertEquals("C", partitions[2].getPaths().get(0)); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXMLFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXmlFilesTest.java similarity index 65% rename from src/test/java/com/marklogic/spark/reader/file/ReadAggregateXMLFilesTest.java rename to src/test/java/com/marklogic/spark/reader/file/ReadAggregateXmlFilesTest.java index bc8e3db6..62800954 100644 --- a/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXMLFilesTest.java +++ b/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXmlFilesTest.java @@ -11,16 +11,18 @@ import java.util.List; -import static org.junit.jupiter.api.Assertions.assertEquals; -import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.junit.jupiter.api.Assertions.*; -class ReadAggregateXMLFilesTest extends AbstractIntegrationTest { +class ReadAggregateXmlFilesTest extends AbstractIntegrationTest { + + private static final String ISO_8859_1_ENCODED_FILE = "src/test/resources/encoding/medline.iso-8859-1.txt"; @Test void noNamespace() { List rows = newSparkSession().read() .format(CONNECTOR_IDENTIFIER) .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_NUM_PARTITIONS, 1) .load("src/test/resources/aggregates") .collectAsList(); @@ -140,12 +142,81 @@ void notXmlFile() { "The error should identify the fail and the root cause; actual error: " + message); } + @Test + void ignoreUriElementNotFound() { + long count = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_NUM_PARTITIONS, 1) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_AGGREGATES_XML_URI_ELEMENT, "id") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/aggregates/employees.xml") + .count(); + + assertEquals(2, count, "The one employee without an 'id' element should have caused an error to be " + + "caught and logged. The 2 employees with 'id' should be returned since abortOnFailure = false."); + } + + @Test + void ignoreInvalidXmlFile() { + long count = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_NUM_PARTITIONS, 1) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/junit-platform.properties", "src/test/resources/aggregates/employees.xml") + .count(); + + assertEquals(3, count); + } + + @Test + void encoding() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "MedlineCitation") + .option(Options.READ_FILES_ENCODING, "ISO-8859-1") + .load(ISO_8859_1_ENCODED_FILE) + .collectAsList(); + + assertEquals(2, rows.size(), "This verifies that the encoded file can be parsed correctly when the user " + + "specifies the associated encoding as an option."); + } + + @Test + void wrongEncoding() { + Dataset dataset = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "MedlineCitation") + .option(Options.READ_FILES_ENCODING, "UTF-16") + .load(ISO_8859_1_ENCODED_FILE); + + ConnectorException ex = assertThrowsConnectorException(() -> dataset.show()); + assertTrue(ex.getMessage().contains("Failed to traverse document"), "When an incorrect encoding is used, " + + "the connector should throw an error stating that it cannot read the document. The stacktrace has more " + + "detail in it. Actual error: " + ex.getMessage()); + } + + @Test + void invalidEncoding() { + Dataset dataset = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "MedlineCitation") + .option(Options.READ_FILES_ENCODING, "Not-a-real-encoding") + .load(ISO_8859_1_ENCODED_FILE); + + ConnectorException ex = assertThrows(ConnectorException.class, () -> dataset.show()); + assertTrue(ex.getMessage().contains("Unsupported encoding value: Not-a-real-encoding"), + "Actual error: " + ex.getMessage()); + } + private void verifyRow(Row row, String expectedUriSuffix, String rootPath, String name, int age) { String uri = row.getString(0); - assertTrue(uri.endsWith(expectedUriSuffix), String.format("URI %s doesn't end with %s", uri, expectedUriSuffix)); - String xml = new String((byte[]) row.get(3)); + assertTrue(uri.endsWith(expectedUriSuffix), format("URI %s doesn't end with %s", uri, expectedUriSuffix)); + String xml = new String((byte[]) row.get(1)); XmlNode doc = new XmlNode(xml, Namespace.getNamespace("ex", "org:example")); doc.assertElementValue(String.format("%sname", rootPath), name); doc.assertElementValue(String.format("%sage", rootPath), Integer.toString(age)); + assertEquals("xml", row.getString(2)); } } diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXMLZipFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXmlZipFilesTest.java similarity index 65% rename from src/test/java/com/marklogic/spark/reader/file/ReadAggregateXMLZipFilesTest.java rename to src/test/java/com/marklogic/spark/reader/file/ReadAggregateXmlZipFilesTest.java index 5a28c89c..061260a4 100644 --- a/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXMLZipFilesTest.java +++ b/src/test/java/com/marklogic/spark/reader/file/ReadAggregateXmlZipFilesTest.java @@ -14,7 +14,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertTrue; -class ReadAggregateXMLZipFilesTest extends AbstractIntegrationTest { +class ReadAggregateXmlZipFilesTest extends AbstractIntegrationTest { @Test void zipWithTwoAggregateXMLFiles() { @@ -34,6 +34,23 @@ void zipWithTwoAggregateXMLFiles() { verifyRow(rows.get(3), "employee-aggregates.zip-2-1.xml", rootPath, "Linda", 43); } + @Test + void twoZipsOnePartition() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_NUM_PARTITIONS, 1) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_FILES_COMPRESSION, "zip") + .load( + "src/test/resources/aggregate-zips/employee-aggregates.zip", + "src/test/resources/aggregate-zips/employee-aggregates-copy.zip" + ) + .collectAsList(); + + assertEquals(8, rows.size(), "The two zip files, each containing 4 rows, should be read by a single " + + "partition reader that iterates over the two file paths."); + } + @Test void uriElementHasMixedContent() { Dataset dataset = newSparkSession().read() @@ -102,12 +119,57 @@ void notXmlFileInZip() { "The error should identify the file and the root cause; actual error: " + message); } + @Test + void ignoreUriElementNotFound() { + long count = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_AGGREGATES_XML_URI_ELEMENT, "id") + .option(Options.READ_FILES_COMPRESSION, "zip") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/aggregate-zips/employee-aggregates.zip") + .count(); + + assertEquals(3, count, "The element without an 'id' element should be ignored and an error should be logged " + + "for it, and then the 3 elements with an 'id' child element should be returned as rows. 2 of those " + + "elements come from employees.xml, and the 3rd comes from employees2.xml."); + } + + @Test + void ignoreBadFileInZip() { + long count = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_FILES_COMPRESSION, "zip") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/aggregate-zips/xml-and-json.zip") + .count(); + + assertEquals(3, count, "The JSON file in the zip should result in the error being logged, and the valid " + + "XML file should still be processed."); + } + + @Test + void ignoreAllBadFilesInZip() { + long count = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_FILES_COMPRESSION, "zip") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/zip-files/mixed-files.zip") + .count(); + + assertEquals(0, count, "Every file in mixed-files.zip is either not XML or it's an XML document " + + "with no occurrences of 'Employee', so 0 rows should be returned."); + } + private void verifyRow(Row row, String expectedUriSuffix, String rootPath, String name, int age) { String uri = row.getString(0); assertTrue(uri.endsWith(expectedUriSuffix), String.format("URI %s doesn't end with %s", uri, expectedUriSuffix)); - String xml = new String((byte[]) row.get(3)); + String xml = new String((byte[]) row.get(1)); XmlNode doc = new XmlNode(xml, Namespace.getNamespace("ex", "org:example")); doc.assertElementValue(String.format("%sname", rootPath), name); doc.assertElementValue(String.format("%sage", rootPath), Integer.toString(age)); + assertEquals("xml", row.getString(2)); } } diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadArchiveFileTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadArchiveFileTest.java new file mode 100644 index 00000000..ea0c5b92 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadArchiveFileTest.java @@ -0,0 +1,272 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import com.marklogic.spark.TestUtil; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.jdom2.Namespace; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; +import scala.collection.JavaConversions; +import scala.collection.mutable.WrappedArray; + +import java.nio.charset.StandardCharsets; +import java.nio.file.Path; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.*; + +class ReadArchiveFileTest extends AbstractIntegrationTest { + + @BeforeEach + void beforeEach() { + TestUtil.insertTwoDocumentsWithAllMetadata(getDatabaseClient()); + } + + @Test + void testAllMetadata(@TempDir Path tempDir) { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "collection1") + .option(Options.READ_DOCUMENTS_CATEGORIES, "content,metadata") + .load() + .repartition(1) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + verifyAllMetadata(tempDir, 2); + } + + @Test + void testCollectionsAndPermissions(@TempDir Path tempDir) { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "collection1") + .option(Options.READ_DOCUMENTS_CATEGORIES, "content,metadata") + .load() + .repartition(1) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_ARCHIVES_CATEGORIES, "collections,permissions") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + assertEquals(2, rows.size(), "Expecting 2 rows in the zip."); + rows.forEach(row -> { + verifyContent(row); + assertTrue(row.isNullAt(2)); + verifyCollections(row); + verifyPermissions(row); + assertNull(row.get(5), "Quality column should be null."); + assertNull(row.get(6), "Properties column should be null."); + assertNull(row.get(7), "MetadataValues column should be null."); + }); + } + + @Test + void testQualityPropertiesAndMetadatavalues(@TempDir Path tempDir) { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "collection1") + .option(Options.READ_DOCUMENTS_CATEGORIES, "content,metadata") + .load() + .repartition(1) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_ARCHIVES_CATEGORIES, "quality,properties,metadataValues") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + assertEquals(2, rows.size(), "Expecting 2 rows in the zip."); + rows.forEach(row -> { + verifyContent(row); + assertTrue(row.isNullAt(2)); + assertNull(row.get(3), "Collections column should be null."); + assertNull(row.get(4), "Permissions column should be null."); + assertEquals(10, row.get(5)); + verifyProperties(row); + verifyMetadatavalues(row); + }); + } + + @Test + void testCollectionsAndMetadataValues(@TempDir Path tempDir) { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "collection1") + .option(Options.READ_DOCUMENTS_CATEGORIES, "content,metadata") + .load() + .repartition(1) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_ARCHIVES_CATEGORIES, "collections,metadataValues") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + assertEquals(2, rows.size(), "Expecting 2 rows in the zip."); + rows.forEach(row -> { + verifyContent(row); + assertTrue(row.isNullAt(2)); + verifyCollections(row); + assertNull(row.get(4), "Permissions column should be null."); + assertNull(row.get(5), "Quality column should be null."); + assertNull(row.get(6), "Properties column should be null."); + verifyMetadatavalues(row); + }); + } + + @Test + void invalidArchiveAndAbort() { + Dataset dataset = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .load("src/test/resources/archive-files/firstEntryInvalid.zip"); + + ConnectorException ex = assertThrowsConnectorException(() -> dataset.count()); + assertTrue(ex.getMessage().startsWith("Could not find metadata entry for entry test/1.xml in file"), + "The connector should default to throwing an error when it cannot find a metadata entry; error: " + ex.getMessage()); + } + + @Test + void invalidArchiveDontAbort() { + List rows = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load( + "src/test/resources/archive-files/firstEntryInvalid.zip", + "src/test/resources/archive-files/archive1.zip" + ) + .collectAsList(); + + assertEquals(2, rows.size(), "Because the first entry in bad.zip is invalid as it does not have a metadata " + + "entry, no rows should be returned for that zip, but the connector should still process the valid " + + "zip and return its 2 rows."); + + assertEquals("/test/1.xml", rows.get(0).getString(0)); + assertEquals("/test/2.xml", rows.get(1).getString(0)); + } + + @Test + void secondEntryInvalidDontAbort() { + List rows = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/archive-files/secondEntryInvalid.zip") + .collectAsList(); + + assertEquals(1, rows.size(), "The first entry in the zip is valid, and so a row should be returned for it " + + "and its associated metadata entry. The second entry is invalid because it's missing a metadata entry. " + + "But no error should be thrown since the connector is configured to not abort on failure."); + assertEquals("test/1.xml", rows.get(0).getString(0)); + } + + @Test + void threeZipsOnePartition() { + List rows = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .option(Options.READ_NUM_PARTITIONS, 1) + .load( + "src/test/resources/archive-files/archive1.zip", + "src/test/resources/archive-files/firstEntryInvalid.zip", + "src/test/resources/archive-files/secondEntryInvalid.zip" + ) + .collectAsList(); + + assertEquals(3, rows.size(), "Expecting 2 rows from archive1.zip, none from firstEntryInvalid.zip, " + + "and 1 from secondEntryInvalid.zip."); + } + + /** + * Verifies that the encoding is applied to documents read from an archive zip. In this case, iso-8859-1 is required + * for the content entry but still works fine for the metadata entry, which is UTF-8. + */ + @Test + void customEncoding() { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .option(Options.READ_FILES_ENCODING, "iso-8859-1") + .load("src/test/resources/encoding/medline.iso-8859-1.archive.zip") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .mode(SaveMode.Append) + .save(); + + XmlNode doc = readXmlDocument("test/medline.iso-8859-1.xml", "collection1"); + doc.assertElementExists("/MedlineCitationSet"); + } + + private void verifyAllMetadata(Path tempDir, int rowCount) { + List rows = sparkSession.read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + assertEquals(rowCount, rows.size(), "Expecting 2 rows in the zip."); + + for (int i = 0; i < rowCount; i++) { + Row row = rows.get(i); + assertTrue(row.getString(0).endsWith("/test/" + (i + 1) + ".xml")); + verifyContent(row); + assertTrue(row.isNullAt(2), "There's no indication in an archive file as to what the format of a " + + "content entry is, so the 'format' column should always be null."); + verifyCollections(row); + verifyPermissions(row); + assertEquals(10, row.get(5)); + verifyProperties(row); + verifyMetadatavalues(row); + } + } + + private void verifyContent(Row row) { + String content = new String((byte[]) row.get(1), StandardCharsets.UTF_8); + assertTrue(content.contains("world")); + } + + private void verifyCollections(Row row) { + List collections = JavaConversions.seqAsJavaList(row.getSeq(3)); + assertEquals("collection1", collections.get(0)); + assertEquals("collection2", collections.get(1)); + } + + private void verifyPermissions(Row row) { + Map permissions = row.getJavaMap(4); + assertTrue(permissions.get("spark-user-role").toString().contains("READ")); + assertTrue(permissions.get("spark-user-role").toString().contains("UPDATE")); + assertTrue(permissions.get("qconsole-user").toString().contains("READ")); + } + + private void verifyProperties(Row row) { + XmlNode properties = new XmlNode(row.getString(6), PROPERTIES_NAMESPACE, Namespace.getNamespace("ex", "org:example")); + properties.assertElementValue("/prop:properties/ex:key1", "value1"); + properties.assertElementValue("/prop:properties/key2", "value2"); + } + + private void verifyMetadatavalues(Row row) { + Map metadataValues = row.getJavaMap(7); + assertEquals("value1", metadataValues.get("meta1")); + assertEquals("value2", metadataValues.get("meta2")); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadGenericFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadGenericFilesTest.java new file mode 100644 index 00000000..5f123895 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadGenericFilesTest.java @@ -0,0 +1,128 @@ +package com.marklogic.spark.reader.file; + +import com.fasterxml.jackson.databind.JsonNode; +import com.marklogic.client.io.BytesHandle; +import com.marklogic.client.io.Format; +import com.marklogic.client.io.StringHandle; +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import java.util.List; +import java.util.stream.Stream; + +import static org.junit.jupiter.api.Assertions.*; + +/** + * The generic file reader has support for aborting or continuing on failure - but we haven't yet found a way to + * force an error to occur. An encoding issue doesn't cause an error because the reader simply reads in all the + * bytes from the file. + */ +class ReadGenericFilesTest extends AbstractIntegrationTest { + + private static final String ISO_8859_1_ENCODED_FILE = "src/test/resources/encoding/medline.iso-8859-1.txt"; + + @Test + void readAndWriteMixedFiles() { + Dataset dataset = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_NUM_PARTITIONS, 2) + .load("src/test/resources/mixed-files"); + + List rows = dataset.collectAsList(); + assertEquals(4, rows.size()); + rows.forEach(row -> { + assertFalse(row.isNullAt(0)); // URI + assertFalse(row.isNullAt(1)); // content + Stream.of(2, 3, 4, 5, 6, 7).forEach(index -> assertTrue(row.isNullAt(index), + "Expecting a null value for every column that isn't URI or content; index: " + index)); + }); + + defaultWrite(dataset.write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "generic") + .option(Options.WRITE_URI_REPLACE, ".*/mixed-files,''")); + + JsonNode doc = readJsonDocument("/hello.json", "generic"); + assertEquals("world", doc.get("hello").asText()); + XmlNode xmlDoc = readXmlDocument("/hello.xml", "generic"); + xmlDoc.assertElementValue("/hello", "world"); + String text = getDatabaseClient().newTextDocumentManager().read("/hello.txt", new StringHandle()).get(); + assertEquals("hello world", text.trim()); + BytesHandle handle = getDatabaseClient().newBinaryDocumentManager().read("/hello2.txt.gz", new BytesHandle()); + assertEquals(Format.BINARY, handle.getFormat()); + } + + /** + * Need to actually write the document to force an error to occur. + */ + @Test + void wrongEncoding() { + DataFrameWriter writer = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .load(ISO_8859_1_ENCODED_FILE) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save()); + assertTrue(ex.getMessage().contains("document is not UTF-8 encoded"), "Actual error: " + ex.getMessage()); + } + + @Test + void customEncoding() { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, "ISO-8859-1") + .load(ISO_8859_1_ENCODED_FILE) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/iso-doc.xml") + .mode(SaveMode.Append) + .save(); + + XmlNode doc = readXmlDocument("/iso-doc.xml"); + doc.assertElementExists("/MedlineCitationSet"); + doc.assertElementValue("/MedlineCitationSet/MedlineCitation/Affiliation", + "Istituto di Anatomia e Istologia Patologica, Università di Ferrara, Italy."); + } + + @Test + void invalidEncodingValue() { + DataFrameWriter writer = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, "not-a-real-encoding") + .load(ISO_8859_1_ENCODED_FILE) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append); + + ConnectorException ex = assertThrows(ConnectorException.class, () -> writer.save()); + assertTrue(ex.getMessage().contains("Unsupported encoding value: not-a-real-encoding"), "Actual error: " + ex.getMessage()); + } + + /** + * Verifies that encoding is applied when a file is gzipped as well. Neat! + */ + @Test + void gzippedCustomEncoding() { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, "ISO-8859-1") + .option(Options.READ_FILES_COMPRESSION, "gzip") + .load("src/test/resources/encoding/medline2.iso-8859-1.xml.gz") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_COLLECTIONS, "encoding-test") + .mode(SaveMode.Append) + .save(); + + String uri = getUrisInCollection("encoding-test", 1).get(0); + XmlNode doc = readXmlDocument(uri); + doc.assertElementExists("/MedlineCitationSet"); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadGZIPAggregateXMLFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadGzipAggregateXmlFilesTest.java similarity index 70% rename from src/test/java/com/marklogic/spark/reader/file/ReadGZIPAggregateXMLFilesTest.java rename to src/test/java/com/marklogic/spark/reader/file/ReadGzipAggregateXmlFilesTest.java index 85132b04..4432ad8d 100644 --- a/src/test/java/com/marklogic/spark/reader/file/ReadGZIPAggregateXMLFilesTest.java +++ b/src/test/java/com/marklogic/spark/reader/file/ReadGzipAggregateXmlFilesTest.java @@ -14,7 +14,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertTrue; -class ReadGZIPAggregateXMLFilesTest extends AbstractIntegrationTest { +class ReadGzipAggregateXmlFilesTest extends AbstractIntegrationTest { @Test void noNamespace() { @@ -42,16 +42,31 @@ void nonGZIPFile() { ConnectorException ex = assertThrowsConnectorException(() -> dataset.count()); String message = ex.getMessage(); - assertTrue(message.startsWith("Unable to open file:///"), "Unexpected error: " + message); + assertTrue(message.startsWith("Unable to read file at file:///"), "Unexpected error: " + message); assertTrue(message.endsWith("cause: Not in GZIP format"), "Unexpected error: " + message); } + @Test + void ignoreInvalidGzipFile() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_AGGREGATES_XML_ELEMENT, "Employee") + .option(Options.READ_FILES_COMPRESSION, "gzip") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/aggregates/employees.xml", "src/test/resources/aggregate-gzips/employees.xml.gz") + .collectAsList(); + + assertEquals(3, rows.size(), "The error from the non-gzipped file should be ignored and logged, and the " + + "3 rows from the valid gzip file should be returned."); + } + private void verifyRow(Row row, String expectedUriSuffix, String rootPath, String name, int age) { String uri = row.getString(0); assertTrue(uri.endsWith(expectedUriSuffix), String.format("URI %s doesn't end with %s", uri, expectedUriSuffix)); - String xml = new String((byte[]) row.get(3)); + String xml = new String((byte[]) row.get(1)); XmlNode doc = new XmlNode(xml, Namespace.getNamespace("ex", "org:example")); doc.assertElementValue(String.format("%sname", rootPath), name); doc.assertElementValue(String.format("%sage", rootPath), Integer.toString(age)); + assertEquals("xml", row.getString(2)); } } diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadGZIPFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadGzipFilesTest.java similarity index 60% rename from src/test/java/com/marklogic/spark/reader/file/ReadGZIPFilesTest.java rename to src/test/java/com/marklogic/spark/reader/file/ReadGzipFilesTest.java index 893bbfd0..4e5f6bd5 100644 --- a/src/test/java/com/marklogic/spark/reader/file/ReadGZIPFilesTest.java +++ b/src/test/java/com/marklogic/spark/reader/file/ReadGzipFilesTest.java @@ -12,7 +12,7 @@ import static org.junit.jupiter.api.Assertions.*; -class ReadGZIPFilesTest extends AbstractIntegrationTest { +class ReadGzipFilesTest extends AbstractIntegrationTest { @Test void readThreeGZIPFiles() { @@ -30,6 +30,19 @@ void readThreeGZIPFiles() { verifyRow(rows.get(2), "/src/test/resources/gzip-files/level1/level2/hello.json", "{\"hello\":\"world\"}\n"); } + @Test + void threeFilesOnePartition() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "gzip") + .option(Options.READ_NUM_PARTITIONS, 1) + .option("recursiveFileLookup", "true") + .load("src/test/resources/gzip-files") + .collectAsList(); + + assertEquals(3, rows.size()); + } + @Test void filesNotGzipped() { Dataset dataset = newSparkSession().read() @@ -39,14 +52,28 @@ void filesNotGzipped() { SparkException ex = assertThrows(SparkException.class, () -> dataset.count()); assertTrue(ex.getCause() instanceof ConnectorException); - assertTrue(ex.getCause().getMessage().startsWith("Unable to read gzip file at "), + assertTrue(ex.getCause().getMessage().startsWith("Unable to read file at file:///"), "Unexpected error message: " + ex.getCause().getMessage()); } + @Test + void dontAbortOnFailure() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "gzip") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .option("recursiveFileLookup", true) + .load("src/test/resources/zip-files/mixed-files.zip", "src/test/resources/gzip-files") + .collectAsList(); + + assertEquals(3, rows.size(), "Expecting to get the 3 files back from the gzip-files directory, with the " + + "error for the non-gzipped mixed-files.zip file being logged as a warning but not causing a failure."); + } + private void verifyRow(Row row, String expectedUriSuffix, String expectedContent) { String uri = row.getString(0); assertTrue(uri.endsWith(expectedUriSuffix), "Unexpected URI: " + uri); - String content = new String((byte[]) row.get(3)); + String content = new String((byte[]) row.get(1)); assertEquals(expectedContent, content); } } diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadGzipRdfFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadGzipRdfFilesTest.java new file mode 100644 index 00000000..fc45e1eb --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadGzipRdfFilesTest.java @@ -0,0 +1,43 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.CsvSource; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * These tests only verify that the expected number of rows are returned, thereby verifying that the gzipped file + * is read correctly. The files are the same as those used in ReadRdfFilesTest, just gzipped. We count on + * ReadRdfFilesTest to verify that the triple values are correct (which really is just verifying that Jena works + * correctly). + */ +class ReadGzipRdfFilesTest extends AbstractIntegrationTest { + + @ParameterizedTest + @CsvSource({ + "englishlocale2.ttl.gz,32", + "mini-taxonomy2.xml.gz,8", + "semantics2.json.gz,12", + "semantics2.n3.gz,25", + "semantics2.nt.gz,8", + "three-quads2.trig.gz,16", + "semantics2.nq.gz,4" + }) + void test(String file, int expectedCount) { + Dataset dataset = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .option(Options.READ_FILES_COMPRESSION, "gzip") + .load("src/test/resources/rdf/" + file); + + List rows = dataset.collectAsList(); + assertEquals(expectedCount, rows.size()); + } + +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadMlcpArchiveFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadMlcpArchiveFilesTest.java new file mode 100644 index 00000000..c942db47 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadMlcpArchiveFilesTest.java @@ -0,0 +1,296 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.jdom2.Namespace; +import org.junit.jupiter.api.Test; +import scala.collection.mutable.WrappedArray; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class ReadMlcpArchiveFilesTest extends AbstractIntegrationTest { + + private static final int COLLECTIONS_COLUMN = 3; + private static final int PERMISSIONS_COLUMN = 4; + private static final int QUALITY_COLUMN = 5; + private static final int PROPERTIES_COLUMN = 6; + private static final int METADATAVALUES_COLUMN = 7; + + /** + * Depends on an MLCP archive file containing the two XML documents and their metadata entries as produced by + * TestUtil.insertTwoDocumentsWithAllMetadata. + */ + @Test + void readMlcpArchiveFile() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip") + .collectAsList(); + + assertEquals(2, rows.size(), "The .metadata entries should not be returned as rows, but rather should be " + + "used to populate the metadata columns in the rows for the documents they're associated with."); + + verifyFirstRow(rows.get(0)); + verifySecondRow(rows.get(1)); + } + + @Test + void twoArchivesOnePartition() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load( + "src/test/resources/mlcp-archive-files/complex-properties.zip", + "src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip" + ) + .collectAsList(); + + assertEquals(3, rows.size(), "A single partition reader should iterate over the two files and get " + + "2 rows from files-with-all-metadata and 1 row from complex-properties."); + } + + @Test + void mlcpArchivesContainingAllFourTypesOfDocuments() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/mlcp-archive-files/all-four-document-types") + .sort(new Column("URI")) + .collectAsList(); + + assertEquals(4, rows.size()); + + rows.forEach(row -> System.out.println(row.prettyJson())); + + assertEquals("/mixed-files/hello.json", rows.get(0).getString(0)); + assertEquals("JSON", rows.get(0).getString(2)); + + assertEquals("/mixed-files/hello.txt", rows.get(1).getString(0)); + assertEquals("TEXT", rows.get(1).getString(2)); + + assertEquals("/mixed-files/hello.xml", rows.get(2).getString(0)); + assertEquals("XML", rows.get(2).getString(2)); + + assertEquals("/mixed-files/hello2.txt.gz", rows.get(3).getString(0)); + assertEquals("JSON", rows.get(3).getString(2), "MLCP appears to have a bug where a binary file has its " + + "format captured as JSON. MLE-12923 was created to capture this bug. Once the bug is fixed, this should " + + "equal BINARY."); + } + + @Test + void subsetOfCategories() { + readArchiveWithCategories("collections,permissions,quality").forEach(row -> { + verifyCollections(row); + verifyPermissions(row); + verifyQuality(row); + verifyColumnsAreNull(row, PROPERTIES_COLUMN, METADATAVALUES_COLUMN); + }); + } + + @Test + void onlyCollections() { + readArchiveWithCategories("collections").forEach(row -> { + verifyCollections(row); + verifyColumnsAreNull(row, PERMISSIONS_COLUMN, QUALITY_COLUMN, PROPERTIES_COLUMN, METADATAVALUES_COLUMN); + }); + } + + @Test + void onlyPermissions() { + readArchiveWithCategories("permissions").forEach(row -> { + verifyPermissions(row); + verifyColumnsAreNull(row, COLLECTIONS_COLUMN, QUALITY_COLUMN, PROPERTIES_COLUMN, METADATAVALUES_COLUMN); + }); + } + + @Test + void onlyQuality() { + readArchiveWithCategories("quality").forEach(row -> { + verifyQuality(row); + verifyColumnsAreNull(row, COLLECTIONS_COLUMN, PERMISSIONS_COLUMN, PROPERTIES_COLUMN, METADATAVALUES_COLUMN); + }); + } + + @Test + void onlyProperties() { + readArchiveWithCategories("properties").forEach(row -> { + verifyProperties(row); + verifyColumnsAreNull(row, COLLECTIONS_COLUMN, PERMISSIONS_COLUMN, QUALITY_COLUMN, METADATAVALUES_COLUMN); + }); + } + + @Test + void onlyMetadataValues() { + readArchiveWithCategories("metadatavalues").forEach(row -> { + verifyMetadataValues(row); + verifyColumnsAreNull(row, COLLECTIONS_COLUMN, PERMISSIONS_COLUMN, QUALITY_COLUMN, PROPERTIES_COLUMN); + }); + } + + @Test + void complexProperties() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/mlcp-archive-files/complex-properties.zip") + .collectAsList(); + + assertEquals(1, rows.size()); + + XmlNode properties = new XmlNode(rows.get(0).getString(PROPERTIES_COLUMN), + PROPERTIES_NAMESPACE, Namespace.getNamespace("flexrep", "http://marklogic.com/xdmp/flexible-replication")); + properties.assertElementValue( + "This verifies that the properties column can contain any serialized string of XML. This is necessary so " + + "that complex XML structures can be read from and written to MarkLogic.", + "/prop:properties/flexrep:document-status/flexrep:document-uri", "/equipment/DX06040.json"); + } + + @Test + void notAnMlcpArchive() { + Dataset dataset = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/zip-files/mixed-files.zip"); + + ConnectorException ex = assertThrowsConnectorException(() -> dataset.count()); + String message = ex.getMessage(); + assertTrue(message.startsWith("Unable to read metadata for entry: mixed-files/hello.json"), "Unexpected message: " + message); + } + + @Test + void notAZipFile() { + Dataset dataset = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/logback.xml"); + + assertEquals(0, dataset.count(), "Just like with reading zip files, the underlying Java ZipInputStream " + + "does not throw an error when it reads a non-zip file. It just doesn't return any zip entries. For now, " + + "we are not treating this as an error condition either. The user will simply get back zero rows."); + } + + @Test + void dontAbortOnBadArchiveFile() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .option(Options.READ_NUM_PARTITIONS, 1) + .load( + "src/test/resources/zip-files/mixed-files.zip", + "src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip" + ) + .collectAsList(); + + assertEquals(2, rows.size(), "When abortOnFailure is false, the error from mixed-files.zip should be logged " + + "and the connector should keep processing other files, thus returning the 2 valid rows from " + + "files-with-all-metadata.mlcp.zip."); + } + + @Test + void dontAbortOnArchiveFileMissingContentEntry() { + List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load( + "src/test/resources/mlcp-archive-files/missing-content-entry.mlcp.zip", + "src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip" + ) + .collectAsList(); + + assertEquals(3, rows.size(), "The connector should get 1 valid row from missing-content-entry.mlcp.zip, " + + "as it's the second metadata entry that is missing a content entry. It should also get 2 rows from " + + "files-with-all-metadata.mlcp.zip. And the error from the missing content entry should be logged but " + + "not thrown."); + } + + private void verifyFirstRow(Row row) { + assertEquals("/test/1.xml", row.getString(0)); + XmlNode doc = new XmlNode(new String((byte[]) row.get(1))); + assertEquals("world", doc.getElementValue("hello")); + verifyRowMetadata(row); + } + + private void verifySecondRow(Row row) { + assertEquals("/test/2.xml", row.getString(0)); + XmlNode doc = new XmlNode(new String((byte[]) row.get(1))); + assertEquals("world", doc.getElementValue("hello")); + verifyRowMetadata(row); + } + + private void verifyRowMetadata(Row row) { + assertEquals("XML", row.getString(2)); // format + verifyCollections(row); + verifyPermissions(row); + verifyQuality(row); + verifyProperties(row); + verifyMetadataValues(row); + } + + private List readArchiveWithCategories(String categories) { + return newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .option(Options.READ_ARCHIVES_CATEGORIES, categories) + .load("src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip") + .collectAsList(); + } + + private void verifyCollections(Row row) { + List collections = row.getList(COLLECTIONS_COLUMN); + assertEquals(2, collections.size()); + assertEquals("collection1", collections.get(0)); + assertEquals("collection2", collections.get(1)); + } + + private void verifyPermissions(Row row) { + Map> permissions = row.getJavaMap(PERMISSIONS_COLUMN); + assertEquals(2, permissions.size()); + WrappedArray capabilities = permissions.get("qconsole-user"); + assertEquals(1, capabilities.size()); + assertEquals("READ", capabilities.apply(0)); + capabilities = permissions.get("spark-user-role"); + assertEquals(2, capabilities.size()); + List list = new ArrayList<>(); + list.add(capabilities.apply(0)); + list.add(capabilities.apply(1)); + assertTrue(list.contains("READ")); + assertTrue(list.contains("UPDATE")); + } + + private void verifyQuality(Row row) { + assertEquals(10, row.getInt(QUALITY_COLUMN)); + } + + private void verifyProperties(Row row) { + XmlNode properties = new XmlNode(row.getString(PROPERTIES_COLUMN), + PROPERTIES_NAMESPACE, Namespace.getNamespace("ex", "org:example")); + properties.assertElementValue("/prop:properties/ex:key1", "value1"); + properties.assertElementValue("/prop:properties/key2", "value2"); + } + + private void verifyMetadataValues(Row row) { + Map metadataValues = row.getJavaMap(METADATAVALUES_COLUMN); + assertEquals(2, metadataValues.size()); + assertEquals("value1", metadataValues.get("meta1")); + assertEquals("value2", metadataValues.get("meta2")); + } + + private void verifyColumnsAreNull(Row row, int... indices) { + for (int index : indices) { + assertTrue(row.isNullAt(index), "Unexpected non-null column: " + index + "; value: " + row.get(index)); + } + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadMlcpArchiveWithNakedPropertiesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadMlcpArchiveWithNakedPropertiesTest.java new file mode 100644 index 00000000..ade03e25 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadMlcpArchiveWithNakedPropertiesTest.java @@ -0,0 +1,111 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +import java.util.List; +import java.util.stream.Stream; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * A "naked properties" URI in MarkLogic is possible by creating a properties fragment at a URI but not + * assigning any document content to it. MLCP archives can contain these, and thus we need to support them when reading + * an MLCP archive. However, because v1/search cannot find these documents, it's not possible for the archives created + * by this connector to contain them. + */ +class ReadMlcpArchiveWithNakedPropertiesTest extends AbstractIntegrationTest { + + private static final int PROPERTIES_COLUMN = 6; + + /** + * The plumbing in the parent class for deleting documents before a test runs won't catch naked properties created + * by this test, so we ensure they're deleted here. + */ + @BeforeEach + void deleteNakedPropertiesFromPreviousTestRuns() { + Stream.of("example.xml.naked", "example2.xml.naked", "naked/example.xml.naked").forEach(uri -> { + String query = String.format("xdmp:document-delete('%s')", uri); + try { + getDatabaseClient().newServerEval().xquery(query).evalAs(String.class); + } catch (Exception e) { + logger.debug("Ignoring this error because it's only due to the naked properties fragment not existing"); + } + }); + } + + @Test + void twoNakedEntries() { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/mlcp-archive-files/two-naked-entries.zip") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_COLLECTIONS, "naked") + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + assertCollectionSize("Using v1/search should not find the naked URIs since they do not have a document " + + "associated with them", "naked", 0); + + Stream.of("example.xml.naked", "example2.xml.naked").forEach(uri -> { + String collection = getDatabaseClient().newServerEval() + .javascript(String.format("xdmp.documentGetCollections('%s')[0]", uri)) + .evalAs(String.class); + assertEquals("naked", collection, "Each naked properties document should still be assigned to the " + + "collection found in its MLCP metadata entry from the archive file. But these URIs aren't returned " + + "by v1/search since there are no documents associated with them."); + }); + + XmlNode props = readDocumentProperties("example.xml.naked"); + props.assertElementValue("/prop:properties/priority", "1"); + props = readDocumentProperties("example2.xml.naked"); + props.assertElementValue("/prop:properties/priority", "2"); + } + + @Test + void normalAndNakedEntry() { + Dataset dataset = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "mlcp_archive") + .load("src/test/resources/mlcp-archive-files/normal-and-naked-entry.zip"); + + List rows = dataset.collectAsList(); + assertEquals(2, rows.size(), "The example.xml.naked entry should have produced 1 row."); + assertEquals("xml/1.xml", rows.get(1).getString(0)); + + final String expectedNakedPropertiesUrl = "naked/example.xml.naked"; + Row nakedRow = rows.get(0); + assertEquals(expectedNakedPropertiesUrl, nakedRow.getString(0)); + assertTrue(nakedRow.isNullAt(1), "Content should be null."); + assertTrue(nakedRow.isNullAt(2), "Format should be null, since there's no content."); + XmlNode properties = new XmlNode(nakedRow.getString(PROPERTIES_COLUMN), PROPERTIES_NAMESPACE); + properties.assertElementValue("/prop:properties/priority", "1"); + + // Write the rows to verify that the naked document is created correctly. + dataset.write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_COLLECTIONS, "naked-test") + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + List uris = getUrisInCollection("naked-test", 1); + assertEquals("xml/1.xml", uris.get(0), "getUrisInCollection uses v1/search to find URIs, and thus it " + + "should only find the URI of the normal document and not the one of the naked properties document."); + + XmlNode nakedProperties = readDocumentProperties(expectedNakedPropertiesUrl); + nakedProperties.assertElementValue( + "As of Java Client 6.6.1, a DMSDK WriteBatcher now allows for a document to have a null content handle, " + + "which allows for 'naked properties' URIs to be written.", + "/prop:properties/priority", "1"); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadRdfFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadRdfFilesTest.java new file mode 100644 index 00000000..e9420b29 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadRdfFilesTest.java @@ -0,0 +1,231 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class ReadRdfFilesTest extends AbstractIntegrationTest { + + @Test + void rdfXml() { + Dataset dataset = startRead().load("src/test/resources/rdf/mini-taxonomy.xml"); + + List rows = dataset.collectAsList(); + assertEquals(8, rows.size(), "Expecting 8 triples, as there are 8 child elements in the " + + "single rdf:Description element in the test file."); + + // Verify a few triples to make sure things look good. + final String subject = "http://vocabulary.worldbank.org/taxonomy/451"; + verifyRow(rows.get(0), subject, "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", "http://www.w3.org/2004/02/skos/core#Concept"); + verifyRow(rows.get(1), subject, "http://purl.org/dc/terms/creator", "wb", "http://www.w3.org/2001/XMLSchema#string", null); + verifyRow(rows.get(4), subject, "http://www.w3.org/2004/02/skos/core#prefLabel", "Debt Management", + "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString", "en"); + } + + @Test + void emptyRdfXml() { + Dataset dataset = startRead().load("src/test/resources/rdf/empty-taxonomy.xml"); + assertEquals(0, dataset.count(), "Verifying that no error is thrown if an RDF file is valid but simply " + + "has no triples in it."); + } + + /** + * Verifies that blank nodes are generated in the same manner as with MLCP. + */ + @Test + void blankNodes() { + Dataset dataset = startRead().load("src/test/resources/rdf/blank-nodes.xml"); + + List rows = dataset.collectAsList(); + assertEquals(4, rows.size()); + + final String subject = "http://example.org/web-data"; + verifyRow(rows.get(0), subject, "http://example.org/data#title", "Web Data", "http://www.w3.org/2001/XMLSchema#string", null); + verifyRow(rows.get(1), subject, "http://example.org/data#professor", "BLANK"); + verifyRow(rows.get(2), "BLANK", "http://example.org/data#fullName", "Alice Carol", "http://www.w3.org/2001/XMLSchema#string", null); + verifyRow(rows.get(3), "BLANK", "http://example.org/data#homePage", "http://example.net/alice-carol"); + } + + @Test + void turtleTriples() { + Dataset dataset = startRead().load("src/test/resources/rdf/englishlocale.ttl"); + List rows = dataset.collectAsList(); + assertEquals(32, rows.size()); + + // Verify a few rows as a sanity check. + final String subject = "http://marklogicsparql.com/id#1111"; + verifyRow(rows.get(0), subject, "http://marklogicsparql.com/addressbook#firstName", "John", + "http://www.w3.org/2001/XMLSchema#string", null, null); + verifyRow(rows.get(1), subject, "http://marklogicsparql.com/addressbook#lastName", "Snelson", + "http://www.w3.org/2001/XMLSchema#string", null, null); + verifyRow(rows.get(31), "http://marklogicsparql.com/id#8888", "http://marklogicsparql.com/addressbook#email", + "Lihan.Wang@gmail.com", "http://www.w3.org/2001/XMLSchema#string", null, null); + } + + @Test + void unrecognizedExtension() { + Dataset dataset = startRead().load("src/test/resources/rdf/turtle-triples.txt"); + ConnectorException ex = assertThrowsConnectorException(() -> dataset.show()); + assertTrue( + ex.getMessage().contains("RDF syntax is not supported or the file extension is not recognized."), + "We rely on Jena to identify a file based on its extension (with the exception of RDF JSON; Jena for " + + "some reason does not recognize .json as an extension). If the user provides a file with an " + + "extension that Jena does not recognize, we expect a friendly error that doesn't contain the " + + "Jena message of '(.lang or .base required)', which a user won't understand and can't do " + + "anything about. Actual error: " + ex.getMessage() + ); + } + + @Test + void rdfJson() { + Dataset dataset = startRead().load("src/test/resources/rdf/semantics.json"); + List rows = dataset.collectAsList(); + assertEquals(12, rows.size()); + + // Verify a few rows as a sanity check. + final String subject = "http://jondoe.example.org/#me"; + verifyRow(rows.get(0), subject, "http://www.w3.org/2000/01/rdf-schema#type", "http://xmlns.com/foaf/0.1/Person"); + verifyRow(rows.get(1), subject, "http://xmlns.com/foaf/0.1/name", "Jon", + "http://www.w3.org/2001/XMLSchema#string", null, null); + verifyRow(rows.get(11), subject, "http://www.w3.org/2006/vcard/ns#tel", "+49-12-3546789", + "http://www.w3.org/2001/XMLSchema#string", null, null); + } + + @Test + void n3Triples() { + Dataset dataset = startRead().load("src/test/resources/rdf/semantics.n3"); + List rows = dataset.collectAsList(); + assertEquals(25, rows.size()); + + // Verify a few rows as a sanity check. + verifyRow(rows.get(0), "http://www.w3.org/1999/02/22-rdf-syntax-ns#nil", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://www.w3.org/1999/02/22-rdf-syntax-ns#List"); + verifyRow(rows.get(14), "http://purl.org/dc/elements/1.1/", "http://purl.org/dc/elements/1.1/description", + "The Dublin Core Element Set v1.1 namespace provides URIs for the Dublin Core Elements v1.1. Entries are declared using RDF Schema language to support RDF applications.", + "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString", "en-US", null); + verifyRow(rows.get(24), "http://purl.org/dc/elements/1.1/", "http://purl.org/dc/terms/modified", "2003-03-24", + "http://www.w3.org/2001/XMLSchema#string", null, null); + } + + @Test + void ntriples() { + Dataset dataset = startRead().load("src/test/resources/rdf/semantics.nt"); + List rows = dataset.collectAsList(); + assertEquals(8, rows.size()); + + // Verify a few rows as a sanity check. + verifyRow(rows.get(0), "http://www.w3.org/2001/sw/RDFCore/ntriples/", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://xmlns.com/foaf/0.1/Document"); + verifyRow(rows.get(1), "http://www.w3.org/2001/sw/RDFCore/ntriples/", "http://purl.org/dc/terms/title", + "N-Triples", "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString", "en-US", null); + verifyRow(rows.get(7), "BLANK", "http://xmlns.com/foaf/0.1/name", "Dave Beckett", + "http://www.w3.org/2001/XMLSchema#string", null, null); + } + + @Test + void trigQuads() { + Dataset dataset = startRead().load("src/test/resources/rdf/three-quads.trig"); + List rows = dataset.collectAsList(); + assertEquals(16, rows.size()); + + verifyRow(rows.get(0), "http://www.example.org/exampleDocument#Monica", "http://www.example.org/vocabulary#name", + "Monica Murphy", "http://www.w3.org/2001/XMLSchema#string", null, "http://www.example.org/exampleDocument#G1"); + + verifyRow(rows.get(4), "http://www.example.org/exampleDocument#Monica", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://www.example.org/vocabulary#Person", null, null, "http://www.example.org/exampleDocument#G2"); + + verifyRow(rows.get(6), "http://www.example.org/exampleDocument#G1", "http://www.w3.org/2004/03/trix/swp-1/assertedBy", + "BLANK", null, null, "http://www.example.org/exampleDocument#G3"); + + // Verifies that Jena uses urn:x-arq:DefaultGraphNode as the default graph when a graph is not specified + // in a quads file. + verifyRow(rows.get(15), "http://www.example.org/exampleDocument#Default", "http://www.example.org/vocabulary#graphname", + "Default", "http://www.w3.org/2001/XMLSchema#string", null, "urn:x-arq:DefaultGraphNode"); + } + + @Test + void nquads() { + Dataset dataset = startRead().load("src/test/resources/rdf/semantics.nq"); + + List rows = dataset.collectAsList(); + assertEquals(4, rows.size()); + + verifyRow(rows.get(0), "http://dbpedia.org/resource/Autism", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://dbpedia.org/ontology/Disease", null, null, "http://en.wikipedia.org/wiki/Autism?oldid=495234324#absolute-line=9"); + + // MLCP converts urn:x-arq:DefaultGraphNode into MarkLogic's default graph. + verifyRow(rows.get(1), "http://dbpedia.org/resource/Autism", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://dbpedia.org/ontology/Disease1", null, null, "urn:x-arq:DefaultGraphNode"); + + verifyRow(rows.get(2), "http://dbpedia.org/resource/Animal_Farm", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://schema.org/CreativeWork", null, null, "http://en.wikipedia.org/wiki/Animal_Farm?oldid=494597186#absolute-line=5"); + + verifyRow(rows.get(3), "http://dbpedia.org/resource/Aristotle", "http://www.w3.org/1999/02/22-rdf-syntax-ns#type", + "http://dbpedia.org/ontology/Agent", null, null, "http://en.wikipedia.org/wiki/Aristotle?oldid=494147695#absolute-line=4"); + } + + @Test + void dontAbortOnTriplesFailure() { + List rows = startRead() + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/data.csv", "src/test/resources/rdf/mini-taxonomy.xml") + .collectAsList(); + + assertEquals(8, rows.size(), "The error from data.csv should be caught and logged, and then the 8 triples " + + "from mini-taxonomy.xml should be returned."); + } + + @Test + void dontAbortOnQuadsFailure() { + List rows = startRead() + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .load("src/test/resources/rdf/bad-quads.trig", "src/test/resources/rdf/three-quads.trig") + .collectAsList(); + + assertEquals(20, rows.size(), "Should have the 20 triples from three-quads.trig, and the error from " + + "bad-quads.trig should have been caught and logged at the WARN level."); + } + + private void verifyRow(Row row, String subject, String predicate, String object) { + verifyRow(row, subject, predicate, object, null, null); + } + + private void verifyRow(Row row, String subject, String predicate, String object, String datatype, String lang) { + verifyRow(row, subject, predicate, object, datatype, lang, null); + } + + private void verifyRow(Row row, String subject, String predicate, String object, String datatype, String lang, String graph) { + assertEqualsOrBlank(subject, row.getString(0)); + assertEquals(predicate, row.getString(1)); + assertEqualsOrBlank(object, row.getString(2)); + assertEquals(datatype, row.get(3)); + assertEquals(lang, row.getString(4)); + assertEquals(graph, row.getString(5)); + } + + private void assertEqualsOrBlank(String expectedValue, String actualValue) { + if ("BLANK".equals(expectedValue)) { + assertTrue(actualValue.startsWith("http://marklogic.com/semantics/blank/"), + "We are reusing copy/pasted code from MLCP for generating a 'blank' value, which is expected to end with " + + "a random hex value. It is not known why this isn't just a Java-generated UUID; we're simply reusing " + + "the code because it's what MLCP does. Actual value: " + actualValue); + } else { + assertEquals(expectedValue, actualValue); + } + } + + private DataFrameReader startRead() { + return newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf"); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadRdfZipFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadRdfZipFilesTest.java new file mode 100644 index 00000000..0f8b9493 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/file/ReadRdfZipFilesTest.java @@ -0,0 +1,126 @@ +package com.marklogic.spark.reader.file; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Test; + +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class ReadRdfZipFilesTest extends AbstractIntegrationTest { + + @Test + void twoRdfFilesInZip() { + List rows = startRead() + .load("src/test/resources/rdf/two-rdf-files.zip") + .collectAsList(); + + assertEquals(40, rows.size(), "Expecting 32 triples from the englishlocale.ttl file in the zip " + + "and 8 triples from the mini-taxonomy.xml file."); + + Map subjectCounts = getSubjectCounts(rows); + assertEquals(4, subjectCounts.get("http://marklogicsparql.com/id#1111")); + assertEquals(8, subjectCounts.get("http://vocabulary.worldbank.org/taxonomy/451")); + } + + @Test + void twoFilesOnePartition() { + List rows = startRead() + .option(Options.READ_NUM_PARTITIONS, 1) + .load("src/test/resources/rdf/two-rdf-files.zip", "src/test/resources/rdf/two-rdf-files.zip") + .collectAsList(); + + assertEquals(80, rows.size(), "A single partition reader should get 40 triples from each of the files - " + + "and this shows you can also pass in the same path twice, which was a little surprising."); + } + + @Test + void zipHasEmptyRdfFile() { + List rows = startRead() + .load("src/test/resources/rdf/has-empty-entry.zip") + .collectAsList(); + + assertEquals(32, rows.size(), "Expecting 32 triples from englishlocale.ttl and zero triples from " + + "empty-taxonomy.xml. The fact that empty-taxonomy.xml is a valid XML file but has no triples should not " + + "result in any error."); + } + + @Test + void eachRdfFileTypeInZip() { + List rows = startRead() + .load("src/test/resources/rdf/each-rdf-file-type.zip") + .collectAsList(); + + assertEquals(105, rows.size(), "Expecting the following counts: 32 from englishlocale.ttl; 8 from " + + "mini-taxonomy.xml; 12 from semantics.json; 25 from semantics.n3; 8 from semantics.nt; 16 from " + + "three-quads.trig; and 4 from semantics.nq."); + + Map subjectCounts = getSubjectCounts(rows); + assertEquals(4, subjectCounts.get("http://marklogicsparql.com/id#1111"), + "Verifies that englishlocale.ttl was read correctly."); + assertEquals(8, subjectCounts.get("http://vocabulary.worldbank.org/taxonomy/451"), + "Verifies that mini-taxonomy.xml was read correctly."); + assertEquals(12, subjectCounts.get("http://jondoe.example.org/#me"), + "Verifies that semantics.json was read correctly."); + assertEquals(4, subjectCounts.get("http://www.w3.org/2001/sw/RDFCore/ntriples/"), + "Verifies that semantics.nt was read correctly."); + assertEquals(11, subjectCounts.get("http://purl.org/dc/elements/1.1/"), + "Verifies that semantics.n3 was read correctly."); + assertEquals(1, subjectCounts.get("http://dbpedia.org/resource/Animal_Farm"), + "Verifies that semantics.nq was read correctly."); + assertEquals(6, subjectCounts.get("http://www.example.org/exampleDocument#Monica"), + "Verifies that three-quads.trig was read correctly."); + } + + @Test + void abortOnBadEntry() { + Dataset dataset = startRead().load("src/test/resources/rdf/good-and-bad-rdf.zip"); + ConnectorException ex = assertThrowsConnectorException(() -> dataset.count()); + assertTrue(ex.getMessage().contains("Unable to read bad-quads.trig; cause: "), + "Unexpected error: " + ex.getMessage()); + } + + @Test + void zipWithBadAndGoodEntry() { + List rows = startRead() + .option(Options.READ_FILES_ABORT_ON_FAILURE, false) + .option(Options.READ_NUM_PARTITIONS, 1) + .load( + "src/test/resources/rdf/good-and-bad-rdf.zip", + "src/test/resources/rdf/has-empty-entry.zip" + ) + .collectAsList(); + + assertEquals(68, rows.size(), "Expecting 4 quads to be read from the 'bad' entry until an error occurs, " + + "and expecting 32 triples to be read from the 'good' entry. Then expecting 32 more triples to be " + + "read from has-empty-entry.zip."); + } + + private DataFrameReader startRead() { + return newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .option(Options.READ_FILES_COMPRESSION, "zip"); + } + + private Map getSubjectCounts(List rows) { + Map subjectCounts = new HashMap<>(); + rows.forEach(row -> { + String subject = row.getString(0); + if (subjectCounts.containsKey(subject)) { + subjectCounts.put(subject, subjectCounts.get(subject) + 1); + } else { + subjectCounts.put(subject, 1); + } + }); + return subjectCounts; + } +} diff --git a/src/test/java/com/marklogic/spark/reader/file/ReadZipFilesTest.java b/src/test/java/com/marklogic/spark/reader/file/ReadZipFilesTest.java index 106d4091..a1029b39 100644 --- a/src/test/java/com/marklogic/spark/reader/file/ReadZipFilesTest.java +++ b/src/test/java/com/marklogic/spark/reader/file/ReadZipFilesTest.java @@ -27,9 +27,14 @@ void readAndWriteFourFilesInZip() { Dataset reader = newZipReader() .load("src/test/resources/zip-files/mixed*.zip"); - verifyFileRows(reader.collectAsList()); + List rows = reader.collectAsList(); + assertEquals(4, rows.size(), "Expecting 1 row for each of the 4 entries in the zip."); + verifyUriEndsWith(rows.get(0), "mixed-files.zip/mixed-files/hello.json"); + verifyUriEndsWith(rows.get(1), "mixed-files.zip/mixed-files/hello.txt"); + verifyUriEndsWith(rows.get(2), "mixed-files.zip/mixed-files/hello.xml"); + verifyUriEndsWith(rows.get(3), "mixed-files.zip/mixed-files/hello2.txt.gz"); - // Now write the rows so we can verify the doc in MarkLogic. + // Now write the rows so that we can verify the doc in MarkLogic. defaultWrite(reader.write() .format(CONNECTOR_IDENTIFIER) .option(Options.WRITE_URI_REPLACE, ".*/mixed-files.zip,''") @@ -51,6 +56,7 @@ void readAndWriteFourFilesInZip() { void readViaMultiplePaths() { List rows = newZipReader() .option(Options.READ_FILES_COMPRESSION, "zip") + .option(Options.READ_NUM_PARTITIONS, 1) .load( "src/test/resources/zip-files/mixed-files.zip", "src/test/resources/zip-files/child/logback.zip" @@ -60,6 +66,32 @@ void readViaMultiplePaths() { assertEquals(5, rows.size(), "Expecting 4 rows from mixed-files.zip and 1 row from logback.zip."); } + /** + * See https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html . + */ + @Test + void modifiedBefore() { + long count = newZipReader() + .option(Options.READ_FILES_COMPRESSION, "zip") + .option("modifiedBefore", "2020-06-01T13:00:00") + .load("src/test/resources/zip-files/mixed-files.zip") + .count(); + assertEquals(0, count, "Verifying that 'modifiedBefore' 'just works'."); + } + + /** + * See https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html . + */ + @Test + void modifiedAfter() { + long count = newZipReader() + .option(Options.READ_FILES_COMPRESSION, "zip") + .option("modifiedAfter", "2020-06-01T13:00:00") + .load("src/test/resources/zip-files/mixed-files.zip") + .count(); + assertEquals(4, count, "Verifying that 'modifiedAfter' 'just works'."); + } + @Test void readTwoZipFilesViaRecursiveLookupWithFilter() { List rows = newZipReader() @@ -101,24 +133,9 @@ void pathDoesntExist() { "standard Spark exception that is thrown when a non-existent path is detected."); } - private void verifyFileRows(List rows) { - assertEquals(4, rows.size(), "Expecting 1 row for each of the 4 entries in the zip."); - Row row = rows.get(0); - assertTrue(row.getString(0).endsWith("mixed-files.zip/mixed-files/hello.json")); - assertNull(row.get(1)); - assertEquals(23, row.getLong(2)); - row = rows.get(1); - assertTrue(row.getString(0).endsWith("mixed-files.zip/mixed-files/hello.txt")); - assertNull(row.get(1)); - assertEquals(12, row.getLong(2)); - row = rows.get(2); - assertTrue(row.getString(0).endsWith("mixed-files.zip/mixed-files/hello.xml")); - assertNull(row.get(1)); - assertEquals(21, row.getLong(2)); - row = rows.get(3); - assertTrue(row.getString(0).endsWith("mixed-files.zip/mixed-files/hello2.txt.gz")); - assertNull(row.get(1)); - assertEquals(43, row.getLong(2)); + private void verifyUriEndsWith(Row row, String value) { + String uri = row.getString(0); + assertTrue(uri.endsWith(value), format("URI '%s' does not end with %s", uri, value)); } private DataFrameReader newZipReader() { diff --git a/src/test/java/com/marklogic/spark/reader/optic/GroupByDuplicateColumnNamesTest.java b/src/test/java/com/marklogic/spark/reader/optic/GroupByDuplicateColumnNamesTest.java new file mode 100644 index 00000000..9a6c5523 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/optic/GroupByDuplicateColumnNamesTest.java @@ -0,0 +1,85 @@ +package com.marklogic.spark.reader.optic; + +import com.marklogic.spark.Options; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class GroupByDuplicateColumnNamesTest extends AbstractPushDownTest { + + @Test + void sameColumnNameTwice() { + List rows = newDefaultReader() + .option(Options.READ_OPTIC_QUERY, "op.fromView('Medical', 'Authors', '')") + .load() + .groupBy("CitationID", "CitationID") + .sum("LuckyNumber") + .orderBy("CitationID") + .collectAsList(); + + assertEquals(5, rows.size(), "This verifies that the duplicate groupBy column name is not passed to Optic, " + + "which would cause an error of 'Grouping key shouldn't have duplicates'."); + assertRowsReadFromMarkLogic(5, "The groupBy/sum should have been pushed down."); + + String message = "The first two columns are expected to both be named " + + "'CitationID', which is a little surprising but appears to be expected by Spark. And the " + + "columns should have the same CitationID value."; + for (int i = 0; i < 5; i++) { + Row row = rows.get(i); + int expectedCitationID = i + 1; + assertEquals(expectedCitationID, row.getLong(0), message); + assertEquals(expectedCitationID, row.getLong(1), message); + } + } + + @Test + void sameColumnViaFieldReference() { + List rows = newDefaultReader() + .option(Options.READ_OPTIC_QUERY, "op.fromView('Medical', 'Authors', '')") + .load() + .withColumn("otherID", new Column("CitationID")) + .groupBy("otherID", "CitationID") + .sum("LuckyNumber") + .collectAsList(); + + assertEquals(5, rows.size()); + assertRowsReadFromMarkLogic(5, "The groupBy/sum should have been pushed down."); + + String message = "withColumn does not seem to work correctly for our connector when it only creates an alias. Our " + + "connector is not passed 'CitationID' and 'otherID', but rather it gets 'CitationID' twice. So our " + + "connector doesn't even know about 'otherID'. The connector then returns rows with the correct values for " + + "'CitationID' and 'SUM(LuckyNumber)', but for an unknown reason, JsonRowDeserializer does not copy the " + + "'CitationID' value into the 'CitationID' column. And it understandably leaves the 'otherID' column " + + "blank since it doesn't know about that column."; + rows.forEach(row -> { + assertEquals(0, row.getLong(0), message); + assertEquals(0, row.getLong(1), message); + }); + } + + @Test + void groupByTwoColumns() { + List rows = newDefaultReader() + .option(Options.READ_OPTIC_QUERY, "op.fromView('Medical', 'Authors', '')") + .load() + .withColumn("CitationIDPlusOne", new Column("CitationID").plus(1)) + .groupBy("CitationID", "CitationIDPlusOne") + .sum("LuckyNumber") + .orderBy("CitationID") + .collectAsList(); + + assertEquals(5, rows.size()); + assertRowsReadFromMarkLogic(15, "Surprisingly, Spark never calls pushDownAggregation in this scenario. The " + + "use of withColumn with a new set of values seems to prevent that, but it is not known why. We get back " + + "the correct data from Spark, but the groupBy/sum are not pushed down. "); + for (int i = 0; i < 5; i++) { + Row row = rows.get(i); + assertEquals(i + 1, row.getLong(0)); + assertEquals(i + 2, row.getLong(1)); + } + } +} diff --git a/src/test/java/com/marklogic/spark/reader/optic/PushDownFilterTest.java b/src/test/java/com/marklogic/spark/reader/optic/PushDownFilterTest.java index c369f3a5..758f4b60 100644 --- a/src/test/java/com/marklogic/spark/reader/optic/PushDownFilterTest.java +++ b/src/test/java/com/marklogic/spark/reader/optic/PushDownFilterTest.java @@ -129,6 +129,44 @@ void and() { assertRowsReadFromMarkLogic(9); } + /** + * Captured in MLE-13771. + */ + @Test + void multipleFilters() { + Dataset dataset = newDataset(); + dataset = dataset + .filter(dataset.col("LastName").contains("umbe")) + .filter(dataset.col("CitationID").equalTo(5)); + + List rows = dataset.collectAsList(); + assertEquals(1, rows.size()); + assertRowsReadFromMarkLogic(1, "The two filters should be tossed into separate Optic 'where' clauses so " + + "so that an op.sqlCondition is not improperly added to an op.and, which Optic does not allow. The " + + "filters should thus both be pushed down successfully"); + } + + @Test + void orClauseWithSqlCondition() { + assertEquals(2, getCountOfRowsWithFilter("LastName LIKE '%ool%' OR LastName LIKE '%olb%'")); + assertRowsReadFromMarkLogic(15, "An OR with a sqlCondition cannot be pushed down."); + } + + @Test + void notClauseWithSqlCondition() { + assertEquals(14, getCountOfRowsWithFilter("NOT LastName LIKE '%ool%'")); + assertRowsReadFromMarkLogic(15, "A NOT with a sqlCondition cannot be pushed down."); + } + + @Test + void andClauseWithSqlCondition() { + assertEquals(1, getCountOfRowsWithFilter("LastName LIKE '%ool%' AND ForeName LIKE '%ivi%'")); + assertRowsReadFromMarkLogic(1, "Since Spark defaults to AND'ing clauses together, it will not construct " + + "an 'AND' operator. Instead, it will just sent the two 'LIKE' expressions as two separate filters to " + + "our connector, and our connector will create two separate Optic sqlCondition's, thus pushing both " + + "filters down to MarkLogic."); + } + @Test void or() { assertEquals(8, getCountOfRowsWithFilter("CitationID == 1 OR CitationID == 2")); @@ -250,6 +288,7 @@ void stringEndsWithNoMatch() { private Dataset newDataset() { return newDefaultReader() .option(Options.READ_OPTIC_QUERY, QUERY_WITH_NO_QUALIFIER) + .option(Options.READ_PUSH_DOWN_AGGREGATES, false) .load(); } diff --git a/src/test/java/com/marklogic/spark/reader/optic/ReadFromNonRestApiServerTest.java b/src/test/java/com/marklogic/spark/reader/optic/ReadFromNonRestApiServerTest.java new file mode 100644 index 00000000..b787bff9 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/optic/ReadFromNonRestApiServerTest.java @@ -0,0 +1,30 @@ +package com.marklogic.spark.reader.optic; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameReader; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertThrows; + +class ReadFromNonRestApiServerTest extends AbstractIntegrationTest { + + /** + * Verifies that when a user connects to a non-REST-API server, we give them something more helpful than just a + * 404. + */ + @Test + void test() { + DataFrameReader reader = newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_OPTIC_QUERY, NO_AUTHORS_QUERY) + .option(Options.CLIENT_URI, String.format("admin:admin@%s:8001", testConfig.getHost())); + + ConnectorException ex = assertThrows(ConnectorException.class, () -> reader.load()); + assertEquals("Unable to connect to MarkLogic; status code: 404; ensure that " + + "you are attempting to connect to a MarkLogic REST API app server. See the MarkLogic documentation on " + + "REST API app servers for more information.", ex.getMessage()); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/optic/ReadRowsFromTempViewTest.java b/src/test/java/com/marklogic/spark/reader/optic/ReadRowsFromTempViewTest.java new file mode 100644 index 00000000..f1e51041 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/optic/ReadRowsFromTempViewTest.java @@ -0,0 +1,55 @@ +package com.marklogic.spark.reader.optic; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class ReadRowsFromTempViewTest extends AbstractIntegrationTest { + + /** + * Demonstrates that Spark's sql method can be used against a temporary Spark view based on our connector. This is + * not intended to demonstrate that every possible SQL clause works - though the assumption is that a SQL query + * is translated into an expected set of Spark filter classes that may be pushed down to our connector. + */ + @Test + void tempView() { + newDefaultReader() + .option(Options.READ_OPTIC_QUERY, "op.fromView('Medical','Authors', '')") + .load() + .createOrReplaceTempView("Author"); + + assertEquals(15, sparkSession.sql("select * from Author").count(), "All 15 author rows should " + + "be accessible via Spark's sql method against the temp view."); + + List rows = sparkSession.sql("select * from Author where Base64Value is not null").collectAsList(); + assertEquals(1, rows.size(), "The 'is not null' should get pushed down to our connector via the temp view so " + + "that only the 1 row with a Base64Value is selected."); + assertEquals(1, rows.get(0).getLong(0)); + assertEquals("Golby", rows.get(0).getString(1)); + + rows = sparkSession.sql("select * from Author where Base64Value is not null and CitationID = 2").collectAsList(); + assertEquals(0, rows.size(), "No rows are expected since the only row with a Base64Value has a CitationID of 1."); + } + + @Test + void sqlWithLocate() { + newDefaultReader() + .option(Options.READ_OPTIC_QUERY, "op.fromView('Medical','Authors', 'test')") + .load() + .createOrReplaceTempView("Author"); + + List rows = sparkSession + .sql("select `test.CitationID`, `test.LastName` from Author where locate('umb', `test.LastName`) >= 1") + .collectAsList(); + assertEquals(1, rows.size()); + assertEquals(5, rows.get(0).getLong(0)); + assertEquals("Humbee", rows.get(0).getString(1)); + } + + +} diff --git a/src/test/java/com/marklogic/spark/reader/optic/ReadRowsTest.java b/src/test/java/com/marklogic/spark/reader/optic/ReadRowsTest.java index 114be1eb..be42a3cb 100644 --- a/src/test/java/com/marklogic/spark/reader/optic/ReadRowsTest.java +++ b/src/test/java/com/marklogic/spark/reader/optic/ReadRowsTest.java @@ -86,7 +86,7 @@ void invalidQuery() { DataFrameReader reader = newDefaultReader().option(Options.READ_OPTIC_QUERY, "op.fromView('Medical', 'ViewNotFound')"); RuntimeException ex = assertThrows(RuntimeException.class, () -> reader.load()); - assertTrue(ex.getMessage().startsWith("Unable to run Optic DSL query op.fromView('Medical', 'ViewNotFound'); cause: "), + assertTrue(ex.getMessage().startsWith("Unable to run Optic query op.fromView('Medical', 'ViewNotFound'); cause: "), "If the query throws an error for any reason other than no rows being found, the error should be wrapped " + "in a new error with a message containing the user's query to assist with debugging; actual " + "message: " + ex.getMessage()); @@ -109,7 +109,7 @@ void nonNumericPartitionCount() { .option(Options.READ_NUM_PARTITIONS, "abc") .load(); ConnectorException ex = assertThrows(ConnectorException.class, () -> reader.count()); - assertEquals("Value of 'spark.marklogic.read.numPartitions' option must be numeric.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.read.numPartitions' must be numeric.", ex.getMessage()); } @Test @@ -119,7 +119,7 @@ void partitionCountLessThanOne() { .load(); ConnectorException ex = assertThrows(ConnectorException.class, () -> reader.count()); - assertEquals("Value of 'spark.marklogic.read.numPartitions' option must be 1 or greater.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.read.numPartitions' must be 1 or greater.", ex.getMessage()); } @Test @@ -129,7 +129,7 @@ void nonNumericBatchSize() { .load(); ConnectorException ex = assertThrows(ConnectorException.class, () -> reader.count()); - assertEquals("Value of 'spark.marklogic.read.batchSize' option must be numeric.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.read.batchSize' must be numeric.", ex.getMessage()); } @Test @@ -138,6 +138,6 @@ void batchSizeLessThanZero() { .option(Options.READ_BATCH_SIZE, "-1") .load(); ConnectorException ex = assertThrows(ConnectorException.class, () -> reader.count()); - assertEquals("Value of 'spark.marklogic.read.batchSize' option must be 0 or greater.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.read.batchSize' must be 0 or greater.", ex.getMessage()); } } diff --git a/src/test/java/com/marklogic/spark/reader/optic/ReadStreamOfRowsTest.java b/src/test/java/com/marklogic/spark/reader/optic/ReadStreamOfRowsTest.java index a43e9910..18dfb5f6 100644 --- a/src/test/java/com/marklogic/spark/reader/optic/ReadStreamOfRowsTest.java +++ b/src/test/java/com/marklogic/spark/reader/optic/ReadStreamOfRowsTest.java @@ -16,6 +16,7 @@ package com.marklogic.spark.reader.optic; import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.sql.streaming.DataStreamReader; import org.junit.jupiter.api.Test; @@ -87,7 +88,7 @@ void readWithNoQuery() { .format(CONNECTOR_IDENTIFIER) .option(Options.CLIENT_URI, makeClientUri()); - IllegalArgumentException ex = assertThrows(IllegalArgumentException.class, () -> reader.load()); + ConnectorException ex = assertThrows(ConnectorException.class, () -> reader.load()); assertEquals("No Optic query found; must define spark.marklogic.read.opticQuery", ex.getMessage()); } } diff --git a/src/test/java/com/marklogic/spark/reader/optic/ReadWithClientUriTest.java b/src/test/java/com/marklogic/spark/reader/optic/ReadWithClientUriTest.java index 53fa8d32..fe55b532 100644 --- a/src/test/java/com/marklogic/spark/reader/optic/ReadWithClientUriTest.java +++ b/src/test/java/com/marklogic/spark/reader/optic/ReadWithClientUriTest.java @@ -16,6 +16,7 @@ package com.marklogic.spark.reader.optic; import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.sql.Row; import org.apache.spark.sql.types.DataTypes; @@ -134,6 +135,22 @@ void threeTokensAfterAtSymbol() { verifyClientUriIsInvalid("user:password@host:port:something"); } + @Test + void nonNumericPort() { + ConnectorException ex = assertThrows(ConnectorException.class, + () -> readRowsWithClientUri("user:password@host:nonNumericPort")); + assertEquals("Invalid value for spark.marklogic.client.uri; port must be numeric, but was 'nonNumericPort'", + ex.getMessage()); + } + + @Test + void nonNumericPortWithDatabase() { + ConnectorException ex = assertThrows(ConnectorException.class, + () -> readRowsWithClientUri("user:password@host:nonNumericPort/database")); + assertEquals("Invalid value for spark.marklogic.client.uri; port must be numeric, but was 'nonNumericPort'", + ex.getMessage()); + } + private List readRowsWithClientUri(String clientUri) { return newSparkSession() .read() @@ -145,13 +162,10 @@ private List readRowsWithClientUri(String clientUri) { } private void verifyClientUriIsInvalid(String clientUri) { - IllegalArgumentException ex = assertThrows( - IllegalArgumentException.class, - () -> readRowsWithClientUri(clientUri) - ); - + ConnectorException ex = assertThrows(ConnectorException.class, + () -> readRowsWithClientUri(clientUri)); assertEquals( - "Invalid value for spark.marklogic.client.uri; must be username:password@host:port", + "Invalid value for spark.marklogic.client.uri; must be username:password@host:port/optionalDatabaseName", ex.getMessage() ); } diff --git a/src/test/java/com/marklogic/spark/reader/optic/SerializeOpticReaderObjectsTest.java b/src/test/java/com/marklogic/spark/reader/optic/SerializeOpticReaderObjectsTest.java index 74b08a1b..b5debc0a 100644 --- a/src/test/java/com/marklogic/spark/reader/optic/SerializeOpticReaderObjectsTest.java +++ b/src/test/java/com/marklogic/spark/reader/optic/SerializeOpticReaderObjectsTest.java @@ -1,6 +1,5 @@ package com.marklogic.spark.reader.optic; -import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.node.ObjectNode; import com.marklogic.spark.AbstractIntegrationTest; import com.marklogic.spark.Options; @@ -38,7 +37,7 @@ void factory() { Map props = new HashMap<>(); props.put(Options.CLIENT_URI, makeClientUri()); props.put(Options.READ_OPTIC_QUERY, NO_AUTHORS_QUERY); - ReadContext context = new ReadContext(props, new StructType().add("myType", DataTypes.StringType), 1); + OpticReadContext context = new OpticReadContext(props, new StructType().add("myType", DataTypes.StringType), 1); OpticPartitionReaderFactory factory = new OpticPartitionReaderFactory(context); factory = (OpticPartitionReaderFactory) SerializeUtil.serialize(factory); diff --git a/src/test/java/com/marklogic/spark/reader/triples/ReadTriplesTest.java b/src/test/java/com/marklogic/spark/reader/triples/ReadTriplesTest.java new file mode 100644 index 00000000..f12e4918 --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/triples/ReadTriplesTest.java @@ -0,0 +1,166 @@ +package com.marklogic.spark.reader.triples; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class ReadTriplesTest extends AbstractIntegrationTest { + + @Test + void graph() { + List rows = startRead() + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph") + .load().collectAsList(); + + assertEquals(8, rows.size()); + + // Verify a row with both datatype and lang. + Row langRow = rows.stream().filter(row -> "Debt Management".equals(row.getString(2))).findFirst().get(); + assertEquals("http://vocabulary.worldbank.org/taxonomy/451", langRow.getString(0)); + assertEquals("http://www.w3.org/2004/02/skos/core#prefLabel", langRow.getString(1)); + assertEquals("http://www.w3.org/1999/02/22-rdf-syntax-ns#langString", langRow.getString(3)); + assertEquals("en", langRow.getString(4)); + assertEquals("http://example.org/graph", langRow.getString(5)); + + // Verify a row with a datatype but no lang. + Row creatorRow = rows.stream().filter(row -> "http://purl.org/dc/terms/creator".equals(row.getString(1))).findFirst().get(); + assertEquals("wb", creatorRow.getString(2)); + assertEquals("http://www.w3.org/2001/XMLSchema#string", creatorRow.getString(3)); + assertTrue(creatorRow.isNullAt(4)); + + // Verify a row with neither a datatype nor a lang. + Row conceptRow = rows.stream().filter(row -> "http://www.w3.org/2004/02/skos/core#Concept".equals(row.getString(2))).findFirst().get(); + assertTrue(conceptRow.isNullAt(3)); + assertTrue(conceptRow.isNullAt(4)); + } + + @Test + void defaultGraph() { + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .load("src/test/resources/rdf/mini-taxonomy.xml") + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_PREFIX, "/defaultgraph") + .mode(SaveMode.Append) + .save(); + + List rows = startRead() + .option(Options.READ_TRIPLES_DIRECTORY, "/defaultgraph/") + .load().collectAsList(); + + assertEquals(8, rows.size()); + rows.forEach(row -> assertEquals("http://marklogic.com/semantics#default-graph", + row.getString(5), "Since no graph was specified when the triples were loaded, the triples should have " + + "been assigned to MarkLogic's default graph. That value should then be in the 'graph' column for " + + "each of the triples when they're read back.")); + } + + @Test + void multipleGraphs() { + long count = startRead() + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph,other-graph") + .load().count(); + + assertEquals(16, count, "Should include the 8 triples from each test file in the 2 given graphs."); + } + + @Test + void collections() { + long count = startRead() + .option(Options.READ_TRIPLES_COLLECTIONS, "http://example.org/graph") + .load().count(); + + assertEquals(16, count, "We expect 16 triples back as MarkLogic sees each triple on the test document " + + "as belonging to two graphs - 'http://example.org/graph' and 'test-config'."); + } + + @Test + void twoCollections() { + long count = startRead() + .option(Options.READ_TRIPLES_COLLECTIONS, "http://example.org/graph,other-graph") + .option(Options.READ_BATCH_SIZE, 5) + .option(Options.READ_LOG_PROGRESS, 10) + .load().count(); + + assertEquals(32, count, "Since both test triples files belong to 'test-config', and each also belongs to " + + "a second collection, the 8 triples in each file are returned twice - once for each collection - " + + "producing a total of 32 triples."); + } + + @Test + void graphAndCollection() { + long count = startRead() + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph") + .option(Options.READ_TRIPLES_COLLECTIONS, "test-config") + .load().count(); + + assertEquals(8, count, "When both graphs and collections are specified, they should result in a collection " + + "query that constrains on both collections. But since a graph is specified, we only expect back the 8 " + + "triples in the given graph."); + } + + @Test + void directory() { + long count = startRead() + .option(Options.READ_TRIPLES_DIRECTORY, "/triples/") + .load().count(); + + assertEquals(16, count, "Since no graph is specified, we should get 16 triples, as each of the 8 triples " + + "is associated with 2 collections. The directory option ensures we pull from only one of the two test " + + "triples files."); + } + + @Test + void stringQueryWithOptions() { + long count = startRead() + .option(Options.READ_TRIPLES_STRING_QUERY, "tripleObject:\"http://vocabulary.worldbank.org/taxonomy/1107\"") + .option(Options.READ_TRIPLES_OPTIONS, "test-options") + .load().count(); + + assertEquals(16, count, "Since no graph is specified, we should get 16 triples, as each of the 8 triples " + + "is associated with 2 collections. The string query plus the options ensures we only get triples from =" + + "the test file that has a sem:object value of 'http://vocabulary.worldbank.org/taxonomy/1107'."); + } + + @Test + void structuredQuery() { + String query = "" + + "Debt Management"; + + long count = startRead() + .option(Options.READ_TRIPLES_QUERY, query) + .load().count(); + + assertEquals(16, count, "Since no graph is specified, we should get 16 triples, as each of the 8 triples " + + "is associated with 2 collections. And we should only get triples from the test triples document that " + + "has the term 'Debt Management in it."); + } + + @Test + void uris() { + long count = startRead() + .option(Options.READ_TRIPLES_URIS, "/triples/mini-taxonomy.xml\n/other-triples/other-taxonomy.xml") + .load().count(); + + assertEquals(32, count, "Since no graph is specified, the 8 triples in each test triples document get " + + "returned twice, once per collection assigned to the document."); + } + + private DataFrameReader startRead() { + return newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()); + } +} diff --git a/src/test/java/com/marklogic/spark/reader/triples/ReadTriplesWithBaseIriTest.java b/src/test/java/com/marklogic/spark/reader/triples/ReadTriplesWithBaseIriTest.java new file mode 100644 index 00000000..31de4d8f --- /dev/null +++ b/src/test/java/com/marklogic/spark/reader/triples/ReadTriplesWithBaseIriTest.java @@ -0,0 +1,44 @@ +package com.marklogic.spark.reader.triples; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Row; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class ReadTriplesWithBaseIriTest extends AbstractIntegrationTest { + + @Test + void relativeGraph() { + List rows = startRead() + .option(Options.READ_TRIPLES_GRAPHS, "other-graph") + .option(Options.READ_TRIPLES_BASE_IRI, "/my-base-iri/") + .load().collectAsList(); + + assertEquals(8, rows.size()); + rows.forEach(row -> assertEquals("/my-base-iri/other-graph", row.getString(5), "When the triple's graph " + + "value is relative, the user-provided base IRI should be prepended to it.")); + } + + @Test + void absoluteGraph() { + List rows = startRead() + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph") + .option(Options.READ_TRIPLES_BASE_IRI, "/my-base-iri") + .load().collectAsList(); + + assertEquals(8, rows.size()); + rows.forEach(row -> assertEquals("http://example.org/graph", row.getString(5), "When the triple's graph " + + "is absolute, the user-provided base IRI should not be prepended to it.")); + } + + private DataFrameReader startRead() { + return newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/IgnoreNullValuesTest.java b/src/test/java/com/marklogic/spark/writer/IgnoreNullValuesTest.java new file mode 100644 index 00000000..0dd9d18b --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/IgnoreNullValuesTest.java @@ -0,0 +1,102 @@ +/* + * Copyright © 2024 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved. + */ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.databind.JsonNode; +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; + +class IgnoreNullValuesTest extends AbstractIntegrationTest { + + @Test + void jsonWithEmptyValues() { + newSparkSession().read() + .option("header", "true") + .option("inferSchema", "true") + .csv("src/test/resources/csv-files/empty-values.csv") + // Add the special file path column so we can verify it's not included in the JSON. + .withColumn("marklogic_spark_file_path", new Column("_metadata.file_path")) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/a/{number}.json") + .option(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX + "ignoreNullFields", "true") + .mode(SaveMode.Append) + .save(); + + JsonNode doc = readJsonDocument("/a/1.json"); + assertEquals(1, doc.get("number").asInt()); + assertEquals("blue", doc.get("color").asText()); + assertEquals(2, doc.size(), "The flag column should not be included in the serialization."); + + doc = readJsonDocument("/a/2.json"); + assertEquals(2, doc.get("number").asInt()); + assertEquals(" ", doc.get("color").asText(), "Verifies that whitespace is retained by default."); + assertFalse(doc.get("flag").asBoolean()); + assertEquals(3, doc.size()); + } + + @Test + void xmlWithEmptyValues() { + newSparkSession().read() + .option("header", "true") + .option("inferSchema", "true") + .csv("src/test/resources/csv-files/empty-values.csv") + .withColumn("marklogic_spark_file_path", new Column("_metadata.file_path")) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_XML_ROOT_NAME, "test") + .option(Options.WRITE_URI_TEMPLATE, "/a/{number}.xml") + .option(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX + "ignoreNullFields", "true") + .mode(SaveMode.Append) + .save(); + + XmlNode doc = readXmlDocument("/a/1.xml"); + doc.assertElementMissing("The empty flag column should be ignored", "/test/flag"); + doc.assertElementValue("/test/number", "1"); + doc.assertElementValue("/test/color", "blue"); + + // This is oddly misleading and won't show the whitespace in an element. + doc = readXmlDocument("/a/2.xml"); + doc.assertElementValue("/test/number", "2"); + doc.assertElementValue("/test/color", " "); + doc.assertElementValue("/test/flag", "false"); + } + + @Test + void jsonLinesWithNestedFieldsConvertedToXml() { + newSparkSession().read() + .option("ignoreNullFields", "false") + .json("src/test/resources/json-lines/nested-objects.txt") + .withColumn("marklogic_spark_file_path", new Column("_metadata.file_path")) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_XML_ROOT_NAME, "parent") + .option(Options.WRITE_URI_TEMPLATE, "/a/{id}.xml") + .option(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX + "ignoreNullFields", "true") + .mode(SaveMode.Append) + .save(); + + XmlNode doc = readXmlDocument("/a/1.xml"); + doc.assertElementValue("/parent/data/color", "blue"); + doc.assertElementValue("/parent/data/numbers[1]", "1"); + doc.assertElementValue("/parent/data/numbers[2]", "2"); + doc.assertElementValue("/parent/hello", "world"); + doc.assertElementValue("/parent/id", "1"); + + doc = readXmlDocument("/a/2.xml"); + doc.assertElementMissing("'hello' should not appear. Spark JSON will actually include it in the schema and " + + "give it a value of null. But with ignoreNullFields set to true, it should be discarded.", + "/parent/hello"); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/SerializeWriterObjectsTest.java b/src/test/java/com/marklogic/spark/writer/SerializeWriterObjectsTest.java index f7edd481..2d4932d0 100644 --- a/src/test/java/com/marklogic/spark/writer/SerializeWriterObjectsTest.java +++ b/src/test/java/com/marklogic/spark/writer/SerializeWriterObjectsTest.java @@ -19,7 +19,7 @@ void test() { Map props = new HashMap<>(); props.put(Options.CLIENT_URI, makeClientUri()); WriteContext writeContext = new WriteContext(new StructType().add("myType", DataTypes.StringType), props); - WriteBatcherDataWriterFactory factory = new WriteBatcherDataWriterFactory(writeContext); + WriteBatcherDataWriterFactory factory = new WriteBatcherDataWriterFactory(writeContext, null); factory = (WriteBatcherDataWriterFactory) SerializeUtil.serialize(factory); WriteBatcherDataWriter writer = (WriteBatcherDataWriter) factory.createWriter(1, 1l); diff --git a/src/test/java/com/marklogic/spark/writer/WriteArchiveOfFailedDocumentsTest.java b/src/test/java/com/marklogic/spark/writer/WriteArchiveOfFailedDocumentsTest.java new file mode 100644 index 00000000..3f3fd371 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/WriteArchiveOfFailedDocumentsTest.java @@ -0,0 +1,166 @@ +package com.marklogic.spark.writer; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.apache.spark.sql.SparkSession; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; +import scala.collection.JavaConversions; +import scala.collection.mutable.WrappedArray; + +import java.io.File; +import java.nio.file.Path; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicInteger; + +import static org.junit.jupiter.api.Assertions.*; + +class WriteArchiveOfFailedDocumentsTest extends AbstractWriteTest { + + @AfterEach + void afterEach() { + MarkLogicWrite.setFailureCountConsumer(null); + MarkLogicWrite.setSuccessCountConsumer(null); + } + + @Test + void happyPath(@TempDir Path tempDir) { + AtomicInteger successCount = new AtomicInteger(); + AtomicInteger failureCount = new AtomicInteger(); + MarkLogicWrite.setSuccessCountConsumer(count -> successCount.set(count)); + MarkLogicWrite.setFailureCountConsumer(count -> failureCount.set(count)); + + SparkSession session = newSparkSession(); + + defaultWrite(session.read().format(CONNECTOR_IDENTIFIER) + .load("src/test/resources/mixed-files") + .repartition(1) // Forces a single partition writer and thus a single archive file being written. + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "partial-batch") + .option(Options.WRITE_URI_SUFFIX, ".json") + .option(Options.WRITE_ABORT_ON_FAILURE, false) + .option(Options.WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS, tempDir.toFile().getAbsolutePath()) + ); + + assertCollectionSize("Only the JSON document should have succeeded; error messages should have been logged " + + "for the other 3 documents.", "partial-batch", 1); + assertEquals(1, successCount.get()); + assertEquals(3, failureCount.get()); + + // Read the archive file back in and verify the contents. + List rows = session.read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .load(tempDir.toFile().getAbsolutePath()) + .sort(new Column("URI")) + .collectAsList(); + verifyArchiveRows(rows); + } + + @Test + void multiplePartitions(@TempDir Path tempDir) { + defaultWrite(newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .load("src/test/resources/mixed-files") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_URI_SUFFIX, ".json") + .option(Options.WRITE_ABORT_ON_FAILURE, false) + .option(Options.WRITE_COLLECTIONS, "multiple-partitions") + .option(Options.WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS, tempDir.toFile().getAbsolutePath()) + ); + + File[] archiveFiles = tempDir.toFile().listFiles(); + assertEquals(3, archiveFiles.length, "Each file is read and thus written via a separate partition. We " + + "expect 3 files then, 1 for each failed file. We should not get a 4th file for the partition that was " + + "able to write a file as a JSON document successfully."); + assertCollectionSize("multiple-partitions", 1); + } + + @Test + void invalidArchivePath() { + ConnectorException ex = assertThrowsConnectorException(() -> defaultWrite(newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .load("src/test/resources/mixed-files") + .repartition(1) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_URI_SUFFIX, ".json") + .option(Options.WRITE_ABORT_ON_FAILURE, false) + .option(Options.WRITE_COLLECTIONS, "should-be-empty") + .option(Options.WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS, "/invalid/path/doesnt/exist") + )); + + String message = ex.getMessage(); + assertTrue(message.startsWith("Unable to write failed documents to archive file at /invalid/path/doesnt/exist"), + "The write should have failed because the connector could not write an archive file to the invalid path. " + + "To avoid creating empty zip files, we don't try to create a zip file right away. The downside of " + + "this is that we don't 'fail fast' - i.e. we don't know that the path is invalid until a document " + + "fails to be written, and then we get an error when trying to write that document to an archive " + + "zip file. However, this scenario seems rare - a user can be expected to typically provide a valid " + + "path for their archive files. The rarity of that seems acceptable in favor of never creating empty " + + "zip files, which seems more annoying for a user to deal with. Actual error: " + message); + } + + @Test + void multipleFailedBatches(@TempDir Path tempDir) { + newSparkSession().read().format("json") + .option("multiLine", true) + .load("src/test/resources/500-employees.json") + .repartition(4) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_COLLECTIONS, "partial-batch") + .option(Options.WRITE_URI_SUFFIX, ".xml") + .option(Options.WRITE_ABORT_ON_FAILURE, false) + .option(Options.WRITE_ARCHIVE_PATH_FOR_FAILED_DOCUMENTS, tempDir.toFile().getAbsolutePath()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + assertCollectionSize("partial-batch", 0); + assertEquals(4, tempDir.toFile().listFiles().length, "Expecting 1 archive for each of the 4 partition writers."); + + long count = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "archive") + .load(tempDir.toFile().getAbsolutePath()) + .count(); + + assertEquals(500, count, + "All 500 employee docs should have failed since we tried to write them as XML documents. All 500 " + + "should have been written to 4 separate archive files."); + } + + private void verifyArchiveRows(List rows) { + assertEquals(3, rows.size(), "Expecting one row for each of the 3 failed documents."); + + Row row = rows.get(0); + assertTrue(row.getString(0).endsWith("/hello.txt.json"), "Unexpected URI: " + row.getString(0)); + verifyMetadataColumns(row); + + row = rows.get(1); + assertTrue(row.getString(0).endsWith("/hello.xml.json"), "Unexpected URI: " + row.getString(0)); + verifyMetadataColumns(row); + + row = rows.get(2); + assertTrue(row.getString(0).endsWith("/hello2.txt.gz.json"), "Unexpected URI: " + row.getString(0)); + verifyMetadataColumns(row); + } + + private void verifyMetadataColumns(Row row) { + List collections = JavaConversions.seqAsJavaList(row.getSeq(3)); + assertEquals(1, collections.size()); + assertEquals("partial-batch", collections.get(0)); + + Map permissions = row.getJavaMap(4); + assertTrue(permissions.get("spark-user-role").toString().contains("READ")); + assertTrue(permissions.get("spark-user-role").toString().contains("UPDATE")); + + assertEquals(0, row.getInt(5)); + + assertTrue(row.isNullAt(6)); + + Map metadataValues = row.getJavaMap(7); + assertEquals(0, metadataValues.size()); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/WriteFileRowsTest.java b/src/test/java/com/marklogic/spark/writer/WriteFileRowsTest.java index f2f1cd74..22b0e16b 100644 --- a/src/test/java/com/marklogic/spark/writer/WriteFileRowsTest.java +++ b/src/test/java/com/marklogic/spark/writer/WriteFileRowsTest.java @@ -1,12 +1,12 @@ package com.marklogic.spark.writer; -import com.fasterxml.jackson.databind.JsonNode; import com.marklogic.client.DatabaseClient; import com.marklogic.client.document.JSONDocumentManager; import com.marklogic.client.io.Format; import com.marklogic.client.io.InputStreamHandle; import com.marklogic.client.io.JacksonHandle; import com.marklogic.client.io.StringHandle; +import com.marklogic.junit5.XmlNode; import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.SparkException; @@ -16,7 +16,6 @@ import org.apache.spark.sql.SaveMode; import org.junit.jupiter.api.Test; -import java.io.File; import java.util.List; import java.util.stream.Stream; @@ -89,28 +88,26 @@ void replacePatternMissingSingleQuotes() { @Test void uriTemplate() { - File f = new File("src/test/resources/mixed-files/hello.json"); - Dataset dataset = newSparkSession() .read() .format("binaryFile") - .load("src/test/resources/mixed-files/*.json"); + .load("src/test/resources/mixed-files/*.xml"); Row row = dataset.collectAsList().get(0); // For as-yet unknown reasons, the timestamp in an InternalRow - which the connector writer receives - will // have 000 appended to it, thus capturing microseconds. But a Row will have the same value that the JVM // returns when getting lastModified for a File. This does not seem like an issue for a user, we just need to // account for it in our test. - final String expectedURI = String.format("/testfile/%d000/%d.json", row.getTimestamp(1).getTime(), row.getLong(2)); + final String expectedURI = String.format("/testfile/%d000/%d.xml", row.getTimestamp(1).getTime(), row.getLong(2)); defaultWrite(dataset.write() .format(CONNECTOR_IDENTIFIER) .option(Options.WRITE_COLLECTIONS, "template-test") - .option(Options.WRITE_URI_TEMPLATE, "/testfile/{modificationTime}/{length}.json") + .option(Options.WRITE_URI_TEMPLATE, "/testfile/{modificationTime}/{length}.xml") ); - JsonNode doc = readJsonDocument(expectedURI); - assertEquals("world", doc.get("hello").asText()); + XmlNode doc = readXmlDocument(expectedURI); + doc.assertElementValue("/hello", "world"); } @Test @@ -151,7 +148,7 @@ void invalidDocumentType() { SparkException ex = assertThrows(SparkException.class, () -> writer.save()); assertTrue(ex.getCause() instanceof ConnectorException); ConnectorException ce = (ConnectorException) ex.getCause(); - assertEquals("Invalid value for option " + Options.WRITE_FILE_ROWS_DOCUMENT_TYPE + ": not valid; " + + assertEquals("Invalid value for " + Options.WRITE_FILE_ROWS_DOCUMENT_TYPE + ": not valid; " + "must be one of 'JSON', 'XML', or 'TEXT'.", ce.getMessage()); } diff --git a/src/test/java/com/marklogic/spark/writer/WritePartialBatchTest.java b/src/test/java/com/marklogic/spark/writer/WritePartialBatchTest.java new file mode 100644 index 00000000..123e2e20 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/WritePartialBatchTest.java @@ -0,0 +1,63 @@ +package com.marklogic.spark.writer; + +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WritePartialBatchTest extends AbstractWriteTest { + + @Test + void threeOutOfFourShouldFail() { + defaultWrite(newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .load("src/test/resources/mixed-files") + .repartition(1) // Forces all 4 docs to be written in a single batch. + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "partial-batch") + .option(Options.WRITE_URI_SUFFIX, ".json") + .option(Options.WRITE_URI_REPLACE, ".*/mixed-files,''") + .option(Options.WRITE_ABORT_ON_FAILURE, false) + ); + + assertCollectionSize("Only the JSON document should have succeeded; error messages should have been logged " + + "for the other 3 documents.", "partial-batch", 1); + } + + @Test + void shouldThrowError() { + DataFrameWriter writer = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .load("src/test/resources/mixed-files") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_URI_SUFFIX, ".json") + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save()); + assertTrue(ex.getMessage().contains("Document is not JSON"), "Verifying that trying to write non-JSON " + + "documents with a .json extension should produce an error; unexpected error: " + ex.getMessage()); + } + + @Test + void fiveOutOfTenShouldFailWithTwoBatches() { + defaultWrite(newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "zip") + .load("src/test/resources/spark-json/some-bad-json-docs.zip") + .repartition(1) // Forces a single partition writer. + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "partial-batch") + .option(Options.WRITE_URI_REPLACE, ".*/spark-json,''") + .option(Options.WRITE_ABORT_ON_FAILURE, false) + .option(Options.WRITE_BATCH_SIZE, 5) // Forces two batches, each of which should have 1 or more failures. + ); + + assertCollectionSize( + "5 of the docs in the zip file are valid JSON docs, so those should all be written, regardless of how " + + "many partition writers and batches are created. Errors should be logged for the 5 invalid JSON docs.", + "partial-batch", 5 + ); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/WriteRowsTest.java b/src/test/java/com/marklogic/spark/writer/WriteRowsTest.java index 4cf20e0f..7677a47b 100644 --- a/src/test/java/com/marklogic/spark/writer/WriteRowsTest.java +++ b/src/test/java/com/marklogic/spark/writer/WriteRowsTest.java @@ -23,7 +23,6 @@ import org.apache.spark.sql.DataFrameWriter; import org.junit.jupiter.api.Test; -import java.io.IOException; import java.util.concurrent.atomic.AtomicInteger; import static org.junit.jupiter.api.Assertions.*; @@ -39,6 +38,25 @@ void defaultBatchSizeAndThreadCount() { verifyTwoHundredDocsWereWritten(); } + @Test + void logProgressTest() { + newWriter(4) + // Including these options here to ensure they don't cause any issues, though we're not yet able to + // assert on the info-level log entries that they add. + .option(Options.WRITE_BATCH_SIZE, 8) + .option(Options.WRITE_THREAD_COUNT, 8) + .option(Options.WRITE_LOG_PROGRESS, 20) + .save(); + + verifyTwoHundredDocsWereWritten(); + + // For manual inspection, run it again to ensure that the progress counter was reset. + newWriter(2) + .option(Options.WRITE_BATCH_SIZE, 10) + .option(Options.WRITE_LOG_PROGRESS, 40) + .save(); + } + @Test void batchSizeGreaterThanNumberOfRowsToWrite() { newWriter() @@ -56,13 +74,36 @@ void batchSizeGreaterThanNumberOfRowsToWrite() { @Test void twoPartitions() { - newWriter(2).save(); + newWriter(2) + .option(Options.WRITE_THREAD_COUNT_PER_PARTITION, 8) + .option(Options.WRITE_BATCH_SIZE, 10) + .save(); // Just verifies that the operation succeeds with multiple partitions. Check the logging to see that two // partitions were in fact created, each with its own WriteBatcher. verifyTwoHundredDocsWereWritten(); } + @Test + void insufficientPrivilegeForOtherDatabase() { + DataFrameWriter writer = newWriter(2) + .option(Options.WRITE_THREAD_COUNT_PER_PARTITION, 8) + .option(Options.WRITE_BATCH_SIZE, 10) + .option(Options.CLIENT_URI, "spark-test-user:spark@localhost:8016/Documents"); + + SparkException ex = assertThrows(SparkException.class, () -> writer.save()); + assertNull(ex.getCause(), "Surprisingly, in this scenario where the exception is thrown during the " + + "construction of WriteBatcherDataWriter, Spark does not populate the 'cause' of the exception but rather " + + "shoves the entire stacktrace of the exception into the exception message. This is not a good UX for " + + "connector or Flux users, as it puts an ugly stacktrace right into their face. I have not figured out " + + "how to avoid this yet, so this test is capturing this behavior in the hopes that an upgraded version of " + + "Spark will properly set the cause instead."); + assertTrue(ex.getMessage().contains("at com.marklogic.client.impl.OkHttpServices"), "This is confirming that " + + "the exception message contains the stacktrace of the MarkLogic exception - which we don't want. Hoping " + + "this assertion breaks during a future upgrade of Spark and we have a proper exception message " + + "instead. Actual message: " + ex.getMessage()); + } + @Test void temporalTest() { newWriterWithDefaultConfig("temporal-data.csv", 1) @@ -99,15 +140,15 @@ void testWithCustomConfig() { void invalidThreadCount() { DataFrameWriter writer = newWriter().option(Options.WRITE_THREAD_COUNT, 0); ConnectorException ex = assertThrows(ConnectorException.class, () -> writer.save()); - assertEquals("Value of 'spark.marklogic.write.threadCount' option must be 1 or greater.", ex.getMessage()); + assertEquals("The value of 'spark.marklogic.write.threadCount' must be 1 or greater.", ex.getMessage()); verifyNoDocsWereWritten(); } @Test void invalidBatchSize() { DataFrameWriter writer = newWriter().option(Options.WRITE_BATCH_SIZE, 0); - ConnectorException ex = assertThrowsConnectorException(() -> writer.save()); - assertEquals("Value of 'spark.marklogic.write.batchSize' option must be 1 or greater.", ex.getMessage(), + ConnectorException ex = assertThrows(ConnectorException.class, () -> writer.save()); + assertEquals("The value of 'spark.marklogic.write.batchSize' must be 1 or greater.", ex.getMessage(), "Note that batchSize is very different for writing than it is for reading. For writing, it specifies the " + "exact number of documents to send to MarkLogic in each call. For reading, it used to determine how " + "many requests will be made by a partition, and zero is a valid value for reading."); @@ -121,11 +162,10 @@ void invalidBatchSize() { */ @Test void userNotPermittedToWriteAndFailOnCommit() { - SparkException ex = assertThrows(SparkException.class, - () -> newWriter() - .option(Options.CLIENT_USERNAME, "spark-no-write-user") - .option(Options.WRITE_BATCH_SIZE, 500) - .save() + ConnectorException ex = assertThrowsConnectorException(() -> newWriter() + .option(Options.CLIENT_USERNAME, "spark-no-write-user") + .option(Options.WRITE_BATCH_SIZE, 500) + .save() ); verifyFailureIsDueToLackOfPermission(ex); @@ -152,12 +192,11 @@ void invalidPassword() { */ @Test void userNotPermittedToWriteAndFailOnWrite() { - SparkException ex = assertThrows(SparkException.class, - () -> newWriter() - .option(Options.CLIENT_USERNAME, "spark-no-write-user") - .option(Options.WRITE_BATCH_SIZE, 1) - .option(Options.WRITE_THREAD_COUNT, 1) - .save() + ConnectorException ex = assertThrowsConnectorException(() -> newWriter() + .option(Options.CLIENT_USERNAME, "spark-no-write-user") + .option(Options.WRITE_BATCH_SIZE, 1) + .option(Options.WRITE_THREAD_COUNT, 1) + .save() ); verifyFailureIsDueToLackOfPermission(ex); @@ -196,12 +235,9 @@ void dontAbortOnFailure() { assertEquals(1, successCount.get()); } - private void verifyFailureIsDueToLackOfPermission(SparkException ex) { - Throwable cause = getCauseFromWriterException(ex); - assertNotNull(cause, "Unexpected exception with no cause: " + ex.getClass() + "; " + ex.getMessage()); - assertTrue(cause instanceof IOException, "Unexpected cause: " + cause.getClass()); - assertTrue(cause.getMessage().contains("Server Message: You do not have permission to this method and URL"), - "Unexpected cause message: " + cause.getMessage()); + private void verifyFailureIsDueToLackOfPermission(ConnectorException ex) { + assertTrue(ex.getMessage().contains("Server Message: You do not have permission to this method and URL"), + "Unexpected cause message: " + ex.getMessage()); verifyNoDocsWereWritten(); } diff --git a/src/test/java/com/marklogic/spark/writer/WriteRowsWithFilePathTest.java b/src/test/java/com/marklogic/spark/writer/WriteRowsWithFilePathTest.java new file mode 100644 index 00000000..45b12f87 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/WriteRowsWithFilePathTest.java @@ -0,0 +1,55 @@ +/* + * Copyright © 2024 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved. + */ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.databind.JsonNode; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WriteRowsWithFilePathTest extends AbstractWriteTest { + + /** + * Intended to allow for Flux to optionally use the filename for an initial URI. Relevant any time we use Flux with + * a Spark data source that produces arbitrary data rows. + */ + @Test + void test() { + newSparkSession().read() + .option("header", true) + .format("csv") + .csv("src/test/resources/data.csv") + .withColumn("marklogic_spark_file_path", new Column("_metadata.file_path")) + .limit(10) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_COLLECTIONS, "some-files") + .option(Options.WRITE_URI_REPLACE, ".*/src/test,'/test'") + .mode(SaveMode.Append) + .save(); + + List uris = getUrisInCollection("some-files", 10); + uris.forEach(uri -> { + assertTrue(uri.startsWith("/test/resources/data.csv/"), "When a column named 'marklogic_spark_file_path' is passed " + + "to the connector for writing arbitrary rows, it will be used to construct an initial URI that " + + "also has a UUID in it. This is useful for the somewhat rare use case of wanting the physical file " + + "path to be a part of the URI (as opposed to using a URI template). Actual URI: " + uri); + + JsonNode doc = readJsonDocument(uri); + assertEquals(2, doc.size(), "The marklogic_spark_file_path column should not have been used when " + + "constructing the JSON document. This includes when ignoreNullFields is set to false. We still want " + + "the column removed as the column is an implementation detail that should not be exposed to the user. " + + "If we ever want the file path to be included in the document, we'll add an explicit feature for that."); + assertTrue(doc.has("docNum")); + assertTrue(doc.has("docName")); + }); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/WriteRowsWithJsonRootNameTest.java b/src/test/java/com/marklogic/spark/writer/WriteRowsWithJsonRootNameTest.java new file mode 100644 index 00000000..65bd6c5f --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/WriteRowsWithJsonRootNameTest.java @@ -0,0 +1,48 @@ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.databind.JsonNode; +import com.marklogic.spark.Options; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WriteRowsWithJsonRootNameTest extends AbstractWriteTest { + + @Test + void test() { + newWriter() + .option(Options.WRITE_JSON_ROOT_NAME, "myRootName") + .option(Options.WRITE_URI_TEMPLATE, "/myRootNameTest/{/myRootName/docNum}.json") + .save(); + + // Verify a few of the docs to ensure the rootName was used correctly. + for (int i = 1; i <= 3; i++) { + String uri = String.format("/myRootNameTest/%d.json", i); + JsonNode doc = readJsonDocument(uri, COLLECTION); + assertTrue(doc.has("myRootName")); + assertEquals(i, doc.get("myRootName").get("docNum").asInt()); + assertEquals("doc" + i, doc.get("myRootName").get("docName").asText()); + } + } + + /** + * Per https://restfulapi.net/valid-json-key-names/ , it appears that any valid, escaped string works as a + * field name. So this verifies that a potentially bad field name is escaped correctly. Implementation-wise, we + * expect that to be handled automatically by the Jackson library. + */ + @Test + void rootNameThatNeedsEscaping() { + final String weirdRootName = "{is-'this\"-valid?{"; + newWriter() + .option(Options.WRITE_JSON_ROOT_NAME, weirdRootName) + .save(); + + // Verify one document to ensure the weird root name is valid. + String uri = getUrisInCollection(COLLECTION, 200).get(0); + JsonNode doc = readJsonDocument(uri); + assertTrue(doc.has(weirdRootName)); + assertTrue(doc.get(weirdRootName).has("docNum")); + assertTrue(doc.get(weirdRootName).has("docName")); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/WriteRowsWithTransformTest.java b/src/test/java/com/marklogic/spark/writer/WriteRowsWithTransformTest.java index bbbb72de..9c578c7f 100644 --- a/src/test/java/com/marklogic/spark/writer/WriteRowsWithTransformTest.java +++ b/src/test/java/com/marklogic/spark/writer/WriteRowsWithTransformTest.java @@ -19,6 +19,7 @@ import com.marklogic.junit5.MarkLogicNamespaceProvider; import com.marklogic.junit5.NamespaceProvider; import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.SparkException; import org.junit.jupiter.api.Test; @@ -93,32 +94,27 @@ void withParamsAndCustomDelimiter() { @Test void invalidTransform() { - SparkException ex = assertThrows(SparkException.class, - () -> newWriterForSingleRow() + ConnectorException ex = assertThrowsConnectorException(() -> newWriterForSingleRow() .option(Options.WRITE_TRANSFORM_NAME, "this-doesnt-exist") .save()); - Throwable cause = getCauseFromWriterException(ex); - assertTrue(cause instanceof IOException); - assertTrue(cause.getMessage().contains("Extension this-doesnt-exist or a dependency does not exist"), + assertTrue(ex.getMessage().contains("Extension this-doesnt-exist or a dependency does not exist"), "The connector can't easily validate that a REST transform is valid, but the expectation is that the " + "error message from the REST API will make the problem evident to the user; " + - "unexpected message: " + cause.getMessage()); + "unexpected message: " + ex.getMessage()); } @Test void invalidTransformParams() { - SparkException ex = assertThrows(SparkException.class, + ConnectorException ex = assertThrowsConnectorException( () -> newWriterForSingleRow() .option(Options.WRITE_TRANSFORM_NAME, "withParams") .option(Options.WRITE_TRANSFORM_PARAMS, "param1,value1,param2") .save()); - Throwable cause = getCauseFromWriterException(ex); - assertTrue(cause instanceof IllegalArgumentException); assertEquals( "The spark.marklogic.write.transformParams option must contain an equal number of parameter names and values; received: param1,value1,param2", - cause.getMessage() + ex.getMessage() ); } } diff --git a/src/test/java/com/marklogic/spark/writer/WriteRowsWithUriTemplateTest.java b/src/test/java/com/marklogic/spark/writer/WriteRowsWithUriTemplateTest.java index 481f4306..65b9b6d9 100644 --- a/src/test/java/com/marklogic/spark/writer/WriteRowsWithUriTemplateTest.java +++ b/src/test/java/com/marklogic/spark/writer/WriteRowsWithUriTemplateTest.java @@ -15,6 +15,7 @@ */ package com.marklogic.spark.writer; +import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import org.apache.spark.SparkException; import org.junit.jupiter.api.Test; @@ -72,12 +73,12 @@ void columnNameDoesntExist() { Throwable cause = getCauseFromWriterException(ex); assertTrue(cause instanceof RuntimeException, "Unexpected cause: " + cause); - final String expectedMessage = "Did not find column 'doesntExist' in row: " + + final String expectedMessage = "Expression 'doesntExist' did not resolve to a value in row: " + "{\"id\":\"1\",\"content\":\"hello world\"," + "\"systemStart\":\"2014-04-03T11:00:00\",\"systemEnd\":\"2014-04-03T16:00:00\"," + "\"validStart\":\"2014-04-03T11:00:00\",\"validEnd\":\"2014-04-03T16:00:00\"," + "\"columnWithOnlyWhitespace\":\" \"}; " + - "column is required by URI template: /test/{id}/{doesntExist}.json"; + "expression is required by URI template: /test/{id}/{doesntExist}.json"; assertEquals(expectedMessage, cause.getMessage(), "The entire JSON row is being included in the error " + "message so that the user is able to figure out what a column they chose in the URI template isn't " + @@ -100,15 +101,11 @@ void columnWithOnlyWhitespace() { } private void verifyTemplateIsInvalid(String uriTemplate, String expectedMessage) { - SparkException ex = assertThrows( - SparkException.class, + ConnectorException ex = assertThrowsConnectorException( () -> newWriter().option(Options.WRITE_URI_TEMPLATE, uriTemplate).save() ); - Throwable cause = getCauseFromWriterException(ex); - assertTrue(cause instanceof IllegalArgumentException, "Unexpected cause: " + cause); - - String message = cause.getMessage(); + String message = ex.getMessage(); expectedMessage = "Invalid value for " + Options.WRITE_URI_TEMPLATE + ": " + uriTemplate + "; " + expectedMessage; assertEquals(expectedMessage, message, "Unexpected error message: " + message); } diff --git a/src/test/java/com/marklogic/spark/writer/WriteSparkJsonTest.java b/src/test/java/com/marklogic/spark/writer/WriteSparkJsonTest.java new file mode 100644 index 00000000..925c367d --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/WriteSparkJsonTest.java @@ -0,0 +1,153 @@ +package com.marklogic.spark.writer; + +import com.fasterxml.jackson.databind.node.ArrayNode; +import com.fasterxml.jackson.databind.node.ObjectNode; +import com.marklogic.spark.Options; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * Purpose of this class is to demonstrate how the Spark JSON data source - + * https://spark.apache.org/docs/latest/sql-data-sources-json.html - can be used with our connector. There are 3 + * separate use cases we can support - 1) Treat each JSON file as a separate row; 2) Treat each object in an array in + * a JSON file as a separate row; and 3) Use the 'JSON Lines' support for a JSON lines file, where each line becomes + * a row. In each scenario, we write each row as a document in MarkLogic. + *

+ * This test therefore doesn't test any code in our connector that isn't tested elsewhere. It's simply demonstrating + * how Spark JSON can be used with our connector. + */ +class WriteSparkJsonTest extends AbstractWriteTest { + + /** + * The default behavior of Spark JSON is that each line is expected to be a separate JSON object. Each line then + * becomes a row, which then gets written as a separate document to MarkLogic. + */ + @Test + void eachLineInJsonLinesFileBecomesADocument() { + newSparkSession().read().format("json") + .load("src/test/resources/spark-json/json-lines.txt") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/spark-json/{number}.json") + .option(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX + "ignoreNullFields", "true") + .mode(SaveMode.Append) + .save(); + + verifyTheTwoJsonDocuments(); + } + + /** + * Happily, when Spark JSON is used with multiLine=true on a file containing an array of objects, Spark will + * read each object as a separate row. This allows for each object to become a separate document in MarkLogic. + * If the user does not want this behavior but rather wants the entire file to become a document, they need to use + * Spark's binaryFile data source as shown in a test below this one. + *

+ * Note that if a user attempts to read an array file without multiLine=true, they'll get this error: + * "Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the + * referenced columns only include the internal corrupt record column + * (named _corrupt_record by default)." + */ + @Test + void eachObjectInArrayBecomesADocument() { + newSparkSession().read().format("json") + .option("multiLine", true) + .load("src/test/resources/spark-json/array-of-objects.json") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/spark-json/{number}.json") + .option(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX + "ignoreNullFields", "true") + .mode(SaveMode.Append) + .save(); + + verifyTheTwoJsonDocuments(); + } + + /** + * This test shows that the same multiLine=true reader can be used for both single object files and files containing + * an array of objects. + */ + @Test + void singleObjectFileAndArrayOfObjectsFile() { + newSparkSession().read().format("json") + .option("multiLine", true) + .load( + "src/test/resources/spark-json/array-of-objects.json", + "src/test/resources/spark-json/single-object.json" + ) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/spark-json/{number}.json") + .option(Options.WRITE_JSON_SERIALIZATION_OPTION_PREFIX + "ignoreNullFields", "true") + .mode(SaveMode.Append) + .save(); + + // Verifies the two objects from the array. + verifyTheTwoJsonDocuments(); + + // And verify that the object from single-object.json was created as well. + ObjectNode doc = (ObjectNode) readJsonDocument("/spark-json/3.json"); + assertEquals(3, doc.get("number").asInt()); + assertEquals("text", doc.get("parent").get("child").asText()); + } + + /** + * If the user has a JSON file that is an object, the Spark JSON data source can be used with multiLine=true. + * The benefit is that the fields in the JSON object become Spark columns, thereby making them accessible to our + * URI template feature. + */ + @Test + void jsonObjectFileBecomesDocument() { + newSparkSession().read().format("json") + .option("multiLine", true) + .load("src/test/resources/spark-json/single-object.json") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/spark-json/{number}.json") + .mode(SaveMode.Append) + .save(); + + ObjectNode doc = (ObjectNode) readJsonDocument("/spark-json/3.json"); + assertEquals(3, doc.get("number").asInt()); + assertEquals("text", doc.get("parent").get("child").asText()); + } + + /** + * If the user wants a JSON array file to become a single document, they need to use Spark's Binary data source + * so that the document is ingested as-is. This prohibits the use of our "URI template" feature, but that's likely + * not useful for a file containing an array as opposed to a file containing an object. + */ + @Test + void jsonArrayFileBecomesADocument() { + newSparkSession().read().format("binaryFile") + .load("src/test/resources/spark-json/array-of-objects.json") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_COLLECTIONS, "spark-file") + .mode(SaveMode.Append) + .save(); + + String uri = getUrisInCollection("spark-file", 1).get(0); + ArrayNode doc = (ArrayNode) readJsonDocument(uri); + assertEquals(2, doc.size(), "Expecting the file to be ingested as-is, so there should be an array with " + + "2 objects in it."); + } + + private void verifyTheTwoJsonDocuments() { + ObjectNode doc = (ObjectNode) readJsonDocument("/spark-json/1.json"); + assertEquals(1, doc.get("number").asInt()); + assertEquals("world", doc.get("hello").asText()); + assertEquals(2, doc.size(), "Should only have 'number' and 'hello' fields."); + + doc = (ObjectNode) readJsonDocument("/spark-json/2.json"); + assertEquals(2, doc.get("number").asInt()); + assertEquals("This is different from the first object.", doc.get("description").asText()); + assertEquals(2, doc.size(), "Should only have 'number' and 'description' fields."); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/WriteXmlRowsTest.java b/src/test/java/com/marklogic/spark/writer/WriteXmlRowsTest.java new file mode 100644 index 00000000..92be741b --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/WriteXmlRowsTest.java @@ -0,0 +1,85 @@ +package com.marklogic.spark.writer; + +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.SaveMode; +import org.jdom2.Namespace; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WriteXmlRowsTest extends AbstractWriteTest { + + @Test + void rootNameAndNamespace() { + defaultWrite(newDefaultReader() + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "with-namespace") + .option(Options.WRITE_XML_ROOT_NAME, "myDoc") + .option(Options.WRITE_XML_NAMESPACE, "example.org")); + + getUrisInCollection("with-namespace", 15).forEach(uri -> { + assertTrue(uri.endsWith(".xml"), + "When an XML root name is specified, the URI suffix should default to .xml; actual URI: " + uri); + XmlNode doc = readXmlDocument(uri); + doc.setNamespaces(new Namespace[]{Namespace.getNamespace("ex", "example.org")}); + doc.assertElementExists("/ex:myDoc/ex:Medical.Authors.ForeName"); + }); + } + + @Test + void noNamespace() { + defaultWrite(newDefaultReader() + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "no-namespace") + .option(Options.WRITE_XML_ROOT_NAME, "myDoc")); + + getUrisInCollection("no-namespace", 15).forEach(uri -> { + assertTrue(uri.endsWith(".xml")); + readXmlDocument(uri).assertElementExists("/myDoc/Medical.Authors.ForeName"); + }); + } + + /** + * The URI template feature is a little funky when generating XML documents, because - at least as of 2.3.0 - the + * source for the template values is a JSON representation of the row. If/when we support XPath expression in a + * URI template, it would naturally make sense to instead create an XML representation of the row. But for now, our + * documentation will need to note that regardless of whether the user is generating a JSON or XML document, the + * URI template source is a JSON representation of the row. + */ + @Test + void uriTemplate() { + defaultWrite(newDefaultReader() + .option(Options.READ_OPTIC_QUERY, "op.fromView('Medical','Authors', '')") + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_COLLECTIONS, "with-uri-template") + .option(Options.WRITE_URI_TEMPLATE, "/xml/{ForeName}") + .option(Options.WRITE_XML_ROOT_NAME, "myDoc") + .option(Options.WRITE_XML_NAMESPACE, "example.org")); + + assertCollectionSize("with-uri-template", 15); + XmlNode doc = readXmlDocument("/xml/Appolonia"); + doc.setNamespaces(new Namespace[]{Namespace.getNamespace("ex", "example.org")}); + doc.assertElementValue("/ex:myDoc/ex:LastName", "Edeler"); + } + + @Test + void xmlRootNameAndJsonRootName() { + DataFrameWriter writer = newDefaultReader() + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_JSON_ROOT_NAME, "myJson") + .option(Options.WRITE_XML_ROOT_NAME, "myDoc") + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save()); + assertEquals("Cannot specify both spark.marklogic.write.jsonRootName and spark.marklogic.write.xmlRootName", + ex.getMessage()); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/customcode/ProcessWithCustomCodeTest.java b/src/test/java/com/marklogic/spark/writer/customcode/ProcessWithCustomCodeTest.java index 3638e8e8..aec436cf 100644 --- a/src/test/java/com/marklogic/spark/writer/customcode/ProcessWithCustomCodeTest.java +++ b/src/test/java/com/marklogic/spark/writer/customcode/ProcessWithCustomCodeTest.java @@ -16,6 +16,25 @@ class ProcessWithCustomCodeTest extends AbstractWriteTest { + @Test + void logProgressTest() { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_XQUERY, "for $i in 1 to 100 return $i") + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + // With "uneven" numbers like this, the user will still see 5 progress entries, but the counts won't even - + // they'll be 24, 40, 64, 80, and 100. + .option(Options.WRITE_BATCH_SIZE, 8) + .option(Options.WRITE_LOG_PROGRESS, 20) + .option(Options.WRITE_JAVASCRIPT, "var URI; console.log('Nothing to do here.')") + .mode(SaveMode.Append) + .save(); + + assertTrue(true, "No assertion needed, this test is only for manual inspection of the progress log entries."); + } + @Test void invokeJavaScript() { newWriterWithDefaultConfig("three-uris.csv", 2) @@ -37,6 +56,15 @@ void evalJavaScript() { verifyThreeJsonDocumentsWereWritten(); } + @Test + void evalJavaScriptFile() { + newWriterWithDefaultConfig("three-uris.csv", 2) + .option(Options.WRITE_JAVASCRIPT_FILE, "src/test/resources/custom-code/my-writer.js") + .save(); + + verifyThreeJsonDocumentsWereWritten(); + } + @Test void invokeXQuery() { newWriterWithDefaultConfig("three-uris.csv", 1) @@ -60,6 +88,15 @@ void evalXQuery() { verifyThreeXmlDocumentsWereWritten(); } + @Test + void evalXQueryFile() { + newWriterWithDefaultConfig("three-uris.csv", 1) + .option(Options.WRITE_XQUERY_FILE, "src/test/resources/custom-code/my-writer.xqy") + .save(); + + verifyThreeXmlDocumentsWereWritten(); + } + @Test void customExternalVariableName() { newWriterWithDefaultConfig("three-uris.csv", 2) diff --git a/src/test/java/com/marklogic/spark/writer/document/WriteDocumentRowsToMarkLogicTest.java b/src/test/java/com/marklogic/spark/writer/document/WriteDocumentRowsToMarkLogicTest.java index 12d2300a..26e27899 100644 --- a/src/test/java/com/marklogic/spark/writer/document/WriteDocumentRowsToMarkLogicTest.java +++ b/src/test/java/com/marklogic/spark/writer/document/WriteDocumentRowsToMarkLogicTest.java @@ -4,8 +4,10 @@ import com.marklogic.junit5.PermissionsTester; import com.marklogic.junit5.XmlNode; import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; import com.marklogic.spark.Options; import com.marklogic.spark.TestUtil; +import org.apache.spark.sql.DataFrameWriter; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SaveMode; @@ -74,6 +76,57 @@ void uriPrefix() { assertInCollections("/backup/test/2.xml", "backup-docs"); } + @Test + void writeXmlAsBinary() { + readTheTwoTestDocuments() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_URI_SUFFIX, ".unknown") + .mode(SaveMode.Append) + .save(); + + Stream.of("/test/1.xml.unknown", "/test/2.xml.unknown").forEach(uri -> { + String kind = getDatabaseClient().newServerEval() + .xquery(String.format("xdmp:node-kind(fn:doc('%s')/node())", uri)) + .evalAs(String.class); + assertEquals("binary", kind, "Verifying that MarkLogic stores the document as binary since it does not " + + "recognize the 'unknown' extension."); + }); + } + + @Test + void forceDocumentType() { + readTheTwoTestDocuments() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_URI_SUFFIX, ".unknown") + .option(Options.WRITE_DOCUMENT_TYPE, "xml") + .mode(SaveMode.Append) + .save(); + + Stream.of("/test/1.xml.unknown", "/test/2.xml.unknown").forEach(uri -> { + String kind = getDatabaseClient().newServerEval() + .xquery(String.format("xdmp:node-kind(fn:doc('%s')/node())", uri)) + .evalAs(String.class); + assertEquals("element", kind, "MarkLogic should write each document as XML, as it doesn't recognize the " + + "URI extension but the WRITE_DOCUMENT_TYPE option should force the document type in that scenario."); + }); + } + + @Test + void invalidDocumentType() { + DataFrameWriter writer = readTheTwoTestDocuments() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_URI_SUFFIX, ".unknown") + .option(Options.WRITE_DOCUMENT_TYPE, "notvalid") + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save()); + assertEquals("Invalid value for spark.marklogic.write.documentType: notvalid; must be one of 'JSON', 'XML', or 'TEXT'.", + ex.getMessage()); + } + @Test void overrideMetadataFromCopiedDocuments() { readTheTwoTestDocuments() diff --git a/src/test/java/com/marklogic/spark/writer/document/WriteJsonRowsWithUriTemplateTest.java b/src/test/java/com/marklogic/spark/writer/document/WriteJsonRowsWithUriTemplateTest.java new file mode 100644 index 00000000..cf944a24 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/document/WriteJsonRowsWithUriTemplateTest.java @@ -0,0 +1,81 @@ +package com.marklogic.spark.writer.document; + +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.node.ArrayNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class WriteJsonRowsWithUriTemplateTest extends AbstractIntegrationTest { + + @Test + void zipOfJsonObjects() { + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "zip") + .load("src/test/resources/spark-json/json-objects.zip") + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_URI_TEMPLATE, "/zip/{number}.json") + .option(Options.WRITE_COLLECTIONS, "zip-test") + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + assertCollectionSize("zip-test", 2); + + JsonNode doc = readJsonDocument("/zip/1.json"); + assertEquals("text1", doc.get("parent").get("child").asText()); + + doc = readJsonDocument("/zip/2.json"); + assertEquals("text2", doc.get("parent").get("child").asText()); + } + + + @Test + void zipWithJsonArray() { + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "zip") + .load("src/test/resources/spark-json/json-array.zip") + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_URI_TEMPLATE, "/zip/{/0/number}.json") + .option(Options.WRITE_COLLECTIONS, "json-arrays") + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + assertCollectionSize("json-arrays", 1); + ArrayNode array = (ArrayNode) readJsonDocument("/zip/1.json"); + assertEquals(2, array.size(), "A URI template can work against a JSON array by utilizing JSON Pointer " + + "expressions to refer to specific elements in an array."); + } + + @Test + void jsonReadViaBinaryFileSource() { + newSparkSession().read() + .format("binaryFile") + .load("src/test/resources/spark-json/single-object.json") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_URI_TEMPLATE, "/zip/{number}.json") + .option(Options.WRITE_COLLECTIONS, "binary-test") + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + assertCollectionSize( + "The Spark binaryFile data source should be supported in that our connector should be able to apply a " + + "URI template against a JSON document read via the binaryFile data source.", + "binary-test", 1 + ); + JsonNode doc = readJsonDocument("/zip/3.json"); + assertEquals("text", doc.get("parent").get("child").asText()); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/file/FileUtilTest.java b/src/test/java/com/marklogic/spark/writer/file/FileUtilTest.java index 4260c261..cbbac79f 100644 --- a/src/test/java/com/marklogic/spark/writer/file/FileUtilTest.java +++ b/src/test/java/com/marklogic/spark/writer/file/FileUtilTest.java @@ -1,10 +1,8 @@ package com.marklogic.spark.writer.file; -import com.marklogic.spark.ConnectorException; import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.assertEquals; -import static org.junit.jupiter.api.Assertions.assertThrows; class FileUtilTest { @@ -22,7 +20,9 @@ void makePathFromOpaqueURI() { @Test void makePathWithInvalidURI() { - ConnectorException ex = assertThrows(ConnectorException.class, () -> FileUtil.makePathFromDocumentURI(":::")); - assertEquals("Unable to construct URI from: :::", ex.getMessage()); + String uri = FileUtil.makePathFromDocumentURI("has space.json"); + assertEquals("has space.json", uri, "If a java.net.URI cannot be constructed - in this case, it's due to " + + "the space in the string - then the error should be logged at the DEBUG level and the original value " + + "should be returned."); } } diff --git a/src/test/java/com/marklogic/spark/writer/file/PrettyPrintFilesTest.java b/src/test/java/com/marklogic/spark/writer/file/PrettyPrintFilesTest.java new file mode 100644 index 00000000..aeba8a86 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/file/PrettyPrintFilesTest.java @@ -0,0 +1,108 @@ +package com.marklogic.spark.writer.file; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.commons.io.FileUtils; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.apache.spark.sql.SparkSession; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.io.IOException; +import java.nio.file.Path; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class PrettyPrintFilesTest extends AbstractIntegrationTest { + + @Test + void xmlAndJson(@TempDir Path tempDir) throws IOException { + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "pretty-print") + .load() + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_PRETTY_PRINT, "true") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + File dir = new File(tempDir.toFile(), "pretty-print"); + String doc1 = FileUtils.readFileToString(new File(dir, "doc1.xml"), "UTF-8"); + assertEquals("\n" + + " world\n" + + "\n", doc1, + "Pretty-printing should result in the XML declaration being omitted and child elements being " + + "indented with a default indent of 4. This mirrors how XML is pretty-printed by MLCP."); + + String doc2 = FileUtils.readFileToString(new File(dir, "doc2.json"), "UTF-8"); + assertEquals("{\n" + + " \"hello\" : \"world\"\n" + + "}", doc2, "The JSON should be pretty-printed."); + } + + @Test + void zipWithXmlAndJson(@TempDir Path tempDir) throws Exception { + SparkSession session = newSparkSession(); + + session.read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "pretty-print") + .load() + .repartition(1) + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_PRETTY_PRINT, "true") + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + File[] files = tempDir.toFile().listFiles(); + assertEquals(1, files.length, "Expecting a single zip file due to the repartition call."); + + // Use the connector to read the entries back in, which is a convenient way of checking their content. + List rows = session.read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "zip") + .load(files[0].getAbsolutePath()) + .orderBy(new Column("uri")) + .collectAsList(); + + String xml = new String((byte[]) rows.get(0).get(1)); + assertEquals("\n" + + " world\n" + + "\n", xml); + + String json = new String((byte[]) rows.get(1).get(1)); + assertEquals("{\n" + + " \"hello\" : \"world\"\n" + + "}", json); + } + + @Test + void notPrettyPrinted(@TempDir Path tempDir) throws IOException { + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "pretty-print") + .load() + .write() + .format(CONNECTOR_IDENTIFIER) + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + File dir = new File(tempDir.toFile(), "pretty-print"); + String doc1 = FileUtils.readFileToString(new File(dir, "doc1.xml"), "UTF-8"); + assertEquals("\n" + + "world", doc1); + + String doc2 = FileUtils.readFileToString(new File(dir, "doc2.json"), "UTF-8"); + assertEquals("{\"hello\":\"world\"}", doc2); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteArchiveTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteArchiveTest.java new file mode 100644 index 00000000..1023af6c --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/file/WriteArchiveTest.java @@ -0,0 +1,137 @@ +package com.marklogic.spark.writer.file; + +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import com.marklogic.spark.TestUtil; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.jdom2.Namespace; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.io.TempDir; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.ValueSource; + +import java.nio.file.Path; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WriteArchiveTest extends AbstractIntegrationTest { + + @BeforeEach + void beforeEach() { + TestUtil.insertTwoDocumentsWithAllMetadata(getDatabaseClient()); + } + + @ParameterizedTest + @ValueSource(strings = { + "metadata", + "permissions", + "collections", + "quality", + "properties", + "metadatavalues" + }) + void writeAllMetadata(String metadata, @TempDir Path tempDir) throws Exception { + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_COLLECTIONS, "collection1") + .option(Options.READ_DOCUMENTS_CATEGORIES, "content," + metadata) + .load() + .repartition(1) + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + assertEquals(1, tempDir.toFile().listFiles().length, "Expecting 1 zip since repartition created 1 partition writer."); + verifyMetadataFiles(tempDir, metadata); + } + + private void verifyMetadataFiles(Path tempDir, String metadataValue) { + final List rows = newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_COMPRESSION, "zip") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + + assertEquals(4, rows.size(), "Expecting the 2 author JSON entries and 2 entries for metadata."); + + final String expectedUriPrefix = "file://" + tempDir.toFile().getAbsolutePath(); + for (Row row : rows) { + String uri = row.getString(0); + assertTrue(uri.startsWith(expectedUriPrefix), "Unexpected URI, which is expected to start with the " + + "absolute path of the zip file: " + uri); + + if (uri.endsWith(".xml")) { + XmlNode doc = new XmlNode(new String((byte[]) row.get(1))); + doc.assertElementValue("/hello", "world"); + } else { + assertTrue(uri.endsWith(".metadata")); + verifyMetadata(row, metadataValue); + } + } + } + + private void verifyMetadata(Row row, String metadataValue) { + String xml = new String((byte[]) row.get(1)); + XmlNode metadata = new XmlNode(xml, + Namespace.getNamespace("rapi", "http://marklogic.com/rest-api"), + PROPERTIES_NAMESPACE, + Namespace.getNamespace("ex", "org:example")); + + switch (metadataValue) { + case "collections": + verifyCollections(metadata); + break; + case "permissions": + verifyPermissions(metadata); + break; + case "quality": + verifyQuality(metadata); + break; + case "properties": + verifyProperties(metadata); + break; + case "metadatavalues": + verifyMetadataValues(metadata); + break; + case "metadata": + verifyCollections(metadata); + verifyPermissions(metadata); + verifyQuality(metadata); + verifyProperties(metadata); + verifyMetadataValues(metadata); + break; + } + } + + private void verifyCollections(XmlNode metadata) { + metadata.assertElementValue("/rapi:metadata/rapi:collections/rapi:collection", "collection1"); + metadata.assertElementValue("/rapi:metadata/rapi:collections/rapi:collection", "collection2"); + } + + private void verifyPermissions(XmlNode metadata) { + String path = "/rapi:metadata/rapi:permissions/rapi:permission"; + metadata.assertElementExists(path + "[rapi:role-name = 'spark-user-role' and rapi:capability='update']"); + metadata.assertElementExists(path + "[rapi:role-name = 'spark-user-role' and rapi:capability='read']"); + metadata.assertElementExists(path + "[rapi:role-name = 'qconsole-user' and rapi:capability='read']"); + } + + private void verifyQuality(XmlNode metadata) { + metadata.assertElementValue("/rapi:metadata/rapi:quality", "10"); + } + + private void verifyProperties(XmlNode metadata) { + metadata.assertElementValue("/rapi:metadata/prop:properties/ex:key1", "value1"); + metadata.assertElementValue("/rapi:metadata/prop:properties/key2", "value2"); + } + + private void verifyMetadataValues(XmlNode metadata) { + metadata.prettyPrint(); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteArchiveWithEncodingTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteArchiveWithEncodingTest.java new file mode 100644 index 00000000..2b97ff62 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/file/WriteArchiveWithEncodingTest.java @@ -0,0 +1,79 @@ +package com.marklogic.spark.writer.file; + +import com.fasterxml.jackson.databind.JsonNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.nio.file.Path; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class WriteArchiveWithEncodingTest extends AbstractIntegrationTest { + + @Test + void test(@TempDir Path tempDir) { + addMetadataToTestDocument(); + + // Write the JSON test document to an archive with ISO encoding, including its metadata. + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, "/utf8-sample.json") + .option(Options.READ_DOCUMENTS_CATEGORIES, "content,metadata") + .load() + .repartition(1) + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_ENCODING, "ISO-8859-1") + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + // Read the archive with ISO encoding and loading it into MarkLogic. + sparkSession.read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, "ISO-8859-1") + .option(Options.READ_FILES_COMPRESSION, "zip") + .option(Options.READ_FILES_TYPE, "archive") + .load(tempDir.toAbsolutePath().toString()) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_COLLECTIONS, "loaded-data") + .option(Options.WRITE_URI_PREFIX, "/loaded") + .mode(SaveMode.Append) + .save(); + + JsonNode doc = readJsonDocument("/loaded/utf8-sample.json"); + assertEquals("MaryZhengäöüß??", doc.get("text").asText(), "The value should be mostly the same as the " + + "original value, except for the last two characters which are replaced when encoded to ISO-8859-1."); + + // Read the loaded document. + List rows = sparkSession.read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, "/loaded/utf8-sample.json") + .option(Options.READ_DOCUMENTS_CATEGORIES, "metadatavalues") + .load().collectAsList(); + + Map metadata = rows.get(0).getJavaMap(7); + assertEquals("MaryZhengäöüß??", metadata.get("text"), "The user-defined encoding should be applied to " + + "each metadata entry in the archive file as well. This ensures that the encoding is applied to things " + + "like metadata values and properties fragments, where a user is free to capture any text they want."); + } + + /** + * It's fine to add this to the test document, which is created by the test app. It won't impact any other tests, + * and it can be run repeatedly without any ill effects. + */ + private void addMetadataToTestDocument() { + getDatabaseClient().newServerEval() + .javascript("declareUpdate(); " + + "xdmp.documentSetMetadata('/utf8-sample.json', {\"text\": \"MaryZhengäöüß测试\"})") + .evalAs(String.class); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteDocumentFilesTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteDocumentFilesTest.java index fca36013..cab4c1f0 100644 --- a/src/test/java/com/marklogic/spark/writer/file/WriteDocumentFilesTest.java +++ b/src/test/java/com/marklogic/spark/writer/file/WriteDocumentFilesTest.java @@ -1,7 +1,6 @@ package com.marklogic.spark.writer.file; import com.fasterxml.jackson.databind.JsonNode; -import com.fasterxml.jackson.databind.ObjectMapper; import com.marklogic.client.document.DocumentWriteSet; import com.marklogic.client.document.TextDocumentManager; import com.marklogic.client.io.DocumentMetadataHandle; @@ -85,4 +84,36 @@ void variousURIs(@TempDir Path tempDir) throws Exception { content = new String(FileCopyUtils.copyToByteArray(files[filenames.indexOf("example2.txt")])); assertEquals("Opaque URI", content); } + + @Test + void uriHasSpace(@TempDir Path tempDir) { + final String uri = "/has space.json"; + + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .load("src/test/resources/spark-json/single-object.json") + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_COLLECTIONS, "char-test") + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, uri) + .mode(SaveMode.Append) + .save(); + + sparkSession.read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, uri) + .load() + .write().format(CONNECTOR_IDENTIFIER) + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + File dir = tempDir.toFile(); + assertEquals(1, dir.listFiles().length); + String filename = dir.listFiles()[0].getName(); + System.out.println(filename); + assertEquals("has space.json", filename, + "Just like MLCP, if the connector cannot construct a java.net.URI from the document URI (it will fail " + + "due to a space), the error should be logged and the file should be written with its unaltered " + + "document URI used for the file path."); + } } diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteDocumentGZIPFilesTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteDocumentGzipFilesTest.java similarity index 97% rename from src/test/java/com/marklogic/spark/writer/file/WriteDocumentGZIPFilesTest.java rename to src/test/java/com/marklogic/spark/writer/file/WriteDocumentGzipFilesTest.java index 159da366..150a5e89 100644 --- a/src/test/java/com/marklogic/spark/writer/file/WriteDocumentGZIPFilesTest.java +++ b/src/test/java/com/marklogic/spark/writer/file/WriteDocumentGzipFilesTest.java @@ -15,7 +15,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertTrue; -class WriteDocumentGZIPFilesTest extends AbstractIntegrationTest { +class WriteDocumentGzipFilesTest extends AbstractIntegrationTest { @Test void test(@TempDir Path tempDir) { diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteDocumentZipFilesTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteDocumentZipFilesTest.java index d17a99c9..f41e44ab 100644 --- a/src/test/java/com/marklogic/spark/writer/file/WriteDocumentZipFilesTest.java +++ b/src/test/java/com/marklogic/spark/writer/file/WriteDocumentZipFilesTest.java @@ -88,6 +88,7 @@ void opaqueURI(@TempDir @NotNull Path tempDir) throws IOException { "'schema-specific part', which is just example/123.xml."); } + private Dataset readAuthorCollection() { return newSparkSession().read() .format(CONNECTOR_IDENTIFIER) @@ -125,10 +126,7 @@ private void verifyZipFilesContainFifteenAuthors(Path tempDir) throws IOExceptio assertTrue(uri.startsWith(expectedUriPrefix), "Unexpected URI, which is expected to start with the " + "absolute path of the zip file: " + uri); - long length = row.getLong(2); - assertTrue(length > 0, "Length wasn't set to something greater than zero: " + length); - - JsonNode doc = objectMapper.readTree((byte[]) row.get(3)); + JsonNode doc = objectMapper.readTree((byte[]) row.get(1)); assertTrue(doc.has("CitationID"), "Unexpected JSON: " + doc); } } diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteFilesWithEncodingTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteFilesWithEncodingTest.java new file mode 100644 index 00000000..d278f82a --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/file/WriteFilesWithEncodingTest.java @@ -0,0 +1,163 @@ +package com.marklogic.spark.writer.file; + +import com.fasterxml.jackson.databind.JsonNode; +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; +import org.springframework.util.FileCopyUtils; + +import java.io.File; +import java.io.IOException; +import java.nio.file.Path; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * These tests are simpler than they look at first glance. Each one reads a doc from MarkLogic that contains characters + * supported by UTF-8 but not supported by ISO-8859-1. The test then writes the doc to a file using ISO-8859-1. It then + * reads the file and loads it back into MarkLogic and verifies that the contents of both the written file and written + * document meet the expectations for ISO-8859-1 encoding. + */ +class WriteFilesWithEncodingTest extends AbstractIntegrationTest { + + private static final String ISO_ENCODING = "ISO-8859-1"; + private static final String SAMPLE_XML_DOC_URI = "/utf8-sample.xml"; + private static final String SAMPLE_JSON_DOC_URI = "/utf8-sample.json"; + private static final String ORIGINAL_XML_TEXT = "UTF-8 Text: MaryZhengäöüß测试"; + + @Test + void writeXmlFile(@TempDir Path tempDir) { + XmlNode sampleDoc = readXmlDocument(SAMPLE_XML_DOC_URI); + sampleDoc.assertElementValue( + "Verifying that the sample doc was loaded correctly in the test app; also showing what the text looks " + + "to make this test easier to understand.", + "/doc", ORIGINAL_XML_TEXT); + + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, SAMPLE_XML_DOC_URI) + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_ENCODING, ISO_ENCODING) + .mode(SaveMode.Append) + .save(tempDir.toAbsolutePath().toString()); + + String fileContent = readFileContents(tempDir, "utf8-sample.xml"); + assertTrue(fileContent.contains("UTF-8 Text: MaryZheng����??"), + "Unexpected file content: " + fileContent); + + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, ISO_ENCODING) + .load(tempDir.toAbsolutePath().toString()) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/iso-doc.xml") + .mode(SaveMode.Append) + .save(); + + XmlNode doc = readXmlDocument("/iso-doc.xml"); + doc.assertElementValue( + "Verifies that the ISO-encoded text is then converted back to UTF-8 when stored in MarkLogic, but the " + + "value is slightly different due to the use of replacement characters in ISO-8859-1.", + "/doc", "UTF-8 Text: MaryZhengäöüß??"); + } + + @Test + void prettyPrintXmlFile(@TempDir Path tempDir) { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, SAMPLE_XML_DOC_URI) + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_ENCODING, ISO_ENCODING) + .option(Options.WRITE_FILES_PRETTY_PRINT, true) + .mode(SaveMode.Append) + .save(tempDir.toAbsolutePath().toString()); + + String fileContent = readFileContents(tempDir, "utf8-sample.xml"); + assertTrue(fileContent.contains("UTF-8 Text: MaryZheng����测试"), + "Pretty-printing results in some of the characters being escaped by the Java Transformer class, " + + "even though it's been configured to use the user-specified encoding. Unexpected text: " + fileContent); + + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, ISO_ENCODING) + .load(tempDir.toAbsolutePath().toString()) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/iso-doc.xml") + .mode(SaveMode.Append) + .save(); + + XmlNode doc = readXmlDocument("/iso-doc.xml"); + doc.assertElementValue( + "The written doc should have the original XML text, as the problematic characters for ISO-8859-1 were " + + "escaped by the Java Transformer class during the pretty-printing process. This shows that " + + "pretty-printing can actually result in fewer characters being altered via replacement tokens.", + "/doc", ORIGINAL_XML_TEXT); + } + + @Test + void prettyPrintJsonFile(@TempDir Path tempDir) { + newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, SAMPLE_JSON_DOC_URI) + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_ENCODING, ISO_ENCODING) + .option(Options.WRITE_FILES_PRETTY_PRINT, true) + .mode(SaveMode.Append) + .save(tempDir.toAbsolutePath().toString()); + + String fileContent = readFileContents(tempDir, "utf8-sample.json"); + assertTrue(fileContent.contains("MaryZheng����??"), + "Pretty-printing JSON doesn't impact the encoding at all since the underlying Jackson library " + + "doesn't need to escape any of the characters. Unexpected text: " + fileContent); + + newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_ENCODING, ISO_ENCODING) + .load(tempDir.toAbsolutePath().toString()) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_URI_TEMPLATE, "/iso-doc.json") + .mode(SaveMode.Append) + .save(); + + JsonNode doc = readJsonDocument("/iso-doc.json"); + assertEquals("MaryZhengäöüß??", doc.get("text").asText()); + } + + @Test + void invalidEncoding(@TempDir Path tempDir) { + DataFrameWriter writer = newSparkSession().read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_DOCUMENTS_URIS, SAMPLE_JSON_DOC_URI) + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_FILES_ENCODING, "not-valid-encoding") + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save(tempDir.toAbsolutePath().toString())); + assertEquals("Unsupported encoding value: not-valid-encoding", ex.getMessage()); + } + + private String readFileContents(Path tempDir, String filename) { + File file = new File(tempDir.toFile(), filename); + try { + return new String(FileCopyUtils.copyToByteArray(file)); + } catch (IOException e) { + throw new RuntimeException(e); + } + } +} diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteRdfFilesTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteRdfFilesTest.java new file mode 100644 index 00000000..328af778 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/file/WriteRdfFilesTest.java @@ -0,0 +1,192 @@ +package com.marklogic.spark.writer.file; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.Options; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; +import org.junit.jupiter.params.ParameterizedTest; +import org.junit.jupiter.params.provider.CsvSource; + +import java.io.File; +import java.nio.file.Path; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WriteRdfFilesTest extends AbstractIntegrationTest { + + @ParameterizedTest + @CsvSource({ + ",ttl", // TTL is the default format. + "nt,nt", + "ntriples,nt", + "rdfthrift,thrift", + "rdfproto,binpb" + }) + void tripleFormats(String format, String fileExtension, @TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, format, fileExtension); + + List rows = readRdfFiles(tempDir); + verifyGraphIsNullForEachRow(rows); + verifyDebtRowHasLang(rows); + verifyCreatorRowHasDatatype(rows); + } + + /** + * TriX - https://en.wikipedia.org/wiki/TriX_(serialization_format) - is treated as a quads format as it can + * support a graph at /trix/graph/uri in each TriX XML document. + *

+ * Also, Jena is reporting errors with a message of "Unexpected attribute : :datatype at typedLiteral", but that + * does not seem to be causing any issue. Perhaps a bug in Jena, as TriX supports a 'datatype' attribute on a + * 'typedLiteral' element. + */ + @Test + void trix(@TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, "trix"); + + List rows = readRdfFiles(tempDir); + verifyEachRowHasGraph(rows); + verifyDebtRowHasLang(rows); + verifyCreatorRowHasDatatype(rows); + + rows.forEach(row -> System.out.println(row.prettyJson())); + } + + @Test + void trig(@TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, "trig"); + + List rows = readRdfFiles(tempDir); + verifyEachRowHasGraph(rows); + verifyDebtRowHasLang(rows); + verifyCreatorRowHasDatatype(rows); + } + + @Test + void nq(@TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, "nq"); + + List rows = readRdfFiles(tempDir); + verifyEachRowHasGraph(rows); + verifyDebtRowHasLang(rows); + verifyCreatorRowHasDatatype(rows); + } + + @Test + void nquads(@TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, "nquads", "nq"); + + List rows = readRdfFiles(tempDir); + verifyEachRowHasGraph(rows); + verifyDebtRowHasLang(rows); + verifyCreatorRowHasDatatype(rows); + } + + @Test + void tripleFormatWithGraphOverride(@TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, null, "ttl", "this-should-be-ignored"); + + // If the format doesn't support a graph, then a user-defined graph should be ignored without any error + // thrown. + List rows = readRdfFiles(tempDir); + verifyGraphIsNullForEachRow(rows); + } + + @Test + void quadsFormatWithGraphOverride(@TempDir Path tempDir) { + writeExampleGraphToFiles(tempDir, "nquads", "nq", "use-this-graph"); + + List rows = readRdfFiles(tempDir); + verifyEachRowHasGraph(rows, "use-this-graph"); + } + + @Test + void noTriplesFound(@TempDir Path tempDir) { + Dataset dataset = newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_TRIPLES_URIS, "/not-a-document.json") + .load(); + + assertEquals(0, dataset.count()); + + dataset.write().format(CONNECTOR_IDENTIFIER) + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + assertEquals(0, tempDir.toFile().listFiles().length, "No files should have been written since no triples " + + "were found."); + } + + + private void writeExampleGraphToFiles(Path tempDir, String format) { + writeExampleGraphToFiles(tempDir, format, format); + } + + private void writeExampleGraphToFiles(Path tempDir, String format, String expectedFileExtension) { + writeExampleGraphToFiles(tempDir, format, expectedFileExtension, null); + } + + private void writeExampleGraphToFiles(Path tempDir, String format, String expectedFileExtension, String graphOverride) { + newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph") + .load() + .repartition(2) // Force 2 files to be written, just so we can ensure more than 1 works. + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_RDF_FILES_FORMAT, format) + .option(Options.WRITE_RDF_FILES_GRAPH, graphOverride) + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + File[] files = tempDir.toFile().listFiles(); + assertEquals(2, files.length); + + assertTrue(files[0].getName().endsWith(expectedFileExtension), "Unexpected filename: " + files[0].getName()); + assertTrue(files[1].getName().endsWith(expectedFileExtension), "Unexpected filename: " + files[1].getName()); + } + + private List readRdfFiles(Path tempDir) { + List rows = sparkSession.read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + assertEquals(8, rows.size()); + return rows; + } + + private void verifyEachRowHasGraph(List rows) { + verifyEachRowHasGraph(rows, "http://example.org/graph"); + } + + private void verifyEachRowHasGraph(List rows, String graph) { + rows.forEach(row -> assertEquals(graph, row.getString(5), "Since the trig format " + + "supports quads, the graph should have been included in the written files and then read back by our " + + "connector.")); + } + + private void verifyGraphIsNullForEachRow(List rows) { + rows.forEach(row -> assertTrue(row.isNullAt(5), "Each row should have a null graph since it was written " + + "to a format that only supports triples and not quads.")); + } + + private void verifyDebtRowHasLang(List rows) { + Row debtRow = rows.stream().filter(row -> "Debt Management".equals(row.getString(2))).findFirst().get(); + assertEquals("http://www.w3.org/1999/02/22-rdf-syntax-ns#langString", debtRow.getString(3), "For a " + + "langString, the datatype can't be included in the file as Jena only lets us write either a typed literal " + + "or a lang literal. But when reading in the rows, we know that if a 'lang' value exists, then the " + + "datatype must be a langString."); + assertEquals("en", debtRow.getString(4)); + } + + private void verifyCreatorRowHasDatatype(List rows) { + Row creatorRow = rows.stream().filter(row -> "wb".equals(row.getString(2))).findFirst().get(); + assertEquals("http://www.w3.org/2001/XMLSchema#string", creatorRow.getString(3), "Verifying that the " + + "datatype is set correctly based on what is written to the file."); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/file/WriteRdfGzipFilesTest.java b/src/test/java/com/marklogic/spark/writer/file/WriteRdfGzipFilesTest.java new file mode 100644 index 00000000..a4b552e4 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/file/WriteRdfGzipFilesTest.java @@ -0,0 +1,68 @@ +package com.marklogic.spark.writer.file; + +import com.marklogic.spark.AbstractIntegrationTest; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.io.TempDir; + +import java.io.File; +import java.nio.file.Path; +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class WriteRdfGzipFilesTest extends AbstractIntegrationTest { + + @Test + void gzip(@TempDir Path tempDir) { + Dataset dataset = newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph") + .load(); + + assertEquals(8, dataset.count()); + + dataset.repartition(1) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_RDF_FILES_FORMAT, "nt") + .option(Options.WRITE_FILES_COMPRESSION, "gzip") + .mode(SaveMode.Append) + .save(tempDir.toFile().getAbsolutePath()); + + File[] files = tempDir.toFile().listFiles(); + assertEquals(1, files.length, "Expecting 1 gzip file due to repartition=1 producing one writer."); + assertTrue(files[0].getName().endsWith(".nt.gz"), "Unexpected filename: " + files[0].getName()); + + List rows = sparkSession.read().format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .option(Options.READ_FILES_COMPRESSION, "gzip") + .load(tempDir.toFile().getAbsolutePath()) + .collectAsList(); + + assertEquals(8, rows.size(), "Expecting the 8 rows originally read from MarkLogic to be read " + + "from the single gzipped file."); + } + + @Test + void zipIsntValidChoice(@TempDir Path tempDir) { + DataFrameWriter writer = newSparkSession() + .read().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.READ_TRIPLES_GRAPHS, "http://example.org/graph") + .load() + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.WRITE_RDF_FILES_FORMAT, "nt") + .option(Options.WRITE_FILES_COMPRESSION, "zip") + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save(tempDir.toFile().getAbsolutePath())); + assertEquals("Unsupported compression value; only 'gzip' is supported: zip", ex.getMessage()); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/rdf/AbstractWriteRdfTest.java b/src/test/java/com/marklogic/spark/writer/rdf/AbstractWriteRdfTest.java new file mode 100644 index 00000000..e4913897 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/rdf/AbstractWriteRdfTest.java @@ -0,0 +1,33 @@ +package com.marklogic.spark.writer.rdf; + +import com.marklogic.client.io.StringHandle; +import com.marklogic.client.semantics.GraphManager; +import com.marklogic.client.semantics.RDFMimeTypes; +import com.marklogic.junit5.XmlNode; +import com.marklogic.spark.AbstractIntegrationTest; +import org.junit.jupiter.api.BeforeEach; + +abstract class AbstractWriteRdfTest extends AbstractIntegrationTest { + + private GraphManager graphManager; + + @BeforeEach + void beforeEach() { + graphManager = getDatabaseClient().newGraphManager(); + } + + protected final XmlNode readTriplesInGraph(String graph) { + String content = graphManager.read(graph, new StringHandle().withMimetype(RDFMimeTypes.TRIPLEXML)).get(); + return new XmlNode(content, TriplesDocument.SEMANTICS_NAMESPACE); + } + + protected final void assertTripleCount(String graph, int count, String message) { + XmlNode doc = readTriplesInGraph(graph); + doc.assertElementCount(message, "/sem:triples/sem:triple", count); + } + + protected final void assertTripleCount(String graph, int count) { + XmlNode doc = readTriplesInGraph(graph); + doc.assertElementCount("/sem:triples/sem:triple", count); + } +} diff --git a/src/test/java/com/marklogic/spark/writer/rdf/WriteQuadsTest.java b/src/test/java/com/marklogic/spark/writer/rdf/WriteQuadsTest.java new file mode 100644 index 00000000..65f84589 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/rdf/WriteQuadsTest.java @@ -0,0 +1,160 @@ +package com.marklogic.spark.writer.rdf; + +import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.node.ArrayNode; +import com.marklogic.client.document.DocumentPage; +import com.marklogic.client.document.XMLDocumentManager; +import com.marklogic.client.ext.helper.ClientHelper; +import com.marklogic.client.io.DocumentMetadataHandle; +import com.marklogic.junit5.PermissionsTester; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import java.util.List; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Tests "triple rows" where the "graph" column is populated, hence making it a quad. + */ +class WriteQuadsTest extends AbstractWriteRdfTest { + + private static final String GRAPH_COLLECTION = "http://marklogic.com/semantics#graphs"; + + @Test + void quads() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS + ",qconsole-user,read") + .option(Options.WRITE_COLLECTIONS, "test-triples") + .mode(SaveMode.Append) + .save(); + + verifyFourGraphsExist(); + verifyGraphPermissions(); + assertCollectionSize("Expecting 1 triples document for each graph, as there's only 1 partition writer.", + "test-triples", 4); + + assertTripleCount("http://www.example.org/exampleDocument#G1", 4); + assertTripleCount("http://www.example.org/exampleDocument#G2", 2); + assertTripleCount("http://www.example.org/exampleDocument#G3", 9); + assertTripleCount(RdfRowConverter.DEFAULT_MARKLOGIC_GRAPH, 1); + } + + @Test + void quadsInUserDefinedGraph() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_GRAPH, "my-graph") + .mode(SaveMode.Append) + .save(); + + assertTripleCount("my-graph", 1, "The one triple in the quads file without a graph is first " + + "assigned by Jena to the urn:x-arq:DefaultGraphNode graph. The user-defined graph is then used instead of that."); + assertTripleCount("http://www.example.org/exampleDocument#G1", 4); + assertTripleCount("http://www.example.org/exampleDocument#G2", 2); + assertTripleCount("http://www.example.org/exampleDocument#G3", 9); + } + + @Test + void quadsWithGraphOverride() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_GRAPH_OVERRIDE, "my-graph-override") + .mode(SaveMode.Append) + .save(); + + assertTripleCount("my-graph-override", 16, "The 'graph override' value should override the graph in every " + + "quad, regardless of whether it has a graph value or not."); + + assertCollectionSize(GRAPH_COLLECTION, 1); + } + + @Test + void threeQuadsTwoPartitions() { + readRdf("src/test/resources/rdf/three-quads.trig") + .repartition(2) + .write().format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()) + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_COLLECTIONS, "multiple-partitions") + .mode(SaveMode.Append) + .save(); + + verifyFourGraphsExist(); + assertTripleCount("http://www.example.org/exampleDocument#G1", 4); + assertTripleCount("http://www.example.org/exampleDocument#G2", 2); + assertTripleCount("http://www.example.org/exampleDocument#G3", 9); + assertTripleCount(RdfRowConverter.DEFAULT_MARKLOGIC_GRAPH, 1); + + List uris = new ClientHelper(getDatabaseClient()).getUrisInCollection("multiple-partitions"); + assertTrue(uris.size() > 4, "With 2 partition writers, we expect more than 4 triple documents because it is " + + "almost certainly the case that the two writers get triples from the same graph. Since they're independent " + + "of each other, they'll both create a separate triples document in the same graph. This means a user can " + + "end up with far fewer than 100 triples in one document, but that doesn't really have any functional " + + "impact. The user primarily cares that the triples are in the triple index and queryable via the correct " + + "graph. Actual count of triple document URIs: " + uris.size()); + } + + private Dataset readRdf(String path) { + return newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .load(path); + } + + private DataFrameWriter readRdfAndWrite() { + return readRdf("src/test/resources/rdf/three-quads.trig") + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()); + } + + private void verifyFourGraphsExist() { + assertCollectionSize(GRAPH_COLLECTION, 4); + assertInCollections("http://www.example.org/exampleDocument#G1", GRAPH_COLLECTION); + assertInCollections("http://www.example.org/exampleDocument#G2", GRAPH_COLLECTION); + assertInCollections("http://www.example.org/exampleDocument#G3", GRAPH_COLLECTION); + + // The MarkLogic default graph should be used for the one triple that is assigned to the Jena default + // graph - urn:x-arq:DefaultGraphNode. + assertInCollections(RdfRowConverter.DEFAULT_MARKLOGIC_GRAPH, GRAPH_COLLECTION); + } + + /** + * This verifies the permissions on each graph document, which are assumed to be set via the + * sem:create-graph-document function. It also does a check on the output of sem.graphGetPermissions, verifying + * that the 3 expected permissions are present. + */ + private void verifyGraphPermissions() { + XMLDocumentManager mgr = getDatabaseClient().newXMLDocumentManager(); + String[] graphUris = new String[]{ + "http://www.example.org/exampleDocument#G1", + "http://www.example.org/exampleDocument#G2", + "http://www.example.org/exampleDocument#G3", + "http://www.example.org/exampleDocument#G3" + }; + + for (String graphUri : graphUris) { + ArrayNode perms = (ArrayNode) getDatabaseClient().newServerEval() + .javascript(String.format("sem.graphGetPermissions('%s')", graphUri)) + .evalAs(JsonNode.class); + assertEquals(3, perms.size(), "Expecting 2 permissions for spark-user-role and 1 for qconsole-user."); + } + + DocumentPage page = mgr.readMetadata(graphUris); + while (page.hasNext()) { + DocumentMetadataHandle metadata = page.next().getMetadata(new DocumentMetadataHandle()); + DocumentMetadataHandle.DocumentPermissions perms = metadata.getPermissions(); + assertEquals(2, perms.size(), "Expecting qconsole-user and spark-user-role as the 2 roles containing permissions."); + PermissionsTester tester = new PermissionsTester(metadata.getPermissions()); + tester.assertUpdatePermissionExists("spark-user-role"); + tester.assertReadPermissionExists("spark-user-role"); + tester.assertReadPermissionExists("qconsole-user"); + } + } +} diff --git a/src/test/java/com/marklogic/spark/writer/rdf/WriteTriplesTest.java b/src/test/java/com/marklogic/spark/writer/rdf/WriteTriplesTest.java new file mode 100644 index 00000000..6b65b724 --- /dev/null +++ b/src/test/java/com/marklogic/spark/writer/rdf/WriteTriplesTest.java @@ -0,0 +1,96 @@ +package com.marklogic.spark.writer.rdf; + +import com.marklogic.junit5.PermissionsTester; +import com.marklogic.spark.ConnectorException; +import com.marklogic.spark.Options; +import org.apache.spark.sql.DataFrameWriter; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +class WriteTriplesTest extends AbstractWriteRdfTest { + + @Test + void triplesInDefaultGraph() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + assertTripleCount(RdfRowConverter.DEFAULT_MARKLOGIC_GRAPH, 8, + "Triples should be added to the default MarkLogic graph when the user does not specify a graph."); + } + + @Test + void triplesInUserDefinedGraph() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_GRAPH, "my-graph") + .mode(SaveMode.Append) + .save(); + + assertTripleCount("my-graph", 8, + "The spark.marklogic.write.graph option should be used to specify a graph for each triple."); + } + + @Test + void triplesInGraphOverride() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_GRAPH_OVERRIDE, "my-graph-override") + .mode(SaveMode.Append) + .save(); + + assertTripleCount("my-graph-override", 8, + "The spark.marklogic.write.graphOverride option should function the exact same way as " + + "spark.marklogic.write.graph for triples, as both are specifying a graph for every triple."); + } + + @Test + void bothGraphAndGraphOverrideSpecified() { + DataFrameWriter writer = readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .option(Options.WRITE_GRAPH, "my-graph") + .option(Options.WRITE_GRAPH_OVERRIDE, "my-graph-override") + .mode(SaveMode.Append); + + ConnectorException ex = assertThrowsConnectorException(() -> writer.save()); + assertEquals("Can only specify one of spark.marklogic.write.graph and spark.marklogic.write.graphOverride.", + ex.getMessage()); + } + + @Test + void writeTwiceWithDifferentPermissions() { + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, DEFAULT_PERMISSIONS) + .mode(SaveMode.Append) + .save(); + + readRdfAndWrite() + .option(Options.WRITE_PERMISSIONS, "rest-extension-user,read,rest-extension-user,update") + .option(Options.WRITE_COLLECTIONS, "second-write") + .mode(SaveMode.Append) + .save(); + + PermissionsTester tester = readDocumentPermissions(getUrisInCollection("second-write", 1).get(0)); + tester.assertReadPermissionExists("rest-extension-user"); + tester.assertUpdatePermissionExists("rest-extension-user"); + assertEquals(1, tester.getDocumentPermissions().size(), "Should only have permissions " + + "for the rest-extension-user role. Not sure if this is correct though. MLCP does some upfront work " + + "to retrieve permissions for any existing graph and then reuses those, ignoring what the user " + + "specifies. That behavior is not documented though and may seem surprising to a user."); + } + + private DataFrameWriter readRdfAndWrite() { + return newSparkSession().read() + .format(CONNECTOR_IDENTIFIER) + .option(Options.READ_FILES_TYPE, "rdf") + .load("src/test/resources/rdf/mini-taxonomy.xml") + .write() + .format(CONNECTOR_IDENTIFIER) + .option(Options.CLIENT_URI, makeClientUri()); + } + +} diff --git a/src/test/ml-data/other-triples/collections.properties b/src/test/ml-data/other-triples/collections.properties new file mode 100644 index 00000000..b3eac7c2 --- /dev/null +++ b/src/test/ml-data/other-triples/collections.properties @@ -0,0 +1,3 @@ +# This file is included to ensure that it doesn't get pulled in by queries that are expecting to +# find triples in the triples/mini-taxonomy.xml file. +*=other-graph,test-config diff --git a/src/test/ml-data/other-triples/other-taxonomy.xml b/src/test/ml-data/other-triples/other-taxonomy.xml new file mode 100644 index 00000000..03decc47 --- /dev/null +++ b/src/test/ml-data/other-triples/other-taxonomy.xml @@ -0,0 +1,42 @@ + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/1999/02/22-rdf-syntax-ns#type + http://www.w3.org/2004/02/skos/core#Concept + + + http://vocabulary.worldbank.org/taxonomy/451 + http://purl.org/dc/terms/creator + wb + + + http://vocabulary.worldbank.org/taxonomy/451 + http://purl.org/dc/terms/created + 2013-05-21T15:49:55Z + + + http://vocabulary.worldbank.org/taxonomy/451 + http://purl.org/dc/terms/hasVersion + 0 + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#prefLabel + NOT Management of Debt + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#broader + http://vocabulary.worldbank.org/taxonomy/450 + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#narrower + http://vocabulary.worldbank.org/taxonomy/1137 + + + http://vocabulary.worldbank.org/taxonomy/1107 + http://www.w3.org/2004/02/skos/core#narrower + http://vocabulary.worldbank.org/taxonomy/451 + + diff --git a/src/test/ml-data/pretty-print/collections.properties b/src/test/ml-data/pretty-print/collections.properties new file mode 100644 index 00000000..1164b225 --- /dev/null +++ b/src/test/ml-data/pretty-print/collections.properties @@ -0,0 +1 @@ +*=test-config,pretty-print diff --git a/src/test/ml-data/pretty-print/doc1.xml b/src/test/ml-data/pretty-print/doc1.xml new file mode 100644 index 00000000..017dd70a --- /dev/null +++ b/src/test/ml-data/pretty-print/doc1.xml @@ -0,0 +1,2 @@ + +world diff --git a/src/test/ml-data/pretty-print/doc2.json b/src/test/ml-data/pretty-print/doc2.json new file mode 100644 index 00000000..56c8e280 --- /dev/null +++ b/src/test/ml-data/pretty-print/doc2.json @@ -0,0 +1 @@ +{"hello": "world"} diff --git a/src/test/ml-data/triples/collections.properties b/src/test/ml-data/triples/collections.properties new file mode 100644 index 00000000..de8f646e --- /dev/null +++ b/src/test/ml-data/triples/collections.properties @@ -0,0 +1 @@ +*=http://example.org/graph,test-config diff --git a/src/test/ml-data/triples/mini-taxonomy.xml b/src/test/ml-data/triples/mini-taxonomy.xml new file mode 100644 index 00000000..83a25080 --- /dev/null +++ b/src/test/ml-data/triples/mini-taxonomy.xml @@ -0,0 +1,42 @@ + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/1999/02/22-rdf-syntax-ns#type + http://www.w3.org/2004/02/skos/core#Concept + + + http://vocabulary.worldbank.org/taxonomy/451 + http://purl.org/dc/terms/creator + wb + + + http://vocabulary.worldbank.org/taxonomy/451 + http://purl.org/dc/terms/created + 2013-05-21T15:49:55Z + + + http://vocabulary.worldbank.org/taxonomy/451 + http://purl.org/dc/terms/hasVersion + 0 + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#prefLabel + Debt Management + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#broader + http://vocabulary.worldbank.org/taxonomy/450 + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#narrower + http://vocabulary.worldbank.org/taxonomy/1137 + + + http://vocabulary.worldbank.org/taxonomy/451 + http://www.w3.org/2004/02/skos/core#narrower + http://vocabulary.worldbank.org/taxonomy/1107 + + diff --git a/src/test/ml-data/utf8-sample.json b/src/test/ml-data/utf8-sample.json new file mode 100644 index 00000000..f1df7775 --- /dev/null +++ b/src/test/ml-data/utf8-sample.json @@ -0,0 +1,3 @@ +{ + "text": "MaryZhengäöüß测试" +} diff --git a/src/test/ml-data/utf8-sample.xml b/src/test/ml-data/utf8-sample.xml new file mode 100644 index 00000000..3fc77f9b --- /dev/null +++ b/src/test/ml-data/utf8-sample.xml @@ -0,0 +1 @@ +UTF-8 Text: MaryZhengäöüß测试 diff --git a/src/test/ml-modules/options/test-options.xml b/src/test/ml-modules/options/test-options.xml index a0bb85a3..a0c3ba2c 100644 --- a/src/test/ml-modules/options/test-options.xml +++ b/src/test/ml-modules/options/test-options.xml @@ -4,4 +4,9 @@ CitationID + + + + + diff --git a/src/test/resources/aggregate-zips/employee-aggregates-copy.zip b/src/test/resources/aggregate-zips/employee-aggregates-copy.zip new file mode 100644 index 00000000..ea9724a7 Binary files /dev/null and b/src/test/resources/aggregate-zips/employee-aggregates-copy.zip differ diff --git a/src/test/resources/aggregate-zips/employee-aggregates.zip b/src/test/resources/aggregate-zips/employee-aggregates.zip index dcdea294..ea9724a7 100644 Binary files a/src/test/resources/aggregate-zips/employee-aggregates.zip and b/src/test/resources/aggregate-zips/employee-aggregates.zip differ diff --git a/src/test/resources/aggregate-zips/xml-and-json.zip b/src/test/resources/aggregate-zips/xml-and-json.zip new file mode 100644 index 00000000..74c11a16 Binary files /dev/null and b/src/test/resources/aggregate-zips/xml-and-json.zip differ diff --git a/src/test/resources/aggregates/employees.xml b/src/test/resources/aggregates/employees.xml index e9f9f29f..9aabb1e1 100644 --- a/src/test/resources/aggregates/employees.xml +++ b/src/test/resources/aggregates/employees.xml @@ -1,6 +1,7 @@ John + 1 40 has mixed content @@ -12,6 +13,7 @@ Brenda + 2 42 has mixed content diff --git a/src/test/resources/archive-files/archive1.zip b/src/test/resources/archive-files/archive1.zip new file mode 100644 index 00000000..310035d1 Binary files /dev/null and b/src/test/resources/archive-files/archive1.zip differ diff --git a/src/test/resources/archive-files/firstEntryInvalid.zip b/src/test/resources/archive-files/firstEntryInvalid.zip new file mode 100644 index 00000000..06b2d6e4 Binary files /dev/null and b/src/test/resources/archive-files/firstEntryInvalid.zip differ diff --git a/src/test/resources/archive-files/secondEntryInvalid.zip b/src/test/resources/archive-files/secondEntryInvalid.zip new file mode 100644 index 00000000..2c27327c Binary files /dev/null and b/src/test/resources/archive-files/secondEntryInvalid.zip differ diff --git a/src/test/resources/csv-files/empty-values.csv b/src/test/resources/csv-files/empty-values.csv new file mode 100644 index 00000000..5329204a --- /dev/null +++ b/src/test/resources/csv-files/empty-values.csv @@ -0,0 +1,3 @@ +number,color,flag +1,blue, +2, ,false diff --git a/src/test/resources/custom-code/my-partitions.js b/src/test/resources/custom-code/my-partitions.js new file mode 100644 index 00000000..26b25679 --- /dev/null +++ b/src/test/resources/custom-code/my-partitions.js @@ -0,0 +1 @@ +xdmp.databaseForests(xdmp.database()) diff --git a/src/test/resources/custom-code/my-partitions.xqy b/src/test/resources/custom-code/my-partitions.xqy new file mode 100644 index 00000000..2260cf4c --- /dev/null +++ b/src/test/resources/custom-code/my-partitions.xqy @@ -0,0 +1 @@ +xdmp:database-forests(xdmp:database()) diff --git a/src/test/resources/custom-code/my-reader.js b/src/test/resources/custom-code/my-reader.js new file mode 100644 index 00000000..2f8394d0 --- /dev/null +++ b/src/test/resources/custom-code/my-reader.js @@ -0,0 +1 @@ +Sequence.from(['firstValue', 'secondValue']) diff --git a/src/test/resources/custom-code/my-reader.xqy b/src/test/resources/custom-code/my-reader.xqy new file mode 100644 index 00000000..6de5abe5 --- /dev/null +++ b/src/test/resources/custom-code/my-reader.xqy @@ -0,0 +1 @@ +(1,2,3) diff --git a/src/test/resources/custom-code/my-writer.js b/src/test/resources/custom-code/my-writer.js new file mode 100644 index 00000000..bd37948f --- /dev/null +++ b/src/test/resources/custom-code/my-writer.js @@ -0,0 +1,7 @@ +declareUpdate(); +var URI; + +xdmp.documentInsert(URI + ".json", {"hello": "world"}, { + "permissions": [xdmp.permission("spark-user-role", "read"), xdmp.permission("spark-user-role", "update")] +}); + diff --git a/src/test/resources/custom-code/my-writer.xqy b/src/test/resources/custom-code/my-writer.xqy new file mode 100644 index 00000000..4e84f7f3 --- /dev/null +++ b/src/test/resources/custom-code/my-writer.xqy @@ -0,0 +1,8 @@ +xquery version "1.0-ml"; + +declare variable $URI external; + +xdmp:document-insert($URI || ".xml", world, ( + xdmp:permission("spark-user-role", "read"), + xdmp:permission("spark-user-role", "update") +)); diff --git a/src/test/resources/encoding/medline.iso-8859-1.archive.zip b/src/test/resources/encoding/medline.iso-8859-1.archive.zip new file mode 100644 index 00000000..e6705fd1 Binary files /dev/null and b/src/test/resources/encoding/medline.iso-8859-1.archive.zip differ diff --git a/src/test/resources/encoding/medline.iso-8859-1.txt b/src/test/resources/encoding/medline.iso-8859-1.txt new file mode 100644 index 00000000..4bc2f2f8 --- /dev/null +++ b/src/test/resources/encoding/medline.iso-8859-1.txt @@ -0,0 +1,12 @@ + + +10605436 +Concerning the localization of steroids in centrioles and basal bodies by immunofluorescence. +Istituto di Anatomia e Istologia Patologica, Università di Ferrara, Italy. + + +12261559 +[An attempt to study, through genealogies, family structures in the case of a non-noble family] +Un tentativo di studio, tramite, genealogie, di strutture familiari nel caso di una famiglia non nobile + + diff --git a/src/test/resources/encoding/medline2.iso-8859-1.xml.gz b/src/test/resources/encoding/medline2.iso-8859-1.xml.gz new file mode 100644 index 00000000..268716c4 Binary files /dev/null and b/src/test/resources/encoding/medline2.iso-8859-1.xml.gz differ diff --git a/src/test/resources/json-lines/nested-objects.txt b/src/test/resources/json-lines/nested-objects.txt new file mode 100644 index 00000000..709f1359 --- /dev/null +++ b/src/test/resources/json-lines/nested-objects.txt @@ -0,0 +1,2 @@ +{"id": 1, "data": {"color": "blue", "numbers": [1, 2]}, "hello": "world"} +{"id": 2, "data": {"color": "green", "numbers": [2, 3]}} diff --git a/src/test/resources/logback.xml b/src/test/resources/logback.xml index a7649a17..ef6f1136 100644 --- a/src/test/resources/logback.xml +++ b/src/test/resources/logback.xml @@ -30,4 +30,11 @@ + + + diff --git a/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-binary.zip b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-binary.zip new file mode 100644 index 00000000..fdceff65 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-binary.zip differ diff --git a/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-json.zip b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-json.zip new file mode 100644 index 00000000..c53e374e Binary files /dev/null and b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-json.zip differ diff --git a/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-text.zip b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-text.zip new file mode 100644 index 00000000..a4587890 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-text.zip differ diff --git a/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-xml.zip b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-xml.zip new file mode 100644 index 00000000..f040eee9 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/all-four-document-types/mlcp-xml.zip differ diff --git a/src/test/resources/mlcp-archive-files/complex-properties.zip b/src/test/resources/mlcp-archive-files/complex-properties.zip new file mode 100644 index 00000000..b295d630 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/complex-properties.zip differ diff --git a/src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip b/src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip new file mode 100644 index 00000000..10c87c56 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/files-with-all-metadata.mlcp.zip differ diff --git a/src/test/resources/mlcp-archive-files/missing-content-entry.mlcp.zip b/src/test/resources/mlcp-archive-files/missing-content-entry.mlcp.zip new file mode 100644 index 00000000..5cb13b72 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/missing-content-entry.mlcp.zip differ diff --git a/src/test/resources/mlcp-archive-files/normal-and-naked-entry.zip b/src/test/resources/mlcp-archive-files/normal-and-naked-entry.zip new file mode 100644 index 00000000..e0bde268 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/normal-and-naked-entry.zip differ diff --git a/src/test/resources/mlcp-archive-files/two-naked-entries.zip b/src/test/resources/mlcp-archive-files/two-naked-entries.zip new file mode 100644 index 00000000..d7313725 Binary files /dev/null and b/src/test/resources/mlcp-archive-files/two-naked-entries.zip differ diff --git a/src/test/resources/mlcp-metadata/complete.xml b/src/test/resources/mlcp-metadata/complete.xml new file mode 100644 index 00000000..aff5aed2 --- /dev/null +++ b/src/test/resources/mlcp-metadata/complete.xml @@ -0,0 +1,46 @@ + + + xml + + + collection1 + collection2 + + + + + read + R + + spark-user-role + 3908672739498265499 + + + + qconsole-user + 16675984334178111809 + + + + update + U + + spark-user-role + 3908672739498265499 + + + <perms><sec:permission xmlns:sec="http://marklogic.com/xdmp/security"><sec:capability>read</sec:capability><sec:role-id>3908672739498265499</sec:role-id></sec:permission><sec:permission xmlns:sec="http://marklogic.com/xdmp/security"><sec:capability>read</sec:capability><sec:role-id>16675984334178111809</sec:role-id></sec:permission><sec:permission xmlns:sec="http://marklogic.com/xdmp/security"><sec:capability>update</sec:capability><sec:role-id>3908672739498265499</sec:role-id></sec:permission></perms> + 10 + + + meta2 + value2 + + + meta1 + value1 + + + <prop:properties xmlns:prop="http://marklogic.com/xdmp/property"><key2 xsi:type="xs:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">value2</key2><key1 xsi:type="xs:string" xmlns="org:example" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">value1</key1></prop:properties> + false + diff --git a/src/test/resources/rdf/bad-quads.trig b/src/test/resources/rdf/bad-quads.trig new file mode 100644 index 00000000..cb568ed1 --- /dev/null +++ b/src/test/resources/rdf/bad-quads.trig @@ -0,0 +1,11 @@ +@prefix rdf: . +@prefix xsd: . +@prefix swp: . +@prefix dc: . +@prefix ex: . +@prefix : . + +:G1 { :Monica ex:name "Monica Murphy" . + :Monica ex:homepage . + :Monica ex:email . + :Monica ex:hasSkill ex:Management diff --git a/src/test/resources/rdf/blank-nodes.xml b/src/test/resources/rdf/blank-nodes.xml new file mode 100644 index 00000000..08705407 --- /dev/null +++ b/src/test/resources/rdf/blank-nodes.xml @@ -0,0 +1,11 @@ + + + + + + + + + diff --git a/src/test/resources/rdf/each-rdf-file-type.zip b/src/test/resources/rdf/each-rdf-file-type.zip new file mode 100644 index 00000000..fd850d97 Binary files /dev/null and b/src/test/resources/rdf/each-rdf-file-type.zip differ diff --git a/src/test/resources/rdf/empty-taxonomy.xml b/src/test/resources/rdf/empty-taxonomy.xml new file mode 100644 index 00000000..4630e2ba --- /dev/null +++ b/src/test/resources/rdf/empty-taxonomy.xml @@ -0,0 +1,2 @@ + + diff --git a/src/test/resources/rdf/englishlocale.ttl b/src/test/resources/rdf/englishlocale.ttl new file mode 100644 index 00000000..8a9b8506 --- /dev/null +++ b/src/test/resources/rdf/englishlocale.ttl @@ -0,0 +1,34 @@ +@prefix ad: . +@prefix id: . +id:1111 ad:firstName "John" . +id:1111 ad:lastName "Snelson" . +id:1111 ad:homeTel 1111111111 . +id:1111 ad:email "John.Snelson@marklogic.com" . +id:2222 ad:firstName "Micah" . +id:2222 ad:lastName "Dubinko" . +id:2222 ad:homeTel "(222) 222-2222" . +id:2222 ad:email "Micah.Dubinko@marklogic.com" . +id:3333 ad:firstName "Fei" . +id:3333 ad:lastName "Ling" . +id:3333 ad:email "FeiLing@yahoo.com" . +id:3333 ad:email "Fei.Ling@marklogic.com" . +id:4444 ad:firstName "Ling"@en . +id:4444 ad:lastName "Ling" . +id:4444 ad:email "lingling@yahoo.com" . +id:4444 ad:email "Ling.Ling@marklogic.com" . +id:5555 ad:firstName "Fei" . +id:5555 ad:lastName "Xiang" . +id:5555 ad:email "FeiXiang@yahoo.comm" . +id:5555 ad:email "Fei.Xiang@marklogic.comm" . +id:6666 ad:firstName "Lei" . +id:6666 ad:lastName "Pei" . +id:6666 ad:homeTel "(666) 666-6666" . +id:6666 ad:email "Lei.Pei@gmail.com" . +id:7777 ad:firstName "Meng" . +id:7777 ad:lastName "Chen" . +id:7777 ad:homeTel "(777) 777-7777" . +id:7777 ad:email "Meng.Chen@gmail.com" . +id:8888 ad:firstName "Lihan" . +id:8888 ad:lastName "Wang" . +id:8888 ad:email "lihanwang@yahoo.com" . +id:8888 ad:email "Lihan.Wang@gmail.com" . diff --git a/src/test/resources/rdf/englishlocale2.ttl.gz b/src/test/resources/rdf/englishlocale2.ttl.gz new file mode 100644 index 00000000..b503e4a0 Binary files /dev/null and b/src/test/resources/rdf/englishlocale2.ttl.gz differ diff --git a/src/test/resources/rdf/good-and-bad-rdf.zip b/src/test/resources/rdf/good-and-bad-rdf.zip new file mode 100644 index 00000000..abc5e662 Binary files /dev/null and b/src/test/resources/rdf/good-and-bad-rdf.zip differ diff --git a/src/test/resources/rdf/has-empty-entry.zip b/src/test/resources/rdf/has-empty-entry.zip new file mode 100644 index 00000000..957b61fd Binary files /dev/null and b/src/test/resources/rdf/has-empty-entry.zip differ diff --git a/src/test/resources/rdf/mini-taxonomy.xml b/src/test/resources/rdf/mini-taxonomy.xml new file mode 100644 index 00000000..9fc7d0da --- /dev/null +++ b/src/test/resources/rdf/mini-taxonomy.xml @@ -0,0 +1,18 @@ + + + + + + wb + 2013-05-21T15:49:55Z + 0 + Debt Management + + + + + + diff --git a/src/test/resources/rdf/mini-taxonomy2.xml.gz b/src/test/resources/rdf/mini-taxonomy2.xml.gz new file mode 100644 index 00000000..173cfeba Binary files /dev/null and b/src/test/resources/rdf/mini-taxonomy2.xml.gz differ diff --git a/src/test/resources/rdf/semantics.json b/src/test/resources/rdf/semantics.json new file mode 100644 index 00000000..75f5409b --- /dev/null +++ b/src/test/resources/rdf/semantics.json @@ -0,0 +1,75 @@ +{ + "http://jondoe.example.org/#me": { + "http://www.w3.org/2000/01/rdf-schema#type": [ + { + "value": "http://xmlns.com/foaf/0.1/Person", + "type": "uri" + } + ], + "http://xmlns.com/foaf/0.1/name": [ + { + "value": "Jon", + "type": "literal" + } + ], + "http://example.org/icollide#name": [ + { + "value": "Jon", + "type": "literal" + } + ], + "http://xmlns.com/foaf/0.1/depiction": [ + { + "value": "http://jondoe.example.org/me.jpg", + "type": "uri" + } + ], + "http://xmlns.com/foaf/0.1/knows": [ + { + "value": "http://janedoe.example.org/#me", + "type": "uri" + }, + { + "value": "_:b0", + "type": "bnode" + } + ], + "http://xmlns.com/foaf/0.1/birthday": [ + { + "value": "2010-03-23T13:40:22.489+00:00", + "type": "literal", + "datatype": "http://www.w3.org/2001/XMLSchema#dateTime" + } + ], + "http://xmlns.com/foaf/0.1/age": [ + { + "value": "1", + "type": "literal" + } + ], + "http://xmlns.com/foaf/0.1/description": [ + { + "value": "Just another Jon Doe", + "type": "literal", + "lang": "en" + }, + { + "value": "Justement un autre Jon Doe", + "type": "literal", + "lang": "fr" + } + ], + "http://www.w3.org/2006/vcard/ns#geo": [ + { + "value": "_:b1", + "type": "bnode" + } + ], + "http://www.w3.org/2006/vcard/ns#tel": [ + { + "value": "+49-12-3546789", + "type": "literal" + } + ] + } +} diff --git a/src/test/resources/rdf/semantics.n3 b/src/test/resources/rdf/semantics.n3 new file mode 100644 index 00000000..705b5178 --- /dev/null +++ b/src/test/resources/rdf/semantics.n3 @@ -0,0 +1,51 @@ +#Processed by Id: cwm.py,v 1.197 2007-12-13 15:38:39 syosi Exp + # using base file:///projects/MarkLogic/data.marklogic.com/w3c-tr.rdf + +# Notation3 generation by +# notation3.py,v 1.200 2007-12-11 21:18:08 syosi Exp + +# Base was: file:///projects/MarkLogic/data.marklogic.com/w3c-tr.rdf + @prefix : <#> . + @prefix e: . + @prefix e2: . + @prefix g: . + @prefix j.0: . + @prefix j.1: . + @prefix j.2: . + @prefix j.3: . + @prefix j.4: . + @prefix j.6: . + @prefix j.7: . + @prefix j.8: . + @prefix j.9: . + @prefix owl: . + @prefix rdf: . + @prefix rdfs: . + + () a rdf:List, + rdfs:Resource; + rdfs:comment "The empty list, with no items in it. If the rest of a list is nil then the list has no more items in it."; + rdfs:isDefinedBy rdf:; + rdfs:label "nil"; + rdfs:seeAlso rdf: . + + rdf:type a rdf:Property, + rdfs:Resource; + rdfs:comment "The subject is an instance of a class."; + rdfs:domain rdfs:Resource; + rdfs:isDefinedBy rdf:; + rdfs:label "type"; + rdfs:range rdfs:Class; + rdfs:seeAlso rdf: . + + "The Dublin Core Element Set v1.1 namespace provides URIs for the Dublin Core Elements v1.1. Entries are declared using RDF Schema language to support RDF applications."@en-US; + "English"@en-US; + "The Dublin Core Metadata Initiative"@en-US; + , + ; + "The Dublin Core Element Set v1.1 namespace providing access to its content by means of an RDF Schema"@en-US; + , + ; + ; + "1999-07-02"; + "2003-03-24" . diff --git a/src/test/resources/rdf/semantics.nq b/src/test/resources/rdf/semantics.nq new file mode 100644 index 00000000..955e4667 --- /dev/null +++ b/src/test/resources/rdf/semantics.nq @@ -0,0 +1,5 @@ +# started 2012-06-04T11:00:11Z + . + . + . + . diff --git a/src/test/resources/rdf/semantics.nt b/src/test/resources/rdf/semantics.nt new file mode 100644 index 00000000..3e8b0269 --- /dev/null +++ b/src/test/resources/rdf/semantics.nt @@ -0,0 +1,31 @@ +# The N-Triples statements below are equivalent to this RDF/XML: +# +# +# +# N-Triples +# +# +# Art Barstow +# +# +# +# +# Dave Beckett +# +# +# +# + + + . + "N-Triples"@en-US . + _:art . + _:dave . + +_:art . +_:art "Art Barstow". + +_:dave . +_:dave "Dave Beckett". diff --git a/src/test/resources/rdf/semantics2.json.gz b/src/test/resources/rdf/semantics2.json.gz new file mode 100644 index 00000000..c71b35d6 Binary files /dev/null and b/src/test/resources/rdf/semantics2.json.gz differ diff --git a/src/test/resources/rdf/semantics2.n3.gz b/src/test/resources/rdf/semantics2.n3.gz new file mode 100644 index 00000000..cb90e2a7 Binary files /dev/null and b/src/test/resources/rdf/semantics2.n3.gz differ diff --git a/src/test/resources/rdf/semantics2.nq.gz b/src/test/resources/rdf/semantics2.nq.gz new file mode 100644 index 00000000..7822471b Binary files /dev/null and b/src/test/resources/rdf/semantics2.nq.gz differ diff --git a/src/test/resources/rdf/semantics2.nt.gz b/src/test/resources/rdf/semantics2.nt.gz new file mode 100644 index 00000000..b50d1c52 Binary files /dev/null and b/src/test/resources/rdf/semantics2.nt.gz differ diff --git a/src/test/resources/rdf/three-quads.trig b/src/test/resources/rdf/three-quads.trig new file mode 100644 index 00000000..898900df --- /dev/null +++ b/src/test/resources/rdf/three-quads.trig @@ -0,0 +1,30 @@ +# This document encodes three graphs. + +@prefix rdf: . +@prefix xsd: . +@prefix swp: . +@prefix dc: . +@prefix ex: . +@prefix : . + +:G1 { :Monica ex:name "Monica Murphy" . + :Monica ex:homepage . + :Monica ex:email . + :Monica ex:hasSkill ex:Management } + +:G2 { :Monica rdf:type ex:Person . + :Monica ex:hasSkill ex:Programming } + +:G3 { :G1 swp:assertedBy _:w1 . + _:w1 swp:authority :Chris . + _:w1 dc:date "2003-10-02"^^xsd:date . + :G2 swp:quotedBy _:w2 . + :G3 swp:assertedBy _:w2 . + _:w2 dc:date "2003-09-03"^^xsd:date . + _:w2 swp:authority :Chris . + :Chris rdf:type ex:Person . + :Chris ex:email } + +{ + :Default ex:graphname "Default" +} diff --git a/src/test/resources/rdf/three-quads2.trig.gz b/src/test/resources/rdf/three-quads2.trig.gz new file mode 100644 index 00000000..30604110 Binary files /dev/null and b/src/test/resources/rdf/three-quads2.trig.gz differ diff --git a/src/test/resources/rdf/turtle-triples.txt b/src/test/resources/rdf/turtle-triples.txt new file mode 100644 index 00000000..8a9b8506 --- /dev/null +++ b/src/test/resources/rdf/turtle-triples.txt @@ -0,0 +1,34 @@ +@prefix ad: . +@prefix id: . +id:1111 ad:firstName "John" . +id:1111 ad:lastName "Snelson" . +id:1111 ad:homeTel 1111111111 . +id:1111 ad:email "John.Snelson@marklogic.com" . +id:2222 ad:firstName "Micah" . +id:2222 ad:lastName "Dubinko" . +id:2222 ad:homeTel "(222) 222-2222" . +id:2222 ad:email "Micah.Dubinko@marklogic.com" . +id:3333 ad:firstName "Fei" . +id:3333 ad:lastName "Ling" . +id:3333 ad:email "FeiLing@yahoo.com" . +id:3333 ad:email "Fei.Ling@marklogic.com" . +id:4444 ad:firstName "Ling"@en . +id:4444 ad:lastName "Ling" . +id:4444 ad:email "lingling@yahoo.com" . +id:4444 ad:email "Ling.Ling@marklogic.com" . +id:5555 ad:firstName "Fei" . +id:5555 ad:lastName "Xiang" . +id:5555 ad:email "FeiXiang@yahoo.comm" . +id:5555 ad:email "Fei.Xiang@marklogic.comm" . +id:6666 ad:firstName "Lei" . +id:6666 ad:lastName "Pei" . +id:6666 ad:homeTel "(666) 666-6666" . +id:6666 ad:email "Lei.Pei@gmail.com" . +id:7777 ad:firstName "Meng" . +id:7777 ad:lastName "Chen" . +id:7777 ad:homeTel "(777) 777-7777" . +id:7777 ad:email "Meng.Chen@gmail.com" . +id:8888 ad:firstName "Lihan" . +id:8888 ad:lastName "Wang" . +id:8888 ad:email "lihanwang@yahoo.com" . +id:8888 ad:email "Lihan.Wang@gmail.com" . diff --git a/src/test/resources/rdf/two-rdf-files.zip b/src/test/resources/rdf/two-rdf-files.zip new file mode 100644 index 00000000..453ac09b Binary files /dev/null and b/src/test/resources/rdf/two-rdf-files.zip differ diff --git a/src/test/resources/spark-json/array-of-objects.json b/src/test/resources/spark-json/array-of-objects.json new file mode 100644 index 00000000..7f831814 --- /dev/null +++ b/src/test/resources/spark-json/array-of-objects.json @@ -0,0 +1,10 @@ +[ + { + "number": 1, + "hello": "world" + }, + { + "number": 2, + "description": "This is different from the first object." + } +] diff --git a/src/test/resources/spark-json/json-array.zip b/src/test/resources/spark-json/json-array.zip new file mode 100644 index 00000000..76149605 Binary files /dev/null and b/src/test/resources/spark-json/json-array.zip differ diff --git a/src/test/resources/spark-json/json-lines.txt b/src/test/resources/spark-json/json-lines.txt new file mode 100644 index 00000000..3125c2a7 --- /dev/null +++ b/src/test/resources/spark-json/json-lines.txt @@ -0,0 +1,2 @@ +{"number": 1, "hello": "world"}, +{"number": 2, "description": "This is different from the first object."} diff --git a/src/test/resources/spark-json/json-objects.zip b/src/test/resources/spark-json/json-objects.zip new file mode 100644 index 00000000..dd801895 Binary files /dev/null and b/src/test/resources/spark-json/json-objects.zip differ diff --git a/src/test/resources/spark-json/single-object.json b/src/test/resources/spark-json/single-object.json new file mode 100644 index 00000000..5c13526a --- /dev/null +++ b/src/test/resources/spark-json/single-object.json @@ -0,0 +1,6 @@ +{ + "number": 3, + "parent": { + "child": "text" + } +} diff --git a/src/test/resources/spark-json/some-bad-json-docs.zip b/src/test/resources/spark-json/some-bad-json-docs.zip new file mode 100644 index 00000000..bfcd3706 Binary files /dev/null and b/src/test/resources/spark-json/some-bad-json-docs.zip differ