diff --git a/.github/workflows/build-fcs-endpoint-dev-tutorial-adoc.yml b/.github/workflows/build-fcs-endpoint-dev-tutorial-adoc.yml new file mode 100644 index 0000000..b53cf80 --- /dev/null +++ b/.github/workflows/build-fcs-endpoint-dev-tutorial-adoc.yml @@ -0,0 +1,36 @@ +name: build adocs + +on: + push: + branches: + - main + - dev + - feature/fcs-endpoint-dev-tutorial + paths: + - 'fcs-endpoint-dev-tutorial/**' + - '.github/workflows/build-fcs-endpoint-dev-tutorial-adoc.yml' + workflow_dispatch: + +concurrency: + group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} + cancel-in-progress: true + +jobs: + build: + runs-on: ubuntu-latest + container: asciidoctor/docker-asciidoctor + + steps: + - uses: actions/checkout@v3 + + - name: Build HTML + run: asciidoctor -v -D docs -a data-uri --backend=html5 -o fcs-endpoint-dev-tutorial.html fcs-endpoint-dev-tutorial/index.adoc + + - name: Build PDF + run: asciidoctor-pdf -v -D docs -o fcs-endpoint-dev-tutorial.pdf fcs-endpoint-dev-tutorial/index.adoc + + - name: Store results + uses: actions/upload-artifact@v3 + with: + name: fcs-endpoint-dev-tutorial + path: docs/* diff --git a/fcs-endpoint-dev-tutorial/index.adoc b/fcs-endpoint-dev-tutorial/index.adoc new file mode 100644 index 0000000..2ccc642 --- /dev/null +++ b/fcs-endpoint-dev-tutorial/index.adoc @@ -0,0 +1,29 @@ += FCS 2.0 Endpoint Developer's Tutorial +Oliver Schonefeld ; Erik Körner +v1.0, 2016-01 +// more metadata +:description: This is a tutorial on how to develop CLARIN FCS endpoints. +:organization: CLARIN +// settings +:doctype: book +// source code +:source-highlighter: rouge +:rouge-style: igor_pro +// toc and heading +:toc: +:toclevels: 4 +:sectnums: +:sectnumlevels: 4 +:appendix-caption!: +// directory stuff +:imagesdir: images +// pdf +ifdef::backend-pdf[] +:pdf-theme: clarin +:pdf-themesdir: {docdir}/themes +:title-logo-image: image:{docdir}/themes/clarin-logo.svg[pdfwidth=5.75in,align=center] +endif::[] + +<<< + +include::java/index.adoc[leveloffset=+1] diff --git a/fcs-endpoint-dev-tutorial/java/adaption.adoc b/fcs-endpoint-dev-tutorial/java/adaption.adoc new file mode 100644 index 0000000..fa18f14 --- /dev/null +++ b/fcs-endpoint-dev-tutorial/java/adaption.adoc @@ -0,0 +1,58 @@ += Adaptation + +The easiest way to get started is to adapt the <>. + + +== SRUSearchEngine/SRUSearchEngineBase + +By extending the `SimpleEndpointSearchEngineBase`, or if it suits your search engine's needs better +the `SRUSearchEngineBase` directly, you adapt the behaviour to your search engine. A few notes: + +* do not override `init()` use `doInit()`. +* If you need to do cleanup do not override `destroy()` use `doDestroy()`. +* Implementing the scan method is optional. If you want to provide custom scan behavior for a different index, override the `doScan()` method. +* Implementing the explain method is optional. Only needed if you need to fill `writeExtraResponseData` block of the SRU response. The implementation of this method must be thread-safe. The `SimpleEndpointSearchEngineBase` implementation has a on request parameter only response of `SRUExplainResult` with diagnostics. + + +=== Initialize the search engine + +The initialization should be tailored towards your environment and needs. You need to provide the context (`ServletContext`), config (`SRUServerConfig`) and a query parser builder `SRUQueryParserRegistry.Builder` if you want to register additional query parsers. In addition you can provide parameters gathered from servlet configuration and the servlet context. + + +== EndpointDescription + +`SimpleEndpointDescription` is an implementtion of an endpoint description that is initialized from static information supplied at construction time. You will probably use the `SimpleEndpointDescriptionParser` to provide the endpoint description, but you can generate the list of resource info records in any way suitable to your situation. Though probably this is not the first behaviour you need to adapt since it supports both URL or w3 Document instantiation. + + +== EndpointDescriptionParser + +The `SimpleEndpointDescriptionParser` is able to do the heavy lifting for you by parsing and extracting the information from the endpoint description including everything needed for basic and required FCS 2.0 features like capabilities, supported layers and dataviews, resource enumeration etc. It also already provide simpe consistency checks like checking unique IDs and that the declared capabilities and dataviews match. See <> section for further details. + + +== SRUSearchResultSet + +This class needs to be implemented to support your search engine's behaviour. Implement these methods: + +* `writeRecord()`, +* `getResultCountPrecision()`, +* `getRecordIdentifier()`, +* `nextRecord()`, +* `getRecordSchemaIdentifier()`, +* `getRecordCount()`, and +* `getTotalRecordCount()`. + + +== SRUScanResultSet + +This class needs to be implemented to support your search engine's beahviour. Implement these methods: + +* `getWhereInList()`, +* `getNumberOfRecords()`, +* `getDisplayTerm()`, +* `getValue()`, and +* `getNextTerm()`. + + +== SRUExplainResult + +This class needs to be implemented to support your search engine's data source. diff --git a/fcs-endpoint-dev-tutorial/java/code-examples.adoc b/fcs-endpoint-dev-tutorial/java/code-examples.adoc new file mode 100644 index 0000000..165afab --- /dev/null +++ b/fcs-endpoint-dev-tutorial/java/code-examples.adoc @@ -0,0 +1,105 @@ += Code examples + +In this section the most probable classes or methods to override or implement are walked through with code examples from one or more of the reference implementations. + +.Extract FCS-QL query from request +[source,java] +---- +if (request.isQueryType(Constants.FCS_QUERY_TYPE_FCS)) { + /* + * Got a FCS query (SRU 2.0). + * Translate to a proper Lucene query + */ + final FCSQueryParser.FCSQuery q = request.getQuery(FCSQueryParser.FCSQuery.class); + query = makeSpanQueryFromFCS(q); +} +---- + +.Translate FCS-QL query to `SpanTermQuery` +[source,java] +---- +private SpanQuery makeSpanQueryFromFCS(FCSQueryParser.FCSQuery query) throws SRUException { + QueryNode tree = query.getParsedQuery(); + logger.debug("FCS-Query: {}", tree.toString()); + // crude query translator + if (tree instanceof QuerySegment) { + QuerySegment segment = (QuerySegment) tree; + if ((segment.getMinOccurs() == 1) && (segment.getMaxOccurs() == 1)) { + QueryNode child = segment.getExpression(); + if (child instanceof Expression) { + Expression expression = (Expression) child; + if (expression.getLayerIdentifier().equals("text") && + (expression.getLayerQualifier() == null) && + (expression.getOperator() == Operator.EQUALS) && + (expression.getRegexFlags() == null)) { + return new SpanTermQuery(new Term("text", expression.getRegexValue().toLowerCase())); + } else { + throw new SRUException( + Constants.FCS_DIAGNOSTIC_GENERAL_QUERY_TOO_COMPLEX_CANNOT_PERFORM_QUERY, + "Endpoint only supports 'text' layer, the '=' operator and no regex flags"); + } + } else { + throw new SRUException( + Constants.FCS_DIAGNOSTIC_GENERAL_QUERY_TOO_COMPLEX_CANNOT_PERFORM_QUERY, + "Endpoint only supports simple expressions"); + } + } else { + throw new SRUException( + Constants.FCS_DIAGNOSTIC_GENERAL_QUERY_TOO_COMPLEX_CANNOT_PERFORM_QUERY, + "Endpoint only supports default occurances in segments"); + } + } else { + throw new SRUException( + Constants.FCS_DIAGNOSTIC_GENERAL_QUERY_TOO_COMPLEX_CANNOT_PERFORM_QUERY, + "Endpoint only supports single segment queries"); + } +} +---- + +.Serialize a single XML record as Data Views +[source,java] +---- +@Override +public void writeRecord(XMLStreamWriter writer) throws XMLStreamException { + XMLStreamWriterHelper.writeStartResource(writer, idno, null); + XMLStreamWriterHelper.writeStartResourceFragment(writer, null, null); + /* + * NOTE: use only AdvancedDataViewWriter, even if we are only doing + * legacy/simple FCS. + * The AdvancedDataViewWriter instance could also be + * reused, by calling reset(), if it was used in a smarter fashion. + */ + AdvancedDataViewWriter helper = new AdvancedDataViewWriter(AdvancedDataViewWriter.Unit.ITEM); + URI layerId = URI.create("http://endpoint.example.org/Layers/orth1"); + String[] words; + long start = 1; + if ((left != null) && !left.isEmpty()) { + words = left.split("\\s+"); + for (int i = 0; i < words.length; i++) { + long end = start + words[i].length(); + helper.addSpan(layerId, start, end, words[i]); + start = end + 1; + } + } + words = keyword.split("\\s+"); + for (int i = 0; i < words.length; i++) { + long end = start + words[i].length(); + helper.addSpan(layerId, start, end, words[i], 1); + start = end + 1; + } + if ((right != null) && !right.isEmpty()) { + words = right.split("\\s+"); + for (int i = 0; i < words.length; i++) { + long end = start + words[i].length(); + helper.addSpan(layerId, start, end, words[i]); + start = end + 1; + } + } + helper.writeHitsDataView(writer, layerId); + if (advancedFCS) { + helper.writeAdvancedDataView(writer); + } + XMLStreamWriterHelper.writeEndResourceFragment(writer); + XMLStreamWriterHelper.writeEndResource(writer); +} +---- diff --git a/fcs-endpoint-dev-tutorial/java/configuration.adoc b/fcs-endpoint-dev-tutorial/java/configuration.adoc new file mode 100644 index 0000000..199be44 --- /dev/null +++ b/fcs-endpoint-dev-tutorial/java/configuration.adoc @@ -0,0 +1,127 @@ += Configuration + +== Maven + +To include <> these are the dependencies: + +[source,xml] +---- + + + eu.clarin.sru.fcs + fcs-simple-endpoint + 1.3.0 + + + javax.servlet + servlet-api + 2.5 + jar + provided + + +---- + +The version is currently `1.4-SNAPSHOT` if you want and enable the Clarin snapshots repository. + + +== Endpoint + +To enable SRU 2.0 which is required for FCS 2.0 functionality you need to provide the following +initialization parameters to the servlet context: + +[source,xml] +---- + + eu.clarin.sru.server.sruSupportedVersionMax + 2.0 + + + eu.clarin.sru.server.legacyNamespaceMode + loc + +---- + +The endpoint configurations consists of the already mentionend context (`ServletContext`), a config (`SRUServerConfig`) and if you want further query parsers (`SRUQueryParserRegistry.Builder`). Also additional parameters gathered from servlet configuration and the servlet context are available. + + +== EndpointDescriptionParser + +You probably start out using the provided `EndpointdescriptionParser`. It will parse and make available what is required and also do some sanity checkning. + +* `Capabilities`, _basic search_ capability is required and _advanced search_ is available for FCS 2.0, checks that any given capability is encoded as a proper URI and that the IDs are unique. +* Supported Data views, checks that `` elements have: ++ +-- +** a proper `@id` attribute and that the value is unique. +** a `@delivery-policy` attribute, e.g. `DeliveryPolicy.SEND_BY_DEFAULT`, `DeliveryPolicy.NEED_TO_REQUEST`. +** a child text node with a MIME-type as its content, e.g. for _basics search (hits)_: `application/x-clarin-fcs-hits+xml` and for _advanced search_: `application/x-clarin-fcs-adv+xml` +-- ++ +Sample: `application/x-clarin-fcs-adv+xml` + +Makes sure capabilities and declared dataviews actually match otherwise it will warn you. + +* Supported Layers, checks that `` elements have: + +** a proper `@id` attribute and that the value is unique. +** a proper `@result-id` attribute and that is is encoded as a proper URI, ant that the child text node is "text", "lemma", "pos", "orth", "norm", "phonetic", or other value starting with "x-". +** if a `@alt-value-info-uri` attribute that is encoded as proper URI, e.g. tag description +** if _advanced search_ is given in capabilities that it is also available. + +* Resources, checks that some resources are actually defined, and have: + +** a proper `@xml:lang` attribute on its `` elelement. +** a child `` element +** a child `` element and that is must use ISO-639-3 three letter language codes + + +== Translation library + +For the current version of the translation library a mapping for <> to your used word classes for the word class layer is needed. It currently also does <> conversion for the phonetic layer. The mappings are specified in one configuration file, an XML document. This will mostly be 1-to-1, but might require lossy translation either way. To guide you in this we walk through configuration and mapping examples from the reference implemetations. + + +=== Part-of-Speech (PoS) + +The PoS translation configuration is expressed in a TranslationTable element with the attributes `@fromResourceLayer`, `@toResourceLayer` and `@translationType`: + +[source,xml] +---- + + + +---- + +`@translationType` is currently a closed set of two values, but could be extended by any definition on how to replace something in to. The values are _replaceWhole_ and _replaceSegments_, but _replaceSegments_ require further defintions of trellis segment translations which will not be +addressed by this tutorial. + +The values of `@fromResourceLayer` and `@toResourceLayer` only depends on these being declared +by `` elements under `//`: + +[source,xml] +---- + +---- + +The attributes of `` are `@resource`, `@layer` and `@formalism`. The value of `@layer` is (most easily) the identifier which is used for the layer in the FCS 2.0 specification. `@formalism` is (most easily) the namespace value prefix or an URI. E.g. for PoS this can be _SUC-PoS_ for the +already mentionend SUC PoS tagset, _CGN_ or _UD-17_. These tag sets often also includes morphosyntactic descriptions _MSD_ in its original form, but since MSD is not part of the FCS 2.0 specification we are only dealing with the PoS tags here. + +Going from UD-17's _VERB_ tag to Stockholm Umeå Corpus (SUC) Part-of-Speech you get two tags +VB and PC: + +[source,xml] +---- + + +---- + +Adding the translation of the UD-17 AUX tag which gives VB in SUC-PoS too, but this is a 1-to-1 translation this way. + +[source,xml] +---- + +---- + +As you can see from this the precision is varying and could become too bad to be useful going both ways from the <> to the endpoint and then back. For this you can use the available alerting methods given in the FCS 2.0 specification. + +With non-1-to-1 translations you need to know how alternatives are expressed in the endpoints query language. This is where the not yet available conversion library would use the translation library adding rule-based knowledge on how to translate to e.g. CQP `[pos = "VB" | pos = "PC"]`. diff --git a/fcs-endpoint-dev-tutorial/java/index.adoc b/fcs-endpoint-dev-tutorial/java/index.adoc new file mode 100644 index 0000000..ee7e133 --- /dev/null +++ b/fcs-endpoint-dev-tutorial/java/index.adoc @@ -0,0 +1,9 @@ += Java FCS-SRU Endpoint + +include::introduction.adoc[] + +include::adaption.adoc[leveloffset=+1] + +include::code-examples.adoc[leveloffset=+1] + +include::configuration.adoc[leveloffset=+1] diff --git a/fcs-endpoint-dev-tutorial/java/introduction.adoc b/fcs-endpoint-dev-tutorial/java/introduction.adoc new file mode 100644 index 0000000..f1357f6 --- /dev/null +++ b/fcs-endpoint-dev-tutorial/java/introduction.adoc @@ -0,0 +1,80 @@ +== Requirements + +* Reference libraries: <>, <>, <> or your own selected FCS 2.0 and +SRU 2.0 compatible libraries. +* Endpoint reference library: <> or you own from scratch. +* Translation library (optional) + + +== Resources + +Specifications:: + * FCS 2.0 specification -- <> + * SRU 2.0 specification -- <> + +Maven dependencies:: + Reference libraries: <>, <>, and <> (simple as well as other ones). See <> section. + +Implementations:: + * http://clarin.ids-mannheim.de/downloads/clarin/DigiBibSRU-source-2016-02-08.zip + * https://github.com/clarin-eric/fcs-korp-endpoint/[Korp Endpoint] + + +== References + +[[ref:SRUServer]]SRUServer:: + SRU/CQL server implementation, conforming to SRU/CQL protocol version 1.1 and 1.2 and (partially) 2.0, June 2023, + https://github.com/clarin-eric/fcs-sru-server/ + +[[ref:SRUClient]]SRUClient:: + SRU/CQL client implementation, conforming to SRU/CQL protocol version 1.1, 1.2 and (partially) 2.0, June 2023, + https://github.com/clarin-eric/fcs-sru-client/ + +[[ref:FCS-QL]]FCS-QL:: + CLARIN-FCS Core 2.0 query language grammar and parser, June 2023, + https://github.com/clarin-eric/fcs-ql/ + +[[ref:FCSSimpleEndpoint]]FCSSimpleEndpoint:: + A simple CLARIN FCS endpoint, June 2023, + https://github.com/clarin-eric/fcs-simple-endpoint/ + +[[ref:FCSAggregator]]FCSAggregator:: + Federated Content Search Aggregator, June 2023, + https://github.com/clarin-eric/fcs-sur-aggregator/, + https://contentsearch.clarin.eu/ + +[[ref:CLARIN-FCSCore20]]CLARIN-FCS-Core 2.0:: + CLARIN Federated Content Search (CLARIN-FCS) - Core 2.0, SCCTC FCS Task-Force, June 2023, + https://office.clarin.eu/v/CE-2017-1046-FCS-Specification-v20230426.pdf[PDF], + https://github.com/clarin-eric/fcs-misc/tree/main/fcs-core-2.0[sources (asciidoc, examples, xml schema)] + +[[ref:OASIS-SRU20]]OASIS-SRU20:: + searchRetrieve: Part 3. SRU searchRetrieve Operation: APD Binding for SRU 2.0 Version 1.0, OASIS, January 2013, + http://www.loc.gov/standards/sru/sru-2-0.html, + http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.doc + http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.html[(HTML)], + http://docs.oasis-open.org/search-ws/searchRetrieve/v1.0/os/part3-sru2.0/searchRetrieve-v1.0-os-part3-sru2.0.pdf[(PDF)] + +[[ref:UD-POS]]UD-POS:: + Universal Dependencies, Universal POS tags v2.0, + https://universaldependencies.github.io/u/pos/index.html + +[[ref:SAMPA]]SAMPA:: + Dafydd Gibbon, Inge Mertins, Roger Moore (Eds.): Handbook of Multimodal and Spoken Language Systems. Resources, Terminology and Product Evaluation, Kluwer Academic Publishers, Boston MA, 2000, ISBN 0-7923-7904-7 + + +== Typographic and XML Namespace conventions + +The following typographic conventions for XML fragments will be used throughout this specification: + +* `` ++ +An XML element with the Generic Identifier _Element_ that is bound to an XML namespace denoted by the prefix _prefix_. + +* `@attr` ++ +An XML attribute with the name _attr_. + +* `string` ++ +The literal _string_ must be used either as element content or attribute value. diff --git a/fcs-endpoint-dev-tutorial/themes b/fcs-endpoint-dev-tutorial/themes new file mode 120000 index 0000000..de90031 --- /dev/null +++ b/fcs-endpoint-dev-tutorial/themes @@ -0,0 +1 @@ +../themes \ No newline at end of file diff --git a/historical/documents/FCS-2-endpoint-developers-tutorial.pdf b/historical/documents/FCS-2-endpoint-developers-tutorial.pdf new file mode 100644 index 0000000..3162827 Binary files /dev/null and b/historical/documents/FCS-2-endpoint-developers-tutorial.pdf differ