You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The indexing system has grown complex over the years, and some features aren't very useful, or aren't needed anymore if you use Saxon with XPath 3 support. We should deprecate and eventually remove unnecessary features to reduce complexity. There should be good documentation that makes it clear how to accomplish your goals without relying on deprecated features.
Linked documents
Could we deprecate the "linked documents" feature, using XPath's document() function to look things up instead? We'd have to look into this, especially when it comes to AutoSearch, uploading (possibly zipped) contents and metadata, but it's probably possible to use a custom document resolver with Saxon?
tokenIdPath/tokenRefPath (Standoff annotations)
Standoff annotations that index an inline tag or relation are still useful, as they refer to multiple words. But maybe we can do away with tokenIdPath/tokenRefPath (capturing and referring to token id values), and simply have the standoff annotation's XPaths refer to the relevant word nodes, and look up the token position for that.
Standoff annotations tied to a single word can be retrieved while indexing that word using XPath, but it probably doesn't make sense to remove this option from standoffAnnotations. Sometimes doing it this way can be simpler.
Non-XML file formats don't support standoff annotations, so they don't use tokenIdPath/tokenRefPath anyway.
Subannotations
How useful is the concept of subannotations?
they automatically inherit the parent's name as a prefix (doesn't seem essential)
they can re-use the parent's matching nodes if the valuePaths match (but that optimization could be extended to all annotations that share the same valuePath, and becomes a little trickier when you're doing more work in XPath expressions)
they support forEach, unlike top-level annotations, although each subannotation does need to be declared. So there's no reason we couldn't support forEach for top-level annotations in the same way.
Other than that, they're just annotations like any other, so it may not be worth it to treat them differently. This would simplify the code and documentation.
Capture value paths, processing steps
XML file formats should be able to replace these with XPath 3.1 expressions. If we're capturing annotations with forEach, but want to perform different processing for different annotations, we can't do that yet. Maybe we need a per-annotation processPath that is applied after the annotation node or value is captured while executing the forEach.
Non-XML file formats also use processing steps, and there's no clear alternative here. Maybe a few precooked options, or plugins?
Note also that XPath can be a little tricky, so the documentation should give lots of examples of how to accomplish the same functionality in XPath. E.g. a default value is done like this:
# get 'ana' attribute from link element, or use 'dep' as the default value (XPath 2+)valuePath: "./link/(string(@ana), 'dep')[1]"
Input format inheritance (NOW DEPRECATED, remove in v5)
It is possible to derive input formats from other input formats, but this isn't very useful in practice. It is rare that your input format just needs to add a small thing compared to a base format; more often you need to make more changes. Having the whole format in one file is the simplest and most readable solution, even if it leads to a little bit of duplication sometimes.
This would simplify loading formats as we don't need to ensure the base format has been loaded before loading the derived format. It also just reduces BlackLab's complexity.
The text was updated successfully, but these errors were encountered:
jan-niestadt
changed the title
Simplify indexing by deprecating unnecessary features such as inheritance
Simplify indexing by deprecating questionable features such as linked documents
Sep 8, 2023
The indexing system has grown complex over the years, and some features aren't very useful, or aren't needed anymore if you use Saxon with XPath 3 support. We should deprecate and eventually remove unnecessary features to reduce complexity. There should be good documentation that makes it clear how to accomplish your goals without relying on deprecated features.
Linked documents
Could we deprecate the "linked documents" feature, using XPath's
document()
function to look things up instead? We'd have to look into this, especially when it comes to AutoSearch, uploading (possibly zipped) contents and metadata, but it's probably possible to use a custom document resolver with Saxon?tokenIdPath/tokenRefPath (Standoff annotations)
Standoff annotations that index an inline tag or relation are still useful, as they refer to multiple words. But maybe we can do away with
tokenIdPath
/tokenRefPath
(capturing and referring to token id values), and simply have the standoff annotation's XPaths refer to the relevant word nodes, and look up the token position for that.Standoff annotations tied to a single word can be retrieved while indexing that word using XPath, but it probably doesn't make sense to remove this option from
standoffAnnotations
. Sometimes doing it this way can be simpler.Non-XML file formats don't support standoff annotations, so they don't use
tokenIdPath
/tokenRefPath
anyway.Subannotations
How useful is the concept of subannotations?
forEach
, unlike top-level annotations, although each subannotation does need to be declared. So there's no reason we couldn't supportforEach
for top-level annotations in the same way.Other than that, they're just annotations like any other, so it may not be worth it to treat them differently. This would simplify the code and documentation.
Capture value paths, processing steps
XML file formats should be able to replace these with XPath 3.1 expressions. If we're capturing annotations with forEach, but want to perform different processing for different annotations, we can't do that yet. Maybe we need a per-annotation
processPath
that is applied after the annotation node or value is captured while executing the forEach.Non-XML file formats also use processing steps, and there's no clear alternative here. Maybe a few precooked options, or plugins?
Note also that XPath can be a little tricky, so the documentation should give lots of examples of how to accomplish the same functionality in XPath. E.g. a default value is done like this:
Input format inheritance (NOW DEPRECATED, remove in v5)
It is possible to derive input formats from other input formats, but this isn't very useful in practice. It is rare that your input format just needs to add a small thing compared to a base format; more often you need to make more changes. Having the whole format in one file is the simplest and most readable solution, even if it leads to a little bit of duplication sometimes.
This would simplify loading formats as we don't need to ensure the base format has been loaded before loading the derived format. It also just reduces BlackLab's complexity.
The text was updated successfully, but these errors were encountered: