-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support indexless|hierarchical generic content part declaration patterns. #1
Comments
This is relevant in this sense:
In some cases it makes sense to use a second pass for |
Allows e.g. hierarchical progressive splitting of sections, i.e. gathering context (!) while on the way to detecting Thus in the end the leaf content parts found are the same as in the current weighted score system where As only markup stores the hierarchy level, there is no known way around extending the declaration detection on a per sheet document type basis, e.g. ODT, DOCX, MD, RST, ... In these sets of declarations, the upper most level must get the highest weight by all means besides there are more than 1 occurrences to ensure an top bottom approach which is mandatory here due to the tree structure. (Currently as said above, only leaf content parts are detected in |
…orlddevelopment#1 May be extended to a tuples with more than two entries.
…orlddevelopment#1 These are: * isGeneric * isSpecific * isMarkupPhraseFilter * ... Also add functions that are required since commit c879d6c: * canResultBeIndex * canMatchIndex Also add a so far unused function which is less specific as many kind of patterns can contain a number (some for hierarchy, some for index|pos, some as real content numbers): * canResultContainNumber Also add a TODO to rename isWordedPattern to * isContentPatternIndexless or * isContentPhraseFilterIndexless
Also change some variable names from e[_]? to p\1 as the rename from exercise to (content) part is due with worlddevelopment#1.
Definitions
Indexless := no index numbering scheme, i.e. if a number occurs then it is either content or denoting a hierarchy in a markup and not a series. => numbers are explicit (no regular expression) => can only have an implicit ordering.
indexed := with index numbering scheme (i.e. explicite order)
Generic := filter by an expression (regex|wildcard|...)
Specific := explicit := filter by explicit content (repeating phrase)
Raw content := markup content
Content := plain text.content, i.e. the visual content like information text, media, ...
Content part declarations
[Content phrase filter] (matching all specific content parts within all hierarchy levels mixed)
generic|regex|wildcard
can match index (have an explicit series order)
all|mixed
numbers+special chars only (have an explicit series order, hierarchy e.g. 1.1, 1.2, 2.1, ...)
Note: Number based index may need filtering of false positives due to numbers occuring in the content parts, too.)
guaranteed indexless
specific|explicit (guaranteed indexless)
[Raw content | Markup phrase filter]
specific|explicit, match only indexless (no order; numbers denote hierarchy depth; Matches only within one hierarchy level)
generic, can match index (Matching all series within all hierachy levels in one pass! [1])
Note: This is the default case for XML base file formats. It requires keeping track of hierarchy depth counting in code because a node has no number attached! It can however have a style attached denoting depth.
[Mixed: Markup & Content phrase filters]
Note: For all XML base file formats this merged pattern is easier to achieve via postprocessing the respective content part's head after employing the generic, indexless filter.
[1] Only of limited use as higher level elements have no content part if following strict sectioning. what remains is only the declaration unless there is summary|description content between e.g. 1. and its subsection 1.1 .
Purpose
They are essential for the worlddevelopment civilization editor, open bookkeeper bot, ...
The text was updated successfully, but these errors were encountered: