UIMA Tokens Regex is a UIMA Analysis Engine (AE) that implements regex-like pattern matching over a sequence of UIMA annotations.
UIMA Tokens Regex has been described together with TermSuite and published as system demonstration at ACL2016:
Damien Cram and Béatrice Daille. Terminology Extraction with Term Variant Detection. Proceedings of ACL-2016 System Demonstrations. Download PDF
- Getting started
- Documentation
- The UIMA Tokens Regex resource file
- Rule file header: import,
use
,set
- Rule definition: the keyword rule
- Matcher definition
- Matcher predefinition: the keyword matcher
- The keyword as
- The quantifiers:
?
,*
,+
,{n}
,{m,n}
- Sub-rule disjunctions |
- Sub-rule quantifiers
- The universal matcher []
- Ignoring labels with ~
- Commenting with #
- Example rule list resource file
- Rule file header: import,
- UIMA Tokens Regex Analysis Engine (AE)
- Example implementation of UIMA Tokens Regex AE
- The UIMA Tokens Regex resource file
Add UIMA Tokens Regex to your classpath, with Maven or Gradle:
Maven:
<dependency>
<groupId>fr.univ-nantes.julestar</groupId>
<artifactId>uima-tokens-regex</artifactId>
<version>1.5</version>
</dependency>
Gradle:
compile 'fr.univ-nantes.julestar:uima-tokens-regex:1.5'
Or download the jar and add it manually to your classpath.
The next thing to do is writing the regex rules in valid UIMA Tokens Regex format. Example:
# file: my-regex-list.txt
# your UIMA type system
import fr.univnantes.termsuite.types.TermSuiteTypeSystem;
# the annotation type your need to analyse
use fr.univnantes.termsuite.types.WordAnnotation;
rule "rule 1": [category="noun"] /^of$/ /^the$/? [category="noun"] ;
The minimalist file above defines only one rule: rule 1
, which matches in a sequence of word annotations any subsequence starting with a noun, followed by the preprosition of, followed optionnaly by (see the quantifier "?
") the determiner the, followed by a noun again.
For example, this rule matches the following subsequences: top of the tower
, source of energy
, power of the wind
, etc.
The import
statement points to your UIMA type system (must-be accessible from UIMAfit). The use
statement points to the UIMA annotation type being analyzed in the type system. See Documentation for more details on the syntax.
Create a java class extending TokenRegexAE
. It will be called whenever a rule has matched. The following example creates a new UIMA annotation of type OccAnno whenever a rule matches.
public class RecogEngine extends TokenRegexAE {
@Override
protected void ruleMatched(JCas jCas, RegexOccurrence occurrence) {
OccAnno a = new OccAnno(jCas);
a.setBegin(occurrence.getBegin());
a.setEnd(occurrence.getEnd());
a.setRule(occurrence.getRule().getName());
a.setPattern(occurrence.getLabelledAnnotations().stream()
.map(LabelledAnnotation::getLabel)
.collect(Collectors.joining(" ")));
a.addToIndexes();
}
}
Create your AE description and and run the pipeline:
// The AE description
AnalysisEngineDescription ae = AnalysisEngineFactory.createEngineDescription(RecogEngine.class);
// The UIMA resource description of your list of regex rules
ExternalResourceDescription resDesc = ExternalResourceFactory.createExternalResourceDescription(
RegexListResource.class,
"file:my-regex-list.txt"
);
// Bind AE description to the resource description
ExternalResourceFactory.bindResource(
ae,
TokenRegexAE.TOKEN_REGEX_RULES,
resDesc);
// Run the pipeline
AnalysisEngine engine = UIMAFramework.produceAnalysisEngine(ae);
SimplePipeline.runPipeline(cas, engine);
Getting UIMA Tokens Regex to work requires two mandatory phases: defining the UIMA Tokens Regex resource file (a list of rules, see The UIMA Tokens Regex resource file) and implementing the engine logic in Java
(UIMA Tokens Regex AE).
The UIMA Tokens Regex resource file is the list of rules that UIMA Tokens Regex AE needs as input. This resource file consists in three parts, details in next subsections :
- the file header (mandatory)
- the definition of matchers (optional)
- the definition of rules (mandatory)
Example
# Header
import eu.project.ttc.types.TermSuiteTypeSystem;
use eu.project.ttc.types.WordAnnotation;
set inline = false;
# Definition of matchers
matcher A: [category=="adjective"];
matcher N: [category=="noun"];
# Definition of rules
rule "an": A N;
rule "nnn": N N N;
rule "nn": N N;
The import
statement is mandatory and unique. It takes as value the path to the type system definition of the CAS to analyse.
The use
statement is mandatory and unique. It takes as value the fully qualified name of the analysed type (must be defined in the type system). The feature names available in matcher declarations (see Matcher definition) are the features contained in the type declared with the keyword use
.
The set
statement is optional and may be declared multiple times. It configures the options of the UIMA Tokens Regex's resource parser. Available options are :
inline
(boolean, default:true
) : when set totrue
, matchers can be declared "inline", i.e. directly in the right parts of the rules. When set totrue
, the matcher definition part is optional.
Each rule is defined with the keyword rule
, a mandatory rule name, a colon :
, the regular expression part (i.e. a sequence of annotation matchers with associated quantifiers) and a semi-colon ;
.
The example below defines the rule named My rule, with two annotation matchers : one that matches any annotation with a feature category
having a value of "noun" (occurring 1 to n times sequentially), followed by one that matches exactly one annotation with a feature category
having a value of "adjective".
rule "My rule": [category=="noun"]+ [category=="adjective"];
There are two types of matcher, each defined with its own syntax : the The feature structure matcher and the covered text string matcher.
The feature structure matcher applies to UIMA feature structures, i.e. annotations. It is a boolean expression defined within brackets [
and ]
.
Each boolean expression has the form feature
op
literal
. In the example below, the rule has one matcher, where feature
is category, op
is the equality operator ==
, and literal
is the string value noun.
rule "My rule": [category=="noun"];
Boolean expressions within brackets can also be composed of sub-expressions with the help of parentheses and three types of boolean operators :
operator | description |
---|---|
& |
the logical conjunction |
| |
the logical disjunction |
The operators &
and |
are n-ary:
rule "My rule": [category=="noun" & (lemma == "chair" |
(stem=="mang" & mood="participle" & tense="past") | lemma=="vitesse")];
UIMA types' features
All features declared in the type used (see use}) can be referenced from the left part of a boolean expression in matchers. For example, if the type used is the UIMA WordAnnotation
type from platform TermSuite, the available features would be:
feature | literal type |
---|---|
begin |
Integer |
end |
Integer |
category |
String |
lemma |
String |
stem |
String |
tag |
String |
subCategory |
String |
number |
String |
gender |
String |
case |
String |
mood |
String |
tense |
String |
person |
String |
possessor |
String |
degree |
String |
formation |
String |
Operators
There exists 6 different operators :
operator | description |
---|---|
== |
the equality |
!= |
the inequality |
<= |
less than or equal to |
< |
less than |
>= |
greater than or equal to |
> |
greater than |
Literals
Four different types of literals are defined:
| Literal type | Possible values in UIMA Tokens Regex syntax | Mapped UIMA types |
|---|---|
| String
| "noun", "chair", "axe" | String
|
| Integer
| Integer
, Short
, Long
|
| Float
| Float
, Double
|
| Boolean
| true, false | Boolean
|
UIMA array and list types and associated collection operators (in
, not in
, etc.) are not yet supported by UIMA Tokens Regex.
The covered text string matcher applies to any UIMA textual annotation, i.e. annotation with begin
and end
features applying on string SOFAs. It is a Java
regular expressions language defined between two slashes /
that matches against the raw text covered by the annotation, i.e. the value returned by the method getCoveredText()
.
For example, the rule below has exactly one matcher that matches any annotation covering a text starting with "mang".
rule "My rule": /^mang.*$/;
Matcher predefinitions are a list a matcher
statements occurring after the header and before the rule
statements.
So far, our examples showed matchers defined inline, i.e. directly on the right part of the rule. The keyword matcher
allow to predefine matchers before they are used in rule declaration.
matcher M1 : /^mang.*$/;
matcher M2 : [category=="noun"];
rule "My rule": M1 M2;
The matcher predefinition declares a matcher label (M1
and M2
in the example above) and the actual matcher definition.
Matcher labels can be invoked both in rule statements and in other matcher statements. Pay attention that a matcher label invoked in a matcher statement must be defined prior to this statement.
matcher B : /^mang.*$/;
matcher B2 : [B & category=="noun"];
rule "My rule": B2+;
With such a predefinition syntax, a matcher can be reused several times in different rules, without having to copy many times the same matcher code, hence improving the readability of the rule list.
The keyword as
can be optionally used in a matcher predefinition to rename its matcher label.
matcher A1 as A: [category=="adjective"];
matcher A2 as A: [category=="verb" & tense="past" & mood="participle"];
matcher N: [category=="noun"];
rule "My rule 1": A1 N;
rule "My rule 2": A2 N;
For both rules above, the sequence of matcher labels attached to each rule occurrence extracted by UIMA Tokens Regex's engine (see UIMA Tokens Regex AE) will be A N
, instead of A1 N
for the first rule and A2 N
for the second rule, because of the as
keyword.
The keyword as
has no effect in rule declaration nor in the way the UIMA Tokens Regex engine extracts rule occurrences, it only renames labels for later reuse by other AEs in the UIMA AE process flow.
Any matcher can be associated with one the the following regular expression quantifiers.
Quantifier | Quantity of successive matching annotations |
---|---|
? |
zero or one |
* |
any number (zero or n) |
+ |
at least one (one or n) |
{n} |
exactly n |
{m,n} |
from m to n (m ≤ n) |
Example:
matcher A: [category=="adjective"];
matcher N: [category=="noun"];
matcher N: [category=="preposition"];
# Matches NA, NRA, NRRA, NRRRA, etc.
rule "My rule 1": N R* A;
# Matches NPN and NPDN
rule "My rule 2": N P D? N;
# Matches NA, NAA, NAAA, etc.
rule "My rule 3": N A+;
The quantifiers can be associated to any type of matcher : feature structure matchers, regular expression matchers, and matcher labels :
matcher N: [category=="noun"];
matcher N: [category=="adjective"];
# A quantifier associated to a feature structure matcher
rule "My rule 1": N [category="adverb"]* A;
# A quantifier associated to a regular expression matcher
rule "My rule 2": N P /^(de)|(des)|(du)$/? N;
# A quantifier associated to a matcher label (predefined matcher)
rule "My rule 3": N A+;
Since version 1.5, UIMA Tokens Regex allows to declare alternative sub-rule, as with usual regular expressions:
rule "My rule 4": N (P N | A) ;
Since version 1.5, UIMA Tokens Regex allows to quantify sub-rule groups as with any matcher:
rule "My rule 4": N (P N | A)+ ;
All matcher quantifiers available are also allowed on sub-rules.
The universal matcher []
matches any annotation. It is the equivalent of the dot .
in string regular expressions.
# Matches any subsequence of one of the forms: NPN, NPDN, NPAN, NPDAN, etc.
rule "complex term 1": N P []* N;
In the right part of rule declaration, any matcher label can be prefixed with the special character ~
to refrain the following label to appear in the rule occurrences extracted by UIMA Tokens Regex's engine.
Each time the engine finds a matching occurrence for a rule, it produces an instance of the class Occurrence (see UIMA Tokens Regex AE). The class Occurrence has a method named getPattern()
returning the sequence of matcher labels that have matched. The effect of the special character ~
is to remove some labels from the pattern.
This functionality is useful when some annotation are mandatory to specify a matching context but useless for further analysis.
The example illustrates this functionality in the case of terminology extraction. We need the rule My rule
to extract all subsequences having the pattern NPN
, but we need to appear after a determiner D
. As we don't want the D
to appear in the subsequence's pattern, we hide it from the pattern by prefixing it with ~
.
matcher N: [category=="noun"];
matcher P: [category=="preposition"];
matcher D: [category=="determiner"];
# The following rule matches DNPDN or DNPN
# but getPattern() will always return NPN
rule "My rule": ~D N P ~D? N;
The character ~
does not apply to feature structure matchers nor to regular expression matchers.
Any sequence of characters starting with #
will be considered as a comment and will be ignored by the UIMA Tokens Regex parser until the end of the line.
# This is ignored
rule "My rule 1": [category=="noun"]
rule "My rule 2": [category=="adjective"];# This is ignored
The example below is a complete example of a UIMA Tokens Regex rule list that is used in the context of complex term detection over english word annotations (WordAnnotation
) in TermSuite.
import fr.univnantes.termsuite.types.TermSuiteTypeSystem;
use fr.univnantes.termsuite.types.WordAnnotation;
set inline = false;
matcher Prep: [category=="adposition" & subCategory=="preposition"];
matcher P1: [lemma=="of" | lemma=="with"]; # on? between?
matcher P as P: [Prep & P1];
matcher V: [category=="verb"];
matcher A: [category=="adjective" & lemma!="same" & lemma!="many" & lemma!="other" & lemma!="much" & lemma!="several" & lemma!="new"];
matcher Vpp as A: [V & tag=="VBN"]; # past participle
matcher Ving as A: [V & tag=="VBG"]; # gerund or present participle
matcher A2 as A: [A | Vpp | Ving];
matcher N: [category=="noun"];
matcher N1 as N: [category=="noun" & lemma!="number"];
matcher D: [category=="determiner" | category=="article"];
rule "n": N;
rule "a": A2;
rule "r": R;
# from ruta
rule "an": A2 N;
rule "nn": N N;
rule "npn": N1 P ~D? N ; # modified
# ...
See the complete regex file here.
The UIMA Tokens Regex AE takes its resource file (c.f. The UIMA Tokens Regex resource file) as input, scans all annotations sequentially (see Sequential scanning of annotations for more details) and notifies a Java
handler method, which must be implemented (See Definition of the Java handler method), every time a rule matches.
The UIMA Tokens Regex AE parses the rule list and creates for each rule exactly one finite state automaton, where each annotation matcher plays the role of a transition label (There are as many annotation matchers as there are transitions). The engine iterates over the sequence of all annotations of the type declared by the keyword set
(c.f. set) taken in their natural UIMA order (begin
ascending, end
descending). The engine propagates the current annotation to all automaton, i.e. to n automata, where n is the number of rules declared.
Each time an automaton fails, i.e. an automaton cannot propagate the current annotation, the AE behaves as follows:
- if the automaton was in an acceptable state before failing, then a new occurrence of the rule is created and the handler method
ruleMatched()
is invoked. All other automaton instances of the same automaton are dropped, - if the automaton was in no acceptable state before the annotation, the AE drops the current automaton instance and tries to create a new one with the current annotation.
We can deduce two important consequences of this behaviour:
- the AE will never produce two overlapping occurrences of a single rule,
- occurrences of different rules are produced independently from each other can overlap.
Example:
Let's consider the two following rules :
matcher N : [myFeat=="N"];
matcher P : [myFeat=="P"];
matcher A : [myFeat=="A"];
rule "Rule1": N P N;
rule "Rule2": N P A N;
And the following sequence of annotations (actually figure below represents only the values of the annotations for the feature myFeat
):
NPNPANPN
The occurrences recognized by the AE will be :
--- | --- |
---|---|
Occurrences for Rule1 |
[0,3], [5,8], [7,10] |
Occurrences extracted by the AE for Rule1 |
[0,3], [5,8] |
Occurrences for Rule2 |
[2,6] |
Occurrences extracted by the AE for Rule2 |
[2,6] |
Final occurrence set | Rule1 : [0,3], [5,8] -- Rule2 : [2,6] |
In this example, the occurrence [5,8] of Rule1
overlaps with the previous occurrence of the same rule [0,3]. It is then not recognized by the AE.
As explained in Section Sequential scanning of annotations, every time an occurrence is recognized, the method ruleMatched()
is invoked. All the developer has to do is to extends the class TokenRegexAE
and to override its methods (including ruleMatched()
) so as to implement the desired behaviour of the AE.
Here is the list of the abstract methods of the class TokenRegexAE
that the user can override :
Method | Description |
---|---|
ruleMatched(JCas, Occurrence) |
Invoked whenever a rule has matched the sequence of annotations. It is given the UIMA CAS object and an instance of the class Occurrence with all the details about the matching rule and the matched annotation. (See Occurrence ) |
beforeRuleProcessing(JCas) |
Invoked before the AE starts processing the sequence of annotations. This is where the developer can implement pre-processing instructions. |
afterRuleProcessing(JCas) |
Invoked after the AE ends processing the sequence of annotations. This is where the developer can implement post-processing instructions. |
Whenever a rule matches in the sequence of annotations, the AE invokes the method ruleMatched()
and passes to it an instance of the class Occurrence
. Here is the list of public methods of this class that the developer can invoke so as to access all matching information available.
Method | Description |
---|---|
int getBegin() |
returns the begin offset of the matching occurrence in the document string. |
int getEnd() |
returns the end offset of the matching occurrence in the SOFA string. |
Rule getRule() |
returns the rule that this occurrence has matched, as a Rule object (see Rule). |
List getAllMatchingAnnotations() |
returns the list of all annotations that make the occurrence of the matching rule. The list includes ignored annotations, i.e. annotations being prefixed with ~ . |
List getLabelledAnnotations() |
returns the list of annotations that make the occurrence of the matching rule (without ignored annotations) |
List getLabels() |
returns the list of the labels of the annotations that make the matching occurrence. |
String asPatternString() |
returns the string representation of the labels returned by getLabels() , i.e. the labels joined with white spaces. |
int size() |
returns the number of annotations that make the occurrence of the matching rule, i.e. the size of the list returned by the method getLabelledAnnotations() . |
Each occurrence object has a reference to its matching rule. This rule is an instance of the class Rule
and has the following public methods.
Method | Description |
---|---|
Method | Description |
String getName() |
Returns the rule name as declared by in the left part of the rule statement. |
In the example implementation below, we create on each occurrence recognized an annotation of type OccAnno
and we set its begin
, end
, and pattern
features from the Occurrence
object and and its associated rule's name and id.
public class RecogEngine extends TokenRegexAE {
@Override
protected void ruleMatched(JCas jCas, RegexOccurrence occurrence) {
OccAnno a = new OccAnno(jCas);
a.setBegin(occurrence.getBegin());
a.setEnd(occurrence.getEnd());
a.setRule(occurrence.getRule().getName());
a.setPattern(occurrence.getLabelledAnnotations().stream().map(LabelledAnnotation::getLabel).collect(Collectors.joining(" ")));
a.addToIndexes();
}
}