Documentation

Table of Contents Interface Types TextScanner XMLScanner XMLPathSelectAutomaton XMLPathSelect Character set encodings Predefined How to define your own Iterators Input Output How to define an XML path expression automaton textwolf and XPath

Interface

Types

textwolf defines the following templates and types

TextScanner

The TextScanner class template defines an iterator on the characters of the input as

unicode characters
ascii characters
XML control characters

It has two template arguments

InputIterator (explained in the section iterator)
Charset = character set encoding of the input (explained in the section character set encodings)

Example:

 char* input = ...;
 textwolf::TextScanner<char*,charset::UCS2<LE> > itr( input);
 while (itr->chr()) itr++;

XMLScanner

The XMLScanner class template defines the state of a parser scanning the XML elements like tags,atrributes,values,content,etc.. on a source defined with an input iterator. It is constructed by passing

InputIterator& src = an input iterator reference (template argument)
OutputBuffer outbuf = buffer to use for output
EntityMap* emap = pointer to a read only map of entities (template argument, optional)

Template parameters

InputIterator = input iterator type (explained in the section iterator)
InputCharSet = character set encoding of the input (explained in the section character set encodings)
OutputCharSet = character set encoding of the output (explained in the section character set encodings)
OutputBuffer = buffer type to use for the tokens parsed (STL back insertion sequence interface)
EntityMap = read only map from ASCII const char* to UChar that describes the mapping of named entities to unicode characters. (default is std::map)

Example:

 char* input = ...;
 std::string outputbuf;
 typedef textwolf::XMLScanner<char*,charset::IsoLatin1,charset::IsoLatin1,std::string> Scan;
 Scan xs( input, outputbuf);
 for (Scan::iterator itr = xs.begin(); itr != xs.end(); itr++)
 {
     switch (itr->type())
     {
           case MyXMLScanner::ErrorOccurred: throw std::runtime_error( itr->content());
           case MyXMLScanner::OpenTag: ... break;
           case MyXMLScanner::Content: ... break;
           ...
     }
 }

End of tag events come without tag name. So it is not possible to validate an XML with 'textwolf'.

XMLPathSelectAutomaton

This class template defines an automaton for selecting XML path expression and to assign a type as integer to the filtered tokens. It has one template argument:

OutputCharSet = the character set in which the tokens are stored and therefore in which format they are processed.

The automaton construction is described in the section "how to define an XML path expression automaton".

XMLPathSelect

The XMLPathSelect class template defines the state of a set of XML path selections filtered on an input iterator. It is constructed by passing

textwolf::Automaton* atm = the pointer to an XML path selection automaton
InputIterator& src = an input iterator reference (template argument)
OutputBuffer outbuf = buffer to use for output
EntityMap* emap = pointer to a read only map of entities (template argument, optional)

The class template definition has the following parameters

InputIterator = input iterator type (explained in the section iterator)
InputCharSet = character set encoding of the input (explained in the section character set encodings)
OutputCharSet = character set encoding of the output (explained in the section character set encodings)
OutputBuffer = buffer type to use for the tokens parsed (STL back insertion sequence interface)
EntityMap = read only map from ASCII const char* to UChar that describes the mapping of named entities to unicode characters. (default is std::map)

It looks very similar to XMLScanner except that we pass an automaton and the iterator elements get the types assigned to the expressions instead of the predefined XML element enum ids.

Character set encodings

Predefined

The following character set encodings are defined in the textwolf::charset namespace as examples:

textwolf::charset::UTF8 = utf-8
textwolf::charset::IsoLatin1 = Iso-Latin-1
textwolf::charset::UCS2 = UCS2 Little Endian
textwolf::charset::UCS2 = UCS2 Big Endian
textwolf::charset::UCS4 = UCS4 Little Endian
textwolf::charset::UCS4 = UCS4 Big Endian

How to define your own

You can define your own character set encodings. The following structure is passed to a textwolf iterator as a character set encoding definition. Textwolf assumes some form of a single Unicode character (UChar) to be able to map them to each other.

 struct MyCharSet
 {
     static unsigned int asize();
     static unsigned int size( const char*);
     static char achar( const char* buf);
     static UChar value( const char* buf);
     static unsigned int print( UChar chr, char* buf, unsigned int bufsize);
 };

asize = return number of bytes that have to be read to identify an ascii character (XML control character). For fixed length character formats it equals 'size'. For variable length formats like UTF not.
size = return number of bytes that have to be read for the whole character
achar = return the ascii character or 0xFF for any other character or 0 for EOF
value = return the decoded unicode character
print = print the unicode character 'chr' to 'buf' with byte length 'bufsize' and return the length of the character printed in bytes.

Iterators

Input

The textwolf scanners,parsers and filters expect an input iterator on a sequence of bytes with the following properties:

It implements pre increment:

 iterator& operator++()

It implements one byte access, that returns a sequence of 0 (zeros) after reaching end of data and that throws an exception if it reaches end of message:

 char operator*() const

The exception thrown by the input iterator is caught by the caller of textwolf. Textwolf just ensures to save its state and that it can be called again, if it has data again and can continue.

To use it with a pair of STL iterators (begin,end), you have to define and pass a structure like this:

    struct EndOfMessage {};
 
    twiterator( const iterator& begin, const iterator& end, bool eof)
        :m_itr(begin),m_end(end),m_eof(eof) {}
 
    char operator*() const {
        if (m_itr >= m_end) {
            if (m_eof) {
                return 0;
            } else {
                throw( EndOfMessage());
            }
        } else {
            return *m_itr;
        }
    }
    twiterator& operator++()
    {
        ++m_itr; return *this;
    }
 };

It depends very much on your environment where you use textwolf how this structure looks like. So it is left to the user how to define this. It is not part of textwolf.

Output

The following iterators refer to InputIterator as the required input iterator type (see input):

Iterator on the characters of the input

 InputIterator in = input.begin();
 textwolf::TextScanner<InputIterator,textwolf::charset::UTF8> itr( in);
 for (; itr.control() != textwolf::EndOfText; itr++)
 {
     UChar chr = *itr; // parse the unicode character
 }

Iterator on the XML element (tags,attributes,values,etc.) of the input

 typedef XMLScanner<InputIterator,charset::IsoLatin1,charset::IsoLatin1,std::string> MyXMLScanner;
 std::string outputbuf;
 InputIterator in = input.begin();
 MyXMLScanner xs( in, outputbuf);

 MyXMLScanner::iterator itr;
 for (itr=xs.begin(); itr != xs.end(); itr++)
 {
     std::cout << "Element " << itr->name() << ": " << itr->content() << std::endl;
     const char* typestr = 0;
     switch (itr->type())
     {
           case MyXMLScanner::ErrorOccurred: throw std::runtime_error( itr->content());
           case MyXMLScanner::HeaderAttribName: typestr = "attribute name"; break;
           case MyXMLScanner::HeaderAttribValue: typestr = "attribute value"; break;
           case MyXMLScanner::HeaderEnd: typestr = "end of header"; break;
           case MyXMLScanner::TagAttribName: typestr = "attribute name"; break;
           case MyXMLScanner::TagAttribValue: typestr = "attribute value"; break;
           case MyXMLScanner::OpenTag: typestr = "open tag"; break;
           case MyXMLScanner::CloseTag: typestr = "close tag"; break;
           case MyXMLScanner::CloseTagIm: typestr = "close tag"; break;
           case MyXMLScanner::Content: typestr = "content"; break;
           case MyXMLScanner::Exit: typestr = "end of document"; break;
     }
     std::cout << "Element (" << itr->name() << ")" << typestr << ": " << itr->content() << std::endl;
 }

Iterator on the XML path expressions filtered in the input

 // define the XML Path selection by the automaton over the source iterator
 typedef XMLPathSelect<char*,charset::UTF8,charset::UTF8,std::string> MyXMLPathSelect; 
 std::string outputbuf;
 MyXMLPathSelect xs( &atm, src, outputbuf);
 
 // iterating through the produced elements and printing them
 MyXMLPathSelect::iterator itr=xs.begin(),end=xs.end();
 for (; itr!=end; itr++)
 {
    std::cout << "Element " << itr->type() << ": " << itr->content() << std::endl;
 }

How to define an XML path expression automaton

An XML Path expression automaton in defined as tree. With every expression defined we select first the root node and describe for follow node on the expression path. For the nodes we have defined operators to declare the type of the follow node. For a node 'A'

select the tag with the name "doc" following node A

 A["doc"]

or

 A.selectTag("doc")

select the value of the attribute "id"

 A("id")

or

 A.selectAttribute("id")

seek the attribute "id" (also defined as function ifAttribute)

 A("id",0)

or

 A.ifAttribute("id",0)

seek the attribute "id" with value "188" (same ifAttribute)

 A("id","188")

or

 A.ifAttribute("id","188")

select all content values of the node selected by A (also as function selectContent)

A()

or

 A.selectContent()

stop selecting elements with index beyond 24 counted from 0

 A.TO(25)

ignore elements with an index smaller than 2 (counted from 0)

 A.FROM(2)

assign the value 3 as type to all values selected by the expression A (also defined as function assignType)

 A = 3

or

 A.assignType(3)

select all nodes below the current node, so that all conditions expressed in this context apply transitively to all successors of A (also as function doFollow)

A--

or

 A.doFollow()

A special role has the operator '--'. it corresponds to the operator '//' in abbreviated syntax of XPath expressions. It says that the following selection applies also for a successors of the current node.

The root node is selector with the operator '*' on the automaton. The following code selects all 'alt/prd' tag content elements in the document and assigns them the type 1:

 typedef XMLPathSelectAutomaton<charset::UTF8> Automaton;
 Automaton atm;
 (*atm)--["alt"]["prd"]() = 1;

If no content or attribute value is selected then textwolf just triggers an event when the tag or attribute appears:

Get an event '111' for every new document:

 (*atm) = 111;

Get an event '234' for every 'txt' tag:

 (*atm)["txt"] = 234;

Get an event '761' for every attribute 'id' of a 'pers' tag:

 (*atm)["pers"]("id") = 761;

textwolf and XPath

Textwolf has not the power of XPath and it does not aim to. It does not buffer than the currently processed token so it can't detect patterns that require buffering. It can't even cope with the fact that tag attributes in XML have no order. For expressions that are not expressible in this model, you have to build the logic around textwolf. Textwolf is not XPath, but with some addional effort you get an engine that is able to process at least 'abbreviated syntax of XPath' without parent references and content conditions. For example

 A//ter[@id='5' and @name='kaspar']

has to be translated to

 A--["ter"]("id","5")("name","kaspar")
 A--["ter"]("name","kaspar")("id","5")

and

 A//ter[@id='5' or @name='kaspar']

to

 A--["ter"]("id","5")
 A--["ter"]("name","kaspar")

Some cases are even worse. If you select attribute values where you have attribute conditions, then you can solve it only in the filter functions on the iterator after calling textwolf. Selections have to be at the end, because they are not buffered. Therefore something like

 A//person[@id='se1']@name

cannot be expressed in textwolf for the case where 'name' appears before 'id' in the XML. The expression

 A--["ter"]("id","se1")("name",0)

works only for the case where 'id' appears before 'name'. A possible solution is to define

 A--["ter"]("id","se1") = 201;
 A--["ter"]("name",0) = 202;

and to set the value with 202 on the element and a flag with 201 that together with 202 enables the element created.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly