-
Notifications
You must be signed in to change notification settings - Fork 1
Documentation
textwolf defines the following templates and types
The TextScanner class template defines an iterator on the characters of the input as
- unicode characters
- ascii characters
- XML control characters
- InputIterator (explained in the section iterator)
- Charset = character set encoding of the input (explained in the section character set encodings)
char* input = ...; textwolf::TextScanner<char*,charset::UCS2<LE> > itr( input); while (itr->chr()) itr++;
The XMLScanner class template defines the state of a parser scanning the XML elements like tags,atrributes,values,content,etc.. on a source defined with an input iterator. It is constructed by passing
- InputIterator& src = an input iterator reference (template argument)
- OutputBuffer outbuf = buffer to use for output
- EntityMap* emap = pointer to a read only map of entities (template argument, optional)
- InputIterator = input iterator type (explained in the section iterator)
- InputCharSet = character set encoding of the input (explained in the section character set encodings)
- OutputCharSet = character set encoding of the output (explained in the section character set encodings)
- OutputBuffer = buffer type to use for the tokens parsed (STL back insertion sequence interface)
- EntityMap = read only map from ASCII const char* to UChar that describes the mapping of named entities to unicode characters. (default is std::map)
char* input = ...; std::string outputbuf; typedef textwolf::XMLScanner<char*,charset::IsoLatin1,charset::IsoLatin1,std::string> Scan; Scan xs( input, outputbuf); for (Scan::iterator itr = xs.begin(); itr != xs.end(); itr++) { switch (itr->type()) { case MyXMLScanner::ErrorOccurred: throw std::runtime_error( itr->content()); case MyXMLScanner::OpenTag: ... break; case MyXMLScanner::Content: ... break; ... } }
End of tag events come without tag name. So it is not possible to validate an XML with 'textwolf'.
This class template defines an automaton for selecting XML path expression and to assign a type as integer to the filtered tokens. It has one template argument:
- OutputCharSet = the character set in which the tokens are stored and therefore in which format they are processed.
The XMLPathSelect class template defines the state of a set of XML path selections filtered on an input iterator. It is constructed by passing
- textwolf::Automaton* atm = the pointer to an XML path selection automaton
- InputIterator& src = an input iterator reference (template argument)
- OutputBuffer outbuf = buffer to use for output
- EntityMap* emap = pointer to a read only map of entities (template argument, optional)
- InputIterator = input iterator type (explained in the section iterator)
- InputCharSet = character set encoding of the input (explained in the section character set encodings)
- OutputCharSet = character set encoding of the output (explained in the section character set encodings)
- OutputBuffer = buffer type to use for the tokens parsed (STL back insertion sequence interface)
- EntityMap = read only map from ASCII const char* to UChar that describes the mapping of named entities to unicode characters. (default is std::map)
The following character set encodings are defined in the textwolf::charset namespace as examples:
- textwolf::charset::UTF8 = utf-8
- textwolf::charset::IsoLatin1 = Iso-Latin-1
- textwolf::charset::UCS2 = UCS2 Little Endian
- textwolf::charset::UCS2 = UCS2 Big Endian
- textwolf::charset::UCS4 = UCS4 Little Endian
- textwolf::charset::UCS4 = UCS4 Big Endian
You can define your own character set encodings. The following structure is passed to a textwolf iterator as a character set encoding definition. Textwolf assumes some form of a single Unicode character (UChar) to be able to map them to each other.
struct MyCharSet { static unsigned int asize(); static unsigned int size( const char*); static char achar( const char* buf); static UChar value( const char* buf); static unsigned int print( UChar chr, char* buf, unsigned int bufsize); };
- asize = return number of bytes that have to be read to identify an ascii character (XML control character). For fixed length character formats it equals 'size'. For variable length formats like UTF not.
- size = return number of bytes that have to be read for the whole character
- achar = return the ascii character or 0xFF for any other character or 0 for EOF
- value = return the decoded unicode character
- print = print the unicode character 'chr' to 'buf' with byte length 'bufsize' and return the length of the character printed in bytes.
The textwolf scanners,parsers and filters expect an input iterator on a sequence of bytes with the following properties:
- It implements pre increment:
iterator& operator++()
- It implements one byte access, that returns a sequence of 0 (zeros) after reaching end of data and that throws an exception if it reaches end of message:
char operator*() const
The exception thrown by the input iterator is caught by the caller of textwolf. Textwolf just ensures to save its state and that it can be called again, if it has data again and can continue.
- To use it with a pair of STL iterators (begin,end), you have to define and pass a structure like this:
struct EndOfMessage {}; twiterator( const iterator& begin, const iterator& end, bool eof) :m_itr(begin),m_end(end),m_eof(eof) {} char operator*() const { if (m_itr >= m_end) { if (m_eof) { return 0; } else { throw( EndOfMessage()); } } else { return *m_itr; } } twiterator& operator++() { ++m_itr; return *this; } };
It depends very much on your environment where you use textwolf how this structure looks like. So it is left to the user how to define this. It is not part of textwolf.
The following iterators refer to InputIterator as the required input iterator type (see input):
- Iterator on the characters of the input
InputIterator in = input.begin(); textwolf::TextScanner<InputIterator,textwolf::charset::UTF8> itr( in); for (; itr.control() != textwolf::EndOfText; itr++) { UChar chr = *itr; // parse the unicode character }
- Iterator on the XML element (tags,attributes,values,etc.) of the input
typedef XMLScanner<InputIterator,charset::IsoLatin1,charset::IsoLatin1,std::string> MyXMLScanner; std::string outputbuf; InputIterator in = input.begin(); MyXMLScanner xs( in, outputbuf);
MyXMLScanner::iterator itr; for (itr=xs.begin(); itr != xs.end(); itr++) { std::cout << "Element " << itr->name() << ": " << itr->content() << std::endl; const char* typestr = 0; switch (itr->type()) { case MyXMLScanner::ErrorOccurred: throw std::runtime_error( itr->content()); case MyXMLScanner::HeaderAttribName: typestr = "attribute name"; break; case MyXMLScanner::HeaderAttribValue: typestr = "attribute value"; break; case MyXMLScanner::HeaderEnd: typestr = "end of header"; break; case MyXMLScanner::TagAttribName: typestr = "attribute name"; break; case MyXMLScanner::TagAttribValue: typestr = "attribute value"; break; case MyXMLScanner::OpenTag: typestr = "open tag"; break; case MyXMLScanner::CloseTag: typestr = "close tag"; break; case MyXMLScanner::CloseTagIm: typestr = "close tag"; break; case MyXMLScanner::Content: typestr = "content"; break; case MyXMLScanner::Exit: typestr = "end of document"; break; } std::cout << "Element (" << itr->name() << ")" << typestr << ": " << itr->content() << std::endl; }
- Iterator on the XML path expressions filtered in the input
// define the XML Path selection by the automaton over the source iterator typedef XMLPathSelect<char*,charset::UTF8,charset::UTF8,std::string> MyXMLPathSelect; std::string outputbuf; MyXMLPathSelect xs( &atm, src, outputbuf); // iterating through the produced elements and printing them MyXMLPathSelect::iterator itr=xs.begin(),end=xs.end(); for (; itr!=end; itr++) { std::cout << "Element " << itr->type() << ": " << itr->content() << std::endl; }
An XML Path expression automaton in defined as tree. With every expression defined we select first the root node and describe for follow node on the expression path. For the nodes we have defined operators to declare the type of the follow node. For a node 'A'
- select the tag with the name "doc" following node A
A["doc"]
or
A.selectTag("doc")
- select the value of the attribute "id"
A("id")
or
A.selectAttribute("id")
- seek the attribute "id" (also defined as function ifAttribute)
A("id",0)
or
A.ifAttribute("id",0)
- seek the attribute "id" with value "188" (same ifAttribute)
A("id","188")
or
A.ifAttribute("id","188")
- select all content values of the node selected by A (also as function selectContent)
A()
or
A.selectContent()
- stop selecting elements with index beyond 24 counted from 0
A.TO(25)
- ignore elements with an index smaller than 2 (counted from 0)
A.FROM(2)
- assign the value 3 as type to all values selected by the expression A (also defined as function assignType)
A = 3
or
A.assignType(3)
- select all nodes below the current node, so that all conditions expressed in this context apply transitively to all successors of A (also as function doFollow)
A--
or
A.doFollow()
A special role has the operator '--'. it corresponds to the operator '//' in abbreviated syntax of XPath expressions. It says that the following selection applies also for a successors of the current node.
The root node is selector with the operator '*' on the automaton. The following code selects all 'alt/prd' tag content elements in the document and assigns them the type 1:
typedef XMLPathSelectAutomaton<charset::UTF8> Automaton; Automaton atm; (*atm)--["alt"]["prd"]() = 1;
If no content or attribute value is selected then textwolf just triggers an event when the tag or attribute appears:
- Get an event '111' for every new document:
(*atm) = 111;
- Get an event '234' for every 'txt' tag:
(*atm)["txt"] = 234;
- Get an event '761' for every attribute 'id' of a 'pers' tag:
(*atm)["pers"]("id") = 761;
Textwolf has not the power of XPath and it does not aim to. It does not buffer than the currently processed token so it can't detect patterns that require buffering. It can't even cope with the fact that tag attributes in XML have no order. For expressions that are not expressible in this model, you have to build the logic around textwolf. Textwolf is not XPath, but with some addional effort you get an engine that is able to process at least 'abbreviated syntax of XPath' without parent references and content conditions. For example
A//ter[@id='5' and @name='kaspar']
has to be translated to
A--["ter"]("id","5")("name","kaspar") A--["ter"]("name","kaspar")("id","5")
and
A//ter[@id='5' or @name='kaspar']
to
A--["ter"]("id","5") A--["ter"]("name","kaspar")
Some cases are even worse. If you select attribute values where you have attribute conditions, then you can solve it only in the filter functions on the iterator after calling textwolf. Selections have to be at the end, because they are not buffered. Therefore something like
A//person[@id='se1']@name
cannot be expressed in textwolf for the case where 'name' appears before 'id' in the XML. The expression
A--["ter"]("id","se1")("name",0)
works only for the case where 'id' appears before 'name'. A possible solution is to define
A--["ter"]("id","se1") = 201; A--["ter"]("name",0) = 202;
and to set the value with 202 on the element and a flag with 201 that together with 202 enables the element created.