-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add EN Readme for Pire #84
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the huge work you have done!
I pointed out some technical issues.
As for the translation, I did not understand several parts of the text well enough.
Could You please proofread it?
Or at least compare it to what translate.google.ru yields for the Russian readme.
@@ -120,8 +120,8 @@ LongestPrefix(), LongestSuffix(), ShortestPrefix() и ShortestSuffix(). Они | |||
Сканеры объединяются функцией Pire::Scanner::Glue(). В результате получается сканер, | |||
который за один проход проверяет строку на соответствие всем содержащимся в нём | |||
паттернам. Общее количество паттернов возвращает Scanner::RegexpsCount(). Вместо | |||
Final() имеет смысл вызвать AcceptedRegexps(const State&), возвращающая пару | |||
итераторов (с семантикой [begin,end)), указывающую на диапазон номеров регулярок, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was the correct closing parenthesis )
in [begin,end)
meaning the end
is not inclusive. Just like in math.
README.en
Outdated
is to construct it from the string with the regular expression using the Pire::Lexer class. | ||
|
||
Regular expressions have the syntax similar to POSIX. It supports standard features as | ||
a|b, a*, a+, ., [a-z], a{3}, a{3,5}, a{3,}) and character classes (\w (alphanumeric characters), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\w
includes only alphabetical, not all alphanumerical characters. Likewise, \W
is a complementary set of all not alphabetical characters.
README.en
Outdated
The function Pire::Scanner::Glue() unites several machines. The resulting scanner | ||
checks whether the string matches all united patterns at a time. The Scanner::RegexpsCount() | ||
function returns common number of patterns. It is recommended to replace running Final() with | ||
AcceptedRegexps(const State&) that returns iterators with the [begin,end] semantics indicating the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a half-open interval [begin, end)
meaning the end
is not inclusive.
README.en
Outdated
|
||
Pire::Scanner CompileRegexp(const char* pattern) | ||
{ | ||
// Переводим шаблон в UCS4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some text left in Russian.
README.en
Outdated
* Finals(), IsFinal(), SetFinal(), ClearFinal(). | ||
|
||
In addition, a bit flag field will be associated with each state and each transition of the machine. | ||
Transition flags are called outputs (since they implement a classic spacecraft with an output). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace 'spacecraft' with 'FSM'.
I'm sorry for the long replay. Thank you for the feedback! I've fixed the technical issues that you've found in the translation. Please could you point me in translated parts that are not enough clear? |
\S (non-whitespace)). Operators a&b (intersection, see Fsm::operator &), ~a (inversion, see | ||
Fsm::Complement), non-greedy analogs of repetitions *?, +?, ?? (Fsm::SlowCapturingScanner). | ||
Operators in this scanner can be set to as short expressions as possible, in contrast to greedy | ||
analogs that set to as long expressions as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source paragraph
Кроме того, могут быть добавлены
операции a&b (пересечение, см. Fsm::operator &) и ~a (инверсия, см. Fsm::Complement).
Помимо этого, для Fsm::SlowCapturingScanner есть возможность использовать нежадные аналоги
повторений - операторы *?, +?, ??. В этом сканере данные операторы будут принимать
как можно более короткие выражения, в отличие от жадных аналогов (которые, наоборот, принимают
как можно более длинные.
is hard for google translate.
But after tweaking both the source and its suggested translation I have got
In addition, a&b (intersection, see Fsm::operator &) and ~a (inverse, see Fsm::Complement) can be used.
Also, Fsm :: SlowCapturingScanner has the ability to use non-greedy repetition operators - * ?, + ?, ??.
These operators will accept as short subtexts as possible, as opposed to greedy counterparts (which, on the contrary, accept as long as possible).
Our FSMs are acceptors. So the words "to accept" and "to match" can be used interchangeably. Whilst the word "to set" is not suitable here.
United machines can be united once again. However, it is not recommended to repeat union | ||
indefinitely because the number of states in the resulting scanner grows | ||
exponentially with each added pattern. The max size of the scanners can be specified in | ||
the maxSize parameter in the function Glue(). If the specified size is exceeded, an empty scanner | ||
will be returned (Size() == 0, Empty() == true). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the source paragraph
Склеенные сканеры в свою очередь можно склеить между собой. В то же время процесс склейки
нельзя повторять до бесконечности, потому как количество состояний в полученном автомате
растёт экспоненциально с каждым добавленным паттерном. Иногда имеет смысл ограничить
размер полученного автомата, указав ненулевой параметр maxSize в Glue(). В таком случае
при превышении указанного размера возвращается пустой автомат (Size() == 0, Empty() == true).
https://translate.google.ru suggests
The glued scanners, in turn, can be glued together. At the same time, the gluing process
cannot be repeated indefinitely, because the number of states in the resulting machine
grows exponentially with each added pattern. Sometimes it makes sense to limit
the size of the resulting automaton by specifying a nonzero maxSize parameter in Glue (). In this case
when the specified size is exceeded, an empty automaton is returned (Size () == 0, Empty () == true).
Which seams a bit clearer to me. First of all, it uses the words "to glue" and "the scanner" which are used for the same purpose in the code itself. This makes it easier to understand the code by reading the documentation.
I would suggest using "to glue" and "glued" instead of "to unite" and "united" in the previous paragraphs as well. Even if there seems to be a lot of tautologies this way, it is better for technical documentation.
|
||
Each scanner has the State type that stores the state, and methods Initialize(State&) | ||
to return the initial machine state, Next(State&, char) to take the machine to the next state, | ||
Final(const State&) to report whether the state is valid. To check whether the string matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use the word 'final' instead of 'valid':
Final(const State&) to report whether the state is final (ie. the processed text is accepted by the machine).
Numbers start from 0 and, after union, numbers of expressions on the right are shifted | ||
by the expression numbers on the left. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbers start from 0 and, after the Pire::Scanner::Glue() function, numbers of patterns on the right are shifted by the pattern count on the left.
|
||
* i - case insensitive | ||
* u - compile the expression in UTF-8 | ||
* s - escape the expression .* on each side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- s - surround the pattern by .* on each side
The following characters can be in the string with flags: | ||
|
||
* i - case insensitive | ||
* u - compile the expression in UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terms "regular expression" and "pattern" can be used interchangeably, but I would not use just "expression" because it is a broader term.
In the inherited class, the following methods should be overridden: | ||
|
||
* wchar32 FromLocal(const char*& begin, const char* end) reads and returns input character and forwards begin. | ||
Returns an exceprion if the input includes an invalid sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, replace "exceprion" with "exception".
And it is more common to say "to throw an exception" than "to return an exception".
* Default constructor: creates the scanner that allows an empty string | ||
* MakeFalse(): creates the scanner that does not allows any string | ||
* Size(): the scanner size (number of states) | ||
* Append(unsigned char c): adds to the machine the transition for the match with . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Append(unsigned char c): adds to the machine the transition matching the given character.
* MakeFalse(): creates the scanner that does not allows any string | ||
* Size(): the scanner size (number of states) | ||
* Append(unsigned char c): adds to the machine the transition for the match with . | ||
* AppendStrings(const vector<string>&): adds to the machine the transition for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- AppendStrings(const vector&): adds transitions to the machine that match any of the given strings
cycle writing it as a single expression (see an example). | ||
|
||
Four more functions helpful in lexical analysis are LongestPrefix(), LongestSuffix(), ShortestPrefix(), | ||
and ShortestSuffix(). These functions returns the largest / the smallest prefix or suffix allowed by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to say that the FSM "matches" or "accepts" the given text, rather than "allows". Please, check other uses of the word "to allow" in the same context.
* i - case insensitive | ||
* u - compile the expression in UTF-8 | ||
* s - escape the expression .* on each side | ||
* a - enable checking for the & and ~ operators in expressions (see above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- a - enable support for the & and ~ operators in expressions (see above)
should be performed. | ||
|
||
The serialized scanner is not portable between architectures (even between x86 and x86_64). | ||
When trying to read or map the regular expression serialized in another architecture, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serialized on another architecture
Returns an exceprion if the input includes an invalid sequence. | ||
* std::string ToLocal(wchar32) returns returns the byte representation of the character in this encoding. | ||
If the character in this encoding is not representable, returns an empty string. | ||
* AppendDot(Fsm&) adds a fragment to the end machine that allows any one character in this encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- AppendDot(Fsm&) appends to the end of the machine a fragment that accepts any character in this encoding.
|
||
* wchar32 FromLocal(const char*& begin, const char* end) reads and returns input character and forwards begin. | ||
Returns an exceprion if the input includes an invalid sequence. | ||
* std::string ToLocal(wchar32) returns returns the byte representation of the character in this encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns returns -> returns
cycle writing it as a single expression (see an example). | ||
|
||
Four more functions helpful in lexical analysis are LongestPrefix(), LongestSuffix(), ShortestPrefix(), | ||
and ShortestSuffix(). These functions returns the largest / the smallest prefix or suffix allowed by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions returns -> These functions return
As an example, I suggested replacements for a couple of paragraphs. |
No description provided.