Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add EN Readme for Pire #84

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

AnastasiaTWW
Copy link

No description provided.

Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the huge work you have done!

I pointed out some technical issues.

As for the translation, I did not understand several parts of the text well enough.
Could You please proofread it?
Or at least compare it to what translate.google.ru yields for the Russian readme.

@@ -120,8 +120,8 @@ LongestPrefix(), LongestSuffix(), ShortestPrefix() и ShortestSuffix(). Они
Сканеры объединяются функцией Pire::Scanner::Glue(). В результате получается сканер,
который за один проход проверяет строку на соответствие всем содержащимся в нём
паттернам. Общее количество паттернов возвращает Scanner::RegexpsCount(). Вместо
Final() имеет смысл вызвать AcceptedRegexps(const State&), возвращающая пару
итераторов (с семантикой [begin,end)), указывающую на диапазон номеров регулярок,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the correct closing parenthesis ) in [begin,end) meaning the end is not inclusive. Just like in math.

README.en Outdated
is to construct it from the string with the regular expression using the Pire::Lexer class.

Regular expressions have the syntax similar to POSIX. It supports standard features as
a|b, a*, a+, ., [a-z], a{3}, a{3,5}, a{3,}) and character classes (\w (alphanumeric characters),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\w includes only alphabetical, not all alphanumerical characters. Likewise, \W is a complementary set of all not alphabetical characters.

README.en Outdated
The function Pire::Scanner::Glue() unites several machines. The resulting scanner
checks whether the string matches all united patterns at a time. The Scanner::RegexpsCount()
function returns common number of patterns. It is recommended to replace running Final() with
AcceptedRegexps(const State&) that returns iterators with the [begin,end] semantics indicating the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a half-open interval [begin, end) meaning the end is not inclusive.

README.en Outdated

Pire::Scanner CompileRegexp(const char* pattern)
{
// Переводим шаблон в UCS4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some text left in Russian.

README.en Outdated
* Finals(), IsFinal(), SetFinal(), ClearFinal().

In addition, a bit flag field will be associated with each state and each transition of the machine.
Transition flags are called outputs (since they implement a classic spacecraft with an output).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace 'spacecraft' with 'FSM'.

@AnastasiaTWW
Copy link
Author

Thank you for the huge work you have done!

I pointed out some technical issues.

As for the translation, I did not understand several parts of the text well enough.
Could You please proofread it?
Or at least compare it to what translate.google.ru yields for the Russian readme.

I'm sorry for the long replay. Thank you for the feedback! I've fixed the technical issues that you've found in the translation.

Please could you point me in translated parts that are not enough clear?

Comment on lines +139 to +142
\S (non-whitespace)). Operators a&b (intersection, see Fsm::operator &), ~a (inversion, see
Fsm::Complement), non-greedy analogs of repetitions *?, +?, ?? (Fsm::SlowCapturingScanner).
Operators in this scanner can be set to as short expressions as possible, in contrast to greedy
analogs that set to as long expressions as possible.
Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source paragraph

Кроме того, могут быть добавлены
операции a&b (пересечение, см. Fsm::operator &) и ~a (инверсия, см. Fsm::Complement).
Помимо этого, для Fsm::SlowCapturingScanner есть возможность использовать нежадные аналоги
повторений - операторы *?, +?, ??. В этом сканере данные операторы будут принимать
как можно более короткие выражения, в отличие от жадных аналогов (которые, наоборот, принимают
как можно более длинные.

is hard for google translate.
But after tweaking both the source and its suggested translation I have got

In addition, a&b (intersection, see Fsm::operator &) and ~a (inverse, see Fsm::Complement) can be used.
Also, Fsm :: SlowCapturingScanner has the ability to use non-greedy repetition operators - * ?, + ?, ??.
These operators will accept as short subtexts as possible, as opposed to greedy counterparts (which, on the contrary, accept as long as possible).

Our FSMs are acceptors. So the words "to accept" and "to match" can be used interchangeably. Whilst the word "to set" is not suitable here.

Comment on lines +123 to +127
United machines can be united once again. However, it is not recommended to repeat union
indefinitely because the number of states in the resulting scanner grows
exponentially with each added pattern. The max size of the scanners can be specified in
the maxSize parameter in the function Glue(). If the specified size is exceeded, an empty scanner
will be returned (Size() == 0, Empty() == true).
Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the source paragraph

Склеенные сканеры в свою очередь можно склеить между собой. В то же время процесс склейки
нельзя повторять до бесконечности, потому как количество состояний в полученном автомате
растёт экспоненциально с каждым добавленным паттерном. Иногда имеет смысл ограничить
размер полученного автомата, указав ненулевой параметр maxSize в Glue(). В таком случае
при превышении указанного размера возвращается пустой автомат (Size() == 0, Empty() == true).

https://translate.google.ru suggests

The glued scanners, in turn, can be glued together. At the same time, the gluing process
cannot be repeated indefinitely, because the number of states in the resulting machine
grows exponentially with each added pattern. Sometimes it makes sense to limit
the size of the resulting automaton by specifying a nonzero maxSize parameter in Glue (). In this case
when the specified size is exceeded, an empty automaton is returned (Size () == 0, Empty () == true).

Which seams a bit clearer to me. First of all, it uses the words "to glue" and "the scanner" which are used for the same purpose in the code itself. This makes it easier to understand the code by reading the documentation.

I would suggest using "to glue" and "glued" instead of "to unite" and "united" in the previous paragraphs as well. Even if there seems to be a lot of tautologies this way, it is better for technical documentation.


Each scanner has the State type that stores the state, and methods Initialize(State&)
to return the initial machine state, Next(State&, char) to take the machine to the next state,
Final(const State&) to report whether the state is valid. To check whether the string matches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use the word 'final' instead of 'valid':

Final(const State&) to report whether the state is final (ie. the processed text is accepted by the machine).

Comment on lines +120 to +121
Numbers start from 0 and, after union, numbers of expressions on the right are shifted
by the expression numbers on the left.
Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers start from 0 and, after the Pire::Scanner::Glue() function, numbers of patterns on the right are shifted by the pattern count on the left.


* i - case insensitive
* u - compile the expression in UTF-8
* s - escape the expression .* on each side
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • s - surround the pattern by .* on each side

The following characters can be in the string with flags:

* i - case insensitive
* u - compile the expression in UTF-8
Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terms "regular expression" and "pattern" can be used interchangeably, but I would not use just "expression" because it is a broader term.

In the inherited class, the following methods should be overridden:

* wchar32 FromLocal(const char*& begin, const char* end) reads and returns input character and forwards begin.
Returns an exceprion if the input includes an invalid sequence.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, replace "exceprion" with "exception".

And it is more common to say "to throw an exception" than "to return an exception".

* Default constructor: creates the scanner that allows an empty string
* MakeFalse(): creates the scanner that does not allows any string
* Size(): the scanner size (number of states)
* Append(unsigned char c): adds to the machine the transition for the match with .
Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Append(unsigned char c): adds to the machine the transition matching the given character.

* MakeFalse(): creates the scanner that does not allows any string
* Size(): the scanner size (number of states)
* Append(unsigned char c): adds to the machine the transition for the match with .
* AppendStrings(const vector<string>&): adds to the machine the transition for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • AppendStrings(const vector&): adds transitions to the machine that match any of the given strings

cycle writing it as a single expression (see an example).

Four more functions helpful in lexical analysis are LongestPrefix(), LongestSuffix(), ShortestPrefix(),
and ShortestSuffix(). These functions returns the largest / the smallest prefix or suffix allowed by
Copy link
Collaborator

@sergey-v-galtsev sergey-v-galtsev Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to say that the FSM "matches" or "accepts" the given text, rather than "allows". Please, check other uses of the word "to allow" in the same context.

* i - case insensitive
* u - compile the expression in UTF-8
* s - escape the expression .* on each side
* a - enable checking for the & and ~ operators in expressions (see above)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • a - enable support for the & and ~ operators in expressions (see above)

should be performed.

The serialized scanner is not portable between architectures (even between x86 and x86_64).
When trying to read or map the regular expression serialized in another architecture,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serialized on another architecture

Returns an exceprion if the input includes an invalid sequence.
* std::string ToLocal(wchar32) returns returns the byte representation of the character in this encoding.
If the character in this encoding is not representable, returns an empty string.
* AppendDot(Fsm&) adds a fragment to the end machine that allows any one character in this encoding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • AppendDot(Fsm&) appends to the end of the machine a fragment that accepts any character in this encoding.


* wchar32 FromLocal(const char*& begin, const char* end) reads and returns input character and forwards begin.
Returns an exceprion if the input includes an invalid sequence.
* std::string ToLocal(wchar32) returns returns the byte representation of the character in this encoding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns returns -> returns

cycle writing it as a single expression (see an example).

Four more functions helpful in lexical analysis are LongestPrefix(), LongestSuffix(), ShortestPrefix(),
and ShortestSuffix(). These functions returns the largest / the smallest prefix or suffix allowed by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions returns -> These functions return

@sergey-v-galtsev
Copy link
Collaborator

As an example, I suggested replacements for a couple of paragraphs.
And highlighted some other issues of the translation that caught my eye.
Could You please proofread the rest of the text?
Thank You in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants