-
Notifications
You must be signed in to change notification settings - Fork 1
Tokens
The first step to interpreting Proton code is to break down the source code into tokens. Each token has a type and a content (as well as **kwargs
, but those can be ignored since I never got around to using them). The type determines what component it will become on the AST, and the content is kept in the AST, especially useful for literals and operators.
The lexer works by comparing the code to a set of rules, in a specific order of precedence. Once it matches one, it will break off that section of the code and yield a token. The lexer is a generator, though its sole use is in a list(...)
statement.
The rules are, in the particular order in which the lexer matches code:
Regex | Type | Description |
---|---|---|
#.+ |
Comment | If this is matched, skip the entire line |
/\*([^*]|\*[^/])*\*/ |
Comment | If this is matched, skip everything in it (/* ... */ ) |
\d*\.\d+j |
Literal Complex Number | If it is in the form <0+ digits>.<1+ digits>j , then it is a complex literal. Note that 2.3ja would still become a complex literal |
\d+j |
Literal Complex Number | Same as above, except for an integer multiple of the imaginary unit |
\d*\.\d+ |
Literal Floating Point Number | Digits before the decimal are unnecessary but must be present after them (this is to avoid a bug that I now cannot recall) |
\d+ |
Literal Integer | If it's not a complex number, a rational, or a floating point number, and it has a bunch of digits, it's probably an integer. Note that a trailing space is not necessary |
"([^"\\]|\\.)*" |
Literal string | A quote with an arbitrary amount of non-quotes and non-backslashes or a backslash followed by any character. The string is evaluated using ast.literal_eval so strings work just like in Python. Actual newlines can be present in strings (no triple-quotes needed) |
'([^'\\]|\\.)*' |
Literal string | Same as above, but with single quotes |
"([^"\\]|\\.)* |
UnclosedStringError |
Raises an error if the string is unclosed. Used by the shell to determine when to continue code across multiple lines |
'([^'\\]|\\.)* |
UnclosedStringError |
Same as above but with single quotes |
(no regex) | Keyword | Keywords are matched here. See the keywords subsection below on this page |
[A-Za-z_][A-Za-z_0-9]* |
Identifier | If it starts with a letter or an underscore and is followed by any number of underscores, letters, or numbers, then it's an identifier, unless it's an operator. See the operators subsection below on this page |
(;|,|\?|:>|->|=>) |
Special Statements | These are really simple tokens and their functionality needs not be explained until later |
(no regex) | Operator | Operators are matched here (if not already by the Identifier section). See the operators subsection below on this page |
[\(\)\[\]\{\}] |
Bracket | These single-byte tokens become important in the parser |
\s+ |
Whitespace | If it matches whitespace then skip it. This, like a comment, is not actually tokenized |
Keywords are special words that mean things, like for
and if
. This is a list of all keywords, and their use if it is not part of most mainstream programming languages:
if
, else
, unless
, while
, for
, try
, except
, exist not
, exist
, exists not
, exists
, break
, continue
, import
, include
, as
, from
, to
, by
, timeof
unless
- essentially a backwards if statement. More in the Control Flow section
exist not
, exist
, exists not
, exists
- check if a variable is defined, or if the argument is a literal, return True
since literals are always defined. More in the Unique Features section
timeof
- time how long an expression takes to be evaluated. More in the Unique Features section
Operators are things that join or modify expressions to compute things. This is a list of all operators, in descending order of precedence:
.
**
-
>>
,<<
-
*
,/
,//
%
-
+
,-
-
>
,<
,<=
,>=
&
|
^
-
**=
,*=
,/=
,//=
,+=
,-=
,>>=
,<<=
,%=
,&=
,|=
,&&=
,||=
-
==
,!=
,:=
,=
,=:
-
&&
,and
-
||
,or
-
in
,not, in
,is
,are
,is, not
,are, not
is
, are
, is not
, and are not
check whether or not an object is of a certain type.