Scanning Source Code
Now we have established Possum’s use case and its formal grammar, we can almost start writing some Xojo code to implement the langauge!
A programming langauge can be imlpemented in many ways but Possum will take a “traditional” approach and split the implementation into the following phases:
- Scanning (aka tokenisation)
The scanner’s job is to take a Xojo
String of Possum source code and convert it into an array of tokens. These tokens will subsequently be fed to the parser. Think of a token as the atomic building block of a programming language. In essence, they represent either an operator symbol (e.g:
>>), a literal value (e.g:
true) or keyword (
else). Some tokens don’t fall into one of these categories and are “special” (e.g: the end-of-file marker, newlines, indentation, etc). The scanner’s job is to analyse the characters fed to it and partition them into tokens. In addition to determining the character content of a token (it’s lexeme), the scanner will also store the line number the token occurs on and the position in the source code of the first character of the token. This will come in handy later on if we encounter an error and want to report its position to the user.
Types of Tokens
Before we can tokenise any source code, we need to have a firm definition of what Possum’s syntax is. Below is an overview.
Comments begin with a
# and continue until the end of the line. They are ignored by the scanner.
+ # Addition - # Subtraction / unary negation * # Multiplication / # Division % # Remainder = # Assignment += # Addition assignment -= # Subtraction assignment *= # Multiplication assignment /= # Division assignment %= # Remainder assignment == # Equality < # Less than > # Greater than <> # Not equal to <= # Less than or equal to >= # Greater than or equal to & # Bitwise AND | # Bitwise OR ^ # Bitwise XOR ~ # Bitwise NOT << # Left shift >> # Right shift <= # Function block indicator => # Key/value operator ? # Ternary conditional component : # Ternary conditional component / block start indicator . # Access operator / decimal point , # Separator ?? # Nothing coalescing operator _ # Line continuation marker " # String delimiter @" # Escaped string literal indicator ; # Optional statement terminator
Keywords are case-sensitive in Possum.
and as block class constructor downto else elseif exit false for foreach foreign function if import is not nothing or pass quit repeat return skip static super then this true until var while xor yield
Like keywords, identifiers are case sensitive. Valid identifiers begin with a Unicode letter or underscore and may be followed by >= 0 Unicode letters, digits or the
_ character. An identifier may also optionally be followed by either a single
valid OK mutate snake_case myvar1 _classProperty __staticClassProperty isHappy? chop!
Note that an identifier prefixed with a single
_ is a class property and an identifier prefixed with
__ is a static class property.
There are three ways to represent integer numbers in Possum:
64 # Integer 0x40 # Hexadecimal 0b1000000 # Binary
Optionally, you can separate digits with the
_ character. For example
100_000 is the same as
100000. This approach works for all types of integer literals (e.g:
0b1000_000 ). The
_ is simply removed from the value.
A non-integer number is a real number written as the integer component followed by a period and then the fractional part. Non-integers may also be written in scientific notation with
e indicating the power of 10:
1.0 2.590 100_000.5 # 100000.5 1e3 # 1000 3e2.5 # Invalid decimal point
Textual data in Possum is handled by the primitive
String datatype. String literals are created by matching double (
") quotes and may span multiple lines. Text flanked by double quotes is known as a verbatim literal. They are verbatim because every character between the opening and closing
" is included.
var t = "Hello World" var multiline = "This is over three lines"
Character combinations consisting of a backslash (
\) followed by a certain character are called escape sequences. Escape sequences only work within escaped string literals. These are prefixed with the
var t = @"Hello\tWorld\n" # Includes a tab and newline var t2 = @"A double quote: \"" # A double quote: "
\n Newline \t Horizontal tab \" Double quote \\ Backslash \u Unicode code point
It’s not possible to include a double quote within a verbatim literal. If you don’t need to include a
" in the literal, using a verbatim literal is recommended as parsing it is much faster than parsing an escaped literal.
\u followed by one to eight hex digits can be used to specify a Unicode code point:
System.print(@"\u41\ub83\u00DE") # AஃÞ System.print(@"\u1F64A\u1F680") # 🙊🚀
Rather than using curly braces to enclose blocks of code, Possum takes inspiration from Python and uses indentation to denote a block. Unlike Python, spaces at the beginning of a line have no meaning and are detected by the scanner as an error. Only horizontal tabs can be used for indentation. The scanner needs to correctly identify when indentation and dedentation has occurred. For example, given the following code:
a b c d
The scanner needs to produce the following tokens:
IDENTIFIER(a) INDENT IDENTIFIER(b) INDENT IDENTIFIER(c) DEDENT DEDENT IDENTIFIER(d)
The next post will walk through the code for Possum’s scanner.