Scanning Source Code
Now we have established Possum’s use case and its formal grammar, we can almost start writing some Xojo code to implement the langauge!
A programming langauge can be imlpemented in many ways but Possum will take a “traditional” approach and split the implementation into the following phases:
- Scanning (aka tokenisation)
- Parsing
- Compilation
- Interpretation
The scanner’s job is to take a Xojo String
of Possum source code and convert it into an array of tokens. These tokens will subsequently be fed to the parser. Think of a token as the atomic building block of a programming language. In essence, they represent either an operator symbol (e.g: =
, ~
or >>
), a literal value (e.g: "Hello World"
, 1
, true
) or keyword (and
, else
). Some tokens don’t fall into one of these categories and are “special” (e.g: the end-of-file marker, newlines, indentation, etc). The scanner’s job is to analyse the characters fed to it and partition them into tokens. In addition to determining the character content of a token (it’s lexeme), the scanner will also store the line number the token occurs on and the position in the source code of the first character of the token. This will come in handy later on if we encounter an error and want to report its position to the user.
Types of Tokens
Before we can tokenise any source code, we need to have a firm definition of what Possum’s syntax is. Below is an overview.
Comments
Comments begin with a #
and continue until the end of the line. They are ignored by the scanner.
Operators
+ # Addition
- # Subtraction / unary negation
* # Multiplication
/ # Division
% # Remainder
= # Assignment
+= # Addition assignment
-= # Subtraction assignment
*= # Multiplication assignment
/= # Division assignment
%= # Remainder assignment
== # Equality
< # Less than
> # Greater than
<> # Not equal to
<= # Less than or equal to
>= # Greater than or equal to
& # Bitwise AND
| # Bitwise OR
^ # Bitwise XOR
~ # Bitwise NOT
<< # Left shift
>> # Right shift
<= # Function block indicator
=> # Key/value operator
? # Ternary conditional component
: # Ternary conditional component / block start indicator
. # Access operator / decimal point
, # Separator
?? # Nothing coalescing operator
_ # Line continuation marker
" # String delimiter
@" # Escaped string literal indicator
; # Optional statement terminator
Keywords
Keywords are case-sensitive in Possum.
and as block class constructor
downto else elseif exit false
for foreach foreign function if
import is not nothing or
pass quit repeat return skip
static super then this true
until var while xor yield
Identifiers
Like keywords, identifiers are case sensitive. Valid identifiers begin with a Unicode letter or underscore and may be followed by >= 0 Unicode letters, digits or the _
character. An identifier may also optionally be followed by either a single ?
or !
.
Examples:
valid
OK
mutate
snake_case
myvar1
_classProperty
__staticClassProperty
isHappy?
chop!
Note that an identifier prefixed with a single _
is a class property and an identifier prefixed with __
is a static class property.
Numbers
Integers
There are three ways to represent integer numbers in Possum:
64 # Integer
0x40 # Hexadecimal
0b1000000 # Binary
Optionally, you can separate digits with the _
character. For example 100_000
is the same as 100000
. This approach works for all types of integer literals (e.g: 0x4_0
and 0b1000_000
). The _
is simply removed from the value.
Non-Integers
A non-integer number is a real number written as the integer component followed by a period and then the fractional part. Non-integers may also be written in scientific notation with E
or e
indicating the power of 10:
1.0
2.590
100_000.5 # 100000.5
1e3 # 1000
3e2.5 # Invalid decimal point
Strings
Textual data in Possum is handled by the primitive String
datatype. String literals are created by matching double ( "
) quotes and may span multiple lines. Text flanked by double quotes is known as a verbatim literal. They are verbatim because every character between the opening and closing "
is included.
var t = "Hello World"
var multiline = "This is
over three
lines"
Escape sequences
Character combinations consisting of a backslash (\
) followed by a certain character are called escape sequences. Escape sequences only work within escaped string literals. These are prefixed with the @
character.
var t = @"Hello\tWorld\n" # Includes a tab and newline
var t2 = @"A double quote: \"" # A double quote: "
Escape sequences:
\n Newline
\t Horizontal tab
\" Double quote
\\ Backslash
\u Unicode code point
It’s not possible to include a double quote within a verbatim literal. If you don’t need to include a "
in the literal, using a verbatim literal is recommended as parsing it is much faster than parsing an escaped literal.
\u
followed by one to eight hex digits can be used to specify a Unicode code point:
System.print(@"\u41\ub83\u00DE") # AஃÞ
System.print(@"\u1F64A\u1F680") # 🙊🚀
Note: The above Unicode escaping has been edited to support suggestions made below by @npalardy and @Rick.A.
Identation
Rather than using curly braces to enclose blocks of code, Possum takes inspiration from Python and uses indentation to denote a block. Unlike Python, spaces at the beginning of a line have no meaning and are detected by the scanner as an error. Only horizontal tabs can be used for indentation. The scanner needs to correctly identify when indentation and dedentation has occurred. For example, given the following code:
a
b
c
d
The scanner needs to produce the following tokens:
IDENTIFIER(a)
INDENT
IDENTIFIER(b)
INDENT
IDENTIFIER(c)
DEDENT
DEDENT
IDENTIFIER(d)
The next post will walk through the code for Possum’s scanner.