Possum DevLog 2: Tokens

Garry · 17 April 2020 21:34

Scanning Source Code

Now we have established Possum’s use case and its formal grammar, we can almost start writing some Xojo code to implement the langauge!

A programming langauge can be imlpemented in many ways but Possum will take a “traditional” approach and split the implementation into the following phases:

Scanning (aka tokenisation)
Parsing
Compilation
Interpretation

The scanner’s job is to take a Xojo String of Possum source code and convert it into an array of tokens. These tokens will subsequently be fed to the parser. Think of a token as the atomic building block of a programming language. In essence, they represent either an operator symbol (e.g: =, ~ or >>), a literal value (e.g: "Hello World", 1, true) or keyword (and, else). Some tokens don’t fall into one of these categories and are “special” (e.g: the end-of-file marker, newlines, indentation, etc). The scanner’s job is to analyse the characters fed to it and partition them into tokens. In addition to determining the character content of a token (it’s lexeme), the scanner will also store the line number the token occurs on and the position in the source code of the first character of the token. This will come in handy later on if we encounter an error and want to report its position to the user.

Types of Tokens

Before we can tokenise any source code, we need to have a firm definition of what Possum’s syntax is. Below is an overview.

Comments

Comments begin with a # and continue until the end of the line. They are ignored by the scanner.

Operators

+	# Addition
-	# Subtraction / unary negation
*	# Multiplication
/	# Division
%	# Remainder

=	# Assignment
+=	# Addition assignment
-=	# Subtraction assignment
*=	# Multiplication assignment
/=	# Division assignment
%=	# Remainder assignment

==	# Equality
<	# Less than
>	# Greater than
<>	# Not equal to
<=	# Less than or equal to
>=	# Greater than or equal to
&	# Bitwise AND
|	# Bitwise OR
^	# Bitwise XOR
~	# Bitwise NOT
<<	# Left shift
>>	# Right shift

<=	# Function block indicator
=>	# Key/value operator
?	# Ternary conditional component
:	# Ternary conditional component / block start indicator
.	# Access operator / decimal point
,	# Separator
??	# Nothing coalescing operator
_	# Line continuation marker
"	# String delimiter
@"	# Escaped string literal indicator
;	# Optional statement terminator

Keywords

Keywords are case-sensitive in Possum.

and		as		block		class		constructor
downto	else	elseif		exit		false
for		foreach	foreign		function	if
import	is		not			nothing		or
pass	quit	repeat		return		skip
static	super	then		this		true
until	var		while		xor			yield

Identifiers

Like keywords, identifiers are case sensitive. Valid identifiers begin with a Unicode letter or underscore and may be followed by >= 0 Unicode letters, digits or the _ character. An identifier may also optionally be followed by either a single ? or !.

Examples:

valid
OK
mutate
snake_case
myvar1
_classProperty
__staticClassProperty
isHappy?
chop!

Note that an identifier prefixed with a single _ is a class property and an identifier prefixed with __ is a static class property.

Numbers

Integers

There are three ways to represent integer numbers in Possum:

64			# Integer
0x40		# Hexadecimal
0b1000000 	# Binary

Optionally, you can separate digits with the _ character. For example 100_000 is the same as 100000. This approach works for all types of integer literals (e.g: 0x4_0and 0b1000_000 ). The _ is simply removed from the value.

Non-Integers

A non-integer number is a real number written as the integer component followed by a period and then the fractional part. Non-integers may also be written in scientific notation with E or e indicating the power of 10:

1.0
2.590
100_000.5	# 100000.5
1e3			# 1000
3e2.5 		# Invalid decimal point

Strings

Textual data in Possum is handled by the primitive String datatype. String literals are created by matching double ( ") quotes and may span multiple lines. Text flanked by double quotes is known as a verbatim literal. They are verbatim because every character between the opening and closing " is included.

var t = "Hello World"
var multiline = "This is 
over three 
lines"

Escape sequences

Character combinations consisting of a backslash (\) followed by a certain character are called escape sequences. Escape sequences only work within escaped string literals. These are prefixed with the @ character.

var t = @"Hello\tWorld\n" # Includes a tab and newline
var t2 = @"A double quote: \"" # A double quote: "

Escape sequences:

\n		Newline
\t		Horizontal tab
\"		Double quote
\\		Backslash
\u		Unicode code point

It’s not possible to include a double quote within a verbatim literal. If you don’t need to include a " in the literal, using a verbatim literal is recommended as parsing it is much faster than parsing an escaped literal.

\u followed by one to eight hex digits can be used to specify a Unicode code point:

System.print(@"\u41\ub83\u00DE") # AஃÞ
System.print(@"\u1F64A\u1F680") # 🙊🚀

Note: The above Unicode escaping has been edited to support suggestions made below by @npalardy and @Rick.A.

Identation

Rather than using curly braces to enclose blocks of code, Possum takes inspiration from Python and uses indentation to denote a block. Unlike Python, spaces at the beginning of a line have no meaning and are detected by the scanner as an error. Only horizontal tabs can be used for indentation. The scanner needs to correctly identify when indentation and dedentation has occurred. For example, given the following code:

a
	b
		c
d

The scanner needs to produce the following tokens:

IDENTIFIER(a)
INDENT
IDENTIFIER(b)
INDENT
IDENTIFIER(c)
DEDENT
DEDENT
IDENTIFIER(d)

The next post will walk through the code for Possum’s scanner.

npalardy · 17 April 2020 22:12

no unary negation ?
ie/ is

-10

legal ?

Note that an identifier prefixed with a single _ is a class property and an identifier prefixed with __ is a static class property.

this distinction should probably NOT be part of the grammar

Rick.A · 17 April 2020 22:38

Besides the myValue = -myVar that Norman pointed out, as you absorbed C like escapes, I would suggest that the line continuation would be a backslash as the last non-blank (code > 0x20) before the end of line, as we do in shell scripts and C. In most keyboards it is one keystroke only and complements your escaping options.

Garry · 17 April 2020 22:53

A simple omission The parser already supports it. I’ve edited the post.

Why so? I thought it would make it easier to have separate tokens for static and instance class properties since Possum does not allow access to member properties outside of getters and setters. By specifying single and double underscores to indicate property identifiers it allows the parser to recognise errors like myClassInstance._prop as a syntax error.

I did consider using the backslash at the end of a line as the continuation marker but I decided I liked Xojo’s approach of using the underscore more. Just a personal preference really.

Rick.A · 17 April 2020 22:58

Also, I don’t know why \u and \U, to me \u is enough, the variant part is the number of digits. Can be any value from \u0 to \u7FFFFFFF with any quantity of digits forming a valid value.

Garry · 17 April 2020 23:08

For ease of tokenising really. The maximum number of bytes needed to specify any code point is 4 bytes (8 hex digits). I wanted to allow the user to specify code points in the basic multilingual plane (which only requires a maximum of 2 bytes) using 4 hex digits without having to check for the remaining 4 digits.

Rick.A · 17 April 2020 23:09

Garry · 17 April 2020 23:10

It also makes it clear in the source code that the programmer is trying to represent a common lingual character rather than (for instance) an emoji.

Garry · 17 April 2020 23:11

Rick.A · 17 April 2020 23:17

I don’t like the idea of having 2 escape codes for the same thing (“unicoding” a value), but… do the things the way you prefer. It’s your playground.

Garry · 17 April 2020 23:18

I’ll ruminate on it in bed. I’m always open to changing my mind

npalardy · 18 April 2020 00:01

this distinction should probably NOT be part of the grammar

I should have said not part of the tokenizer

IDENTIFIER should be some legal sequence of characters which the tokenizer doesnt need to distinguish

The tokenizer should be worried about “does this mean A or B”
It really should be devoid of SEMANTIC meaning

The SEMANTIC difference should be in a higher level based on what ca be expected accepted etc

npalardy · 18 April 2020 00:04

I tend to agree with Rick that \u and \U shouldnt matter one way or the other
Semantically they are so nearly identical it could be confusing

Garry · 18 April 2020 07:33

You both make a compelling argument. I’ve changed the scanner now such that the escape sequence is simply \u followed by 1 to 8 hex digits. This is now valid:

System.print(@"\u41\ub83\uDE") # AஃÞ
System.print(@"\u1F64A\u1F680") # 🙊🚀

Of course the trade off now is that if you want an unusual character to immediately precede a simple character, you have to use 8 hex digits instead of 4 unless you use string concatenation:

Before:

@"\u00E0B" # àB

Now:

@"\u000000E0B" # àB

# Or

@"\u0E0B" + "B" # àB

npalardy · 18 April 2020 14:20

ah … I might have used \u\u instead of one long string of digits just to reduce ambiguity

npalardy · 18 April 2020 14:54

have you sat down and written a full grammar for possum ?
that night be a useful exercise to get a good handle on implementing everything

SOME systems basically write the tokenizer a part of the grammar ( see antlr and the list of grammars that are written for a LOT of languages at https://github.com/antlr/grammars-v4 )

there are a couple built for BASIC which makes it easier to read since you already know basic

Rick.A · 18 April 2020 15:21

You just should stop the \u analysis when finding the first non-hex code, and you can elect a stop sign to be not included in the string when at end of an \u declaration, let’s supose the “+” here like:

“\u41\ub83here” # Aஃhere as h is not in 0-9 a-f
“\u41\ub83fascinating” # A?scinating (unicode u-b83fa)
“\u41\ub83+fascinating” # Aஃfascinating (unicode u-b83, the stop encoding sign “+” is ignored)
“\u41\ub83++fascinating” # Aஃ+fascinating (unicode u-b83, the next + is IN the string)

Garry · 18 April 2020 15:30

Here’s the grammar I’m using:

npalardy · 18 April 2020 16:42

Thanks !
I’ll probably just extract things and stick it in a plain txt document so I can poke about quick & easy

I notice you use “regex” style syntax which can sometimes be a pain in the butt to turn into code

Garry · 18 April 2020 18:40

I’m using a Pages document - it’s here if you want it: