Possum DevLog 4: The Scanner (Part 1)

Garry · 24 April 2020 10:37

OK, now we’ve “sorted out” our grammar and explained what tokens are, we’re finally in position to walk through Possum’s scanner line by line. This post will take a look at the PossumScanner class whose task is to take a Xojo String representing Possum source code and spit out an array of PossumToken instances which will be subsequently fed into Possum’s parser.

External Dependencies

I had considered writing Possum such that it was an entirely self-contained project with no external dependencies. Whilst Possum doesn’t need any external plugins it does depend on a few of my own external modules for common functionality. Fortunately these are freely available on my GitHub account and can be used in many other projects. The only dependency that the scanner depends on is my StringKit module which provides common String manipulation methods. You’ll need to add this to your project.

Required Classes

Since Possum is a case-sensitive language, we need to implement our own version of Xojo’s Dictionary class that will treat String keys of different cases as being different. The native Dictionary class doesn’t do this. Fortunately, this is easy to overcome with API 2.0 by subclassing the native Dictionary. We’ll call this class CaseSensitiveDictionary:

Class CaseSensitiveDictionary Inherits Dictionary
	Public Sub Constructor(ParamArray entries As Pair)	  
	  Super.Constructor(AddressOf KeyComparer)
	  
	  For Each keyValue As Pair In entries
	    Self.Value(keyValue.Left) = keyValue.Right
	  Next keyValue
	End Sub
	
	Private Function KeyComparer(leftKey As Variant, rightKey As Variant) as Integer  
		If leftKey.Type = Variant.TypeString And rightKey.Type = Variant.TypeString Then
		  // Both keys are strings so do a case-sensitive comparison.
		  Return leftKey.StringValue.Compare(rightKey.StringValue, ComparisonOptions.CaseSensitive)
		Else
		  If leftKey = rightKey Then
		    Return 0
		  ElseIf leftKey < rightKey Then
		    Return -1
		  Else
		    Return 1
		  End If
		End If
	End Function
End Class

Now we can create a case-sensitive dictionary like so:

Var cd As New CaseSensitiveDictionary
cd.Value("hello") = True
If cd.HasKey("hello") Then
  // This is True.
End If
If cd.HasKey("Hello") Then
  // This is False.
End If

Next up, we need a class to represent a token. This class needs to store the following information:

The token’s lexeme These are the actual characters that the token is comprised of
The (1-based) line number of the source code that this token originated on
The (0-based) index within the original source code of the first character of this token
The type of token (e.g. keyword, number, etc). This will be an enumeration.

Here’s a basic structure for the class:

Class PossumToken
	Property Lexeme As String
	Property Line As Integer = 1
	Property Position As Integer = 0
	Property Type As Possum.TokenTypes
End Class

You’ll notice that we have an enumeration called TokenTypes which I have embedded within a module called Possum. The reason for the separate module is because as the pipeline for Possum grows, we will need more properties and methods accessible to the parser, scanner, compiler, etc that are best suited within a separate module. You may be asking at this point, why not put all classes within the Possum module and stop prefixing classes with Possum altogether? Well, this is because I want to have all of Possum’s classes as external dependencies to a project. This means that I can edit and enhance them in one project and have those changes propagate to any project that is using them. Unfortunately, the Xojo IDE cannot externalise a module if it contains classes.

We have many different types of token. Here is the TokenTypes enumeration in full:

Protected Enum TokenTypes
	Ampersand
	BooleanFalse
	BooleanTrue
	Caret
	Colon
	Comma
	Dedent
	Dot
	EOF
	Equal
	EqualEqual
	ExclusiveRange
	ForwardSlash
	ForwardSlashEqual
	Greater
	GreaterEqual
	GreaterGreater
	Identifier
	InclusiveRange
	Indent
	KeywordAnd
	KeywordAs
	KeywordBlock
	KeywordClass
	KeywordConstructor
	KeywordDownTo
	KeywordElse
	KeywordElseIf
	KeywordExit
	KeywordFor
	KeywordForEach
	KeywordForeign
	KeywordFunction
	KeywordImport
	KeywordIf
	KeywordIs
	KeywordNot
	KeywordOr
	KeywordPass
	KeywordQuit
	KeywordRepeat
	KeywordReturn
	KeywordSkip
	KeywordStatic
	KeywordSuper
	KeywordThen
	KeywordThis
	KeywordUntil
	KeywordVar
	KeywordWhile
	KeywordXor
	KeywordYield
	LCurly
	LeftArrow
	LParen
	LSquare
	Less
	LessEqual
	LessLess
	Minus
	MinusEqual
	NotEqual
	Nothing
	NumberLiteral
	Percent
	PercentEqual
	Pipe
	Plus
	PlusEqual
	PropertyIdentifier
	Query
	QueryQuery
	RCurly
	RParen
	RightArrow
	RSquare
	Star
	StarEqual
	StaticPropertyIdentifier
	StringLiteral
	Terminator
	Tilde
	Underscore
End Enum

As good API 2.0 Xojo programmers, we will use exceptions to handle errors. To make it easier, let’s create our own PossumScannerException class that will be raised by our scanner whenever something bad happens. It’ll be a subclass of Xojo’s RuntimeException with two additional properties to store the 1-based line number and 0-based character localising where the error occurred in the source code:

Class PossumScannerException Inherits RuntimeException
	Property LineCharacter As Integer = 0
	Property LineNumber As Integer = 1
End Class

Finally, we need a scanner class to do the tokenisation. Imaginatively we’ll call it PossumScanner. There are a number of properties that this class needs to function correctly and I’ll explain them momentarily. We will add the methods later as we work through this article.

Class PossumScanner
	Private Property AtLineStart as Boolean = True
	Private Property Chars() as String
	Private Property CharsCount as Integer = 0
	Private Property CharsLastRowIndex as Integer = -1
	Private Property Current as Integer = 0
	Private Property IndentStack() as Integer
	Private Property LineNumber as Integer = 1
	Private Property Lines() as String
	Public Property ReservedWords as CaseSensitiveDictionary
		Get
        End Get
	End Property
	Private Property Source as String
	Private Property Tokens() as PossumToken
	Private Property TokenStart as Integer = 0
	Private Property UnclosedCurlyCount as Integer = 0
	Private Property UnclosedParenCount as Integer = 0
	Private Property UnclosedSquareCount as Integer = 0
End Class

Below is a brief description of what each of the scanner’s properties do.

`AtLineStart`

True if we are at the start of a new line.

`Chars()`

An array containing the individual characters of the source code this scanner will tokenise. Line endings have been standardised to UNIX.

`CharsCount`

The number of characters in the Chars array.

`CharsLastRowIndex`

The cached last row index for the Chars array.

`Current`

The 0-based current position in the source code. Points to the next character to handle.

`IndentStack()`

Used when computing the current level of indentation. We’ll cover this in a later topic.

`LineNumber`

The 1-based line number that the scanner is currently on.

`Lines()`

An array containing each line of the source code this scanner is tokenising (including blank lines).

`ReservedWords`

This is a CaseSensitiveDictionary containing all of Possum’s keywords (plus "true", "false" and "nothing") as keys mapped to their token type (as the value). We use our custom CaseSensitiveDictionary class described above because Possum is a case-sensitive language but Xojo is not.

Since this dictionary never changes we can take advantage of Xojo’s Static keyword to create the dictionary once when it is first accessed by making the ReservedWords property computed. In the property’s Get method, put this code:

// This code will only execute once.
Static mReservedWords As New CaseSensitiveDictionary(_
		"and" : Possum.TokenTypes.KeywordAnd, _
		"as" : Possum.TokenTypes.KeywordAs, _
		"block" : Possum.TokenTypes.KeywordBlock, _
		"class" : Possum.TokenTypes.KeywordClass, _
		"constructor" : Possum.TokenTypes.KeywordConstructor, _
		"downto" : Possum.TokenTypes.KeywordDownTo, _
		"else" : Possum.TokenTypes.KeywordElse, _
		"elseif" : Possum.TokenTypes.KeywordElseIf, _
		"exit" : Possum.TokenTypes.KeywordExit, _
		"for" : Possum.TokenTypes.KeywordFor, _
		"foreach" : Possum.TokenTypes.KeywordForEach, _
		"foreign" : Possum.TokenTypes.KeywordForeign, _
		"function" : Possum.TokenTypes.KeywordFunction, _
		"if" : Possum.TokenTypes.KeywordIf, _
		"import" : Possum.TokenTypes.KeywordImport, _
		"is" : Possum.TokenTypes.KeywordIs, _
		"not" : Possum.TokenTypes.KeywordNot, _
		"or" : Possum.TokenTypes.KeywordOr, _
		"pass" : Possum.TokenTypes.KeywordPass, _
		"quit" : Possum.TokenTypes.KeywordQuit, _
		"repeat" : Possum.TokenTypes.KeywordRepeat, _
		"return" : Possum.TokenTypes.KeywordReturn, _
		"skip" : Possum.TokenTypes.KeywordSkip, _
		"static" : Possum.TokenTypes.KeywordStatic, _
		"super" : Possum.TokenTypes.KeywordSuper, _
		"then" : Possum.TokenTypes.KeywordThen, _
		"this" : Possum.TokenTypes.KeywordThis, _
		"until" : Possum.TokenTypes.KeywordUntil, _
		"var" : Possum.TokenTypes.KeywordVar, _
		"while" : Possum.TokenTypes.KeywordWhile, _
		"xor" : Possum.TokenTypes.KeywordXor, _
		"yield" : Possum.TokenTypes.KeywordYield, _
		"false" : Possum.TokenTypes.BooleanFalse, _
		"true" : Possum.TokenTypes.BooleanTrue, _
		"nothing": Possum.TokenTypes.Nothing _
	)

Return mReservedWords

This way, the first time ReservedWords is accessed by any instance of PossumScanner, we create our dictionary. Subsequent calls to the ReservedWords will simply return the already created dictionary.

`Source`

The source code string the scanner is tokenising.

`Tokens()`

The tokens produced by this scanner after tokenising its source code.

`TokenStart`

The 0-based index in the source code where the current token begins.

`UnclosedCurlyCount`

The number of unclosed curly brackets.

`UnclosedParenCount`

The number of unclosed parentheses.

`UnclosedSquareCount`

The number of unclosed square brackets.

Kick Off

We will start off the tokenisation process by passing some Possum source code to an instance of our scanner:

Var scanner = New PossumScanner(someSourceCode)

// We need an array of tokens to store the result.
Var tokens() As PossumToken

// Our scanner raises a PossumScannerException if something goes wrong 
// with the tokenisation process so make sure we catch it.
Try
  tokens = Scanner.Tokenise
Catch e As PossumScannerException
  // Handle the error here.
End Try

// At this point - we have an array of tokens we can use.

Constructing The Scanner

In the code above, you’ll see that the scanner takes some Possum source code as part of its constructor. We need to implement that:

Sub Constructor(source As String)
	// First we need to make sure that the StringKit module is initialised.
	// This is a limitation of StringKit, not Possum.
	StringKit.Initialise
	
	// Initialise our properties.
	Self.Source = source
	Self.Current = 0
	IndentStack = Array(0) // Put the base level of indentation on the stack.
	UnclosedCurlyCount = 0
	UnclosedParenCount = 0
	UnclosedSquareCount = 0
	Self.TokenStart = 0
	
	// Standardise line endings to UNIX ones.
	Var standardised As String = source.ReplaceLineEndings(EndOfLine.UNIX)

	// Split the standardised source code into individual characters.
	Chars = standardised.Split("")
	
	// Since we'll be checking for the upper bounds of Chars a lot, cache the value now.
	CharsLastRowIndex = Chars.LastRowIndex
	
	// Similarly, we'll be checking the number of characters a lot. Cache it.
	CharsCount = Chars.Count
End Sub

“What’s this standardisation business about?” I hear you cry. Well, since Possum is cross-platform, we have to account for the fact that if the scanner is running on macOS but the source code was created on Windows (or vice versa) that the character used to represent a line ending will be different. I’ve settled on using the UNIX line ending character (&h0A). It doesn’t really matter though as long as we’re consistent. We therefore use Xojo’s built in String.ReplaceLineEndings() method to replace all line endings with the UNIX one.

Our scanner will step through each character one-by-one so it makes sense to split the now standardised string into its constituent characters up front now rather than repeated calls to String.Middle() which would be much slower.

The last couple of lines are just there for small performance gains as it’s quicker to compare against a fixed integer than keep on calling a method on an array.

All that’s left now is to implement the Tokenise() method. That’s a lot of work and a lot of text so we’ll carry that on in the next post.

npalardy · 1 May 2020 17:51

The Code project is one I’ve subscribed to for a while and they often have useful articles wort stealing ideas from
And this one just came in
Constructing Fast Lexical Analyzers with RE/flex - Why Another Scanner Generator?

It might be a useful read to steal ideas from

Garry · 1 May 2020 19:34

Good article - a good read. Bookmarked for future reference.

Thanks.