OK, now we’ve “sorted out” our grammar and explained what tokens are, we’re finally in position to walk through Possum’s scanner line by line. This post will take a look at the
PossumScanner class whose task is to take a Xojo
String representing Possum source code and spit out an array of
PossumToken instances which will be subsequently fed into Possum’s parser.
I had considered writing Possum such that it was an entirely self-contained project with no external dependencies. Whilst Possum doesn’t need any external plugins it does depend on a few of my own external modules for common functionality. Fortunately these are freely available on my GitHub account and can be used in many other projects. The only dependency that the scanner depends on is my StringKit module which provides common
String manipulation methods. You’ll need to add this to your project.
Since Possum is a case-sensitive language, we need to implement our own version of Xojo’s
Dictionary class that will treat
String keys of different cases as being different. The native
Dictionary class doesn’t do this. Fortunately, this is easy to overcome with API 2.0 by subclassing the native
Dictionary. We’ll call this class
Class CaseSensitiveDictionary Inherits Dictionary Public Sub Constructor(ParamArray entries As Pair) Super.Constructor(AddressOf KeyComparer) For Each keyValue As Pair In entries Self.Value(keyValue.Left) = keyValue.Right Next keyValue End Sub Private Function KeyComparer(leftKey As Variant, rightKey As Variant) as Integer If leftKey.Type = Variant.TypeString And rightKey.Type = Variant.TypeString Then // Both keys are strings so do a case-sensitive comparison. Return leftKey.StringValue.Compare(rightKey.StringValue, ComparisonOptions.CaseSensitive) Else If leftKey = rightKey Then Return 0 ElseIf leftKey < rightKey Then Return -1 Else Return 1 End If End If End Function End Class
Now we can create a case-sensitive dictionary like so:
Var cd As New CaseSensitiveDictionary cd.Value("hello") = True If cd.HasKey("hello") Then // This is True. End If If cd.HasKey("Hello") Then // This is False. End If
Next up, we need a class to represent a token. This class needs to store the following information:
- The token’s lexeme These are the actual characters that the token is comprised of
- The (1-based) line number of the source code that this token originated on
- The (0-based) index within the original source code of the first character of this token
- The type of token (e.g. keyword, number, etc). This will be an enumeration.
Here’s a basic structure for the class:
Class PossumToken Property Lexeme As String Property Line As Integer = 1 Property Position As Integer = 0 Property Type As Possum.TokenTypes End Class
You’ll notice that we have an enumeration called
TokenTypes which I have embedded within a module called
Possum. The reason for the separate module is because as the pipeline for Possum grows, we will need more properties and methods accessible to the parser, scanner, compiler, etc that are best suited within a separate module. You may be asking at this point, why not put all classes within the
Possum module and stop prefixing classes with
Possum altogether? Well, this is because I want to have all of Possum’s classes as external dependencies to a project. This means that I can edit and enhance them in one project and have those changes propagate to any project that is using them. Unfortunately, the Xojo IDE cannot externalise a module if it contains classes.
We have many different types of token. Here is the
TokenTypes enumeration in full:
Protected Enum TokenTypes Ampersand BooleanFalse BooleanTrue Caret Colon Comma Dedent Dot EOF Equal EqualEqual ExclusiveRange ForwardSlash ForwardSlashEqual Greater GreaterEqual GreaterGreater Identifier InclusiveRange Indent KeywordAnd KeywordAs KeywordBlock KeywordClass KeywordConstructor KeywordDownTo KeywordElse KeywordElseIf KeywordExit KeywordFor KeywordForEach KeywordForeign KeywordFunction KeywordImport KeywordIf KeywordIs KeywordNot KeywordOr KeywordPass KeywordQuit KeywordRepeat KeywordReturn KeywordSkip KeywordStatic KeywordSuper KeywordThen KeywordThis KeywordUntil KeywordVar KeywordWhile KeywordXor KeywordYield LCurly LeftArrow LParen LSquare Less LessEqual LessLess Minus MinusEqual NotEqual Nothing NumberLiteral Percent PercentEqual Pipe Plus PlusEqual PropertyIdentifier Query QueryQuery RCurly RParen RightArrow RSquare Star StarEqual StaticPropertyIdentifier StringLiteral Terminator Tilde Underscore End Enum
As good API 2.0 Xojo programmers, we will use exceptions to handle errors. To make it easier, let’s create our own
PossumScannerException class that will be raised by our scanner whenever something bad happens. It’ll be a subclass of Xojo’s
RuntimeException with two additional properties to store the 1-based line number and 0-based character localising where the error occurred in the source code:
Class PossumScannerException Inherits RuntimeException Property LineCharacter As Integer = 0 Property LineNumber As Integer = 1 End Class
Finally, we need a scanner class to do the tokenisation. Imaginatively we’ll call it
PossumScanner. There are a number of properties that this class needs to function correctly and I’ll explain them momentarily. We will add the methods later as we work through this article.
Class PossumScanner Private Property AtLineStart as Boolean = True Private Property Chars() as String Private Property CharsCount as Integer = 0 Private Property CharsLastRowIndex as Integer = -1 Private Property Current as Integer = 0 Private Property IndentStack() as Integer Private Property LineNumber as Integer = 1 Private Property Lines() as String Public Property ReservedWords as CaseSensitiveDictionary Get End Get End Property Private Property Source as String Private Property Tokens() as PossumToken Private Property TokenStart as Integer = 0 Private Property UnclosedCurlyCount as Integer = 0 Private Property UnclosedParenCount as Integer = 0 Private Property UnclosedSquareCount as Integer = 0 End Class
Below is a brief description of what each of the scanner’s properties do.
True if we are at the start of a new line.
An array containing the individual characters of the source code this scanner will tokenise. Line endings have been standardised to UNIX.
The number of characters in the
The cached last row index for the
The 0-based current position in the source code. Points to the next character to handle.
Used when computing the current level of indentation. We’ll cover this in a later topic.
The 1-based line number that the scanner is currently on.
An array containing each line of the source code this scanner is tokenising (including blank lines).
This is a
CaseSensitiveDictionary containing all of Possum’s keywords (plus
"nothing") as keys mapped to their token type (as the value). We use our custom
CaseSensitiveDictionary class described above because Possum is a case-sensitive language but Xojo is not.
Since this dictionary never changes we can take advantage of Xojo’s
Static keyword to create the dictionary once when it is first accessed by making the
ReservedWords property computed. In the property’s
Get method, put this code:
// This code will only execute once. Static mReservedWords As New CaseSensitiveDictionary(_ "and" : Possum.TokenTypes.KeywordAnd, _ "as" : Possum.TokenTypes.KeywordAs, _ "block" : Possum.TokenTypes.KeywordBlock, _ "class" : Possum.TokenTypes.KeywordClass, _ "constructor" : Possum.TokenTypes.KeywordConstructor, _ "downto" : Possum.TokenTypes.KeywordDownTo, _ "else" : Possum.TokenTypes.KeywordElse, _ "elseif" : Possum.TokenTypes.KeywordElseIf, _ "exit" : Possum.TokenTypes.KeywordExit, _ "for" : Possum.TokenTypes.KeywordFor, _ "foreach" : Possum.TokenTypes.KeywordForEach, _ "foreign" : Possum.TokenTypes.KeywordForeign, _ "function" : Possum.TokenTypes.KeywordFunction, _ "if" : Possum.TokenTypes.KeywordIf, _ "import" : Possum.TokenTypes.KeywordImport, _ "is" : Possum.TokenTypes.KeywordIs, _ "not" : Possum.TokenTypes.KeywordNot, _ "or" : Possum.TokenTypes.KeywordOr, _ "pass" : Possum.TokenTypes.KeywordPass, _ "quit" : Possum.TokenTypes.KeywordQuit, _ "repeat" : Possum.TokenTypes.KeywordRepeat, _ "return" : Possum.TokenTypes.KeywordReturn, _ "skip" : Possum.TokenTypes.KeywordSkip, _ "static" : Possum.TokenTypes.KeywordStatic, _ "super" : Possum.TokenTypes.KeywordSuper, _ "then" : Possum.TokenTypes.KeywordThen, _ "this" : Possum.TokenTypes.KeywordThis, _ "until" : Possum.TokenTypes.KeywordUntil, _ "var" : Possum.TokenTypes.KeywordVar, _ "while" : Possum.TokenTypes.KeywordWhile, _ "xor" : Possum.TokenTypes.KeywordXor, _ "yield" : Possum.TokenTypes.KeywordYield, _ "false" : Possum.TokenTypes.BooleanFalse, _ "true" : Possum.TokenTypes.BooleanTrue, _ "nothing": Possum.TokenTypes.Nothing _ ) Return mReservedWords
This way, the first time
ReservedWords is accessed by any instance of
PossumScanner, we create our dictionary. Subsequent calls to the
ReservedWords will simply return the already created dictionary.
The source code string the scanner is tokenising.
The tokens produced by this scanner after tokenising its source code.
The 0-based index in the source code where the current token begins.
The number of unclosed curly brackets.
The number of unclosed parentheses.
The number of unclosed square brackets.
We will start off the tokenisation process by passing some Possum source code to an instance of our scanner:
Var scanner = New PossumScanner(someSourceCode) // We need an array of tokens to store the result. Var tokens() As PossumToken // Our scanner raises a PossumScannerException if something goes wrong // with the tokenisation process so make sure we catch it. Try tokens = Scanner.Tokenise Catch e As PossumScannerException // Handle the error here. End Try // At this point - we have an array of tokens we can use.
Constructing The Scanner
In the code above, you’ll see that the scanner takes some Possum source code as part of its constructor. We need to implement that:
Sub Constructor(source As String) // First we need to make sure that the StringKit module is initialised. // This is a limitation of StringKit, not Possum. StringKit.Initialise // Initialise our properties. Self.Source = source Self.Current = 0 IndentStack = Array(0) // Put the base level of indentation on the stack. UnclosedCurlyCount = 0 UnclosedParenCount = 0 UnclosedSquareCount = 0 Self.TokenStart = 0 // Standardise line endings to UNIX ones. Var standardised As String = source.ReplaceLineEndings(EndOfLine.UNIX) // Split the standardised source code into individual characters. Chars = standardised.Split("") // Since we'll be checking for the upper bounds of Chars a lot, cache the value now. CharsLastRowIndex = Chars.LastRowIndex // Similarly, we'll be checking the number of characters a lot. Cache it. CharsCount = Chars.Count End Sub
“What’s this standardisation business about?” I hear you cry. Well, since Possum is cross-platform, we have to account for the fact that if the scanner is running on macOS but the source code was created on Windows (or vice versa) that the character used to represent a line ending will be different. I’ve settled on using the UNIX line ending character (
&h0A). It doesn’t really matter though as long as we’re consistent. We therefore use Xojo’s built in
String.ReplaceLineEndings() method to replace all line endings with the UNIX one.
Our scanner will step through each character one-by-one so it makes sense to split the now standardised string into its constituent characters up front now rather than repeated calls to
String.Middle() which would be much slower.
The last couple of lines are just there for small performance gains as it’s quicker to compare against a fixed integer than keep on calling a method on an array.
All that’s left now is to implement the
Tokenise() method. That’s a lot of work and a lot of text so we’ll carry that on in the next post.