Best way to implement a case sensitive dictionary

Tons of languages support some form of generics
And this fits in VERY nicely with Xojo’s “strongly typed” focus
Variants are used in many places because it was the ONLY way to support something remotely like a “generic” but they are so loosely typed they cause as many issues as they solve

They literally let you escape the type system

I’d love to see that if you ever come across it.

I’m pretty sure I can knock a simple trie out though with a Select Case statement since the language only has about 30ish keywords. What is a little annoying is that my language is case sensitive and Xojo’s String equality operations aren’t without the verbose String.Compare(other, ComparionOptions.CaseSensitive) incantation. Bloody API 2.0.

keywords ?
what is this even for that you’re just using keywords ?

a trie is overkill for this :slight_smile:

This is for Possum’s tokeniser.

When the tokeniser is at the start of a new token and needs to know if it’s a variable identifier name or a reserved word/keyword. I’m just looking for the fastest way to determine that. I could just consume the run of valid alphanumeric characters and lookup that String in a case sensitive Dictionary containing all the keywords as keys. If it’s in the Dictionary then it is a keyword. I was just wondering if using a trie is faster than hashing the lexeme then looking up the key.

For example take the following (non-sensical) source code:

123 true
    ^ 
    ^
    current pointer

The tokeniser is at t. For all it knows, this could be a valid identifier named t, could be the start of an identifier named tried or might be the start of the true reserved word. I could just gobble up all the contiguous alpha numeric characters (in this case “true”) and look that up in a case sensitive dictionary. It’ll find true as a key and we therefore know its a reserved word.

However, that obviously relies on the Dictionary class under the hood hashing the String “true”, finding the correct bin in the Dictionary then retrieving the value. I’m wondering if doing something like this is better (pseudocode):

Select Case firstLetter
Case "t"
  If sourceLength - currentPosition > 3 then // At least 3 more letters...
    If NextCharacter("r") Then
      // Could be "true". Need to check the rest...
    Else
     // Must be an identifier but isn't "true"
    End If
  End If
End Select

It’s sure uglier than:

// Consume the run of characters to get the lexeme "true"
If keywords.HasKey(lexeme) Then
  // It's a keyword
Else
  // It's an identifier.
End If

But it might be faster. Like I said earlier, I’ll probably use the Dictionary until later when I’m trying to squeeze more performance out of the tokeniser.

ah ok …

tokenizers I’ve written before used a finite state machine and thats what things like Flex cranks out when you hand it a grammar (they are complex as hell to read but still they are fsm’s)

and then when in any given state you have a list of whats allowable as the next input char, and what state things transition TO on certain characters

and if the character isnt one of the allowed characters what error state you transition to