Finite state machines

Wikipedia defines a finite state machine as

an abstract machine that can be in exactly one of a finite number of states at any given time
and they are very handy in all kinds of ways in computing

You can use them to create parsers, protocol handlers, and lots of other tasks that require your code to react to incoming data in very regular and predictable ways.

Lets look at creating one to parse a floating point value from a text input.
We’re JUST going to create the machine for parsing the floating point value not a full blown tokenizer which might need different inputs, outputs and a better way to report errors in whatever it gets to parse.

The first thing we need to do is to describe our machine, the states it will have and the things that cause it to transition from one state to another. Often this is done in a digram form and it how I usually get started.
There ARE other forms as well but this is the one I tend to like.
This is the one I drew off the top of my head for this tutorial

Everything starts at the top left with the “Start” state.
This is our initial state and we NEVER return to this state while we are working.
In the start state there are only a few things that can occur

  1. we have an empty input string and so return - this may be an error
  2. we get a + or - which we retain, and we change into the “leading sign” state
  3. we get a digit - which we retain, and we change into the “leading digits” state
  4. we get a . - which we retain, and we change into the “leading decimal” state

Note that every start has one, or more, incoming lines from another state.
Each of those lines is labelled with the character we got to cause us to transition to the state that the line connects to.

Some states may have a line that connects back to itself. This might be something like a run of digits where any number of digits is allowed one after the other and so they are just accumulated until something else causes a transition.

One of the problems with my initial diagram is noted at the lower left.
Its possible that “no char” is an error AND also that its perfectly OK and we should exit cleanly.
If we get just a “.” then we would get a sequence of “.” followed by “no char” which is NOT a legal floating point value.
However, we could also get a digit followed by a “.” and that IS legal.
Ambiguities in such a machine are bad.
While it wouldnt be too hard to handle is thing small machine eliminating them is a useful thing to do especially if you want to create a much larger one. Getting rid of the ambiguities means you dont have to have “special case code” for them and debugging the machine becomes simpler.

So here’s a better effort

This removes the ambiguity about the sequence “.” vs “digit.”
But it still has an issue if you follow the sequence “+.” - when we get to the no char case we could consider this legal input but its clearly not.

But this case is much simpler to fix. Simply redirect the line leaving the + - handling node to the new node we inserted at the lower left so we would HAVE to get some digits following he “.” to consider input valid.

Here’s the third attempt at this (and this is why I tend to do this on paper or wipe boards or something like that LONG before I try code because its so much easier to do this way before you have committed to a lot of code that may be hard to spot such oversights and errors)

Now if we, by hand, try sequences like :

  +      // invalid
  +.     // invalid 
  0.0.0 // invalid
  +.0   // valid
  0.8  // valid
  -1.9 // valid

So it seems that we have a decent machine that we can work with.
And yes I know there are states I missed :wink:

One thing to notice about the diagram is that every state should cover all the inputs. They should account for “no character available”, a valid character that causes a state transition, and those that cause invalid transition which is usually “everything else” not valid)
My diagram does have several places where “everything else” is not handled.

In a lot of diagrams you see people try to label the circles, or states, sometimes with a nice name that makes sense. We could do that but often I just put numbers on them. This is up to you.
So far I have left mine unlabelled and in a small machine like this it might be easy to com up with nice named constants for them. In a much larger machine that might be a lot harder.
Work with whatever makes sense to you.

I’m just going to number them starting with 0 (for start) and work my way up as I write the code to handle each one.

As well I probably need to label the error AND exit points to make things clearer.
(sorry for the slightly messy diagram - I missed one error state label and redid them all and had not white out so …)

But now that everything has a label we can certainly turn this almost directly into code.

Our little machine is just going to take a single input and produce one output.

I’m starting with a new desktop project.
To the default Window1 I’m going to add one method runMachine(input as string) as string
I’m not going to put any code in there (yet)
Also add the Open event to Window1.
In there lets start with some test code. Lets start with the cases we evaluated by hand earlier.


dim testString as string
dim result as string

testString = "+"      // invalid
result = runMachine( testString )

testString = "+."     // invalid 
result = runMachine( testString )

testString = "0.0.0" // invalid
result = runMachine( testString )

testString = "+.0"   // valid
result = runMachine( testString )

testString = "0.8"  // valid
result = runMachine( testString )

testString = "-1.9" // valid
result = runMachine( testString )

Obviously we’ll need some way for our machine to signal errors back to our other code or there really isnt much point is there ?

At this point I’m just going to use an exception. But, in many parsers/tokenizers you might NOT want to do this and may need different information returned. Since the tokenizer is just reading a continuous stream of characters what you may want back could be just the portion that could be read and then the next call to a tokenizer maybre reads the next portion. The next thing may be some other valid token - suppose you had “+.09+4” The second + in that string is a valid “token” in a mathematical expression but its not part of the floating point value we’re reading.

But, thats a different design choice.

Since we’re just going to use an exception lets alter the code to



dim testString as string
dim result as string

try
  testString = "+"      // invalid
  result = runMachine( testString )
Catch error As UnsupportedFormatException
  Break
End Try

try
  testString = "+."     // invalid 
  result = runMachine( testString )
Catch error As UnsupportedFormatException
  Break
End Try

try
  testString = "0.0.0" // invalid
  result = runMachine( testString )
Catch error As UnsupportedFormatException
  Break
End Try
try
  testString = "+.0"   // valid
  result = runMachine( testString )
Catch error As UnsupportedFormatException
  Break
End Try

try
  testString = "0.8"  // valid
  result = runMachine( testString )
Catch error As UnsupportedFormatException
  Break
End Try

try
  testString = "-1.9" // valid
  result = runMachine( testString )
Catch error As UnsupportedFormatException
  Break
End Try

This _ should_ be all the testing set up we need.
Now lets work on our machine.

First things first. We will need a local to represent the current state. As well we are going to need one to hold the current position in the input, and one to hold the result we have accumulated so far.

Dim currentState As Integer = 0
Dim charpos As Integer = 1 // mid starts at 1 
Dim result As String

Out machine tries to continue until it either returns a result OR it runs out of characters. And it it runs through all the input then in most cases this means we had success and what we were given was valid. Except in the case that the input was empty - although in some cases that may be allowable its not in our example.

So our code should be encapsulated in a loop.
This loop will either consume all the input OR encounter an error.
Fortunately for us Xojo’s mid returns “” when you ask for an out of bounds character in a string. We could code for this case but dont have to.

if input = "" then
  // this is error state 1 !!!
  Dim tmpException As New UnsupportedFormatException
  tmpException.Message = "State " + Str(currentState) + " got empty string" 

  Raise tmpException
end if

While true
   // one char at a time
  Dim currentChar As String = Mid(Input, charPos, 1)

  charpos = charpos + 1

  // the entire code for  out state machine will be in here !
  // down to the end of th while loop 

Wend

// if we get here we ate the entire string successfully and its valid !
Return result

And we can then start coding up our machine directly from the diagrams

The easiest way to do this is often in a select case statement rather than a long list of if statements.


// given the current state
// how should this character be handled ?

Select Case currentState
  
Case 0
  
  If currentChar = "+" Or currentChar = "-" Then
    // retain this char as part of the result we want to accumulate
    result = result + currentChar
    
    // switch to the new state
    currentState = 1
  Elseif currentChar = "." Then
    // retain this char as part of the result we want to accumulate
    result = result + currentChar
    
    // switch to the new state
    currentState = 3
  Elseif InStr("0123456789",currentChar) > 0 Then
    // retain this char as part of the result we want to accumulate
    result = result + currentChar
    
    // switch to the new state
    currentState = 2
  Else
    Dim tmpException As New UnsupportedFormatException
    tmpException.Message = "State " + Str(currentState) + " got unexpect character [" + currentChar + "]" 
    
    Raise tmpException
  End If
  
Else 
  Break
End Select

you can see that we have taken our diagram and, for state 0 (start), simply added one IF for each of the conditions. And if we need to retain the character that was encountered we append it to the result. As well if that character results in a new state we set the current state variable.

And we progress through our diagram with each state until they are all written.

Note that in the select case I also added an ELSE which is just in case we somehow make a mistake and the currentState variable gets set to a value that is not handled. It makes tracking down invalid states easier if you include it.

And, in state 0 you notice that the ELSE clause of the if is where we detected that there is an error in the input and raise an exception. This is similar in the other states we’ll translate from the diagram.

When you’re done you runMachine might look like

Dim currentState As Integer = 0
Dim charpos As Integer = 1 // mid starts at 1 
Dim result As String

If Input = "" Then
  // this is error state 1 !!!
  Dim tmpException As New UnsupportedFormatException
  tmpException.Message = "State " + Str(currentState) + " got empty string" 
  
  Raise tmpException
End If

While True
  
  // one char at a time
  Dim currentChar As String = Mid(Input, charPos, 1)
  
  charpos = charpos + 1
  
  // given the current state
  // how should this character be handled ?
  
  Select Case currentState
    
  Case 0
    
    If currentChar = "+" Or currentChar = "-" Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 1
    Elseif currentChar = "." Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 3
    Elseif InStr("0123456789",currentChar) > 0 Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 2
    Else
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected character [" + currentChar + "]" 
      
      Raise tmpException
    End If
    
    
    
  Case 1
    If currentChar = "" Then
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected end of input"
      
      Raise tmpException
    Elseif currentChar = "." Then
      result = result + currentChar
      currentState = 3
    Elseif InStr("0123456789",currentChar) > 0 Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 2
    Else
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected character [" + currentChar + "]" 
      
      Raise tmpException
    End If
    
    
  Case 2
    If currentChar = "" Then
      // no error
      Return result
    Elseif currentChar = "." Then
      result = result + currentChar
      currentState = 4
    Elseif InStr("0123456789",currentChar) > 0 Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 2
    Else
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected character [" + currentChar + "]" 
      
      Raise tmpException
    End If
    
    
  Case 3
    If currentChar = "" Then
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " expected additional input"
      
      Raise tmpException
    Elseif InStr("0123456789",currentChar) > 0 Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 4
    Else
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected character [" + currentChar + "]" 
      
      Raise tmpException
    End If
    
    
  Case 4
    If currentChar = "" Then
      // no error
      Return result
    Elseif InStr("0123456789",currentChar) > 0 Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 5
    Else
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected character [" + currentChar + "]" 
      
      Raise tmpException
    End If
    
  Case 5
    If currentChar = "" Then
      // no error
      Return result
    Elseif InStr("0123456789",currentChar) > 0 Then
      // retain this char as part of the result we want to accumulate
      result = result + currentChar
      
      // switch to the new state
      currentState = 5
    Else
      Dim tmpException As New UnsupportedFormatException
      tmpException.Message = "State " + Str(currentState) + " got unexpected character [" + currentChar + "]" 
      
      Raise tmpException
    End If
    
  Else 
    Break
  End Select
  
  
Wend

// if we get here we ate the entire string successfully and its valid !
Return result

and there you have the basics of how to implement a finite state machine from scratch to implementation.

Code is available here

2 Likes

Thanks for such a comprehensive tutorial @npalardy. Really useful and clearly explained. I of course now hate you as I think I owe it to myself to rewrite my scanner to use state machines. They look much more efficient.

They can be applied to a wide set of computing tasks like tokenizers, protocol handlers and many many other things in code.

Norman very well laid out and easy to follow.
Henry

Glad you liked it !

Without reading the details, seems a great educational post. Congratulations for the effort.

I post them because I think folks will find them useful
Others can too :slight_smile:

1 Like

I like such kind of content and thinking very much.