Tokenizing Sentences

Humans unconsciously combine letters into words before recognizing grammatical structure while reading. As adults, we are so good at it that we don’t notice it. Beginning readers, on the other hand, move along with their fingers trying to sound out the words.

Reading Morse code exposes this hidden process nicely by forcing us to move character by character. Before interpreting a sentence as English, we have to convert the dots and dashes to letters and then combine the letters into words. Once we have words, we can apply English grammatical structure. For example, here is print 34 in Morse code:

 
.--. .-. .. -. - …-- ….-
 
p r i n t 3 4

Recognizers that feed off character streams are called tokenizers or lexers (see Pattern ...

Get Language Implementation Patterns now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.