Compiler From Scratch: Phase 1 - Tokenizer Generator 017: Fixing encoding issues, more build testing

8 months ago

Technology Software & Development Programming C Compilers Tokenizer Lexer NFA DFA

Streamed on 2024-11-08 (https://www.twitch.tv/thediscouragerofhesitancy)

Zero Dependencies Programming!

Picking up where we left off last week, the build permutation testing script is mostly working and has revealed that the different encodings supported aren't equally well supported. So I started out by turning the UTF8CharRef into just CharRef, where each CharRef has a flag for its own encoding, and a statically set default encoding. The VVProject sets the CharRef default encoding once it parses that encoding and then the rest of the parsing works the same as before. Fixing a couple of other things here and there got all the encodings to build.

I added some code for testing. Each encoding has a pre-set string to tokenize. Then the Tokenizer generated by our test script is compiled and run. Regardless of build option, the expected outputs are the same. It verifies the number of tokens, the number of lines and the length (in bytes) of the longest token.

With an actual test actually running it was time to run all the permutations. This testing revealed at the end of the stream that there is a bug when hitting the end of the text buffer while using LAZY processing. We tried a couple of things, but didn't have time to debug it. That is where we'll pick up next week.

Loading comments...

Comments