Martin Spasov

Further improvements to the tokenizer and parser

03.11.2025 - 16.11.2025

Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.

I was able to complete 2 of the 4 goals for this sprint.

Math and SVG tag support

Initial implementation is done, tags are now parsed correctly in most test cases and I was able to add 160 new tests. It is going to be a while before I do anything else with these tags as full support is a lot of work. My plan for now is to just render these nodes as text.

Supported Test suite name
Yes adoption01.dat
Yes adoption02.dat
Yes blocks.dat
Yes comments01.dat
Yes doctype01.dat
No domjs-unsafe.dat
Yes entities01.dat
Yes entities02.dat
No foreign-fragment.dat
Yes html5test-com.dat
Yes inbody01.dat
No isindex.dat
Yes main-element.dat
No math.dat
No menuitem-element.dat
No namespace-sensitivity.dat
Yes noscript01.dat
Yes pending-spec-changes-plain-text-unsafe.dat
Yes pending-spec-changes.dat
No plain-text-unsafe.dat
No quirks01.dat
Yes ruby.dat
Yes scriptdata01.dat
No search-element.dat
No svg.dat
Yes tables01.dat
No template.dat
Yes tests1.dat
Yes tests10.dat
Yes tests11.dat
No tests12.dat
Yes tests14.dat
Yes tests15.dat
Yes tests16.dat
Yes tests17.dat
Yes tests18.dat
Yes tests19.dat
Yes tests2.dat
Yes tests20.dat
No tests21.dat
Yes tests22.dat
Yes tests23.dat
Yes tests24.dat
Yes tests25.dat
Yes tests26.dat
Yes tests3.dat
No tests4.dat
Yes tests5.dat
Yes tests6.dat
Yes tests7.dat
Yes tests8.dat
No tests9.dat
No tests_innerHTML_1.dat
Yes tricky01.dat
Yes webkit01.dat
Yes webkit02.dat

CSS tokenizer

The HTML parser and tokenizer are progressing nicely, and there are now plenty of tests validating the logic. There’s still work to be done, but the next steps aren’t entirely clear yet. I’d like to get a better sense of how the calling code will eventually use the parser and the DOM tree. In my opinion, the best way to explore that is to start rendering simple buffers to the screen. Before I can do that, though, I need to implement version 1 of the CSS parser and the CSS DOM.

I’ve already started working on the CSS tokenizer and plan to continue during the next sprint. The tokenization logic seems simpler than the HTML side, so hopefully I’ll have something functional within the next couple of weeks.

I’ll be using the tests from this repository: https://github.com/romainmenke/css-tokenizer-tests. I’ll need to add a translation layer similar to the one used for the HTML parser tests, but that should be fairly straightforward.

Martin