Martin Spasov

Further improvements to the tokenizer and parser

03.11.2025 - 16.11.2025

Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.

I was able to complete 2 of the 4 goals for this sprint.

add support for math tags - v1 done
add support for svg tags - v1 done
add support for template tags - This is a bit more involved so I decided ill come back to it in the future.
add support for NOT_SUPPORTED sections - no tests atm to finish these sections atm

Math and SVG tag support

Initial implementation is done, tags are now parsed correctly in most test cases and I was able to add 160 new tests. It is going to be a while before I do anything else with these tags as full support is a lot of work. My plan for now is to just render these nodes as text.

Supported	Test suite name
Yes	adoption01.dat
Yes	adoption02.dat
Yes	blocks.dat
Yes	comments01.dat
Yes	doctype01.dat
No	domjs-unsafe.dat
Yes	entities01.dat
Yes	entities02.dat
No	foreign-fragment.dat
Yes	html5test-com.dat
Yes	inbody01.dat
No	isindex.dat
Yes	main-element.dat
No	math.dat
No	menuitem-element.dat
No	namespace-sensitivity.dat
Yes	noscript01.dat
Yes	pending-spec-changes-plain-text-unsafe.dat
Yes	pending-spec-changes.dat
No	plain-text-unsafe.dat
No	quirks01.dat
Yes	ruby.dat
Yes	scriptdata01.dat
No	search-element.dat
No	svg.dat
Yes	tables01.dat
No	template.dat
Yes	tests1.dat
Yes	tests10.dat
Yes	tests11.dat
No	tests12.dat
Yes	tests14.dat
Yes	tests15.dat
Yes	tests16.dat
Yes	tests17.dat
Yes	tests18.dat
Yes	tests19.dat
Yes	tests2.dat
Yes	tests20.dat
No	tests21.dat
Yes	tests22.dat
Yes	tests23.dat
Yes	tests24.dat
Yes	tests25.dat
Yes	tests26.dat
Yes	tests3.dat
No	tests4.dat
Yes	tests5.dat
Yes	tests6.dat
Yes	tests7.dat
Yes	tests8.dat
No	tests9.dat
No	tests_innerHTML_1.dat
Yes	tricky01.dat
Yes	webkit01.dat
Yes	webkit02.dat

CSS tokenizer

The HTML parser and tokenizer are progressing nicely, and there are now plenty of tests validating the logic. There’s still work to be done, but the next steps aren’t entirely clear yet. I’d like to get a better sense of how the calling code will eventually use the parser and the DOM tree. In my opinion, the best way to explore that is to start rendering simple buffers to the screen. Before I can do that, though, I need to implement version 1 of the CSS parser and the CSS DOM.

I’ve already started working on the CSS tokenizer and plan to continue during the next sprint. The tokenization logic seems simpler than the HTML side, so hopefully I’ll have something functional within the next couple of weeks.

I’ll be using the tests from this repository: https://github.com/romainmenke/css-tokenizer-tests. I’ll need to add a translation layer similar to the one used for the HTML parser tests, but that should be fairly straightforward.

Martin