Martin Spasov

Further improvements to the tokenizer and parser

20.10.2025 - 02.11.2025

Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.

Sprint completed successfully with most of the planned changes going in.

Tokenizer improvements

Spent some time going over the code and cleaning up what i could. There are still things that need to be addressed but that can be done in the future. The current changes include:

More parser tests

The majority of the work during the sprint was spent on the parser improvements.

The html5lib-tests repo contains a total of 56 test files. At the moment the parser runs approx. 1100 tests which is a big improvement over the previous sprint (100 tests).

Supported Test suite name
Yes adoption01.dat
Yes adoption02.dat
Yes blocks.dat
Yes comments01.dat
Yes doctype01.dat
No domjs-unsafe.dat
Yes entities01.dat
Yes entities02.dat
No foreign-fragment.dat
Yes html5test-com.dat
Yes inbody01.dat
No isindex.dat
Yes main-element.dat
No math.dat
No menuitem-element.dat
No namespace-sensitivity.dat
Yes noscript01.dat
Yes pending-spec-changes-plain-text-unsafe.dat
Yes pending-spec-changes.dat
No plain-text-unsafe.dat
No quirks01.dat
Yes ruby.dat
Yes scriptdata01.dat
No search-element.dat
No svg.dat
Yes tables01.dat
No template.dat
Yes tests1.dat
No tests10.dat
No tests11.dat
No tests12.dat
Yes tests14.dat
Yes tests15.dat
Yes tests16.dat
Yes tests17.dat
Yes tests18.dat
Yes tests19.dat
Yes tests2.dat
No tests20.dat
No tests21.dat
Yes tests22.dat
Yes tests23.dat
Yes tests24.dat
Yes tests25.dat
Yes tests26.dat
Yes tests3.dat
No tests4.dat
Yes tests5.dat
Yes tests6.dat
Yes tests7.dat
Yes tests8.dat
No tests9.dat
No tests_innerHTML_1.dat
Yes tricky01.dat
Yes webkit01.dat
Yes webkit02.dat

For next sprint the focus will remain on the parser. I want to:

Martin