06.10.2025 - 19.10.2025
Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.
Last sprint I implemented the automated test runner for the parser and 90% of the test runner for the tokenizer. This sprint i wanted to finish the tokenizer runner and add more tests.
The format i started with was mirroring the parser tests
#description
Unescaped ampersand in attribute value
#data
<h a='&'>
#errors
#states
Data state
#output
StartTag "h"
Attr "a" "&"
There were a few issues with the above format
First i started fixing these issues but at some point I realised that if i flatten the data more these issues wont happen.
The new format is basically 1 value per line with a header describing it. As long as the test data does not contain the header names we are good to go. I had a look over the data and didnt see any issues.
#description
Open angled bracket in unquoted attribute value state
#data
<a a=f<>
#errors
(1, 7): unexpected-character-in-unquoted-attribute-value
#states
Data state
#start-tag
a
#attr-name
a
#attr-value
f<
#end-test
There are more fields but you get the idea.
The spec dictates that the buffer should be normalized before processing. This means 2 things
\r\n
with \n
\r
with \n
Currently the buffer is copied so i can modify it freely.
After all the changes I was able to add more than 4000 new tests. Currently 11 of the 14 files in html5lib-tests are passing.
Supported | Test suite name |
---|---|
No | contentModelFlags.test |
No | domjs.test |
Yes | entities.test |
Yes | escapeFlag.test |
Yes | namedEntities.test |
Yes | numericEntities.test |
Yes | pendingSpecChanges.test |
Yes | test1.test |
Yes | test2.test |
Yes | test3.test |
Yes | test4.test |
Yes | unicodeChars.test |
Yes | unicodeCharsProblematic.test |
No | xmlViolation.test |
I think this is enough work on the test runner for now. I have good amount of tests so now i can focus on refactoring the tokenizer. Initial implementation has a lot that can be improved.
Martin