22.09.2025 - 5.10.2025
Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.
During this sprint, the primary focus was the implementation of automated test runners for the tokenizer and tree-construction tests.
Until now, I had to manually add each test from html5lib-tests as a test case in C. This approach was acceptable initially, as the focus was on implementing the parser and tokenizer. However, I always knew that some form of automation would be necessary later. The test suite contains too many tests to manage manually.
The plan is as follows:
Version 1 (current): Automatically run all tests that are already passing. It’s acceptable to add more tests as long as they pass with the current implementation. Do not modify the parser or tokenizer implementation as part of this task.
Version 2: Improve the test runners once all missing implementation sections in the parser and tokenizer are complete.
These tests were straightforward. The test definitions use a simple text format, so it was just a matter of parsing each line and gathering the expected results.
This is done and all 112 tests that were part of tests1.dat are passing.
These tests are more complex. The test definitions in the main repository are provided in JSON format, which poses a challenge since a JSON parser has not yet been implemented. To address this, I wrote a Python script that converts the tests into a text format similar to that used for the tree-construction tests.
Original definition
{"description":"Unescaped ampersand in attribute value",
"input":"<h a='&'>",
"output":[["StartTag", "h", { "a":"&" }]]},
Translated definition
#description
Unescaped ampersand in attribute value
#data
<h a='&'>
#errors
#states
Data state
#output
StartTag "h"
Attr "a" "&"
The original definition has some default values that I have made explicit just to make it easier when parsing.
"description" -> #description
"input" -> #data
"errors" -> #errors
"initialStates" -> #states
"output" -> #output
[["StartTag", "h", { "a":"&" }]] -> StartTag "h"
-> Attr "a" "&"
This test runner is still a work in progress, as there are several special cases that have not yet been implemented. In terms of the number of tests, I’m already seeing a significant increase.
Martin