02.06.2025 - 15.06.2025
Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.
In the second sprint of the project, the focus was on completing the tokenizer implementation and adding corresponding tests. Similar to the previous sprint, the time spent on these tasks was less than originally planned due to external factors.
The three main work items for this sprint were:
The goal of the initial tokenizer implementation is to cover the majority of the functionality required by the spec, though not all of it. There are still parts I haven’t figured out how to approach—named character references, for example. Rather than getting stuck on smaller or more complex pieces early on, I’ve added a backlog item for this and will revisit it once I’ve had more time to think it through.
There have been a few changes since the last sprint, including:
DATA_STATE
.After some research, I found a repository containing a comprehensive set of tests for parser conformance. The tests are specified in JSON format and are relatively straightforward to understand. Since I don’t have a JSON parser implemented yet, I’m going through the tests manually, converting each one into C code—similar to the example below.
static void hexadecimal_entity_pair_representing_surrogate_pair()
{
// {"description":"Hexadecimal entity pair representing a surrogate pair",
// "input":"��",
// "output":[["Character", "\uFFFD\uFFFD"]],
// "errors":[
// { "code": "surrogate-character-reference", "line": 1, "col": 9 },
// { "code": "surrogate-character-reference", "line": 1, "col": 17 }
// ]},
html_token_t tokens[SIZE_TEN] = { 0 };
const char buffer[] = "��";
const uint32_t buffer_size = sizeof(buffer) - 1;
html_tokenizer_init(buffer, buffer_size, tokens, SIZE_TEN);
for (uint32_t i = 0; i < SIZE_TEN; i++) { ASSERT_FALSE(tokens[i].is_valid); }
uint32_t return_sizes[] = { 1, 1, 1 };
html_tokenizer_error_e errors[] = { HTML_TOKENIZER_SURROGATE_CHARACTER_REFERENCE,
HTML_TOKENIZER_SURROGATE_CHARACTER_REFERENCE,
HTML_TOKENIZER_OK };
html_token_t tokens_e[][1] = { { {.is_valid = true, .type = HTML_CHARACTER_TOKEN, .data_size = 3, .data = { [0] = 0xef, [1] = 0xbf, [2] = 0xbd } } },
{ {.is_valid = true, .type = HTML_CHARACTER_TOKEN, .data_size = 3, .data = { [0] = 0xef, [1] = 0xbf, [2] = 0xbd } } },
{ {.is_valid = true, .type = HTML_EOF_TOKEN } } };
uint32_t return_sizes_len = sizeof(return_sizes) / sizeof(uint32_t);
uint32_t errors_len = sizeof(errors) / sizeof(html_tokenizer_error_e);
uint32_t tokens_e_len = sizeof(tokens_e) / sizeof(html_token_t) / 1;
ASSERT_EQUAL(return_sizes_len, errors_len);
ASSERT_EQUAL(tokens_e_len, errors_len);
uint32_t tests = sizeof(return_sizes) / sizeof(uint32_t);
for (uint32_t i = 0; i < tests; i++)
{
uint32_t size_e = return_sizes[i];
html_tokenizer_error_e err_e = errors[i];
html_tokenizer_error_e err_a = html_tokenizer_next();
ASSERT_TOKENS_SIZE(size_e, SIZE_TEN);
ASSERT_EQUAL(err_a, err_e);
for (uint32_t j = 0; j < size_e; j++)
{
html_token_t token_e = tokens_e[i][j];
if (!token_e.is_valid) { continue; }
ASSERT_TOKEN(tokens[j], token_e);
}
}
}
There are some hardcoded values in my current setup, and a few differences between my test implementation and the test repository:
correctness
attribute, which corresponds to the inverse of the force_quirks
attribute defined in the spec.Despite these differences, the test cases have already helped me uncover and fix several bugs in the implementation, so it’s great that this repository exists. The tokenizer section alone includes a large number of tests, and I’ve only converted a small portion so far.
One of the things I chose to leave out of the initial implementation was tokenizer error handling. At the time, I wasn’t entirely sure how it should work, so I decided to postpone it. However, as I’ve started working through the test cases, I now have a clearer understanding and have begun incorporating error handling into the tokenizer. There’s still more work to be done, but I’m confident that by the time all the tests are converted, this part of the implementation will be complete.
The focus of sprint 3 will be more tests, error support and bugs.