14.07.2025 - 27.07.2025; 34h
Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.
This sprint was shorter than usual due to some external factors. I worked on getting test structure similar to the tokenizer. I arrived at something that works for now, not sure if this will be the final form as i still have my doubts about the structure of html_node_t
.
The input comes from the html5lib-tests repository. The structure is as follows
#data
Line1<br>Line2<br>Line3<br>Line4
#errors
(1,0): expected-doctype-but-got-chars
#document
| <html>
| <head>
| <body>
| "Line1"
| <br>
| "Line2"
| <br>
| "Line3"
| <br>
| "Line4"
#data
- contains the raw buffer passed to the parser#errors
- the list of expected errors#document
- represents the final expected html document treeThere are some additional types of sections for more advanced features but i will stick to the above for the initial version of the parser.
I have arrived at the following structure
static void test_when_input_is_pure_text_then_add_missing_nodes()
{
// #data
// Test
// #errors
// (1,0): expected-doctype-but-got-chars
// #document
// | <html>
// | <head>
// | <body>
// | "Test"
unsigned char buffer[] = "Test";
html_node_t* document = html_document_new();
html_node_t* html = html_element_new(document, "html", 4);
html_node_t* head = html_element_new(document, "head", 4);
html_node_t* body = html_element_new(document, "body", 4);
html_node_t* text = html_text_new(document, "Test", 4);
APPEND_TO_TREE(document, html);
APPEND_TO_TREE(html, head);
APPEND_TO_TREE(html, body);
APPEND_TO_TREE(body, text);
RUN_TEST_AND_ASSERT_DOCUMENT(buffer, document);
}
Similar to the tokenizer tests, we define the input buffer and then assert expectation.
APPEND_TO_TREE
handles the appropriate logic to add a node to the tree. This can be done with one of the html_node_t
functions so i might change it at a later point.
#define APPEND_TO_TREE(root, node) \
do \
{ \
html_node_t* last_child = root->last_child; \
if (last_child) \
{ \
last_child->next_sibling = node; \
node->prev_sibling = last_child; \
root->last_child = node; \
} \
else \
{ \
root->first_child = node; \
root->last_child = node; \
} \
} while(0);
RUN_TEST_AND_ASSERT_DOCUMENT
macro parses the buffer, gets a tree back and uses a recursive function to compare the nodes.
#define RUN_TEST_AND_ASSERT_DOCUMENT(buffer, expected) \
do \
{ \
html_parser_init(); \
html_node_t* actual = html_parser_run(buffer, sizeof(buffer) - 1); \
ASSERT_NODE(actual, expected); \
html_node_free(expected); \
html_node_free(actual); \
html_parser_free(); \
} while (0);
After a lot of back and forth on how to handle the parser implementation (i mean a lot) I have now settled on something. The parser is a huge switch statement full of NOT_IMPLEMENTED
calls. I add a test and implement the related missing functionality (within reason). There are some things I prefer to leave for the future (fragment parsing, scripting stuff etc). This way, i do not feel overwhelmed and can take small steps forward.
This is what a snippet of the parser looks like atm
html_node_t* html_parser_run(const unsigned char* buffer, const uint32_t size)
{
html_tokenizer_init(buffer, size, tokens, MAX_TOKENS);
mode = HTML_PARSER_MODE_INITIAL;
document = html_document_new();
bool will_use_rules_for = false;
bool use_rules_for = false;
html_node_t* head_element = NULL;
html_node_t* form_element = NULL;
bool scripting_enabled = false;
while (!stop)
{
html_tokenizer_error_e err = html_tokenizer_next();
uint32_t tokens_size = get_tokens_size();
uint32_t i = 0;
if (err != HTML_TOKENIZER_OK || tokens_size == 0) { break; }
while (i < tokens_size)
{
bool consume = true;
html_token_t t = tokens[i];
bool is_doctype = t.type == HTML_DOCTYPE_TOKEN;
bool is_start = t.type == HTML_START_TOKEN;
bool is_end = t.type == HTML_END_TOKEN;
bool is_comment = t.type == HTML_COMMENT_TOKEN;
bool is_character = t.type == HTML_CHARACTER_TOKEN;
bool is_eof = t.type == HTML_EOF_TOKEN;
switch (mode)
{
... MORE
// https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead
case HTML_PARSER_MODE_IN_HEAD:
if (is_character && (t.data[0] == '\t' || t.data[0] == '\n' || t.data[0] == '\f' || t.data[0] == '\r' || t.data[0] == ' '))
{
insert_character(&t);
}
else if (is_comment)
{
insert_comment(&t, NULL);
}
else if (is_doctype)
{
// todo: parse error, ignore token
}
else if (is_start && name_is(HTML, HTML_SIZE, &t))
{
consume = false;
mode = HTML_PARSER_MODE_IN_BODY;
original_mode = HTML_PARSER_MODE_IN_HEAD;
will_use_rules_for = true;
}
else if (is_start && (name_is(BASE, BASE_SIZE, &t) ||
name_is(BASEFONT, BASEFONT_SIZE, &t) ||
name_is(BGSOUND, BGSOUND_SIZE, &t) ||
name_is(LINK, LINK_SIZE, &t)))
{
insert_html_element(t.name, t.name_size);
stack_pop();
// todo: ack self closing tag
}
else if (is_start && name_is(META, META_SIZE, &t))
{
insert_html_element(t.name, t.name_size);
stack_pop();
// todo: ack self closing tag
// todo: speculative parsing logic
}
else if (is_start && name_is(TITLE, TITLE_SIZE, &t))
{
insert_html_element(t.name, t.name_size);
html_tokenizer_set_state(HTML_TOKENIZER_RCDATA_STATE);
original_mode = mode;
mode = HTML_PARSER_MODE_TEXT;
}
else if ((is_start && name_is(NOSCRIPT, NOSCRIPT_SIZE, &t) && scripting_enabled) ||
(is_start && (name_is(NOFRAMES, NOFRAMES_SIZE, &t) || name_is(STYLE, STYLE_SIZE, &t))))
{
insert_html_element(t.name, t.name_size);
html_tokenizer_set_state(HTML_TOKENIZER_RAWTEXT_STATE);
original_mode = mode;
mode = HTML_PARSER_MODE_TEXT;
}
else if (is_start && name_is(NOSCRIPT, NOSCRIPT_SIZE, &t) && !scripting_enabled)
{
insert_html_element(t.name, t.name_size);
mode = HTML_PARSER_MODE_IN_HEAD_NOSCRIPT;
}
else if (is_start && name_is(SCRIPT, SCRIPT_SIZE, &t))
{
NOT_IMPLEMENTED
}
else if (is_end && name_is(HEAD, HEAD_SIZE, &t))
{
stack_pop();
mode = HTML_PARSER_MODE_AFTER_HEAD;
}
else if (is_start && name_is(TEMPLATE, TEMPLATE_SIZE, &t))
{
NOT_IMPLEMENTED
}
else if (is_end && name_is(TEMPLATE, TEMPLATE_SIZE, &t))
{
NOT_IMPLEMENTED
}
else if ((is_start && name_is(HEAD, HEAD_SIZE, &t)) ||
(is_end && !(name_is(BODY, BODY_SIZE, &t) ||
name_is(HTML, HTML_SIZE, &t) ||
name_is(BR, BR_SIZE, &t))))
{
// todo: parse error
NOT_IMPLEMENTED
}
else
{
stack_pop();
mode = HTML_PARSER_MODE_AFTER_HEAD;
consume = false;
}
break;
... MORE
}
if (use_rules_for) { use_rules_for = false; mode = original_mode; }
if (will_use_rules_for) { will_use_rules_for = false; use_rules_for = true; }
if (consume) { i++; }
}
}
html_tokenizer_free();
return document;
}
This approach seems to be working better than my previous attempts so for next sprint I will continue adding tests and implementation logic to the parser. The biggest problem now is deciding which parts should be implemented for the initial version and which parts should be postponed. The guiding principle I try to follow is “Is this logic necessary to parse a simple html document with no attributes?”. Sometimes it seems that it is necessary, but as i start implementation I see that its not actually needed and changes have to be reverted.
Martin