Martin Spasov

HTML Tokenization and Testing Again

16.06.2025 - 29.06.2025; 45h

Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.

For this sprint, the focus remained the same, tokenizer bug fixing and testing. Completed work items

Code coverage with CodeCov
More tests for tokenizer
More work on the tokenizer - bug fixing and parse error support

Code coverage

Initially, I planned to generate coverage information for the diff using gcov directly. This approach would provide coverage data only on pull requests and only relative to a base branch—meaning there wouldn’t be a full project-wide overview. My idea was to post this information as a comment using a GitHub Action.

However, I ran into an issue: the action couldn’t post the comment due to a persistent permission error. Rather than troubleshoot that further, I decided to switch to using the Codecov service. I’ve used Codecov before, and for a simple use case like mine, it’s straightforward to configure and works well. The free tier offers everything I need.

To get things running, you need:

Account with CodeCov (you can use your github account to log in)
Give access to repository
Add Github Action to your pipeline that will upload the coverage reports
Generate a secret token and add it to your repo (Settings -> Secrets and Variables -> Actions -> Repo secrets)
Add configuration file
You can view coverage at: https://app.codecov.io/github/marsp0/vl-browser

The Github Action looks like this:

- name: Upload coverage reports to Codecov
  uses: codecov/codecov-action@v5
  with:
    fail_ci_if_error: true
    disable_telem: true
    token: $

There are a lot more configuration options but this is a good starting point. The configuration looks like this:

comment: false                             -- no comments on the PR
coverage:
    status:                                -- status checks configuration
        patch: off                         -- no status check for patch diff
        project:                           -- only run status check on whole project
            default:
                informational: true        -- do not block merge
                only_pulls: true           -- only post these status checks on PRs

You can check an example status check here: https://github.com/marsp0/vl-browser/pull/5

More tests

The plan for the last sprint was to add nearly all tests from the html5lib-tests repository. The main challenge is that I have to manually translate the tests into C. I managed to get through test1.data and test2.data, but test3.data appears to be autogenerated and is massive—roughly 2,000 tests by my estimate. Processing that manually would take far too much time.

Instead, I’ve decided to focus on adding enough tests to get tokenizer coverage as close to 100% as possible. Currently, coverage sits at around 85%. Once the implementations for CDATA and NAMED_CHARACTER_REFERENCE are complete, I’ll be able to add more tests and bump that number higher. I’ve postponed those implementations for now, as I’m still figuring out the best approach.

More work on tokenizer

The tokenizer now supports nearly all parse errors. The few that are still missing are either related to the postponed implementations or are triggered by the parser rather than the tokenizer itself. Currently, the tokenizer can return only one parse error per call to html_tokenizer_next. This limitation isn’t an issue for now, but it’s something to keep in mind for future improvements.

Thanks to the coverage reports and tests, I was able to catch and fix a lot of bugs in the implementation. For the next sprint or two, I’ll be shifting focus to the tree-building stage of the parser. After that, I plan to return to the tokenizer. Once test coverage for the tokenizer is close to 100%, I’ll start refactoring it—the current code is pretty ugly.

Martin