17.11.2025 - 30.11.2025
Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.
The focus for this sprint was on an initial implementation of the css tokenizer that can be tested.
The tokenizer uses a series of consume_* algorithms and checks to process the input buffer.
consume_tokenconsume_commentconsume_numeric_tokenconsume_ident_tokenconsume_stringconsume_url_tokenconsume_escaped_code_pointconsume_numberconsume_bad_urlconvert_string_to_numberis_valid_escape - checks if two code points are valid escape characteris_ident_start - checks if three code points start an ident sequenceis_number_start - checks if three code points start a numberThe good thing about the above is that each consume_* algorithm is self-contained and returns a single token. Each call to css_tokenizer_next() therefore returns exactly one token, unlike the HTML tokenizer, which can return multiple tokens.
One drawback is that all of these algorithms can consume a code point and advance the cursor on their own. In contrast, the HTML tokenizer advances the cursor at a single call site, which I prefer.
Thankfully I was able to find a repo with tests that i can use - css-tokenizer-tests. Using a translation script (similar to the ones used for the HTML tokenizer and parser) I was able to convert the tests to a familiar line based format.
The repo is divided into categories, each category has a folder for each test, each folder contains a file for the input buffer and another json file for the expected tokens.
Example: ident/0003
source.css
--0
tokens
[
{
"type": "ident-token",
"raw": "--0",
"startIndex": 0,
"endIndex": 3,
"structured": {
"value": "--0"
}
},
{
"type": "whitespace-token",
"raw": "\n",
"startIndex": 3,
"endIndex": 4,
"structured": null
}
]
ident.txt
#data-0003
--0\n
#token-type
ident-token
#token-value
--0
#token-type
whitespace-token
#end-test
As you can see the translation only uses the type and structured fields, the rest are ignored.
As with the other translation scripts, there are some unsupported tests
[
["dimension", "0008", "null byte"],
["fuzz", "b69ece36-057f-4450-9423-a1661787bce6", "null bytes"],
["fuzz", "4f865903-e4dd-4a0b-83ed-e630cfa9dcca", "null bytes"],
["fuzz", "5181013c-60ab-483b-9c06-fb32c7e1e7e8", "null bytes"],
["fuzz", "4e630a47-507b-4b79-b00f-57f7dc1cc79d", "null bytes"],
["fuzz", "6d07fc79-586f-4efa-a0a2-37d4dd3beb09", "null bytes"],
["fuzz", "864d7812-b82f-47c2-94e4-8402ba6ba94a", "null bytes"],
["fuzz", "2abe9406-c063-4e9a-85ac-b13660671553", "long string"],
["fuzz", "7f49c8fc-8292-4a3e-828b-b5d028a80d5f", "long string"],
["url", "0010", "long string"],
["url", "0009", "long string"]
]
Tests status as of this post:
| Test | # of supported |
|---|---|
| at-keyword | 9/9 |
| bad-string | 5/5 |
| bad-url | 8/8 |
| colon | 1/1 |
| comma | 1/1 |
| comment | 6/6 |
| digit | 1/1 |
| dimension | 7/8 |
| escaped-code-point | 16/16 |
| full-stop | 3/3 |
| fuzz | 4/12 |
| hash | 0/15 |
| hyphen-minus | 6/6 |
| ident-like | 9/9 |
| ident | 9/9 |
| left-curly-bracket | 1/1 |
| left-parenthesis | 1/1 |
| left-square-bracket | 1/1 |
| less-than | 4/4 |
| number | 20/20 |
| numeric | 4/4 |
| plus | 4/4 |
| right-curly-bracket | 1/1 |
| right-parenthesis | 1/1 |
| right-square-bracket | 1/1 |
| semi-colon | 1/1 |
| string | 9/9 |
| url | 13/15 |
| whitespace | 8/8 |
For next sprint I will try to convert the tokenizer to use iterative approach similar to the one used by the HTML tokenizer as I like it more.
Martin