Martin Spasov

CSS Tokenizer v2

01.12.2025 - 14.12.2025

Disclaimer: Please take everything I write here with a grain of salt—I’m by no means an expert. Some of these concepts I’m encountering for the first time, and my understanding might have significant gaps—and that’s okay. It’s one of the reasons I decided to take on this project. If you see anything wrong, let me know and I’ll fix it.

The focus of this sprint was on improving the implementation of the CSS tokenizer. The first step in this effort was to convert it into a state machine similar to the HTML tokenizer. One of the challenges is that, at certain points, the tokenizer needs to examine multiple code points at the same time. For example, is a dot part of a number? Is a code point part of an identifier sequence? etc.

State machine

I started with an input buffer that decodes code points one at a time using a buffer cursor. Below is the general structure of the code (I am not including all of the code, as it is quite long, but the overall approach is the same).

typedef enum
{

	CSS_TOKENIZER_STATE_DATA,
	CSS_TOKENIZER_STATE_NUMBER,
	CSS_TOKENIZER_STATE_THAT_NEEDS_MULTIPLE_CPS
	...
} css_tokenizer_state_e;

css_token_t css_tokenizer_next()
{
	css_token_t t = { .type = CSS_TOKEN_EOF };

	// current state
	state = CSS_TOKENIZER_STATE_DATA;

	uint32_t t_buf = { 0 };
	uint32_t t_buf_size = 0;

	//current code point
	uint32_t cp = 0;
	int32_t cp_len;

	bool is_eof = false;

	// advance the buffer
	bool consume = true;

	// move cursor back equal to size of t_buf
	bool rewind = false;
	bool rewind_start = 0;

	bool emit = false;
	while (!emit)
	{
		cp_len = utf8_decode(...)
		if (cp_len < 0)
		{
			is_eof = true;
			cp = 0;
		}

		switch (state)
		{
			case CSS_TOKENIZER_STATE_DATA:
				if (cp == '\\')
				{
					// do something
					consume = false;
				}
				if (cp == '.')
				{
					t_buf[t_buf_size] = cp;
					t_buf_size++;
				}
				break;
			case CSS_TOKENIZER_STATE_THAT_NEEDS_MULTIPLE_CPS:
				if (is_number_start(t_buf, t_buf_size))
				{
					rewind = true;
					state = CSS_TOKENIZER_STATE_NUMBER;
				}
				else
				{
					t.type = CSS_TOKEN_DELIM;
					emit = true;
					rewind = true;
					rewind_start = 1;
				}
				break;
			...
			// more state handlers
		}

		if (consume) { buf_cur += cp_len; }

		if (rewind)
		{
			for (uint32_t i = rewind_start; i < t_buf_size; i++)
			{
				unsigned char b[4] = { 0 };
				buf_cur -= (uint32_t)utf8_encode(t_buf[i], &b);
			}
		}
	}
}

The above approach allows us to reprocess code points by rewinding the cursor position. One drawback is that it introduces an additional layer of logic.

Nested states

Another issue I encountered was the need for nested states. For example, different states can transition to CSS_TOKENIZER_STATE_ID_SEQ or CSS_TOKENIZER_STATE_ESCAPE. The tokenizer needs to know which state to return to once handling of the nested state is complete.

Initially, I used two different approaches to address this:

A new variable called r_state. Any state transitioning to CSS_TOKENIZER_STATE_ID_SEQ would also set r_state to indicate where the tokenizer should return once the identifier sequence was consumed.
Duplicated states to handle escaped code points. Each state that needed to handle escaped code points had its own copy of the handling logic under a new state, CSS_TOKENIZER_STATE_..._ESCAPE. These new states could only transition back to their parent state. This approach was essentially the same as r_state, but embedded directly in the state machine.

The issues with these approaches were clear from the start. The r_state solution only allowed for a single level of nested states, while the _ESCAPE_ states resulted in significant code duplication. To address this, I added a fixed-size stack of states (size 10) and merged all _ESCAPE_ states into a single state.

static css_tokenizer_state_e stack[10]  = { 0 };
static uint32_t stack_idx               = 0;

...
static void push_state(css_tokenizer_state_e state);
static void pop_state();
// change the top state, equivalent to pop -> push
static void change_state(css_tokenizer_state_e state);
static css_tokenizer_state_e get_state();

Removal of temporary buffer

One idea that came to mind once I started modifying the code was that it might be simpler to allow the tokenizer to peek at the first three code points in the buffer. This would eliminate the need for the rewind logic. The tokenizer would then look like this:

css_token_t css_tokenizer_next()
{
	css_token_t t                   = { .type = CSS_TOKEN_EOF };
    uint32_t end_cp                 = 0;
    uint32_t cp1                    = 0;
    int32_t cp1_len                 = -1;
    uint32_t cp2                    = 0;
    int32_t cp2_len                 = -1;
    uint32_t cp3                    = 0;
    int32_t cp3_len                 = -1;
    uint32_t escaped_cp             = 0;
    uint32_t escaped_cp_digits      = 0;
    bool emit                       = false;

    reset_states();
    change_state(CSS_TOKENIZER_STATE_DATA);

	while (!emit)
    {
        bool consume            = true;
        bool consume_peeked     = false;
        bool is_eof             = false;

        css_tokenizer_state_e state = get_state();

        cp1_len = utf8_decode(buf, buf_size, buf_cur, &cp1);
        if (cp1_len < 0)
        {
            cp1 = 0;
            is_eof = true;
        }
        else
        {
            cp2_len = utf8_decode(buf, buf_size, buf_cur + (uint32_t)cp1_len, &cp2);
        }

        if (cp2_len > 0)
        {
            cp3_len = utf8_decode(buf, buf_size, buf_cur + (uint32_t)cp1_len + (uint32_t)cp2_len, &cp3);
        }

        switch (state)
        {
			// ... process states
		}

		if ((consume || consume_peeked) && cp1_len > 0)
        {
            buf_cur += (uint32_t)cp1_len;
        }

        if (consume_peeked && cp2_len > 0)
        {
            buf_cur += (uint32_t)cp2_len;
        }

        if (consume_peeked && cp3_len > 0)
        {
            buf_cur += (uint32_t)cp3_len;
        }
}

consume is still used to advance the buffer. A new variable consume_peeked is now used if we want to consume cp2 and cp3 which is actually only happening in one state. I realise that a call like the below is very ugly but imo it is better than the t_buf.

cp3_len = utf8_decode(buf, buf_size, buf_cur + (uint32_t)cp1_len + (uint32_t)cp2_len, &cp3);

Tests

Another thing i did for this sprint was to add all the hash.txt tests. The tests table now looks like this:

Test	# of supported
at-keyword	9/9
bad-string	5/5
bad-url	8/8
colon	1/1
comma	1/1
comment	6/6
digit	1/1
dimension	7/8
escaped-code-point	16/16
full-stop	3/3
fuzz	7/12 (new)
hash	15/15 (new)
hyphen-minus	6/6
ident-like	9/9
ident	9/9
left-curly-bracket	1/1
left-parenthesis	1/1
left-square-bracket	1/1
less-than	4/4
number	20/20
numeric	4/4
plus	4/4
right-curly-bracket	1/1
right-parenthesis	1/1
right-square-bracket	1/1
semi-colon	1/1
string	9/9
url	13/15
whitespace	8/8

For next sprint I will work on an initial implementation of the parser that can be tested. I did some preliminary googling to try and find tests for the parser and I was able to find postcss-parser-tests that I think i can extract some tests from.

Martin