Now I am become the Destroyer of Threads

We can confidently say that multithreading is a game-changing addition to the WebAssembly spec, considering all that it has enabled us to do, like creating a Node runtime that runs on the Web. Moreover, as we have said before, the Rust tooling around it is (generally) excellent. We borrow liberally from the crates ecosystem, and being able to just pull, say, parking-lot and use it out of the box in your project feels like holding fire in your hands.

However, we have also mentioned in the past how we frequently run into the limitations of the current state of affairs. We recently encountered a particularly nasty one that we had to go ahead and solve ourselves…

No destructor for you

If you use WebAssembly in Rust, you are most likely doing it through wasm-bindgen, which is pretty much the standard for Wasm <—> host interoperation (especially if you’re targeting the Web!). Although wasm-bindgen supports multi-threading, it requires that you use nightly Rust and enable the right combination of compiler flags. It’s not ideal, but it gets the job done.

This support is carefully threaded^[1] between LLVM, the Rust compiler, its standard library and wasm-bindgen itself, and it does not fully support all thread-based abstractions. For instance, std::thread does not work and there is little hope for that since WebAssembly itself does not define any notion of “thread/agent spawning”. Instead, wasm-bindgen assumes you’ll be the one spawning workers (either WebWorkers or Node Workers) and manually sharing the Wasm code and the Memory instances between them.

Another subtler thing that does not work out of the box is “thread destructors”. Every time you run your Wasm code in a new thread, it needs to initialize some memory for its exclusive use, namely TLS and a stack:

TLS means “thread-local storage”, and it’s required for every thread-local variable you use (for example, through thread_local!).
The need for a “stack” might be a little more obscure: the WebAssembly abstract machine specifies both a linearly addressable memory plus a stack of values that can be pushed/popped into local variables. This stack is not directly addressable per se and does not live together with the linear memory. However, LLVM does generate a “shadow stack”, a piece of linear memory managed by the Wasm code, using a Global acting as “stack pointer”. Hence, both this space and global need to be initialized for every thread.

What’s missing then? Well, after all this memory is allocated, there is no automatic way to deallocate it once it’s not needed anymore. If you’re running native, multithreaded code, when a thread finishes, the operating system is responsible to clear all space allocated for it. But, as we hinted before, this model does not map well for some WebAssembly hosts, such as web browsers or Node: there is no obvious “scheduler” that should perform these cleaning tasks.

This “problem” is actually not that big of a deal depending on your use case. If you keep around a fixed-size workers pool to help with some heavy computations, there’s nothing to worry about. But if you spawn and tear down workers dynamically depending on user requests (for example, if you were to roughly map user-spawned processes to workers… 🙈) then this does affect you: you effectively have a memory leak which, by the way, is at least 1MiB per worker by default (!).

What we did

We don’t have a general solution for how to map the Rust abstractions to the multithreaded WebAssembly target,^[2] but we would love to fix this leak. Since there is no obvious place where to perform the cleaning, we proposed exposing a (highly-unstable, not very visible) helper function, __wbindgen_thread_destroy(), to do it manually. This way we choose when to carry it out: it is a bit DIY, but so is initializing anyway.

How does it work? Let us first look at how that memory is initialized:

(func $start (params)
  ; CHECK_NEEDS_STACK
  ; we keep a thread counter somewhere in memory,
  ; which we now increase
  i32.const 327684 ; THREAD_COUNTER_ADDRESS
  i32.const 0
  i32.const 1
  i32.atomic.rmw.cmpxchg

  ; the previous block gives us the _old_ value of the counter.
  ; if it is non-negative, we are not the first thread, so
  ; we need to initialize stuff.
  if
    ; grow memory 16 pages, totaling 1MiB
    memory.grow 16

    ; the previous line returned the previous memory size, in pages.
    ; multiplied by the page size, we have the previous
    ; memory size in bytes.
    i32.const 65536 ; 64KiB
    i32.mul

    ; plus 1MiB that we just added, we have the current size in bytes
    i32.const 1048576 ; 1MiB
    i32.add

    ; set the stack pointer. The stack grows backwards, so
    ; it should be equal to the current size of the memory
    global.set $stack_pointer
  else
    ; nothing to do
  end

  ; now we initialize TLS
  ; first, we call `malloc` to allocate some space
  i32.const 128 ; TLS_SIZE
  call $__wbindgen_malloc

  ; and now we can call the initialization function
  call $__wasm_init_tls
)

This $start function runs every time you instantiate the WebAssembly module in a new thread. It assumes the existence of a few things, that are either emitted by the LLVM backend or arranged by wasm-bindgen itself. Namely, the $__wbindgen_malloc and $__wbindgen_free functions, the $__wasm_init_tls routine that initializes the TLS, the $stack_pointer global and the THREAD_COUNTER_ADDRESS memory address.

OK, so we just need to “undo” everything that happened above. But first, let’s make sure that we keep enough data around that points to the memory that needs cleaning:

  i32.const 1048576 ; 1MiB
  i32.add
+ global.set $stack_alloc
+ global.get $stack_alloc
  global.set $stack_pointer

  ;...

  i32.const 128 ; TLS_SIZE
  call $__wbindgen_malloc
+ global.set $tls_base
+ global.get $tls_base
  call $__wasm_init_tls

Now we have a global ($tls_base) that points to the TLS chunk and another new global ($stack_alloc) that points to the new address of the stack pointer. Note that we can’t simply rely on $stack_pointer itself, since it is subject to change during runtime.^[3] We can try writing our destructor:

(function $__wbindgen_thread_destroy (params)
  ; we call `free` for the TLS chunk
  global.get $tls_base
  i32.const 128 ; TLS_SIZE
  call $__wbindgen_free

  ; and now we try to do the same with the stack
  global.get $stack_alloc
  ; hmmm, and now... memory.ungrow...?
)

There is a little problem though. The stack was allocated with a raw call to memory.grow, a low-level primitive from the WebAssembly runtime itself. There is no “memory.ungrow” instruction we could call to return this chunk back to the host.^[4] Even if it existed, it would be a bit weird to use: between the time when the stack was allocated and the moment we want to destroy it, the Wasm instance might have been called any number of times and new pages could have been added to the memory. “Freeing” a memory page “sandwiched” between two that are still in use does not make a lot of sense.

This is all just hinting that we should leave these matters to the allocator (those $__wbindgen_malloc and $__wbindgen_free calls we’ve already seen), which is in charge of growing memory when needed and keeping track of which pages can actually be reused. So we’ll go back to $start again and change the raw grow instruction into a call to malloc:

- ; grow memory 16 pages, totaling  1MiB
- memory.grow 16

- ; the previous line returned the previous memory size, in pages.
- ; multiplied by the page size, we have the previous
- ; memory size in bytes.
- i32.const 65536 ; 64KiB
- i32.mul

- ; plus 1MiB that we just added, we have the current size in bytes
+ ; allocate 1MiB for the stack
  i32.const 1048576 ; 1MiB
- i32.add
+ call $__wbindgen_malloc

And now we encounter a second, more subtle difficulty. Even if we don’t know much about $__wbindgen_malloc, there is one thing we should assume: that it most likely accesses $stack_pointer! That is the whole point of having a “stack”: that it acts as a “scratch space” in linear memory for each function to read and write freely from.^[5]

If $start is supposed to initialize the value of $stack_pointer, what would this previous malloc call encounter there? Well, the initial value of $stack_pointer is the same for all threads, and it is set to point to an unused 1MiB chunk of linear memory. The first spawned thread does not attempt to allocate a stack: it assumes that initial value is perfectly fine and uses it as its stack. Our attempt to use malloc above would clobber this chunk from two different threads concurrently, a recipe for total disaster. 🙈🙈

Since we are on it, the same can be said about $__wbindgen_free: if it needs a stack to function, how do we expect it to work when we try to destroy it while using it?

The final solution is not very elegant, but it works. Together with the 1MiB “initial stack” for the first thread, we will add a second (statically allocated) chunk of memory, the “temp stack”. Now, before calling malloc or free, we will set $stack_pointer to point to TEMP_STACK. Before we can do that, we have to make sure that other threads are not trying to do the same concurrently, so we will need to write a little mutex-y loop to “acquire” the temporary stack. When putting everything together, this is the final result:^[6]

(func $start (params)
  ; OMITTED: CHECK_NEEDS_STACK as above
  if
    ; use the temporary stack
    global.set $stack_pointer 393216 ; TEMP_STACK

    ; before calling any function, make sure
    ; the temporary stack is "available"

    ; TODO: GRAB_LOCK

    ; call malloc
    i32.const 1048576 ; 1MiB
    call $__wbindgen_malloc

    ; save the newly allocated stack to destroy it later
    global.set $stack_alloc

    ; set the current stack pointer
    global.get $stack_alloc
    global.set $stack_pointer

    ; TODO: RELEASE_LOCK
  else
  end

  ; OMITTED: tls initialization is the same
)

(function $__wbindgen_thread_destroy (params)
  ; we call `free` for the TLS chunk
  global.get $tls_base
  i32.const 128 ; TLS_SIZE
  call $__wbindgen_free

  ; and now we try to do the same with the stack

  ; use the temporary stack
  global.set $stack_pointer 393216 ; TEMP_STACK

  ; before calling any function, make sure
  ; the temporary stack is "available"

  ; TODO: GRAB_LOCK

  global.get $stack_alloc
  call $__wbindgen_free

  ; TODO: RELEASE_LOCK
)

Happily Ever After

All this work was pretty fun and taught us a few things about low-level WebAssembly and its tooling. The shortcomings that we have shown in our partial solutions above are not narrative tricks: they are exactly the mistakes we made along the way. We got it merged into wasm-bindgen a while ago, with the invaluable support of its maintainers. So we feel we’ve earned the (somewhat ostentatious) title of this post: we have become the Destroyer of Threads.^[7]

Multithreaded WebAssembly recently reached stage 3 in the proposals pipeline and has been shipped by all major JS runtimes for quite a bit already. A more comprehensive solution to use it is cooking already with wasi-threads, so the time to give it a try is now!

[1] “threaded”! See what we did here?
[2] The state of multithreaded WebAssembly has not changed much in the last 4 years (!). Most of the difficulties we described were already laid out quite clairvoyantly in this post by the Rust/Wasm Working Group.
[3] We are hand-waving the fact that, when we modify the $start function, we also have to modify the rest of the module to make sure the required elements are present. For instance, we have to add the new (global) declarations to the module definition. We also need to alter the initial memory layout when we later introduce a temporary stack.
[4] This is a known shortcoming of WebAssembly as it stands today. For an interesting discussion see design#1397, where an in-flux proposal is mentioned.
[5] Not 100% “freely”: they are subject to the calling convention.

[6] For simplicity, we have omitted how the actual locking looks like but here it is in case you are interested:

; GRAB_LOCK
loop $acquire
  ; try to acquire the lock
  ; that is by atomically exchanging its value from 0 to 1
  i32.const 327684 ; TEMP_STACK_LOCK
  i32.const 0
  i32.const 1
  i32.atomic.rmw.cmpxchg

  ; if it did _not_ return 0, it is currently lock,
  ; so we need to put the thread to sleep
  if
    i32.const 327684 ; TEMP_STACK_LOCK
    i32.const 1
    i64.const -1
    memory.atomic.wait32
    drop

    ; keep looping
    br $acquire
  else

  ; the previous value was 0, so we can stop looping

  end
end

; ...

; RELEASE_LOCK
; set the lock value back to 0
i32.const 327684 ; TEMP_STACK_LOCK
i32.const 0
i32.atomic.store

; awake 1 blocked thread, if any
i32.const 327684 ; TEMP_STACK_LOCK
i32.const 1
memory.atomic.notify

[7] This is not the end of the story, by the way. Just recently we had to go back and tweak this destruction routine to better fit our use case. Still having fun!

Now I am become the Destroyer of Threads

No destructor for you

What we did

Happily Ever After

Explore more from StackBlitz

Subscribe to StackBlitz Updates

Using StackBlitz at work?

Now I am become the Destroyer of Threads

No destructor for you

What we did

Happily Ever After

More posts by Roberto

Related Posts

Explore more from StackBlitz

Subscribe to StackBlitz Updates

Using StackBlitz at work?