We can confidently say that multithreading is a game-changing addition to the WebAssembly spec, considering all that it has enabled us to do, like creating a Node runtime that runs on the Web. Moreover, as we have said before, the Rust tooling around it is (generally) excellent. We borrow liberally from the crates ecosystem, and being able to just pull, say, parking-lot and use it out of the box in your project feels like holding fire in your hands.

However, we have also mentioned in the past how we frequently run into the limitations of the current state of affairs. We recently encountered a particularly nasty one that we had to go ahead and solve ourselves...

No destructor for you Permalink

If you use WebAssembly in Rust, you are most likely doing it through wasm-bindgen, which is pretty much the standard for Wasm <--> host interoperation (especially if you're targeting the Web!). Although wasm-bindgen supports multi-threading, it requires that you use nightly Rust and enable the right combination of compiler flags. It's not ideal, but it gets the job done.

This support is carefully threaded[1] between LLVM, the Rust compiler, its standard library and wasm-bindgen itself, and it does not fully support all thread-based abstractions. For instance, std::thread does not work and there is little hope for that since WebAssembly itself does not define any notion of "thread/agent spawning". Instead, wasm-bindgen assumes you'll be the one spawning workers (either WebWorkers or Node Workers) and manually sharing the Wasm code and the Memory instances between them.

Another subtler thing that does not work out of the box is "thread destructors". Every time you run your Wasm code in a new thread, it needs to initialize some memory for its exclusive use, namely TLS and a stack:

What's missing then? Well, after all this memory is allocated, there is no automatic way to deallocate it once it's not needed anymore. If you're running native, multithreaded code, when a thread finishes, the operating system is responsible to clear all space allocated for it. But, as we hinted before, this model does not map well for some WebAssembly hosts, such as web browsers or Node: there is no obvious "scheduler" that should perform these cleaning tasks.

This "problem" is actually not that big of a deal depending on your use case. If you keep around a fixed-size workers pool to help with some heavy computations, there's nothing to worry about. But if you spawn and tear down workers dynamically depending on user requests (for example, if you were to roughly map user-spawned processes to workers... 🙈) then this does affect you: you effectively have a memory leak which, by the way, is at least 1MiB per worker by default (!).

What we did Permalink

We don't have a general solution for how to map the Rust abstractions to the multithreaded WebAssembly target,[2] but we would love to fix this leak. Since there is no obvious place where to perform the cleaning, we proposed exposing a (highly-unstable, not very visible) helper function, __wbindgen_thread_destroy(), to do it manually. This way we choose when to carry it out: it is a bit DIY, but so is initializing anyway.

How does it work? Let us first look at how that memory is initialized:

(func $start (params)
; CHECK_NEEDS_STACK
; we keep a thread counter somewhere in memory,
; which we now increase
i32.const 327684 ; THREAD_COUNTER_ADDRESS
i32.const 0
i32.const 1
i32.atomic.rmw.cmpxchg

; the previous block gives us the _old_ value of the counter.
; if it is non-negative, we are not the first thread, so
; we need to initialize stuff.
if
; grow memory 16 pages, totaling 1MiB
memory.grow 16

; the previous line returned the previous memory size, in pages.
; multiplied by the page size, we have the previous
; memory size in bytes.
i32.const 65536 ; 64KiB
i32.mul

; plus 1MiB that we just added, we have the current size in bytes
i32.const 1048576 ; 1MiB
i32.add

; set the stack pointer. The stack grows backwards, so
; it should be equal to the current size of the memory
global.set $stack_pointer
else
; nothing to do
end

; now we initialize TLS
; first, we call `malloc` to allocate some space
i32.const 128 ; TLS_SIZE
call $__wbindgen_malloc

; and now we can call the initialization function
call $__wasm_init_tls
)

This $start function runs every time you instantiate the WebAssembly module in a new thread. It assumes the existence of a few things, that are either emitted by the LLVM backend or arranged by wasm-bindgen itself. Namely, the $__wbindgen_malloc and $__wbindgen_free functions, the $__wasm_init_tls routine that initializes the TLS, the $stack_pointer global and the THREAD_COUNTER_ADDRESS memory address.

OK, so we just need to "undo" everything that happened above. But first, let's make sure that we keep enough data around that points to the memory that needs cleaning:

  i32.const 1048576 ; 1MiB
i32.add
+ global.set $stack_alloc
+ global.get $stack_alloc
global.set $stack_pointer

;...

i32.const 128 ; TLS_SIZE
call $__wbindgen_malloc
+ global.set $tls_base
+ global.get $tls_base
call $__wasm_init_tls

Now we have a global ($tls_base) that points to the TLS chunk and another new global ($stack_alloc) that points to the new address of the stack pointer. Note that we can't simply rely on $stack_pointer itself, since it is subject to change during runtime.[3] We can try writing our destructor:

(function $__wbindgen_thread_destroy (params)
; we call `free` for the TLS chunk
global.get $tls_base
i32.const 128 ; TLS_SIZE
call $__wbindgen_free

; and now we try to do the same with the stack
global.get $stack_alloc
; hmmm, and now... memory.ungrow...?
)

There is a little problem though. The stack was allocated with a raw call to memory.grow, a low-level primitive from the WebAssembly runtime itself. There is no "memory.ungrow" instruction we could call to return this chunk back to the host.[4] Even if it existed, it would be a bit weird to use: between the time when the stack was allocated and the moment we want to destroy it, the Wasm instance might have been called any number of times and new pages could have been added to the memory. "Freeing" a memory page "sandwiched" between two that are still in use does not make a lot of sense.

This is all just hinting that we should leave these matters to the allocator (those $__wbindgen_malloc and $__wbindgen_free calls we've already seen), which is in charge of growing memory when needed and keeping track of which pages can actually be reused. So we'll go back to $start again and change the raw grow instruction into a call to malloc:

- ; grow memory 16 pages, totaling  1MiB
- memory.grow 16

- ; the previous line returned the previous memory size, in pages.
- ; multiplied by the page size, we have the previous
- ; memory size in bytes.
- i32.const 65536 ; 64KiB
- i32.mul

- ; plus 1MiB that we just added, we have the current size in bytes
+ ; allocate 1MiB for the stack
i32.const 1048576 ; 1MiB
- i32.add
+ call $__wbindgen_malloc

And now we encounter a second, more subtle difficulty. Even if we don't know much about $__wbindgen_malloc, there is one thing we should assume: that it most likely accesses $stack_pointer! That is the whole point of having a "stack": that it acts as a "scratch space" in linear memory for each function to read and write freely from.[5]

If $start is supposed to initialize the value of $stack_pointer, what would this previous malloc call encounter there? Well, the initial value of $stack_pointer is the same for all threads, and it is set to point to an unused 1MiB chunk of linear memory. The first spawned thread does not attempt to allocate a stack: it assumes that initial value is perfectly fine and uses it as its stack. Our attempt to use malloc above would clobber this chunk from two different threads concurrently, a recipe for total disaster. 🙈🙈

Since we are on it, the same can be said about $__wbindgen_free: if it needs a stack to function, how do we expect it to work when we try to destroy it while using it?

The final solution is not very elegant, but it works. Together with the 1MiB "initial stack" for the first thread, we will add a second (statically allocated) chunk of memory, the "temp stack". Now, before calling malloc or free, we will set $stack_pointer to point to TEMP_STACK. Before we can do that, we have to make sure that other threads are not trying to do the same concurrently, so we will need to write a little mutex-y loop to "acquire" the temporary stack. When putting everything together, this is the final result:[6]

(func $start (params)
; OMITTED: CHECK_NEEDS_STACK as above
if
; use the temporary stack
global.set $stack_pointer 393216 ; TEMP_STACK

; before calling any function, make sure
; the temporary stack is "available"

; TODO: GRAB_LOCK

; call malloc
i32.const 1048576 ; 1MiB
call $__wbindgen_malloc

; save the newly allocated stack to destroy it later
global.set $stack_alloc

; set the current stack pointer
global.get $stack_alloc
global.set $stack_pointer

; TODO: RELEASE_LOCK
else
end

; OMITTED: tls initialization is the same
)

(function $__wbindgen_thread_destroy (params)
; we call `free` for the TLS chunk
global.get $tls_base
i32.const 128 ; TLS_SIZE
call $__wbindgen_free

; and now we try to do the same with the stack

; use the temporary stack
global.set $stack_pointer 393216 ; TEMP_STACK

; before calling any function, make sure
; the temporary stack is "available"

; TODO: GRAB_LOCK

global.get $stack_alloc
call $__wbindgen_free

; TODO: RELEASE_LOCK
)

Happily Ever After Permalink

All this work was pretty fun and taught us a few things about low-level WebAssembly and its tooling. The shortcomings that we have shown in our partial solutions above are not narrative tricks: they are exactly the mistakes we made along the way. We got it merged into wasm-bindgen a while ago, with the invaluable support of its maintainers. So we feel we've earned the (somewhat ostentatious) title of this post: we have become the Destroyer of Threads.[7]

Multithreaded WebAssembly recently reached stage 3 in the proposals pipeline and has been shipped by all major JS runtimes for quite a bit already. A more comprehensive solution to use it is cooking already with wasi-threads, so the time to give it a try is now!


Roberto Vidal

Engineer at StackBlitz. Talk to me about #rustlang

Recent Posts