We can confidently say that multithreading is a game-changing addition to the WebAssembly spec, considering all that it has enabled us to do, like creating a Node runtime that runs on the Web. Moreover, as we have said before, the Rust tooling around it is (generally) excellent. We borrow liberally from the crates ecosystem, and being able to just pull, say, parking-lot and use it out of the box in your project feels like holding fire in your hands.
However, we have also mentioned in the past how we frequently run into the limitations of the current state of affairs. We recently encountered a particularly nasty one that we had to go ahead and solve ourselves…
No destructor for you
If you use WebAssembly in Rust, you are most likely doing it through wasm-bindgen
, which is pretty much the standard for Wasm <—> host interoperation (especially if you’re targeting the Web!). Although wasm-bindgen
supports multi-threading, it requires that you use nightly Rust and enable the right combination of compiler flags. It’s not ideal, but it gets the job done.
This support is carefully threaded[1] between LLVM, the Rust compiler, its standard library and wasm-bindgen
itself, and it does not fully support all thread-based abstractions. For instance, std::thread
does not work and there is little hope for that since WebAssembly itself does not define any notion of “thread/agent spawning”. Instead, wasm-bindgen
assumes you’ll be the one spawning workers (either WebWorker
s or Node Worker
s) and manually sharing the Wasm code and the Memory
instances between them.
Another subtler thing that does not work out of the box is “thread destructors”. Every time you run your Wasm code in a new thread, it needs to initialize some memory for its exclusive use, namely TLS and a stack:
- TLS means “thread-local storage”, and it’s required for every thread-local variable you use (for example, through
thread_local!
). - The need for a “stack” might be a little more obscure: the WebAssembly abstract machine specifies both a linearly addressable memory plus a stack of values that can be pushed/popped into local variables. This stack is not directly addressable per se and does not live together with the linear memory. However, LLVM does generate a “shadow stack”, a piece of linear memory managed by the Wasm code, using a
Global
acting as “stack pointer”. Hence, both this space and global need to be initialized for every thread.
What’s missing then? Well, after all this memory is allocated, there is no automatic way to deallocate it once it’s not needed anymore. If you’re running native, multithreaded code, when a thread finishes, the operating system is responsible to clear all space allocated for it. But, as we hinted before, this model does not map well for some WebAssembly hosts, such as web browsers or Node: there is no obvious “scheduler” that should perform these cleaning tasks.
This “problem” is actually not that big of a deal depending on your use case. If you keep around a fixed-size workers pool to help with some heavy computations, there’s nothing to worry about. But if you spawn and tear down workers dynamically depending on user requests (for example, if you were to roughly map user-spawned processes to workers… 🙈) then this does affect you: you effectively have a memory leak which, by the way, is at least 1MiB per worker by default (!).
What we did
We don’t have a general solution for how to map the Rust abstractions to the multithreaded WebAssembly target,[2] but we would love to fix this leak. Since there is no obvious place where to perform the cleaning, we proposed exposing a (highly-unstable, not very visible) helper function, __wbindgen_thread_destroy()
, to do it manually. This way we choose when to carry it out: it is a bit DIY, but so is initializing anyway.
How does it work? Let us first look at how that memory is initialized:
(func $start (params)
; CHECK_NEEDS_STACK
; we keep a thread counter somewhere in memory,
; which we now increase
i32.const 327684 ; THREAD_COUNTER_ADDRESS
i32.const 0
i32.const 1
i32.atomic.rmw.cmpxchg
; the previous block gives us the _old_ value of the counter.
; if it is non-negative, we are not the first thread, so
; we need to initialize stuff.
if
; grow memory 16 pages, totaling 1MiB
memory.grow 16
; the previous line returned the previous memory size, in pages.
; multiplied by the page size, we have the previous
; memory size in bytes.
i32.const 65536 ; 64KiB
i32.mul
; plus 1MiB that we just added, we have the current size in bytes
i32.const 1048576 ; 1MiB
i32.add
; set the stack pointer. The stack grows backwards, so
; it should be equal to the current size of the memory
global.set $stack_pointer
else
; nothing to do
end
; now we initialize TLS
; first, we call `malloc` to allocate some space
i32.const 128 ; TLS_SIZE
call $__wbindgen_malloc
; and now we can call the initialization function
call $__wasm_init_tls
)
This $start
function runs every time you instantiate the WebAssembly module in a new thread. It assumes the existence of a few things, that are either emitted by the LLVM backend or arranged by wasm-bindgen
itself. Namely, the $__wbindgen_malloc
and $__wbindgen_free
functions, the $__wasm_init_tls
routine that initializes the TLS, the $stack_pointer
global and the THREAD_COUNTER_ADDRESS
memory address.
OK, so we just need to “undo” everything that happened above. But first, let’s make sure that we keep enough data around that points to the memory that needs cleaning:
i32.const 1048576 ; 1MiB
i32.add
+ global.set $stack_alloc
+ global.get $stack_alloc
global.set $stack_pointer
;...
i32.const 128 ; TLS_SIZE
call $__wbindgen_malloc
+ global.set $tls_base
+ global.get $tls_base
call $__wasm_init_tls
Now we have a global ($tls_base
) that points to the TLS chunk and another new global ($stack_alloc
) that points to the new address of the stack pointer. Note that we can’t simply rely on $stack_pointer
itself, since it is subject to change during runtime.[3] We can try writing our destructor:
(function $__wbindgen_thread_destroy (params)
; we call `free` for the TLS chunk
global.get $tls_base
i32.const 128 ; TLS_SIZE
call $__wbindgen_free
; and now we try to do the same with the stack
global.get $stack_alloc
; hmmm, and now... memory.ungrow...?
)
There is a little problem though. The stack was allocated with a raw call to memory.grow
, a low-level primitive from the WebAssembly runtime itself. There is no “memory.ungrow
” instruction we could call to return this chunk back to the host.[4] Even if it existed, it would be a bit weird to use: between the time when the stack was allocated and the moment we want to destroy it, the Wasm instance might have been called any number of times and new pages could have been added to the memory. “Freeing” a memory page “sandwiched” between two that are still in use does not make a lot of sense.
This is all just hinting that we should leave these matters to the allocator (those $__wbindgen_malloc
and $__wbindgen_free
calls we’ve already seen), which is in charge of growing memory when needed and keeping track of which pages can actually be reused. So we’ll go back to $start
again and change the raw grow
instruction into a call to malloc
:
- ; grow memory 16 pages, totaling 1MiB
- memory.grow 16
- ; the previous line returned the previous memory size, in pages.
- ; multiplied by the page size, we have the previous
- ; memory size in bytes.
- i32.const 65536 ; 64KiB
- i32.mul
- ; plus 1MiB that we just added, we have the current size in bytes
+ ; allocate 1MiB for the stack
i32.const 1048576 ; 1MiB
- i32.add
+ call $__wbindgen_malloc
And now we encounter a second, more subtle difficulty. Even if we don’t know much about $__wbindgen_malloc
, there is one thing we should assume: that it most likely accesses $stack_pointer
! That is the whole point of having a “stack”: that it acts as a “scratch space” in linear memory for each function to read and write freely from.[5]
If $start
is supposed to initialize the value of $stack_pointer
, what would this previous malloc
call encounter there? Well, the initial value of $stack_pointer
is the same for all threads, and it is set to point to an unused 1MiB chunk of linear memory. The first spawned thread does not attempt to allocate a stack: it assumes that initial value is perfectly fine and uses it as its stack. Our attempt to use malloc
above would clobber this chunk from two different threads concurrently, a recipe for total disaster. 🙈🙈
Since we are on it, the same can be said about $__wbindgen_free
: if it needs a stack to function, how do we expect it to work when we try to destroy it while using it?
The final solution is not very elegant, but it works. Together with the 1MiB “initial stack” for the first thread, we will add a second (statically allocated) chunk of memory, the “temp stack”. Now, before calling malloc
or free
, we will set $stack_pointer
to point to TEMP_STACK
. Before we can do that, we have to make sure that other threads are not trying to do the same concurrently, so we will need to write a little mutex-y loop to “acquire” the temporary stack. When putting everything together, this is the final result:[6]
(func $start (params)
; OMITTED: CHECK_NEEDS_STACK as above
if
; use the temporary stack
global.set $stack_pointer 393216 ; TEMP_STACK
; before calling any function, make sure
; the temporary stack is "available"
; TODO: GRAB_LOCK
; call malloc
i32.const 1048576 ; 1MiB
call $__wbindgen_malloc
; save the newly allocated stack to destroy it later
global.set $stack_alloc
; set the current stack pointer
global.get $stack_alloc
global.set $stack_pointer
; TODO: RELEASE_LOCK
else
end
; OMITTED: tls initialization is the same
)
(function $__wbindgen_thread_destroy (params)
; we call `free` for the TLS chunk
global.get $tls_base
i32.const 128 ; TLS_SIZE
call $__wbindgen_free
; and now we try to do the same with the stack
; use the temporary stack
global.set $stack_pointer 393216 ; TEMP_STACK
; before calling any function, make sure
; the temporary stack is "available"
; TODO: GRAB_LOCK
global.get $stack_alloc
call $__wbindgen_free
; TODO: RELEASE_LOCK
)
Happily Ever After
All this work was pretty fun and taught us a few things about low-level WebAssembly and its tooling. The shortcomings that we have shown in our partial solutions above are not narrative tricks: they are exactly the mistakes we made along the way. We got it merged into wasm-bindgen
a while ago, with the invaluable support of its maintainers. So we feel we’ve earned the (somewhat ostentatious) title of this post: we have become the Destroyer of Threads.[7]
Multithreaded WebAssembly recently reached stage 3 in the proposals pipeline and has been shipped by all major JS runtimes for quite a bit already. A more comprehensive solution to use it is cooking already with wasi-threads, so the time to give it a try is now!
- [1] “threaded”! See what we did here?
- [2] The state of multithreaded WebAssembly has not changed much in the last 4 years (!). Most of the difficulties we described were already laid out quite clairvoyantly in this post by the Rust/Wasm Working Group.
- [3] We are hand-waving the fact that, when we modify the
$start
function, we also have to modify the rest of the module to make sure the required elements are present. For instance, we have to add the new(global)
declarations to the module definition. We also need to alter the initial memory layout when we later introduce a temporary stack. - [4] This is a known shortcoming of WebAssembly as it stands today. For an interesting discussion see design#1397, where an in-flux proposal is mentioned.
- [5] Not 100% “freely”: they are subject to the calling convention.
- [6] For simplicity, we have omitted how the actual locking looks like but here it is in case you are interested:
; GRAB_LOCK loop $acquire ; try to acquire the lock ; that is by atomically exchanging its value from 0 to 1 i32.const 327684 ; TEMP_STACK_LOCK i32.const 0 i32.const 1 i32.atomic.rmw.cmpxchg ; if it did _not_ return 0, it is currently lock, ; so we need to put the thread to sleep if i32.const 327684 ; TEMP_STACK_LOCK i32.const 1 i64.const -1 memory.atomic.wait32 drop ; keep looping br $acquire else ; the previous value was 0, so we can stop looping end end ; ... ; RELEASE_LOCK ; set the lock value back to 0 i32.const 327684 ; TEMP_STACK_LOCK i32.const 0 i32.atomic.store ; awake 1 blocked thread, if any i32.const 327684 ; TEMP_STACK_LOCK i32.const 1 memory.atomic.notify
- [7] This is not the end of the story, by the way. Just recently we had to go back and tweak this destruction routine to better fit our use case. Still having fun!