Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmtime slow due to lock contention in kernel during munmap() #177

Open
sunshowers opened this issue Nov 1, 2024 · 4 comments
Open

wasmtime slow due to lock contention in kernel during munmap() #177

sunshowers opened this issue Nov 1, 2024 · 4 comments

Comments

@sunshowers
Copy link
Contributor

Encountered an interesting bug while trying to port wasmtime to illumos, as documented in bytecodealliance/wasmtime#9535.

STR (note +beta, compiling wasmtime on illumos requires Rust 1.83 or above):

git clone https://github.com/bytecodealliance/wasmtime
cd wasmtime
git checkout 44da05665466edb301558aa617d9a7bff295c461
git submodule init
git submodule update --recursive

cargo +beta test --test wast -- --test-threads 1 Cranelift/pooling/tests/spec_testsuite/load.wast

This takes around 0.07 seconds on Linux but around 5-6 seconds on illumos.

DTrace samples:

From my naive reading of particularly the kernel stacks, it seems like most of the time is being spent waiting on locks to various degrees.

Per Alex Crichton in this comment:

Whoa! It looks like the pooling allocator is the part that's slow here and that, by default, has a large number of virtual memory mappings associated with it. For example it'll allocate terabytes of virtual memory and then within that giant chunk it'll slice up roughly 10_000 linear memories (each with guard regions between them). These are prepared with a MemoryImageSlot each.

My guess is that the way things are managed is tuned to "this is acceptable due to some fast path in Linux we're hidding" which we didn't really design for and just happened to run across.

This corresponds to PoolingInstanceAllocator in wasmtime. Alex suggests possibly tweaking how the allocator works either on illumos or generally, but given the performance difference between illumos and Linux it seems that a kernel-level improvement might help.

cc @iximeow, @rmustacc who I briefly chatted with about this.

@sunshowers sunshowers changed the title wasmtime slow due to lock contention in kernel during free() wasmtime slow due to lock contention in kernel during munmap() Nov 1, 2024
@jclulow
Copy link
Collaborator

jclulow commented Nov 1, 2024

We did a bit of digging and it seems like the crux of the matter is that wasmtime is making an extremely large mapping to reserve address space, so that other things don't end up surprise-mapped half way through one of the slots that is set aside for a potential WASM instance. Today it appears that mapping is MAP_ANON | MAP_NORESERVE with PROT_NONE. It would probably be cheaper/faster to open /dev/null and use MAP_FILE | MAP_SHARED | MAP_NORESERVE, still with PROT_NONE, instead. I think it would otherwise be functionally compatible with what's already going on, in that you would have to have already adjusted (a subset of) the mappings to make them useful (as they were PROT_NONE anyway) before touching the pages.

I would also note, if useful, that while we strongly encourage people not to depend on the Linux-specific MADV_DONTNEED behaviour, we appear to have a compatible implementation under a separate name: MADV_PURGE.

@sunshowers
Copy link
Contributor Author

sunshowers commented Nov 2, 2024

It would probably be cheaper/faster to open /dev/null and use MAP_FILE | MAP_SHARED | MAP_NORESERVE, still with PROT_NONE

I gave this a shot but ran into the fact that wasmtime doesn't really centralize its memory allocation tracking anywhere -- several independent components all call mmap directly. AFAICT not great design on wasmtime's part but also a fairly big lift to fix.

sunshowers/wasmtime#1 is my WIP (see mmap.rs for the /dev/null + interval tree-based impl, which is the only meaty part -- the rest is just type tetris). If we do want to invest in this, it would probably be good to establish the performance benefits of the dev/null-based mmap using my impl before continuing.

@sunshowers
Copy link
Contributor Author

Ended up deciding to use unsafe code to store a *const Mmap, and it works and is extremely fast! Wow. This is far from shippable but is definitely a prototype worth presenting upstream.

sunshowers/wasmtime#1

@sunshowers
Copy link
Contributor Author

Filed an issue on the wasmtime repo talking about my prototype, the issues I ran into while writing it, and possible solutions: bytecodealliance/wasmtime#9544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants