Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: "munmap_chunk(): invalid pointer" on cleanup of fi_rdm_tagged_peek with OOB #10123

Open
zachdworkin opened this issue Jun 25, 2024 · 3 comments
Assignees

Comments

@zachdworkin
Copy link
Contributor

fi_rdm_tagged_peek fails to cleanup on the server side with "munmap_chunk(): invalid pointer" if FI_PROVIDER="psm3" is set.

To Reproduce
server_cmd: FI_PROVIDER=psm3 fi_rdm_tagged_peek -p psm3 -E
client_cmd: FI_PROVIDER=psm3 fi_rdm_tagged_peek -p psm3 -E "server_address"

Expected behavior
Test passes successfully

Output
Server Output:
Sending 10 tagged messages
Waiting for messages to complete
munmap_chunk(): invalid pointer
Aborted (core dumped)

Server Backtrace:
gdb) bt
#0 0x00007ffff6496aff in raise () from /lib64/libc.so.6
#1 0x00007ffff6469ea5 in abort () from /lib64/libc.so.6
#2 0x00007ffff64d9097 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff64e04ec in malloc_printerr () from /lib64/libc.so.6
#4 0x00007ffff64e079c in munmap_chunk () from /lib64/libc.so.6
#5 0x00007ffff7a88e0f in psm3_free_internal (ptr=0x735a80, curloc=0x7ffff7b12953 "prov/psm3/psm3/psm_ep.c:1163")
at prov/psm3/psm3/psm_utils.c:3964
#6 0x00007ffff7a63d41 in psm3_ep_close (ep=0x636ac0, mode=0, timeout_in=2000000000) at prov/psm3/psm3/psm_ep.c:1163
#7 0x00007ffff7a29b31 in psmx3_trx_ctxt_free (trx_ctxt=0x62b3a0, usage_flags=3) at prov/psm3/src/psmx3_trx_ctxt.c:223
#8 0x00007ffff7a11cea in psmx3_ep_close (fid=0x7349b0) at prov/psm3/src/psmx3_ep.c:234
#9 0x0000000000403fb1 in fi_close (fid=)
at /path_to_libfabric_install/include/rdma/fabric.h:632
#10 ft_close_fids () at common/shared.c:1792
#11 0x0000000000404a9a in ft_free_res () at common/shared.c:1862
#12 0x0000000000401b2a in main (argc=, argv=) at functional/rdm_tagged_peek.c:364

Client Output:
Peek for a bad msg
Peek w/ claim for a bad msg
Peek msg 1
Receive msg 1
Peek w/ claim msg 2
Receive claimed msg 2
Peek & discard msg 3
Checking to see if msg 3 was discarded
Peek w/ claim msg 4
Claim and discard msg 4
Receive msg 5
Receive msg 6
Receive msg 10
Receive msg 9
Receive msg 8
Receive msg 7

Environment:
rocky 8.7 mlnx 5.0

Additional context
Setting and unsetting FI_PROVIDER fixes this bug
Specific free() call that fails is freeing the hfi_nids struct in file psm_ep.c:1163

@zachdworkin
Copy link
Contributor Author

#10124 disables fi_rdm_tagged_peek test from CI while this bug is investigated. Please revert this change when it is resolved.

@acgoldma
Copy link
Contributor

acgoldma commented Dec 9, 2024

sorry this one got lost, I will add to our internal bug tracker so we can fix this.

@brooksmi
Copy link

@zachdworkin Is this still reproducible? Can you provide any details on the system hardware and configuration?

I've been unable to reproduce this on our PSM test systems so far (tried RHEL 8.10 w/single MT28000 in eth mode on the commit just prior to the test disable commit).

Based on the stack trace above, this is hitting a libc malloc guard on free(). It thinks it's freeing a memory mapped pointer, which should not be the case here. This suggests perhaps that the private malloc header got overwritten, e.g. a buffer underflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants