Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Add support for IBV_ACCESS_RELAXED_ORDERING #9378

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sydidelot
Copy link
Member

@sydidelot sydidelot commented Sep 27, 2023

IBV_ACCESS_RELAXED_ORDERING allows the system to reorder
Send/Write/Atomic operations to improve performance.

The patch enables IBV_ACCESS_RELAXED_ORDERING if the application
has requested no ordering in TX/RX attributes.

man/fi_mr.3.md Outdated Show resolved Hide resolved
@sydidelot sydidelot force-pushed the verbs_relaxed_ordering branch 2 times, most recently from 7e4e296 to a40b2f7 Compare September 28, 2023 16:18
@sydidelot sydidelot changed the title api,prov/verbs: Add support for MR_RELAXED_ORDERING prov/verbs: Add support for IBV_ACCESS_RELAXED_ORDERING Sep 28, 2023
@sydidelot sydidelot force-pushed the verbs_relaxed_ordering branch from a40b2f7 to c3bc905 Compare September 28, 2023 16:21
include/ofi_util.h Outdated Show resolved Hide resolved
prov/verbs/src/verbs_mr.c Outdated Show resolved Hide resolved
@@ -112,6 +112,8 @@ util_domain_init(struct util_domain *domain, const struct fi_info *info,
domain->info_domain_caps = info->caps | info->domain_attr->caps;
domain->info_domain_mode = info->mode | info->domain_attr->mode;
domain->mr_mode = info->domain_attr->mr_mode;
domain->tx_msg_order = info->tx_attr->msg_order;
domain->rx_msg_order = info->rx_attr->msg_order;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could OR in the msg_order values every time we allocate an endpoint. That at least updates the tx/rx_msg_order fields. However, it's possible for the app to register memory prior to creating any endpoints. In that case, we would incorrectly use relaxed ordering.

So, I'm wondering if we have to have a domain that only supports relaxed ordering. If the app calls fi_info with no ordering set, we can return that domain as one of the options. Endpoints allocated on that domain are only unordered.

This brings up that I think most apps need ordering for sends. AFAICT, the relax ordering here mostly impacts RMA ordering: WAW, RAR, etc. We may only need to check that those bits are not set.

@sydidelot sydidelot force-pushed the verbs_relaxed_ordering branch from c3bc905 to 338bba5 Compare October 6, 2023 08:25
@sydidelot
Copy link
Member Author

@shefty I have updated the PR based on your suggestion to create one domain for relaxed ordering.
The PR only supports msg endpoints for now, but I believe the feature could easily be extended for XRC and DGRAM domains.
I verified the code with a modified version of fi_pingpong that sets tx_attr->msg_order = rx_attr->msg_order = 0 before calling fi_getinfo().

prov/verbs/src/verbs_info.c Outdated Show resolved Hide resolved
IBV_ACCESS_RELAXED_ORDERING allows the system to reorder
Send/Write/Atomic operations to improve performance.

The patch enables IBV_ACCESS_RELAXED_ORDERING if the application
has requested no ordering in TX/RX attributes.

Signed-off-by: Sylvain Didelot <[email protected]>
@sydidelot sydidelot force-pushed the verbs_relaxed_ordering branch from 338bba5 to 0609cd6 Compare October 10, 2023 20:28
@shefty
Copy link
Member

shefty commented Oct 11, 2023

CI failure is related. Changes break verbs;ofi_rxd. Most or all fabtests fail, but here's a simple one:

fi_getinfo_test -s cb14-ib0 cb14-ib0 -p "verbs;ofi_rxd"
13:41:03    timestamp: 20231010-203537+0000
13:41:03    result: Fail

Quickly scanning the changes didn't point anything to me. AFAICT, dgram endpoints shouldn't have been impacted by this change.

@iziemba
Copy link
Contributor

iziemba commented Oct 11, 2023

Question: Is the correct behavior to zero the message order flags? Or, is the correct behavior to set max_order_waw_size to zero? The thinking here is that for FI_MSG EPs, the operations would still be ordered in the network. It is target PCIe write operations which will be unordered with respect to each other.

@shefty
Copy link
Member

shefty commented Oct 11, 2023

IMO, the feature isn't well enough documented at the verbs layer to figure out what exactly it does. The PR comment indicates that sends could be received out of order. Is that true, or would sends still match receive buffers in order? Hmm... the latter seems reasonable, since matching would be done by the NIC. That implies that WAW ordering, and RAR I guess, is not guaranteed.

@thomasgillis
Copy link
Contributor

Following a recent issue I have with PCIe ordering (#9621), I am worried that the possible reordering of the PCIe writes at the target will break the completion ordering of RDMA_WRITE_WITH_IMM_DATA.
I couldn't find many details on IBV_ACCESS_RELAXED_ORDERING, is there any guarantee that it will not break that?

@shefty
Copy link
Member

shefty commented Dec 5, 2023

I haven't found any information on what IBV_ACCESS_RELAXED_ORDERING actually does. If completions can be written before the message data is received, then it's use seems suspect. I will try to find out more details.

@shefty
Copy link
Member

shefty commented Dec 6, 2023

Based on initial feedback, IBV_ACCESS_RELAXED_ORDERING should not impact completions. After reading a completion, all data should be present. I'm still investigating what it does beyond that and if it impacts message ordering.

@shefty
Copy link
Member

shefty commented Dec 7, 2023

See linux-rdma/rdma-core#1413 for a proposed update to better document the change in verbs API behavior. My understanding is that @iziemba 's earlier proposal to disable waw ordering is correct.

@sydidelot
Copy link
Member Author

@shefty Thanks for the feedback and for the clarification of IBV_ACCESS_RELAXED_ORDERING.
Should we enable relaxed ordering if the user provides hints->ep_attr->max_order_waw_size == 0 as suggested by @iziemba earlier in the discussion?

@shefty
Copy link
Member

shefty commented Dec 11, 2023

I honestly still don't know what that flag does. It sets the relaxed ordering bit for PCI transfers, but it's unclear when that bit is set.
I submitted text for it to libibverbs, which was merged, but I don't know if that text was actually correct. WAW ordering may actually be maintained, so the situation may be better than it seems. I'm still trying to determine this.

The flag applies at the receiving side. If WAW ordering is impacted, I would also check/update the msg_order flags.

@sydidelot
Copy link
Member Author

I just discovered that the new CXI provider has support for Relaxed Ordering: https://github.com/ofiwg/libfabric/blob/main/man/fi_cxi.7.md#pcie-ordering
We could check the source code to see how it's implemented in this provider :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants