Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: add param for setting RDMA CM ToS #10360

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jakemoroni
Copy link

@jakemoroni jakemoroni commented Sep 4, 2024

Opens:

  • I am unable to test on Windows. I added a no-op handler for rdma_set_option so it should build and have no effect if no ToS is set. If a ToS is set, then it should result in a warning.
  • I'm not quite sure about the calls in verbs_domain_xrc.c. They may not all be necessary.
  • The valid range for RDMA CM ToS is 0-255, but the new FI_VERBS_TOS param takes a range of -1 through 255, with -1 being a special value to indicate "no ToS set" (default), which allows the call to rdma_set_option to be completely skipped. My original plan was to just have 0 be the default and always call rdma_set_option but this would introduce a behavioral change on systems where the default is non-0 and would add extra overhead in scenarios where the ToS is a don't care.

By default, if the RDMA CM application does not explicitly set a ToS, then the system default is used. This system default can be changed via the default_roce_tos configfs param, but this is global to the system and is somewhat cumbersome to deal with.

This change introduces a new environment param: FI_VERBS_TOS

This param can be used to explicitly set the ToS via the rdma_set_option API. If the ToS is set, then the system default is ignored and the new value is used instead.

This allows for multiple concurrent workloads to use different ToS values.

Valid range is -1 through 255. If unset or set to -1, then the call to rdma_set_option is omitted, thus preserving existing behavior.

Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Just some nitpicks.

prov/verbs/src/verbs_domain_xrc.c Outdated Show resolved Hide resolved
prov/verbs/src/verbs_ofi.h Outdated Show resolved Hide resolved
prov/verbs/src/verbs_ofi.h Outdated Show resolved Hide resolved
@j-xiong
Copy link
Contributor

j-xiong commented Sep 6, 2024

Please add a description of the new variable to the verbs provider man page. You can also drop the "RFC" part from the title.

@jakemoroni jakemoroni force-pushed the main branch 2 times, most recently from 80995b1 to 20fac51 Compare September 6, 2024 18:02
@jakemoroni jakemoroni changed the title RFC: prov/verbs: add param for setting RDMA CM ToS prov/verbs: add param for setting RDMA CM ToS Sep 6, 2024
@jakemoroni
Copy link
Author

Please add a description of the new variable to the verbs provider man page. You can also drop the "RFC" part from the title.

Done. Also rebased and retested.

By default, if the RDMA CM application does not explicitly
set a ToS, then the system default is used. This system default
can be changed via the default_roce_tos configfs param, but this
is global to the system and is somewhat cumbersome to deal with.

This change introduces a new environment param: FI_VERBS_TOS

This param can be used to explicitly set the ToS via the
rdma_set_option API. If the ToS is set, then the system default
is ignored and the new value is used instead.

This allows for multiple concurrent workloads to use different
ToS values.

Valid range is -1 through 255. If unset or set to -1, then
the call to rdma_set_option is omitted, thus preserving existing
behavior.

Not supported/tested on Windows.

Signed-off-by: Jacob Moroni <[email protected]>
@shefty
Copy link
Member

shefty commented Sep 10, 2024

The fi_tx_attr includes a tclass value that can be used to control the service level. It supports DSCP via helper functions. Can this be used instead?

@a-szegel
Copy link
Contributor

bot:aws:retest

@jakemoroni
Copy link
Author

The fi_tx_attr includes a tclass value that can be used to control the service level. It supports DSCP via helper functions. Can this be used instead?

I do think this is a good idea. I have some questions regarding how to map these libfabric tclass values to an appropriate "TOS" value though.

The RDMA CM API, like to the regular socket API, allows you to specify an 8 bit "TOS" value. However, the definition of this octet has changed over the years and seems to have converged on two incompatible encodings:

  • RFC 1349 TOS (legacy)
  • DSCP

For either encoding, the value is set using the same API (DSCP values need to be left shifted by 2 first) and there's unfortunately no way to indicate the encoding. I assume that the RDMA CM TOS is semantically similar to the regular socket TOS in this regard.

DSCP is the more modern encoding, but some places still assume that this value is an RFC 1349 TOS. For example, Linux assumes that the value specified in a setsockopt(IP_TOS) call on an IPv4 socket is an RFC 1349 TOS, so if you provide a DSCP value here, it will be mapped to a totally incorrect user priority (EF maps to a lower priority than AF12, etc.).

With a regular IPv4 socket, this can be worked around by just setting the priority afterwards with setsockopt(SO_PRIORITY), but this is not possible with RDMA CM.

I'm wondering if it makes sense to support a few options here:

  • An env param like FI_VERBS_TOS which can be set to: off (default), tos, dscp, or some value in the range of 0x00-0xFF

    • off: No TOS is set at all; RDMA CM default (configurable via sysfs) is used
    • tos: The libfabric tclass gets mapped to a RFC 1349 style TOS value
    • dscp: The libfabric tclass gets mapped to a DSCP value
    • 0x00-0xFF: The raw value is set directly

    ** For both the tos and dscp values, if the libfabric tclass value has the DSCP bit set (FI_TC_DSCP), then value is left shifted by 2 and set directly.

WDYT?

@sydidelot
Copy link
Member

Hi @jakemoroni, I came across your PR while searching for a solution to set ToS/DSCP values to RoCE endpoints.
I like @shefty's suggestion to use the tclass value from fi_tx_attr, which allows to set custom ToS/DSCP values on a per endpoint basis. Environment variables are dangerous: it's not always easy to determine if they have been taken into account by libfabric. As an example, few years ago it was decided to remove the FI_FORK_SAFE environment variable in this patch: #4973. Unfortunately, this environment variable was used in some products at DDN and we didn't immediately notice this change when we upgraded libfabric. IMHO, it would better not to use environment variables.

For ToS/DSCP, can't we simply pass the output of fi_tc_dscp_get(fi_tx_attr::tclass) to rdma_set_option() if the value is different than 0/FI_TC_UNSPEC? I also think we should fail the caller if rdma_set_option() returns an error.

@sydidelot
Copy link
Member

@jakemoroni I have been working on the changes to implement ToS using fi_tx_attr::tclass as suggested by Sean. My changes are based on this current PR and can be found in this private branch: https://github.com/sydidelot/libfabric/commits/tos/

With these changes, setting ToS/DSCP then becomes super easy as the application is just required to call fi_info::tx_attr::tclass = fi_tc_dscp_set(<value>).
If the changes make sense to you, feel free to cherry-pick them and squash them in this PR.

Note 1: In my changes, the DSCP/ToS value is not left shifted by 2. This operation must be done by the application.
Note 2: I think we will need to add a fi_get_opt(ep_fid, ...) call to allow the application to retrieve what is the current DSCP value set for a given endpoint.

@jakemoroni
Copy link
Author

Hey @sydidelot,

Thanks for the input!

  • Agreed on your points about environment variables. The only thing I am a little worried about is introducing a regression/breakage for workloads that I can't test on my end. For instance, there may be some application out there that sets the DSCP value (which is currently just being ignored). With this change, there are two new behaviors:

    1. Now, the DSCP value may have an effect on performance (but this is probably what we want anyway)
    2. If we fail on the rdma_set_option for some reason, the application which used to work will now fail

    I think it's probably low risk, but just wanted to raise awareness. Also, for the Windows stub, we'd probably want to change it to always report success rather than failure if we decide to fail on rdma_set_option errors. What do you think?

  • Agreed that we can pass the tclass value without any remapping if FI_TC_DSCP bit is set.

A few questions:

  • Regarding the left shift by 2: Do we have a sense on what existing applications that use fi_tc_dscp_set are currently passing in? A DSCP value is 6 bits, and I have encountered some other untrelated tools that accept DSCP arguments in this form and then do the shift internally before passing it down to the kernel. One example is rclone

    • On one hand, the name being DSCP makes me lean towards doing the shift in libfabric. On the other hand, it may be useful for users to have a way to pass raw unmodified 8 bit values all the way down.
  • This is more of a follow-on, but what should we do about the other predefined tclass values? Maybe we can start with passing these DSCP values through, then address that in a subsequent commit?

@sydidelot
Copy link
Member

Hey @jakemoroni. Please accept my apologies for the late reply.

Now, the DSCP value may have an effect on performance (but this is probably what we want anyway)

I guess that's fine if applications pay the price of rdma_set_option() if DSCP is set. If DSCP is not set correctly, the application may run into different problems :)

If we fail on the rdma_set_option for some reason, the application which used to work will now fail

Right, that's a risk. @j-xiong what is your opinion on this point? Shall we enforce rdma_set_option to be always successful if DSCP is set or do we want the application to manually check if DSCP value was set correctly as an extra step? For example using a new call to fi_get_opt(ep_fid, ...)?

Also, for the Windows stub, we'd probably want to change it to always report success rather than failure if we decide to fail on rdma_set_option errors. What do you think?

For widows we could return -FI_ENOSYS and fail the caller only if rdma_set_option() returns something different than 0 and -FI_ENOSYS

Regarding the left shift by 2: Do we have a sense on what existing applications that use fi_tc_dscp_set are currently passing in?

I have no idea how existing applications are using fi_tc_dscp_set(). I regret that the OFI interface doesn't allow to set the entire 8-bit ToS field (including the 2 ECN bits).

This is more of a follow-on, but what should we do about the other predefined tclass values? Maybe we can start with passing these DSCP values through, then address that in a subsequent commit?

Yes, I think we should focus on DSCP/ToS first and address other tclass values later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants