Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings to cuFileDriverOpen() and cuFileDriverClose() #514

Merged
merged 18 commits into from
Nov 1, 2024

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Oct 24, 2024

Changes:

  • Adding Python bindings to cuFileDriverOpen() and cuFileDriverClose().
  • We now only open the cufile driver explicitly in CUDA versions older than v12.2.
  • Introducing kvikio.cufile_driver.initialize(), which open the cuFile driver and close it again at module exit.
  • Let CI fail if KvikIO wasn't built with cuFile support.
    • Except on cuda11.8+arm64; cuFile didn't support arm until cuda v12.4.
  • Some refactor and clean up!

@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 24, 2024
@madsbk madsbk force-pushed the cufile_driver branch 2 times, most recently from 47a0452 to 4b8181c Compare October 24, 2024 13:10
@madsbk madsbk force-pushed the cufile_driver branch 10 times, most recently from 4967be7 to e7bde28 Compare October 25, 2024 11:50
@madsbk madsbk force-pushed the cufile_driver branch 2 times, most recently from 3d02555 to 0b4537f Compare October 26, 2024 11:00
@madsbk
Copy link
Member Author

madsbk commented Oct 26, 2024

@EricKern calling kvikio.cufile_driver.initialize() should fix your segfault. Are you able to build KvikIO in your setup? If so, could you test this PR?

@EricKern
Copy link

Thank you @madsbk for your efforts!
I'll try to build it next week. Maybe I can even find time tomorrow.

We want to trigger CI error if a package was built without cufile
support
@madsbk madsbk force-pushed the cufile_driver branch 3 times, most recently from 37d318f to 3f8f8b7 Compare October 27, 2024 13:51
@madsbk
Copy link
Member Author

madsbk commented Oct 28, 2024

Thanks @EricKern, good to get this confirmed. Let's continue the discussion in #497

@madsbk madsbk marked this pull request as ready for review October 28, 2024 06:59
@madsbk madsbk requested review from a team as code owners October 28, 2024 06:59
@madsbk madsbk requested a review from KyleFromNVIDIA October 28, 2024 06:59
return cufile_driver.driver_close()


def initialize() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Whose job is it to call initialize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user for now. If we find a Python example that segfaults because of cuFile's termination issues, we should consider calling it in __init__.py.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestions, none are blocking, I think

cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/shim/cufile.hpp Show resolved Hide resolved
cpp/include/kvikio/shim/cufile.hpp Show resolved Hide resolved
python/kvikio/kvikio/cufile_driver.py Show resolved Hide resolved
python/kvikio/kvikio/cufile_driver.py Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cosmetic suggestions. I had one thought about an atexit handler but it seems like that doesn't order the way we need w.r.t. CUDA teardown?

cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
cpp/include/kvikio/shim/cufile.hpp Outdated Show resolved Hide resolved
// not allowed to call CUDA after main[1]. This is because, cuFile will segfault if the
// driver isn't closed on program exit i.e. we are doomed if we do, doomed if we don't, but
// this seems to be the lesser of two evils.
// [1] <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we explicitly register a std::atexit handler? Assuming that the CUDA runtime is doing the same, atexit has ordering guarantees so we'll be pushing our exit handler onto the stack to run before CUDA does its cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is explicitly UB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how CUDA does its teardown, if not via atexit handlers that have predictable ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think std::atexit has ordering guarantees between multiple shared libraries :/

python/kvikio/kvikio/cufile_driver.py Outdated Show resolved Hide resolved
python/kvikio/kvikio/cufile_driver.py Outdated Show resolved Hide resolved
If cuFile isn't available.
"""
driver_open()
atexit.register(driver_close)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm well if this can error it seems to invalidate my suggestion to use an atexit handler in the C++...

@madsbk
Copy link
Member Author

madsbk commented Nov 1, 2024

Thanks @wence- and @vyasr

@madsbk
Copy link
Member Author

madsbk commented Nov 1, 2024

/merge

@rapids-bot rapids-bot bot merged commit c9209aa into rapidsai:branch-24.12 Nov 1, 2024
57 checks passed
@madsbk madsbk deleted the cufile_driver branch December 6, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants