Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add monitoring (crankshaft-monitor) #4

Open
7 tasks
claymcleod opened this issue Dec 4, 2024 · 0 comments
Open
7 tasks

feat: add monitoring (crankshaft-monitor) #4

claymcleod opened this issue Dec 4, 2024 · 0 comments

Comments

@claymcleod
Copy link
Member

claymcleod commented Dec 4, 2024

@BradenEverson is going to look into writing a monitoring system for Crankshaft. To that end, I've compiled a set of notes regarding my thoughts on what that would look like here.

Inspiration

  • Take a look at how common HPC job systems report status. For example, St. Jude uses LSF, which has the bjobs command. Other institutions commonly use SLURM which uses sstat. These types of commands are going to be what many users are used to interfacing with when gathering statistics on their jobs, though we can definitely improve on a lot things too.
  • To that end, we had designed an interface that we really liked that was built on top of Cromwell called oliver. The whirlwind tour gives you a sense of how the process went, but you can just focus on the monitoring pieces.

Other considerations

  • This code should be in a crankshaft-monitor package.
  • The things I think would be helpful to report are:
    • Minimally
      • Job status (alive, killed, finished, pending).
      • Streaming the logs from stdout and stderr.
    • More difficult but cooler.
      • CPU utilization over time.
      • Memory usage over time.
      • Disk space usage over time.
  • For the cooler ones above, you will need to extend the Backend trait to allow backends to implement viewing the current cpu/memory/disk at a single point in time. Some interfaces will not support some or all of those, and that should be handled gracefully.
  • The monitoring service itself should be a service (e.g., should live alongside Runner here) and should be register-able within a Crankshaft engine. That should access the statistics through the Crankshaft core facilities and/or through the extended functionality in the Backend trait at whatever interval the user has configured (will required adding a new polling configuration option to the configuration).
  • As a proof of concept, I think simply have that service print out metrics gathered at some interval (different that the collection interval above but also configurable) as INFO would be a great start.

Perhaps in a future implementation, we can actually think about communicating with Crankshaft over gRPC/HTTP to gather and display the metrics in a separate process in the terminal or a browser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant