feat: add monitoring (`crankshaft-monitor`) #4

claymcleod · 2024-12-04T22:09:36Z

@BradenEverson is going to look into writing a monitoring system for Crankshaft. To that end, I've compiled a set of notes regarding my thoughts on what that would look like here.

Inspiration

Take a look at how common HPC job systems report status. For example, St. Jude uses LSF, which has the bjobs command. Other institutions commonly use SLURM which uses sstat. These types of commands are going to be what many users are used to interfacing with when gathering statistics on their jobs, though we can definitely improve on a lot things too.
To that end, we had designed an interface that we really liked that was built on top of Cromwell called oliver. The whirlwind tour gives you a sense of how the process went, but you can just focus on the monitoring pieces.

Other considerations

This code should be in a crankshaft-monitor package.
The things I think would be helpful to report are:
- Minimally
  - Job status (alive, killed, finished, pending).
  - Streaming the logs from stdout and stderr.
- More difficult but cooler.
  - CPU utilization over time.
  - Memory usage over time.
  - Disk space usage over time.
For the cooler ones above, you will need to extend the Backend trait to allow backends to implement viewing the current cpu/memory/disk at a single point in time. Some interfaces will not support some or all of those, and that should be handled gracefully.
The monitoring service itself should be a service (e.g., should live alongside Runner here) and should be register-able within a Crankshaft engine. That should access the statistics through the Crankshaft core facilities and/or through the extended functionality in the Backend trait at whatever interval the user has configured (will required adding a new polling configuration option to the configuration).
As a proof of concept, I think simply have that service print out metrics gathered at some interval (different that the collection interval above but also configurable) as INFO would be a great start.

Perhaps in a future implementation, we can actually think about communicating with Crankshaft over gRPC/HTTP to gather and display the metrics in a separate process in the terminal or a browser.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add monitoring (`crankshaft-monitor`) #4

feat: add monitoring (`crankshaft-monitor`) #4

claymcleod commented Dec 4, 2024 •

edited

Loading

feat: add monitoring (crankshaft-monitor) #4

feat: add monitoring (crankshaft-monitor) #4

Comments

claymcleod commented Dec 4, 2024 • edited Loading

feat: add monitoring (`crankshaft-monitor`) #4

feat: add monitoring (`crankshaft-monitor`) #4

claymcleod commented Dec 4, 2024 •

edited

Loading