You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@BradenEverson is going to look into writing a monitoring system for Crankshaft. To that end, I've compiled a set of notes regarding my thoughts on what that would look like here.
Inspiration
Take a look at how common HPC job systems report status. For example, St. Jude uses LSF, which has the bjobs command. Other institutions commonly use SLURM which uses sstat. These types of commands are going to be what many users are used to interfacing with when gathering statistics on their jobs, though we can definitely improve on a lot things too.
To that end, we had designed an interface that we really liked that was built on top of Cromwell called oliver. The whirlwind tour gives you a sense of how the process went, but you can just focus on the monitoring pieces.
Other considerations
This code should be in a crankshaft-monitor package.
The things I think would be helpful to report are:
Minimally
Job status (alive, killed, finished, pending).
Streaming the logs from stdout and stderr.
More difficult but cooler.
CPU utilization over time.
Memory usage over time.
Disk space usage over time.
For the cooler ones above, you will need to extend the Backend trait to allow backends to implement viewing the current cpu/memory/disk at a single point in time. Some interfaces will not support some or all of those, and that should be handled gracefully.
The monitoring service itself should be a service (e.g., should live alongside Runnerhere) and should be register-able within a Crankshaft engine. That should access the statistics through the Crankshaft core facilities and/or through the extended functionality in the Backend trait at whatever interval the user has configured (will required adding a new polling configuration option to the configuration).
As a proof of concept, I think simply have that service print out metrics gathered at some interval (different that the collection interval above but also configurable) as INFO would be a great start.
Perhaps in a future implementation, we can actually think about communicating with Crankshaft over gRPC/HTTP to gather and display the metrics in a separate process in the terminal or a browser.
The text was updated successfully, but these errors were encountered:
@BradenEverson is going to look into writing a monitoring system for Crankshaft. To that end, I've compiled a set of notes regarding my thoughts on what that would look like here.
Inspiration
bjobs
command. Other institutions commonly use SLURM which usessstat
. These types of commands are going to be what many users are used to interfacing with when gathering statistics on their jobs, though we can definitely improve on a lot things too.oliver
. The whirlwind tour gives you a sense of how the process went, but you can just focus on the monitoring pieces.Other considerations
crankshaft-monitor
package.Backend
trait to allow backends to implement viewing the current cpu/memory/disk at a single point in time. Some interfaces will not support some or all of those, and that should be handled gracefully.Runner
here) and should be register-able within a Crankshaft engine. That should access the statistics through the Crankshaft core facilities and/or through the extended functionality in theBackend
trait at whatever interval the user has configured (will required adding a new polling configuration option to the configuration).Perhaps in a future implementation, we can actually think about communicating with Crankshaft over gRPC/HTTP to gather and display the metrics in a separate process in the terminal or a browser.
The text was updated successfully, but these errors were encountered: