You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context: we are using qfsadmin -s $ip -p $port ping to collect metrics from our QFS cluster.
One metric we use is the ncorrupt counter. When it's not 0, we get an alert to check the disk of the particular chunkserver.
Our problem is that the ncorrupt counter doesn't reset to 0 when the disk issue is fixed, until we restart the corresponding chunkserver. If we don't restart the chunkserver, the ncorrupt counter stays the same.
Is this a feature or a bug?
If this is intended, we'll need to resolve it on our end, but I though it's worth a shot asking.
The text was updated successfully, but these errors were encountered:
The short answer is "yes". The counter is sum of the respective counters of all chunk server's chunk directories counters. Typical use would be to generate events based on the counter derivative.
Chunk directories have two different subsets of counters: one is cumulative, and the other that gets reset when chunk server start using chunk directory (again). Serval error counters are included in read and write chunk directory counters that get reset every time chunk server decides to use chunk directory.
For example: Read-err, Write-err, Read-timeout, Write-timeout, Read-err-checksum, Write-err-checksum.
The respective cumulative subset of counters has Total-read- and Total-write- prefixes.
Chunk directories counters can be viewed with meta server web ui. Meta server that does chunk servers counter aggregation. Web ui uses get chunk directory counters RPC go query the aggregated counters from the meta server.
Context: we are using
qfsadmin -s $ip -p $port ping
to collect metrics from our QFS cluster.One metric we use is the
ncorrupt
counter. When it's not 0, we get an alert to check the disk of the particular chunkserver.E.g.
s=REDUCTED, p=REDUCTED, rack=REDUCTED, used=28464644933285, free=23535571826431, total=53541442322432, util=56.04, nblocks=437542, lastheard=0, ncorrupt=65, nchunksToMove=0, numDrives=6, numWritableDrives=6, overloaded=0, numReplications=0, numReadReplications=0, good=1, nevacuate=0, bytesevacuate=0, nlost=0, nwrites=40, load=0, md5sum=a95d6ff5740cb73bd29d8330233c40ff, replay=0, connected=1, stopped=0, chunks=437552, tiers=10:1:19:1482:2.37e+12:3.94e+12:39.76;15:5:23:436070:2.12e+13:4.96e+13:57.34, lostChunkDirs=
Our problem is that the
ncorrupt
counter doesn't reset to 0 when the disk issue is fixed, until we restart the corresponding chunkserver. If we don't restart the chunkserver, thencorrupt
counter stays the same.Is this a feature or a bug?
If this is intended, we'll need to resolve it on our end, but I though it's worth a shot asking.
The text was updated successfully, but these errors were encountered: