Benchmarking epic #79

alexander-held · 2022-08-30T10:41:18Z

This gathers points related to performance and functionality for benchmarking.

UChicago dev instance: FileNotFoundError when running over local files (reproducer)
UChicago prod instance: KilledWorker exception at scale (reproducer: default notebook with N_FILES_MAX_PER_SAMPLE>=1000)
scaling beyond ~50 workers at UNL
dask.distributed scaling behavior (reproducer below): differs per site, see occasionally long tails where the last few tasks do not get picked up for a long time, and workers only process a few tasks sometimes while there are still tasks remaining
basket size can matter a lot for uproot, 10-100 kB per basket is good (could try re-merging with hadd -O), small input files have 1 .num_baskets, 10->1 merged files have 10
split out pre-processing since in principle it's optional (https://gist.github.com/alexander-held/4db5624ab302c423ce40dc82d65965a2)
bytesread bug bytesread in metrics varies depending on file source and disagrees with pure uproot scikit-hep/coffea#717

things to add to notebook (-> #85):

pre-processing skip
adjustable number of branches accessed
turning off processor logic (remove I/O and/or awkward)
optionally removing systematics

points to follow up on (thanks @nsmith-!)

py-spy for profiling https://github.com/benfred/py-spy
Memory fragmentation in threaded executors scikit-hep/coffea#249
cache handling https://github.com/nsmith-/coffea-benchmarks/blob/e3c63509555d823b542e82b46eb6661c3b9f1700/coffea-adl-benchmarks.py#L23-L31

if there is time:

investigate whether unused branches in the files matter for performance as observed by @andrzejnovak in another context

related tooling:

rootreadspeed: https://github.com/root-project/root/blob/master/tree/readspeed/README.md

dask.distributed setup for simple scaling tests:

import time
from dask.distributed import Client, progress

client = Client("tls://localhost:8786")

def do_something(x):
    time.sleep(5)
    return x

x = client.map(do_something, range(1000))
progress(x)

Event rate measurement: what to use as reference (events in input vs events passing selection)? For reference (10 input files per process):

ttbar__nominal            : 442122 events in input ->  88013 passing selection (frac: 19.91%)
ttbar__scaledown          : 435118 events in input ->  81571 passing selection (frac: 18.75%)
ttbar__scaleup            : 416314 events in input ->  86154 passing selection (frac: 20.69%)
ttbar__ME_var             : 455600 events in input ->  94637 passing selection (frac: 20.77%)
ttbar__PS_var             : 406193 events in input ->  76202 passing selection (frac: 18.76%)
single_top_s_chan__nominal: 221600 events in input ->  14847 passing selection (frac: 6.70%)
single_top_t_chan__nominal: 413691 events in input ->  22346 passing selection (frac: 5.40%)
single_top_tW__nominal    : 382354 events in input ->  39945 passing selection (frac: 10.45%)
wjets__nominal            : 412269 events in input ->    438 passing selection (frac: 0.11%)

where (by file size) W+jets is ~31% of all events available, ttbar nominal is ~40%, the four ttbar variations are ~17% together.
By number of events (948 M events total), the breakdown is 46% W+jets, 30% ttbar nominal, 12% ttbar variations, 11% single top t-channel.

some benchmarking calculations for the usual notebook:

print(f"\nexecution took {time_taken:.2f} seconds")
print(f"event rate / worker: {metrics['entries'] / NUM_WORKERS / time_taken / 1000:.3f} kHz (including overhead, so pessimistic estimate)")

print(f"data read: {metrics['bytesread']/1000**3:.3f} GB")
print(f"events processed: {metrics['entries']/1000**2:.3f} M")
print(f"processtime: {metrics['processtime']:.3f} s (?!)")
print(f"processtime per worker: {metrics['processtime']/NUM_WORKERS:.3f} s (should be similar to real runtime, will be lower if opening files etc. is a significant contribution)")
print(f"processtime per chunk: {metrics['processtime']/metrics['chunks']:.3f} s")
print(metrics)

The text was updated successfully, but these errors were encountered:

alexander-held added the implementation concerns analysis implementation label May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking epic #79

Benchmarking epic #79

alexander-held commented Aug 30, 2022 •

edited

Loading

Benchmarking epic #79

Benchmarking epic #79

Comments

alexander-held commented Aug 30, 2022 • edited Loading

alexander-held commented Aug 30, 2022 •

edited

Loading