Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking epic #79

Open
4 of 16 tasks
alexander-held opened this issue Aug 30, 2022 · 0 comments
Open
4 of 16 tasks

Benchmarking epic #79

alexander-held opened this issue Aug 30, 2022 · 0 comments
Labels
implementation concerns analysis implementation

Comments

@alexander-held
Copy link
Member

alexander-held commented Aug 30, 2022

This gathers points related to performance and functionality for benchmarking.

things to add to notebook (-> #85):

  • pre-processing skip
  • adjustable number of branches accessed
  • turning off processor logic (remove I/O and/or awkward)
  • optionally removing systematics

points to follow up on (thanks @nsmith-!)

if there is time:

  • investigate whether unused branches in the files matter for performance as observed by @andrzejnovak in another context

related tooling:

dask.distributed setup for simple scaling tests:

import time
from dask.distributed import Client, progress

client = Client("tls://localhost:8786")

def do_something(x):
    time.sleep(5)
    return x

x = client.map(do_something, range(1000))
progress(x)

Event rate measurement: what to use as reference (events in input vs events passing selection)? For reference (10 input files per process):

ttbar__nominal            : 442122 events in input ->  88013 passing selection (frac: 19.91%)
ttbar__scaledown          : 435118 events in input ->  81571 passing selection (frac: 18.75%)
ttbar__scaleup            : 416314 events in input ->  86154 passing selection (frac: 20.69%)
ttbar__ME_var             : 455600 events in input ->  94637 passing selection (frac: 20.77%)
ttbar__PS_var             : 406193 events in input ->  76202 passing selection (frac: 18.76%)
single_top_s_chan__nominal: 221600 events in input ->  14847 passing selection (frac: 6.70%)
single_top_t_chan__nominal: 413691 events in input ->  22346 passing selection (frac: 5.40%)
single_top_tW__nominal    : 382354 events in input ->  39945 passing selection (frac: 10.45%)
wjets__nominal            : 412269 events in input ->    438 passing selection (frac: 0.11%)

where (by file size) W+jets is ~31% of all events available, ttbar nominal is ~40%, the four ttbar variations are ~17% together.
By number of events (948 M events total), the breakdown is 46% W+jets, 30% ttbar nominal, 12% ttbar variations, 11% single top t-channel.

some benchmarking calculations for the usual notebook:

print(f"\nexecution took {time_taken:.2f} seconds")
print(f"event rate / worker: {metrics['entries'] / NUM_WORKERS / time_taken / 1000:.3f} kHz (including overhead, so pessimistic estimate)")

print(f"data read: {metrics['bytesread']/1000**3:.3f} GB")
print(f"events processed: {metrics['entries']/1000**2:.3f} M")
print(f"processtime: {metrics['processtime']:.3f} s (?!)")
print(f"processtime per worker: {metrics['processtime']/NUM_WORKERS:.3f} s (should be similar to real runtime, will be lower if opening files etc. is a significant contribution)")
print(f"processtime per chunk: {metrics['processtime']/metrics['chunks']:.3f} s")
print(metrics)
@alexander-held alexander-held added the implementation concerns analysis implementation label May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implementation concerns analysis implementation
Projects
None yet
Development

No branches or pull requests

1 participant