Reduce allocations in Dropout #1791

mcognetta · 2021-11-29T14:54:48Z

This allocates a new vector rather than reusing y. I tried this variant and it produced the same lowered code as the original.

I see, it's a compelling change, but I don't think it works when dims is set, because then the size of y is actually smaller than x. The current code relies on broadcasting to inflate size-1 dims in the mask to the equivalent full dim size in x.

Something here breaks type-stability, which isn't visible on the benchmarks of large arrays:

x = randn(3,4) function dropout_pr(x, p; dims=:, active::Bool=true) active || return x y = rand!(similar(x, Flux._dropout_shape(x, dims))) x .* Flux._dropout_kernel.(y, p, 1-p) end @btime Flux.dropout($x, 0.5; dims=1) # 60.359 ns @btime dropout_pr($x, 0.5; dims=1) # 619.457 ns @code_warntype dropout_pr(x, 0.5; dims=1); # Body::Any

Also, it is odd that the calculation of 1-p is pulled out of the kernel, but the more expensive 1/q is not. IMO this should be written _dropout_kernel(y, p, invq) = ifelse(y > p, invq, zero(invq)), although in examples I tried the compiler does seem to figure this out. But explicit is better than magic.

Seems to infer fine for me, perhaps Random.rand! wasn't imported beforehand? I get 100.5ns for dropout_pr and 108.5ns for Flux.dropout.

Riffing of an earlier comment, I wonder if x should also be an arg to dropout_kernel. Local benchmarking didn't show much of a difference, but as long as it doesn't hurt codegen it could help to eliminate some redundancy.

OK, I can't reproduce this on a restart, not sure what was wrong, sorry.

I doubt it matters much whether you write x .* _dropout_kernel.(y, p, invq) or _dropout_kernel.(x, y, p, invq), but not opposed. Your hope is roughly that it'll compile one broadcast for forwards & back, instead of two?

Pulling out the division and avoiding a branch seems like a good idea, although likewise I can't prove it causes issues.

Cool. It may not work at all, not sure. It's also possible that this should be y .= rand.(Float32) .> p.

I get:

julia> cx = cu(randn(100, 100)); julia> @btime CUDA.@sync pr_y($cx, dims=1); min 12.529 μs, mean 14.025 μs (5 allocations, 160 bytes) julia> @btime CUDA.@sync bool_y($cx, 0.5, dims=1); min 14.835 μs, mean 16.531 μs (12 allocations, 640 bytes) julia> @btime CUDA.@sync bool_y32($cx, 0.5f0, dims=1); min 14.647 μs, mean 16.188 μs (13 allocations, 656 bytes) julia> CUDA.@time pr_y(cx, dims=1); # these times very noisy 0.010914 seconds (165 CPU allocations: 8.750 KiB) (1 GPU allocation: 400 bytes, 0.26% memmgmt time) julia> CUDA.@time bool_y32(cx, 0.5f0, dims=1); 0.006785 seconds (71 CPU allocations: 3.625 KiB) (1 GPU allocation: 100 bytes, 0.39% memmgmt time)

function pr_y(x, p::Real; dims=:) y = rand!(similar(x, Flux._dropout_shape(x, dims))) y .= y .> p end julia> gx = cu(x); julia> CUDA.@sync @btime bool_y($gx, 0.5); 5.213 μs (28 allocations: 1.41 KiB) julia> CUDA.@sync @btime bool_y($gx, 0.5, dims=1); 5.212 μs (26 allocations: 1.36 KiB) julia> CUDA.@sync @btime pr_y($gx, 0.5); 9.604 μs (30 allocations: 1.75 KiB) julia> CUDA.@sync @btime pr_y($gx, 0.5, dims=1); 8.112 μs (26 allocations: 1.64 KiB)

But confusingly:

julia> @btime bool_y($x, 0.5, dims=1); 207.288 ns (1 allocation: 144 bytes) julia> @btime pr_y($x, 0.5, dims=1); 121.337 ns (1 allocation: 496 bytes)

What does CUDA.@sync @btime do? It seems like that would sync once after the benchmark has run, but perhaps it's not like that? I am never sure about GPU benchmarks.

The CPU result is sturprising. Note that your pr_y is different to mine, it makes a second pass over y, and broadcasts back to the same array in-place, so it might hit JuliaLang/julia#43153 . I was assuming that, if you materialise an array of random numbers, you should still fuse the .> p loop with the x one. These should if anything make it slower, though.

In the GPU case, this 2nd pass (over a small array, 50x2?) might mean one more kernel launch, and it's possible this is the entire timing here, 2 vs 1?

The @sync @btime was just crossed wires on my end, these should be accurate:

julia> @btime CUDA.@sync pr_y($gx, 0.5, dims=1); 13.444 μs (26 allocations: 1.64 KiB) julia> @btime CUDA.@sync pr_y($gx, 0.5); 14.847 μs (30 allocations: 1.75 KiB) julia> @btime CUDA.@sync bool_y($gx, 0.5, dims=1); 10.059 μs (26 allocations: 1.36 KiB) julia> @btime CUDA.@sync bool_y($gx, 0.5); 10.148 μs (28 allocations: 1.41 KiB) # pr_randonly(x; dims=:) = rand!(similar(x, Flux._dropout_shape(x, dims))) julia> @btime CUDA.@sync pr_randonly($gx, dims=1); 8.575 μs (4 allocations: 128 bytes) julia> @btime CUDA.@sync pr_randonly($gx); 8.107 μs (4 allocations: 128 bytes)

AFAICT from @device_code_llvm, the GPU path doesn't include a similar aliasing check.

Edit: another factor to consider is that rand! may be quite a bit faster on CPU with 1.7+ because of the new SIMD-friendly Xoshiro implementation.

-Original file line number
+Diff line change
@@ Expand Up @@
     function dropout(x, p; dims=:, active::Bool=true)
       active || return x
       y = rand!(similar(x, _dropout_shape(x, dims)))
-      @. y = x * _dropout_kernel(y, p, 1-p)
+      x .* _dropout_kernel.(y, p, 1-p)
     end
     @adjoint function dropout(x, p; dims=:, active::Bool=true)
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations in Dropout #1791

Diff view

Diff view

There are no files selected for viewing

mcognetta Nov 29, 2021

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021

mcabbott Nov 29, 2021

mcabbott Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021 •

edited

Loading

Reduce allocations in Dropout #1791

Are you sure you want to change the base?

Reduce allocations in Dropout #1791

Diff view

Diff view

There are no files selected for viewing

mcognetta Nov 29, 2021

Choose a reason for hiding this comment

ToucheSir Nov 29, 2021

Choose a reason for hiding this comment

mcabbott Nov 29, 2021

Choose a reason for hiding this comment

ToucheSir Nov 29, 2021

Choose a reason for hiding this comment

mcabbott Nov 29, 2021

Choose a reason for hiding this comment

mcabbott Nov 29, 2021

Choose a reason for hiding this comment

mcabbott Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

ToucheSir Nov 29, 2021

Choose a reason for hiding this comment

mcabbott Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

ToucheSir Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

mcabbott Nov 29, 2021 •

edited

Loading

mcabbott Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021 •

edited

Loading