Reduce allocations in Dropout #1791

mcognetta · 2021-11-29T05:40:32Z

The current implementation of Dropout constructs an intermediate vector before doing a matrix multiply, but these operations could be fused. This PR explicitly fuses the dropout masking, which results in a reduction in allocations and a modest speedup on CPU and GPUs for both training and inference.

Note: A helper function, dropout_mask was used in the previous operation (it was basically integrated directly into the dropout implementation in this PR). I can't tell if other packages are using this, so I didn't remove it. Perhaps it should be marked deprecated and then removed later?

Some benchmarks, first on my not great computer/GPU and then on a better one that was lent to me:

julia> versioninfo()
Julia Version 1.6.4
Commit 35f0c911f4 (2021-11-19 03:54 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, haswell)

julia> CUDA.versioninfo()
CUDA toolkit 11.4, artifact installation
NVIDIA driver 495.44.0, for CUDA 11.5
CUDA driver 11.5

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+495.44
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.4
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 980 (sm_52, 2.938 GiB / 3.948 GiB available)

julia> a = CUDA.rand(256, 256, 256); b = rand(256, 256, 256); c = CUDA.rand(256, 64);

julia> @benchmark CUDA.@sync Flux.dropout($a, .1)
BenchmarkTools.Trial: 
  memory estimate:  4.28 KiB
  allocs estimate:  67
  --------------
  minimum time:     3.473 ms (0.00% GC)
  median time:      3.499 ms (0.00% GC)
  mean time:        10.601 ms (0.40% GC)
  maximum time:     160.137 ms (0.78% GC)
  --------------
  samples:          477
  evals/sample:     1

julia> @benchmark CUDA.@sync new_dropout($a, .1)
BenchmarkTools.Trial: 
  memory estimate:  1.73 KiB
  allocs estimate:  24
  --------------
  minimum time:     2.285 ms (0.00% GC)
  median time:      2.314 ms (0.00% GC)
  mean time:        5.849 ms (0.37% GC)
  maximum time:     168.861 ms (0.70% GC)
  --------------
  samples:          855
  evals/sample:     1

julia> @benchmark Flux.dropout($b, .1)
BenchmarkTools.Trial: 
  memory estimate:  256.00 MiB
  allocs estimate:  11
  --------------
  minimum time:     113.488 ms (0.00% GC)
  median time:      127.785 ms (7.41% GC)
  mean time:        133.344 ms (11.72% GC)
  maximum time:     262.799 ms (56.75% GC)
  --------------
  samples:          38
  evals/sample:     1

julia> @benchmark new_dropout($b, .1)
BenchmarkTools.Trial: 
  memory estimate:  128.00 MiB
  allocs estimate:  9
  --------------
  minimum time:     97.156 ms (0.00% GC)
  median time:      113.923 ms (0.00% GC)
  mean time:        123.853 ms (12.71% GC)
  maximum time:     332.686 ms (65.71% GC)
  --------------
  samples:          41
  evals/sample:     1

julia> @benchmark CUDA.@sync Flux.dropout($c, .1)
BenchmarkTools.Trial: 
  memory estimate:  3.81 KiB
  allocs estimate:  61
  --------------
  minimum time:     24.919 μs (0.00% GC)
  median time:      26.891 μs (0.00% GC)
  mean time:        38.205 μs (0.00% GC)
  maximum time:     932.582 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync new_dropout($c, .1)
BenchmarkTools.Trial: 
  memory estimate:  896 bytes
  allocs estimate:  18
  --------------
  minimum time:     16.825 μs (0.00% GC)
  median time:      17.680 μs (0.00% GC)
  mean time:        23.051 μs (0.00% GC)
  maximum time:     266.705 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> versioninfo()                                                                                                                                                       
Julia Version 1.6.4                                                                                                                                                        
Commit 35f0c911f4 (2021-11-19 03:54 UTC)                                                                                                                                   
Platform Info:                                                                                                                                                             
  OS: Linux (x86_64-pc-linux-gnu)                                                                                                                                          
  CPU: AMD Ryzen 9 3950X 16-Core Processor                                                                                                                                 
  WORD_SIZE: 64                                                                                                                                                            
  LIBM: libopenlibm                                                                                                                                                        
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)                                                                                                                                    
                                                                                                                                                                           
julia> CUDA.versioninfo()                                                                                                                                                  
CUDA toolkit 11.4, artifact installation                                                                                                                                   
NVIDIA driver 495.44.0, for CUDA 11.5                                                                                                                                      
CUDA driver 11.5                                                                                                                                                           
                                                                                                                                                                           
Libraries:                                                                                                                                                                 
- CUBLAS: 11.5.4                                                                                                                                                           
- CURAND: 10.2.5                                                                                                                                                           
- CUFFT: 10.5.1                                                                                                                                                            
- CUSOLVER: 11.2.0                                                                                                                                                         
- CUSPARSE: 11.6.0                                                                                                                                                         
- CUPTI: 14.0.0                                                                                                                                                            
- NVML: 11.0.0+495.44                                                                                                                                                      
- CUDNN: 8.20.2 (for CUDA 11.4.0)                                                                                                                                          
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)                                                                                                                                        
                                                                                                                                                                           
Toolchain:                                                                                                                                                                 
- Julia: 1.6.4                                                                                                                                                             
- LLVM: 11.0.1                                                                                                                                                             
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0                                                                                              
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80                                                            
                                                                                                                                                                           
1 device:                                                                                                                                                                  
  0: NVIDIA GeForce RTX 2060 (sm_75, 4.548 GiB / 5.792 GiB available) 

julia> @benchmark CUDA.@sync Flux.dropout($a, .1)                                                                                                                          
BenchmarkTools.Trial: 2392 samples with 1 evaluation.                                                                                                                      
 Range (min … max):  1.714 ms …  10.763 ms  ┊ GC (min … max): 0.00% … 0.00%                                                                                                
 Time  (median):     1.792 ms               ┊ GC (median):    0.00%                                                                                                        
 Time  (mean ± σ):   2.088 ms ± 501.271 μs  ┊ GC (mean ± σ):  1.12% ± 4.41%                                                                                                
                                                                                                                                                                           
  ▃█▇  ▁▃▂  ▁▃▂▁▁  ▁▁▂ ▁▂▁               ▂▂  ▁                                                                                                                             
  ███▇▄████▇█████▇████████▇▇▇▆▆▆█▆▅▆▅▅▄▅▆██▇▇█▆▆▆▇▅▅▆▆▅▆▅▅▆▄▅ █                                                                                                            
  1.71 ms      Histogram: log(frequency) by time      3.52 ms <                                                                                                            
                                                                                                                                                                           
 Memory estimate: 4.34 KiB, allocs estimate: 69.                                                                                                                           
                                                                                                                                                                           
julia> @benchmark CUDA.@sync new_dropout($a, .1)                                                                                                                           
BenchmarkTools.Trial: 4003 samples with 1 evaluation.                                                                                                                      
 Range (min … max):  1.088 ms …   3.040 ms  ┊ GC (min … max): 0.00% … 44.48%                                                                                               
 Time  (median):     1.276 ms               ┊ GC (median):    0.00%                                                                                                        
 Time  (mean ± σ):   1.247 ms ± 175.018 μs  ┊ GC (mean ± σ):  0.78% ±  3.65%                                                                                               
                                                                                                                                                                           
    ▃█▂        ▆▆                                                                                                                                                          
  ▂▄███▄▂▁▁▁▂▃▇███▃▂▂▂▁▂▂▂▂▂▂▂▂▁▂▁▁▁▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▁▂▂▂▂▂▂▁▂▂ ▃                                                                                                            
  1.09 ms         Histogram: frequency by time        1.99 ms <                                                                                                            
                                                                                                                                                                           
 Memory estimate: 1.77 KiB, allocs estimate: 25.                                                                                                                           
                                                                                                                                                                           
julia> @benchmark Flux.dropout($b, .1)                                                                                                                                     
BenchmarkTools.Trial: 61 samples with 1 evaluation.                                                                                                                        
 Range (min … max):  66.213 ms … 185.794 ms  ┊ GC (min … max):  0.00% … 62.21%                                                                                             
 Time  (median):     80.290 ms               ┊ GC (median):    13.02%                                                                                                      
 Time  (mean ± σ):   81.996 ms ±  19.175 ms  ┊ GC (mean ± σ):  15.68% ± 11.43%                                                                                             
                                                                                                                                                                           
  ▅       █   ▅                                                                                                                                                            
  █▅▁▅▁▁▅▁█▁▁▁█▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁                                                                                                           
  66.2 ms       Histogram: log(frequency) by time       166 ms <                                                                                                           
                                                                                                                                                                           
 Memory estimate: 256.00 MiB, allocs estimate: 11.

julia> @benchmark new_dropout($b, .1)                                                                                                                                      
BenchmarkTools.Trial: 100 samples with 1 evaluation.                                                                                                                       
 Range (min … max):  41.640 ms … 161.174 ms  ┊ GC (min … max):  0.00% … 72.10%                                                                                             
 Time  (median):     46.021 ms               ┊ GC (median):     1.87%                                                                                                      
 Time  (mean ± σ):   50.089 ms ±  15.792 ms  ┊ GC (mean ± σ):  13.74% ± 12.79%                                                                                             
                                                                                                                                                                           
  █▂    ▇▄                                                                                                                                                                 
  ██▆▁▁▁██▆▆▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▅                                                                                                           
  41.6 ms       Histogram: log(frequency) by time       142 ms <                                                                                                           
                                                                                                                                                                           
 Memory estimate: 128.00 MiB, allocs estimate: 9.                                                                                                                          
                                                                                                                                                                           
julia> @benchmark CUDA.@sync Flux.dropout($c, .1)                                                                                                                          
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                                                                                                                     
 Range (min … max):  15.639 μs … 267.015 μs  ┊ GC (min … max): 0.00% … 0.00%                                                                                               
 Time  (median):     16.471 μs               ┊ GC (median):    0.00%                                                                                                       
 Time  (mean ± σ):   16.732 μs ±   2.683 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                                                                               
                                                                                                                                                                           
         ▂▃▅█▅▇▅▃▁                                                                                                                                                         
  ▁▁▁▂▂▄▅██████████▇▅▄▃▃▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃                                                                                                           
  15.6 μs         Histogram: frequency by time         19.5 μs <                                                                                                           
                                                                                                                                                                           
 Memory estimate: 3.33 KiB, allocs estimate: 59.                                                                                                                           
                                                                                                                                                                           
julia> @benchmark CUDA.@sync new_dropout($c, .1)                                                                                                                           
BenchmarkTools.Trial: 10000 samples with 1 evaluation.                                                                                                                     
 Range (min … max):  13.616 μs … 352.747 μs  ┊ GC (min … max): 0.00% … 0.00%                                                                                               
 Time  (median):     14.608 μs               ┊ GC (median):    0.00%                                                                                                       
 Time  (mean ± σ):   14.767 μs ±   3.820 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                                                                                               
                                                                                                                                                                           
               ▂▆██▆▃▁                                                                                                                                                     
  ▁▁▁▂▂▂▂▂▁▂▂▃▅███████▇▅▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂                                                                                                           
  13.6 μs         Histogram: frequency by time         17.2 μs <                                                                                                           
                                                                                                                                                                           
 Memory estimate: 816 bytes, allocs estimate: 16.

An example for gradient calculation. m1 has the current Dropout implementation, and m2 has the updated one:

julia> m1 = gpu(m1)
Chain(Diagonal(10000), Dropout(0.1))  # 20_000 parameters

julia> m2 = gpu(m2)
Chain(Diagonal(10000), NewDropout(0.1))  # 20_000 parameters

julia> @benchmark CUDA.@sync g1 = Flux.gradient(Flux.params(m1)) do
           sum(m1(x))/10000
       end
BenchmarkTools.Trial: 
  memory estimate:  22.53 KiB
  allocs estimate:  369
  --------------
  minimum time:     148.950 μs (0.00% GC)
  median time:      166.127 μs (0.00% GC)
  mean time:        248.822 μs (3.34% GC)
  maximum time:     112.305 ms (15.11% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark CUDA.@sync g2 = Flux.gradient(Flux.params(m2)) do
           sum(m2(x))/10000
       end
BenchmarkTools.Trial: 
  memory estimate:  18.81 KiB
  allocs estimate:  273
  --------------
  minimum time:     123.252 μs (0.00% GC)
  median time:      145.639 μs (0.00% GC)
  mean time:        223.976 μs (2.80% GC)
  maximum time:     131.651 ms (12.27% GC)
  --------------
  samples:          10000
  evals/sample:     1

DhairyaLGandhi · 2021-11-29T05:49:51Z

src/layers/normalise.jl

 end

 @adjoint function dropout(x, p; dims=:, active::Bool=true)
  active || return x, Δ -> (Δ, nothing)
-  y = dropout_mask(x, p, dims=dims)
-  return x .* y, Δ -> (Δ .* y, nothing)
+  y = rand!(similar(x, _dropout_shape(x, dims)))


Do we need to replace dropout_mask here?

It would probably be easier to just remove the call to dropout_kernel in dropout_mask and keep the first line here. That also gives a reason for dropout_mask's continued existence (I can see it coming in handy in the future if we ever think of more efficient ways to generate or store the mask depending on input type).

Edit: a slimmed down dropout_mask could also be used by

Flux.jl/src/layers/normalise.jl

Line 114 in 1242c20

noise = rand!(similar(x))

.

ToucheSir · 2021-11-29T06:01:09Z

src/layers/normalise.jl

-  y = dropout_mask(x, p, dims=dims)
-  return x .* y, Δ -> (Δ .* y, nothing)
+  y = rand!(similar(x, _dropout_shape(x, dims)))
+  return x .* _dropout_kernel.(y, p, 1-p), Δ -> (Δ .* _dropout_kernel.(y, p, 1-p), nothing)


This makes me wonder if _dropout_kernel should subsume the pointwise mul as well.

That, I believe, would be equivalent to the change here (but perhaps with neater packaging).

I believe it would also save a multiplication per element (assuming _dropout_kernel(x, y::T, p, q) where {T} = y > p ? T(x / q) : T(0) or some such)

I think it'd be equivalent in the end. Maybe check the generated code to verify.

src/layers/normalise.jl

Co-authored-by: Brian Chen <[email protected]>

mcognetta · 2021-11-29T14:54:48Z

src/layers/normalise.jl

@@ -31,7 +31,7 @@ The [`Dropout`](@ref) layer is what you should use in most scenarios.
 function dropout(x, p; dims=:, active::Bool=true)
  active || return x
  y = rand!(similar(x, _dropout_shape(x, dims)))
-  @. y = x * _dropout_kernel(y, p, 1-p)
+  x .* _dropout_kernel.(y, p, 1-p)


This allocates a new vector rather than reusing y. I tried this variant and it produced the same lowered code as the original.

I see, it's a compelling change, but I don't think it works when dims is set, because then the size of y is actually smaller than x. The current code relies on broadcasting to inflate size-1 dims in the mask to the equivalent full dim size in x.

Something here breaks type-stability, which isn't visible on the benchmarks of large arrays:

x = randn(3,4) function dropout_pr(x, p; dims=:, active::Bool=true) active || return x y = rand!(similar(x, Flux._dropout_shape(x, dims))) x .* Flux._dropout_kernel.(y, p, 1-p) end @btime Flux.dropout($x, 0.5; dims=1) # 60.359 ns @btime dropout_pr($x, 0.5; dims=1) # 619.457 ns @code_warntype dropout_pr(x, 0.5; dims=1); # Body::Any

Also, it is odd that the calculation of 1-p is pulled out of the kernel, but the more expensive 1/q is not. IMO this should be written _dropout_kernel(y, p, invq) = ifelse(y > p, invq, zero(invq)), although in examples I tried the compiler does seem to figure this out. But explicit is better than magic.

Seems to infer fine for me, perhaps Random.rand! wasn't imported beforehand? I get 100.5ns for dropout_pr and 108.5ns for Flux.dropout.

Riffing of an earlier comment, I wonder if x should also be an arg to dropout_kernel. Local benchmarking didn't show much of a difference, but as long as it doesn't hurt codegen it could help to eliminate some redundancy.

OK, I can't reproduce this on a restart, not sure what was wrong, sorry.

I doubt it matters much whether you write x .* _dropout_kernel.(y, p, invq) or _dropout_kernel.(x, y, p, invq), but not opposed. Your hope is roughly that it'll compile one broadcast for forwards & back, instead of two?

Pulling out the division and avoiding a branch seems like a good idea, although likewise I can't prove it causes issues.

Cool. It may not work at all, not sure. It's also possible that this should be y .= rand.(Float32) .> p.

I get:

julia> cx = cu(randn(100, 100)); julia> @btime CUDA.@sync pr_y($cx, dims=1); min 12.529 μs, mean 14.025 μs (5 allocations, 160 bytes) julia> @btime CUDA.@sync bool_y($cx, 0.5, dims=1); min 14.835 μs, mean 16.531 μs (12 allocations, 640 bytes) julia> @btime CUDA.@sync bool_y32($cx, 0.5f0, dims=1); min 14.647 μs, mean 16.188 μs (13 allocations, 656 bytes) julia> CUDA.@time pr_y(cx, dims=1); # these times very noisy 0.010914 seconds (165 CPU allocations: 8.750 KiB) (1 GPU allocation: 400 bytes, 0.26% memmgmt time) julia> CUDA.@time bool_y32(cx, 0.5f0, dims=1); 0.006785 seconds (71 CPU allocations: 3.625 KiB) (1 GPU allocation: 100 bytes, 0.39% memmgmt time)

function pr_y(x, p::Real; dims=:) y = rand!(similar(x, Flux._dropout_shape(x, dims))) y .= y .> p end julia> gx = cu(x); julia> CUDA.@sync @btime bool_y($gx, 0.5); 5.213 μs (28 allocations: 1.41 KiB) julia> CUDA.@sync @btime bool_y($gx, 0.5, dims=1); 5.212 μs (26 allocations: 1.36 KiB) julia> CUDA.@sync @btime pr_y($gx, 0.5); 9.604 μs (30 allocations: 1.75 KiB) julia> CUDA.@sync @btime pr_y($gx, 0.5, dims=1); 8.112 μs (26 allocations: 1.64 KiB)

But confusingly:

julia> @btime bool_y($x, 0.5, dims=1); 207.288 ns (1 allocation: 144 bytes) julia> @btime pr_y($x, 0.5, dims=1); 121.337 ns (1 allocation: 496 bytes)

What does CUDA.@sync @btime do? It seems like that would sync once after the benchmark has run, but perhaps it's not like that? I am never sure about GPU benchmarks.

The CPU result is sturprising. Note that your pr_y is different to mine, it makes a second pass over y, and broadcasts back to the same array in-place, so it might hit JuliaLang/julia#43153 . I was assuming that, if you materialise an array of random numbers, you should still fuse the .> p loop with the x one. These should if anything make it slower, though.

In the GPU case, this 2nd pass (over a small array, 50x2?) might mean one more kernel launch, and it's possible this is the entire timing here, 2 vs 1?

The @sync @btime was just crossed wires on my end, these should be accurate:

julia> @btime CUDA.@sync pr_y($gx, 0.5, dims=1); 13.444 μs (26 allocations: 1.64 KiB) julia> @btime CUDA.@sync pr_y($gx, 0.5); 14.847 μs (30 allocations: 1.75 KiB) julia> @btime CUDA.@sync bool_y($gx, 0.5, dims=1); 10.059 μs (26 allocations: 1.36 KiB) julia> @btime CUDA.@sync bool_y($gx, 0.5); 10.148 μs (28 allocations: 1.41 KiB) # pr_randonly(x; dims=:) = rand!(similar(x, Flux._dropout_shape(x, dims))) julia> @btime CUDA.@sync pr_randonly($gx, dims=1); 8.575 μs (4 allocations: 128 bytes) julia> @btime CUDA.@sync pr_randonly($gx); 8.107 μs (4 allocations: 128 bytes)

AFAICT from @device_code_llvm, the GPU path doesn't include a similar aliasing check.

Edit: another factor to consider is that rand! may be quite a bit faster on CPU with 1.7+ because of the new SIMD-friendly Xoshiro implementation.

mcabbott · 2021-11-29T17:40:55Z

src/layers/normalise.jl

-  y = dropout_mask(x, p, dims=dims)
-  return x .* y, Δ -> (Δ .* y, nothing)
+  y = rand!(similar(x, _dropout_shape(x, dims)))
+  return x .* _dropout_kernel.(y, p, 1-p), Δ -> (Δ .* _dropout_kernel.(y, p, 1-p), nothing)
 end

 function dropout_mask(x, p; dims=:)


Note BTW that this re-use of y to save memory suffers from JuliaLang/julia#43153 . Applying the (possible) fix from there saves 30% or so:

julia> x = randn(Float32, 100, 1000); julia> @btime Flux.dropout_mask($x, 0.5; dims=:); min 70.791 μs, mean 129.631 μs (7 allocations, 390.80 KiB. GC mean 24.65%) julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing}) axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc)) # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast! A = bc.args[1] if axes(dest) == axes(A) return copyto!(dest, A) end end bc′ = preprocess(dest, bc) # Performance may vary depending on whether `@inbounds` is placed outside the # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086) @simd ivdep for I in eachindex(dest) @inbounds dest[I] = bc′[I] end return dest end copyto! (generic function with 126 methods) julia> @btime Flux.dropout_mask($x, 0.5; dims=:); min 55.750 μs, mean 102.479 μs (7 allocations, 390.80 KiB. GC mean 24.71%)

That's another reason to avoid this, in favour of the fusion proposed here.

Also, I think deleting an internal function like this should be fine. If anyone was overloading this for some reason, better they find out sooner than later.

Would this correctly not trigger for GPU arrays? The type of dest seems pretty broad.

Not sure this issue exists for GPU, nor whether it calls the method which I pirate here.

IDK, but the definition appears specific enough to not cause major problems: https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/host/broadcast.jl#L50

ToucheSir · 2021-12-13T23:49:28Z

I think the biggest outstanding question here is how to handle the case when dims != :, because then the mask will be smaller than the output and thus not reusable as a buffer when generating the final output. The most obvious option is to allocate a full-sized array and selectively fill in/mask the important bits (essentially doing what broadcasting implicitly handles), but it's not clear to me whether that will be more or less efficient than the current approach when dims are set.

Another consideration is that the changed code path(s) are generally only active when taking gradients. Which means that most users will only hit the adjoint (which we haven't figured out a more memory-efficient path) and not the main function. One possible direction here is to create a bool mask (saving ~4x the memory). If we can show that doing that and calling the kernel part twice is not slower than what's currently on master, this should be a shoe-in :)

mcognetta added 3 commits November 29, 2021 00:27

fuse loops

214b05c

Update normalise.jl

185ab40

Update normalise.jl

1242c20

DhairyaLGandhi reviewed Nov 29, 2021

View reviewed changes

ToucheSir reviewed Nov 29, 2021

View reviewed changes

src/layers/normalise.jl Outdated Show resolved Hide resolved

Update src/layers/normalise.jl

64d554f

Co-authored-by: Brian Chen <[email protected]>

mcognetta commented Nov 29, 2021

View reviewed changes

mcabbott reviewed Nov 29, 2021

View reviewed changes

mcabbott added the performance label Jan 14, 2022

ToucheSir mentioned this pull request Feb 5, 2022

Use LogExpFunctions for losses #1866

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocations in Dropout #1791

Reduce allocations in Dropout #1791

mcognetta commented Nov 29, 2021 •

edited

Loading

DhairyaLGandhi Nov 29, 2021

ToucheSir Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021

mcognetta Nov 29, 2021

ToucheSir Nov 29, 2021

DhairyaLGandhi Nov 29, 2021

mcognetta Nov 29, 2021

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021

mcabbott Nov 29, 2021

mcabbott Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021 •

edited

Loading

mcabbott Nov 29, 2021 •

edited

Loading

mcabbott Nov 29, 2021

ToucheSir Nov 29, 2021

mcabbott Nov 29, 2021

ToucheSir Nov 29, 2021

ToucheSir commented Dec 13, 2021

Reduce allocations in Dropout #1791

Are you sure you want to change the base?

Reduce allocations in Dropout #1791

Conversation

mcognetta commented Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

ToucheSir Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcabbott Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

ToucheSir Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

mcabbott Nov 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir commented Dec 13, 2021

mcognetta commented Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021 •

edited

Loading

mcabbott Nov 29, 2021 •

edited

Loading

mcabbott Nov 29, 2021 •

edited

Loading

ToucheSir Nov 29, 2021 •

edited

Loading

mcabbott Nov 29, 2021 •

edited

Loading