Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations in Dropout #1791

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions src/layers/normalise.jl
Original file line number Diff line number Diff line change
Expand Up @@ -30,14 +30,14 @@ The [`Dropout`](@ref) layer is what you should use in most scenarios.
"""
function dropout(x, p; dims=:, active::Bool=true)
active || return x
y = dropout_mask(x, p, dims=dims)
return x .* y
y = rand!(similar(x, _dropout_shape(x, dims)))
@inbounds @. y = x * _dropout_kernel(y, p, 1-p)
end

@adjoint function dropout(x, p; dims=:, active::Bool=true)
active || return x, Δ -> (Δ, nothing)
y = dropout_mask(x, p, dims=dims)
return x .* y, Δ -> (Δ .* y, nothing)
y = rand!(similar(x, _dropout_shape(x, dims)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to replace dropout_mask here?

Copy link
Member

@ToucheSir ToucheSir Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be easier to just remove the call to dropout_kernel in dropout_mask and keep the first line here. That also gives a reason for dropout_mask's continued existence (I can see it coming in handy in the future if we ever think of more efficient ways to generate or store the mask depending on input type).

Edit: a slimmed down dropout_mask could also be used by

noise = rand!(similar(x))
.

return x .* _dropout_kernel.(y, p, 1-p), Δ -> (Δ .* _dropout_kernel.(y, p, 1-p), nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if _dropout_kernel should subsume the pointwise mul as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, I believe, would be equivalent to the change here (but perhaps with neater packaging).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would also save a multiplication per element (assuming _dropout_kernel(x, y::T, p, q) where {T} = y > p ? T(x / q) : T(0) or some such)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be equivalent in the end. Maybe check the generated code to verify.

end

function dropout_mask(x, p; dims=:)
Copy link
Member

@mcabbott mcabbott Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note BTW that this re-use of y to save memory suffers from JuliaLang/julia#43153 . Applying the (possible) fix from there saves 30% or so:

julia> x = randn(Float32, 100, 1000);

julia> @btime Flux.dropout_mask($x, 0.5; dims=:);
  min 70.791 μs, mean 129.631 μs (7 allocations, 390.80 KiB. GC mean 24.65%)

julia> @eval Base.Broadcast @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
           if bc.f === identity && bc.args isa Tuple{AbstractArray} # only a single input argument to broadcast!
               A = bc.args[1]
               if axes(dest) == axes(A)
                   return copyto!(dest, A)
               end
           end
           bc′ = preprocess(dest, bc)
           # Performance may vary depending on whether `@inbounds` is placed outside the
           # for loop or not. (cf. https://github.com/JuliaLang/julia/issues/38086)
           @simd ivdep for I in eachindex(dest)
               @inbounds dest[I] = bc′[I]
           end
           return dest
       end
copyto! (generic function with 126 methods)

julia> @btime Flux.dropout_mask($x, 0.5; dims=:);
  min 55.750 μs, mean 102.479 μs (7 allocations, 390.80 KiB. GC mean 24.71%)

That's another reason to avoid this, in favour of the fusion proposed here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think deleting an internal function like this should be fine. If anyone was overloading this for some reason, better they find out sooner than later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this correctly not trigger for GPU arrays? The type of dest seems pretty broad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this issue exists for GPU, nor whether it calls the method which I pirate here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK, but the definition appears specific enough to not cause major problems: https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/host/broadcast.jl#L50

Expand All @@ -56,7 +56,7 @@ e.g. `Dropout(p; dims = 3)` will randomly zero out entire channels on WHCN input
(also called 2D dropout).

Does nothing to the input once [`Flux.testmode!`](@ref) is `true`.
"""
"""`
mutable struct Dropout{F,D}
p::F
dims::D
Expand Down