-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Memory Usage for Tasks with CPU-GPU Transfer #1351
Comments
Your chunk size is |
Also where are you getting the |
@pentschev Apologies for the bad math, shameful on my part. However, there are still some things I don't understand that are affecting us:
Perhaps both of these are aberrations. I will keep investigating. |
huh now this also seems to be the opposite. not sure what's up with that. |
Sorry, the reshape operation is costly. So import dask.distributed as dd
import numpy as np
cluster = dd.LocalCluster(n_workers=1)
client = dd.Client(cluster)
M = 100_000
N = 4_000
def make_chunk():
arr = np.random.random((M,N))
return arr
arr = da.map_blocks(make_chunk, meta=np.array((1.,), dtype=np.float64), dtype=np.float64, chunks=((M,) * 50, (N,) * 1))
def __double_block(block):
return block * 2
doubled_matrices = da.map_blocks(__double_block, arr)
doubled_matrices.sum(axis=0).compute() uses less memory than GPU equivalent, where it peaks at a similar amount but then falls back faster to a near-0 amount. I must have copy-and-pasted the wrong cell. |
When you say "post-completion of the task", are you referring to the return of
With your CPU version of the code I also see memory peaks at ~7GB on my end, which is similar to what I see in your original/GPU code. I'm not sure exactly what you're seeing on your end, but I cannot reproduce the CPU-only setup using much less memory as you're reporting. |
I am saying after the final
Yup, I'm saying that the peak is the same, about 7GB, but it returns to near-0 GB during processing insteading of staying at 4GB as a baseline as on the GPU version. |
Measuring by eyeballing may be tricky here. I wouldn't be surprised if the GPU implementation seemingly doesn't drop as much because the processing is much faster and values simply do not update fast enough. With that said, unless you're measuring at high-frequency, I wouldn't trust too much the appearance that they are behaving differently. |
I would tend to agree here but |
I have considered just turning off the memory monitoring of dask but I am not sure what that will do if |
In your real use case do you have controlled data sizes you're reading like you have in the code you shared here or does that vary across chunks? Is it possible that you're running out-of-memory because you really have too much data being read/stored in memory? Note that depending on your system setup, data access patterns, etc., it's possible that you end up with a different load pattern because GPU compute is completing faster than CPU was. One thing you may try is adding |
Yup, it's not an exact match but the overhead should not be too high. I will check tomorrow to be sure, but we're talking on the order of a few extra megabytes per read. The chunk size is rather small. I will for sure check to be 100% sure tomorrow, though.
I will also have to double check this to be 100% (you can see I am a bit careless sometimes), but I have definitely tried this as well to no avail. I think what's strange still is the memory hanging around after job completion. I will also try an even dumber reproducer for that tomorrow, like one job, no chunks, just cpu-gpu, and see if the memory hangs around. |
This generally means there is some circular reference that the garbage collector couldn't break and release. I haven't checked what happens to memory after your sample(s) above complete and the Dask remains alive, do you observe memory still resident in those samples too or just in your real code? |
Yes, the reproducer definitely does show the "bad behavior." I've checked the garbage collector and don't see anything too big hanging around but will have another look. |
So when I print out the leftover objects (after def make_chunk():
arr = np.random.random((M,N))
chunk = cp.array(arr)
del arr
dask.distributed.print([type(obj) for obj in gc.get_objects() if (sys.getsizeof(obj) / 1000000) > 100])
return chunk |
Ah, I also forgot another counterfactual. The following displays identical behavior: def make_chunk():
return np.random.random((M,N))
arr = da.map_blocks(make_chunk, meta=np.array((1.,), dtype=np.float64), dtype=np.float64, chunks=((M,) * 15, (N,) * 1))
arr = arr.map_blocks(cp.array, meta=cp.array((1.,), dtype=cp.float64), dtype=cp.float64) followed by some operation like: def __double_block(block):
return block * 2
doubled_matrices = da.map_blocks(__double_block, arr)
doubled_matrices.sum(axis=0).compute() So I'm not sure about garbage collection. After that operation completes you see: and as confirmation, |
I can reproduce that too, using your latest example I see commenting for i in range(50):
a = np.random.random((100_000, 4_000))
b = cp.array(a)
print(a)
print(b) While I can confirm the behavior on my end I don't think I'll have much time to push this further in the short-term. In any case I'll summarize my findings as you did above. I can see about 4GB of host memory still being held by the Dask worker after execution completes (see images in #1351 (comment) for examples), the code I ran is below: from dask_cuda import LocalCUDACluster
from dask.distributed import Client
import dask.array as da
import dask
import cupy as cp
import rmm
import numpy as np
from rmm.allocators.cupy import rmm_cupy_allocator
if __name__ == "__main__":
def set_mem():
rmm.reinitialize(managed_memory=True)
cp.cuda.set_allocator(rmm_cupy_allocator)
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="0")
client = Client(cluster)
client.run(set_mem)
M = 100_000
N = 4_000
def make_chunk():
return np.random.random((M,N))
arr = da.map_blocks(make_chunk, meta=np.array((1.,), dtype=np.float64), dtype=np.float64, chunks=((M,) * 15, (N,) * 1))
# Commenting this line out, thus making the code CPU-only, prevents the 4GB from
# being held by the worker.
arr = arr.map_blocks(cp.array, meta=cp.array((1.,), dtype=cp.float64), dtype=cp.float64)
def __double_block(block):
return block * 2
doubled_matrices = da.map_blocks(__double_block, arr)
doubled_matrices.sum(axis=0).compute()
import time
time.sleep(600) Besides the above, commenting out the Nevertheless, it would be interesting to know what is your use case @ilan-gold , as this might help us prioritize this issue. @charlesbluca @quasiben @rjzamora @wence- pinging you in case anybody has time or can think of ways to debug this further. |
We are reading off disk via The issue is really that we are CPU memory limited to 256GB. So this becomes a tradeoff between number of workers and memory as this problem increases with the number of workers. |
I'm not so sure this behavior has anything to do with dask. It definitely seems like cupy is somehow holding on to a host-memory reference somewhere. As far as I can tell, creating and then deleting a cupy array (in the absence of dask) also leaves behind a residual host-memory footprint. |
There's definitely some residual memory, but what I observed is in the ~50MB range, whereas in this case here we're seeing more like 4-5GB, which seems consistent with the original size of the array. However, as per my previous post I couldn't reproduce a more than something in the 50MB range of memory footprint with the following simple loop: for i in range(50):
a = np.random.random((100_000, 4_000))
b = cp.array(a)
print(a)
print(b) I'd happy to be proven wrong as it would be much easier to debug too, perhaps I missed some detail and did not really reproduce write equivalent code above. |
Just as some more evidence here of a leak: |
The global array size would be much larger than 4-5GB (or maybe I'm mistaken?). 4-5GB seems more like an accumulation of many smaller chunks, no? I'm certainly not 100% sure that Dask is not the problem, I just suspect that dask is amplifying some strange cupy behavior. Another interesting note: If the cupy is used to generate the array in the original repro, the problem seems to go away? |
Yes, but this makes sense, no? I'm specifically saying the CPU - GPU transfer is what I suspect to be problematic (and probably its garbage collection). Dask or not Dask.
No idea! |
Yes, I agree that this is expected, so maybe not so "interesting" :) |
When I run my real world job I see:
So more evidence this is a memory leak. There is definitely no reason my job needs 22 GB much less 32GB |
Is it possible dask (i.e., maybe not something in this package) itself is secretly allocating CPU memory because it is not properly aware of GPU arrays? |
Issue + Reproducers
So I have an i/o job that reads in data to the CPU and passes to the GPU in a
map_blocks
call, and then uses CuPy downstream for a non-standard map-blocks call. Here is the reproducer minus the i/o:This uses an unaccountable amount of CPU memory, on the order of 4-8 GB. But I have no idea why this is happening. I don't have any CPU memory that should be used here except the initial read. And when the job completes, dask still reports that it is holding on to 4GB (!) of memory. I see at most 2 tasks running with another 2-6 in memory. In total, the CPU memory being so high doesn't make sense since the individual numpy arrays are 320MB, so this should be at most 640MB (and even that seems high given how long they last on the CPU before I call
del
). I don't think this is a dask-memory-reporting issue becausetop
shows the same amount of memory usage.I also don't think this has to do with the computation I chose as:
has the same issue, albeit with a warning about chunk sizes (although I'm not sure why I'm getting the warning since the reshape is right along the blocks). In any case, it's the same memory problem. Minus the reshape, and so minus that warning, same behavior:
Some env info:
Some monitoring screenshots:
Addendum 1: I don't see this as spilling
I also don't think this is spilling because the GPU memory is not that high:
That number is also basically correct:
((4000 * 4000) * 4 + (100_000 * 4000) * 4) < 2GB
where the first4000 * 4000
is from holding thesum(-partial)
in memory and then100_000 * 4000
is the input data.Addendum 2: This is GPU specific
This behavior does not happen on CPU dask:
I see the memory use fluctuating, but the baseline of
1.5GB
makes sense given the in-memory/processing stats I cited above. It also releases the memory at the endAddendum 3: Not an
rmm
issueI tried commenting out
client.run(set_mem)
and that also had no effect.The text was updated successfully, but these errors were encountered: