Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't rebuild container image from Dockerfile #3

Open
khinsen opened this issue Jun 21, 2021 · 5 comments
Open

Can't rebuild container image from Dockerfile #3

khinsen opened this issue Jun 21, 2021 · 5 comments

Comments

@khinsen
Copy link

khinsen commented Jun 21, 2021

I made an attempt to rebuild locally the container image for Dockerfile-0.5.1-GPU-OpenMPI-xenial-devel, using the command line from docker/README.md:

docker build --tag=mypetibm:mytag --file=Dockerfile-0.5.1-GPU-OpenMPI-xenial-devel .

The build is aborted with the following error message:

#10 2765.4 make: *** [all] Error 2
#10 2765.4 Makefile:127: recipe for target 'all' failed
------
executor failed running [/bin/sh -c REPO=https://github.com/NVIDIA/AMGX &&     AMGX_DIR=/opt/amgx/${AMGX_VERSION} &&     AMGX_ARCH=linux-gnu-openmpi-opt &&     git clone ${REPO} ${AMGX_DIR} &&     cd ${AMGX_DIR} &&     git checkout -b use4petibm ${AMGX_SHA} &&     BUILDDIR=${AMGX_DIR}/${AMGX_ARCH}/build &&     mkdir -p ${BUILDDIR} &&     cd ${BUILDDIR} &&     cmake ${AMGX_DIR}       -DCMAKE_BUILD_TYPE="Release"       -DCMAKE_INSTALL_PREFIX=/usr/local/amgx-${AMGX_VERSION}       -DCMAKE_C_COMPILER=mpicc       -DCMAKE_C_FLAGS_PROFILE="-O3 -DNDEBUG"       -DCMAKE_CXX_COMPILER=mpicxx       -DCMAKE_CXX_FLAGS_PROFILE="-O3 -DNDEBUG"       -DMPI_CXX_COMPILER=mpicxx       -DMPI_C_COMPILER=mpicc       -DCUDA_ARCH="35 37 60 70"       -DCUDA_HOST_COMPILER=/usr/bin/gcc-5 &&     make -j"$(nproc)" all &&     make install]: exit code: 2

I am using Docker Desktop 3.3.3(64133) under macOS, on a computer that has no NVIDIA GPU nor any NVIDIA software installed. If NVIDIA GPUs and/or drivers are a requirement for building the image, it would be nice to indicate this in the README.

@piyueh
Copy link
Member

piyueh commented Jun 24, 2021

I can confirm this recipe is not working, though I'm not sure we are having the same error. The error I encountered is ../libamgxsh.so: undefined reference to 'cusparseSetMatFullPrecision'. I believe this is related to NVIDIA/AMGX#75.

The reason why it worked in the past but does not now is probably that NVIDIA keeps updating the base image nvidia/cuda:10.1-devel-ubuntu16.04, which is used by the recipe Dockerfile-0.5.1-GPU-OpenMPI-xenial-devel. NVIDIA bumped the CUDA version in the base image from 10.1.168 to 10.1.243 in commit c8f0a52. So when rebuilding with the recipe, the CUDA version may be a little bit different from the version when @mesnardo pushed this recipe.

@khinsen Except for the above reason, another possibility is out-of-memory. By default, this recipe uses all cores (w/ hyperthreading) to build AMGX, so it requires a huge memory space. But if you didn't experience laggy responses when the failure happened, then out-of-memory is probably not the cause.

@khinsen
Copy link
Author

khinsen commented Jun 25, 2021

@piyueh No laggy responses indeed. I have 8 GB of memory, but I don't know how much of that can be attributed to Docker.

Your explanation sounds quite possible, and it's the main issue I see with using containers for reproducibility: most container images are themselves not reproducible. In this case, the base images keeps changing, but on top of that, the Dockerfile does apt-get update, so there are two reasons why the build process is variable.

Another potential issue is that the NVIDIA Cuda images comes with the statement that "The NVIDIA Container Toolkit for Docker is required to run CUDA images." I definitely don't have that toolkit. But I am not trying to run the image, I am just trying to build it. Does this also require the NVIDIA toolkit?

@mesnardo
Copy link
Member

mesnardo commented Jul 1, 2021

@khinsen You are completely right, commands such as apt-get update will likely change the final image from one build to an other (say in a year).
In our case, we are also dependent on a base image and cannot guarantee that the base image will not be altered (and it was here).
The build of the Docker image is not reproducible given just a Dockerfile.

As for the issue reported here, I think @piyueh gave the right answer.
I tried to rebuild the image on my laptop (no NVIDIA devices, no CUDA Toolkit installed); hardware specs: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz (4 cores, 2 threads per core; 8 GB of memory).
My first attempt resulted in an out-of-memory failure.
(Probably because the Dockerfile specifies to build the AmgX library using all cores, with hyperthreading).
I then limited the build to 4 threads and got the same error as the one reported above:

../libamgxsh.so: undefined reference to `cusparseSetMatFullPrecision'

which was the consequence of bumping the CUDA version in the base image (nvidia/cuda:10.1-devel-ubuntu16.04) from 10.1.168 to 10.1.243.
(The commit-version of AmgX used here is not compatible with 10.1.243).

Of course, there are ways to update the Dockerfile to make the build successful, but that does not resolve the reproducibility issue with images.

Also, you do not need a NVIDIA GPU device or the CUDA Toolkit installed on the host to build the image.

@khinsen
Copy link
Author

khinsen commented Jul 3, 2021

Given the importance of both CUDA and reproducibility in scientific computing, I wonder if NVIDIA could be convinced to provide reproducible Docker images, at least from time to time, much like Ubuntu's LTS releases. Technically, one option would be to use Debian's Debuerreotype as the lowest-layer image on which to build. Alternatively, at least archive their images in a more permanent way, e.g. on Zenodo.

@mesnardo
Copy link
Member

We deposited the Docker and Singularity images (the one reported in the manuscript you are editing) to Zenodo:

DOI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants