-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Referee reports # 3, CMAME Sep. 2016 (decision: reject) #21
Comments
We don't claim to make a theoretical contribution. But we do seem to be the first ones to use the existing theory of inexact GMRES to speeding-up the FMM algorithm. Even though it may seem obvious to do, after the fact, it hasn't been done—despite the possibility of good speed-ups. The contribution is to implement the idea in an FMM code, test it thoroughly, and show the actual speed-ups that it can offer! Upon showing this result, the community has a good reason to adopt the method. |
We cite the relevant literature for inexact GMRES. The referee says that there is significant amount of work in accelerating Krylov methods, but mentions various people working on direct solvers, not iterative solvers. We could not find others that improve over standard FMM to accelerate Krylov methods: there are of course many authors applying FMM in this scenario, but without modification. Grama et al. '98 use the idea of a sparse approximate pre-conditioner for the linear system, while Corona's pre-print presents a pre-conditioning method based on tensor decomposition. Preconditioning attacks the time-to-solution by reducing the number of iterations, while we attack the effort to complete each iteration by reducing p. They are orthogonal approaches. We have added a reference to a review paper that covers the topic of FMM acceleration of Krylov methods: Nishimura, 2002. (commit 1554d6c) |
First, regarding the performance characteristics of the FMM code used in this study, we see in Figure 3 that it takes < 10 s to compute 1 million points. This timing falls in the middle of the range of timings shown by Yokota (2013) for five different publicly available codes. Both his benchmark and our result are multi-core, on six CPU cores, although the CPUs are not exactly the same. This is indication that our code offers a representative result in regards to performance. (It may not be the fastest of them all, but it is competitive.) Next, regarding the balance between near and far field: we show the time breakdown for each iteration in Figure 9. We have now added the numbers for total time spent in P2P and M2L: 63 vs. 47 seconds. This is evidence that the overall computation is balanced. We also have collected detailed evidence of the exercise to find the best Ncrit in the supplementary materials. All the runs are reproducible: detailed information is provided to repeat each case (and of course the code is open and available). There is no better evidence than open code and data: if anyone were to doubt the result, they can run themselves. |
We used an adaptive octree in our FMM, added by commit d90a0c8. |
The text in the paper made no such claim. It says that Cartesian expansions scale as O(p^6) and using spherical harmonics "reduces this to O(p^4)" — yes, this can be reduced further to p^3 with rotations. But the operator with rotations gives a real benefit only at very high p, say p>20. We have modified the text to make this more clear, and added a reference to a paper showing that the p^4 operator is faster at common values of p (lower than ~20), see commit a7e9937. |
It is the absolute residual at step k-1. We've added the missing "absolute", see commit 1595beb. |
The value Ncrit used in the results of Figure 3 was chosen after systematically running with 11 values between 100 and 200, plus additional cases between 50 and 400, for the case with largest N, i.e. one million points. Ncrit=125 gives the shortest time to solution, with P2P taking 2.3s and M2L taking 2.7s—a well-balanced FMM calculation. The fact that the scaling displays the O(N) complexity is evidence that the FMM is balanced throughout. (We added more details about this have been added to the supplementary materials.) We added the accuracy information, as requested. GFLOP/s are not informative, in this case: consider that pure P2P would offer the most GLFOP/s! The referee's concern is that the code used for this work is a good implementation, and for this we can use the benchmark provided by Yokota (2013) comparing five publicly available codes. For one million points, we time at 10s, which is in the middle of the range of timings given by Yokota (all multi-core tests on six cores). This is evidence that our code performs competitively. Figure 3 has the sole purpose of showing that our implementation scales as expected: at O(N). There is no reason to repeat this with different values of p. Of course, if we run it with a higher p, the optimal Ncrit will be larger (more expensive far-field calculation). |
Figure 5 has the purpose of showing the spatial convergence of the full boundary-integral solution, using an analytical solution as reference. It is certainly not the gist of the paper: it is due diligence. The reason to show both p=10 and 12 was to provide evidence that 12 gives a tiny improvement of accuracy at the largest value of N, and p=10 was accurate enough for the first-kind equation. To avoid confusion, we have changed the presentation of the convergence, now in a separate figure for first- and second-kind equations. We used two sets of the accuracy parameters: p, number of Gauss points, and solver tolerance. The "tight" set of parameters was used to compute the observed order of convergence. The "loose" parameters are for the future speed-up tests, which would be biased otherwise. The plots show that accuracy is minimally affected by the "loose" set of parameters (but this was not known ahead of time, hence the need for the "tight" set). See Table 1 for the parameter values. The referee's point about observing the actual drop in accuracy with p (not really addressed in Figure 5) has been documented in the supplementary materials. There, we show the error of the potential at the collocation points, and at an exterior point, with varying p. For example, with N=8192, the error falls one order of magnitude as p is increased from 3 to 6; then it falls again 4x increasing p to 9. |
Figure 6 shows results with p=8 (left) and p=12 (right). More iterations are needed to converge with the lower starting p. In Tables 1 and 2, we do not see this effect because there is only one value of p used in all cases. |
Revised paper submitted to CMAME on 15 March 2016 — Decision: Reject
Editor comments
Referee # 3 comments
SUMMARY
The paper concerns the solution of linear systems related to boundary element (BE) discretizations of single and double layer operators for the Laplace and Stokes equations in three dimensions. The contribution of the paper is to apply the ideas in [13] to these BE linear systems and to empirically study their performance. The authors use a collocation method with adaptive Gaussian quadrature defined on triangulation of smooth surfaces. The discretized systems are accelerated with a fast multipole method in which the far field is approximated using analytic expansions. The resulting linear system is solved with an unpreconditioned Krylov method. Numerical results are presented for exterior problems for simply and multiply-connected smooth domains, for single and double-layer operators, and for the Laplace and Stokes problems with up to 400K unknowns.
The key idea (section 2.6, page 8) is to use variable fidelity matrix-vector products in the Krylov method to accelerate the overall scheme (by using inexact-- and thus, cheaper-- matrix-vector products). The authors claim 3X speedups.
SUMMARY REVIEW
The paper is incremental and has technical shortcomings that make me question the validity of the conclusions from the numerical results. For these two reasons I don't think that this paper is at a level appropriate for CMAME.
--- DETAILED REVIEW ----
STRENGTHS
WEAKNESSES
There is a significant amount of work in accelerating Krylov methods for integral equations. Many alternative methods come with theoretical results for optimality.
The only reason one would consider the scheme proposed by the authors is that it is trivial to implement. Papers of Greengard, Rokhlin, Ho, Ying, Zorin, Martinsson, Gillman, Darve, etc on fast direct solvers (which are typically used as preconditioners) are not even mentioned. arXiv:1511.06029v1 (optimal preconditioners in 3D and many references of prior work). SISC 98 Grama, Kumar, Sameh (inexact FMM as a preconditioner)
I don't believe it is possible that changing the far-field approximation fidelity can have such dramatic effects. In well tuned FMM implementations the near and far field should take roughly the same time. The authors dispute that but I don't see strong evidence. I agree that Ncrit should be chosen to optimize wall clock time and as one changes
Given that when solving the boundary integrals one has to include the quadrature costs, the far field ends up taking less than one third of the overall computation. So I don't see how you can get more than 1.6X speedup per iteration. Also, the number of Krylov iterations should increase as we vary the accuracy right? Also, where are the matrix setup times for the matrix? I'd like to see end-to-end timings and see if the benefits persist.
DETAILED comments/questions
see: jcph.1999.6355 (Cheng, Greengard, Rokhlin).
to state it)
The text was updated successfully, but these errors were encountered: