#### 0. 待测线性系统:规模为50w*50w,非零项nnz为1.7e7。来源:lattice晶格压缩工况的IPC仿真。 #### 1. AMGCL:CPU backend (花费主要在solve上) | solver | Relaxation | Coarsening | Time cost (s) | | ---------- | ---------------- | ----------------------------------------------------- | ------------- | | CG | gauss_seidel | ruge_stuben | 51.55 | | CG | gauss_seidel | aggregation | 31.56 | | CG | gauss_seidel | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 38.48 | | CG | gauss_seidel | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 42.70 | | CG | spai0 | aggregation | 51.30 | | LGMRES | gauss_seidel | ruge_stuben | 25.54 | | **LGMRES** | **gauss_seidel** | **aggregation** | **17.86** | | LGMRES | gauss_seidel | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 30.30 | | LGMRES | gauss_seidel | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 34.64 | | LGMRES | spai0 | aggregation | 38.44 | #### 2. AMGCL:CUDA Backend (GPU: GeForce GTX 1080 Ti,花费主要在setup上) | solver | Relaxation | Coarsening | Time cost (s) | | ------------------ | ---------- | ----------------------------------------------------- | ------------- | | CG | spai0 | ruge_stuben | 7.35 | | CG | spai0 | aggregation | 4.30 | | CG | spai0 | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 9.98 | | CG | spai0 | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 14.78 | | BiCGStabL(L=2) | spai0 | ruge_stuben | 7.85 | | **BiCGStabL(L=2)** | **spai0** | **aggregation** | **3.88** | | BiCGStabL(L=2) | spai0 | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 10.10 | | BiCGStabL(L=2) | spai0 | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 15.07 | | LGMRES | spai0 | ruge_stuben | 7.55 | | LGMRES | spai0 | aggregation | 4.22 | | LGMRES | spai0 | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 10.06 | | LGMRES | spai0 | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 14.44 | #### 3. 该系统分别用CHOLMOD和cuSolver求解所需要的时间 | CHOLMOD | 9.05s | | ------------------------------------------------------------ | ------ | | cuSolver (GPU: GeForce GTX 1080 Ti, solver=chol, reorder=metis) | 17.11s | #### 4. 关于AMGCL中backend,solver,Preconditioner,relaxation,coarsening选取的总结(仅针对该线性系统,不适合所有情况) ##### backend: * CUDA比CPU (builtin)快**5-12倍**左右 ##### solver: * CPU上,较快:FGMRES > LGMRES > CG;BiCGStab没测;IDR(s)和Richardson较慢 * CUDA上,CG, BiCGStab, LGMRES, FGMRES性能接近,较快;IDR(s)和Richardson较慢 **Preconditioners:** * AMG较快,其他没测试,Composite preconditioner针对the solution of saddle point或者Stokes-like systems ##### Relaxation: * CPU上,较快:Gauss-Seidel和SPAI0,较慢:Damped Jacobi, Chebyshev, ILU, SPAI1 * CUDA上,较快:SPAI0,较慢:Damped Jacobi, Chebyshev, ILU, SPAI1,不能使用:Gauss-Seidel ##### Coarsening: * CPU和CUDA的结论一致,aggregation > ruge_stuben > smoothed aggregation > smoothed_aggr_emin