You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.2 KiB
4.2 KiB
0. 待测线性系统:规模为50w*50w,非零项nnz为1.7e7。来源:lattice晶格压缩工况的IPC仿真。
1. AMGCL:CPU backend (花费主要在solve上)
solver | Relaxation | Coarsening | Time cost (s) |
---|---|---|---|
CG | gauss_seidel | ruge_stuben | 51.55 |
CG | gauss_seidel | aggregation | 31.56 |
CG | gauss_seidel | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 38.48 |
CG | gauss_seidel | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 42.70 |
CG | spai0 | aggregation | 51.30 |
LGMRES | gauss_seidel | ruge_stuben | 25.54 |
LGMRES | gauss_seidel | aggregation | 17.86 |
LGMRES | gauss_seidel | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 30.30 |
LGMRES | gauss_seidel | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 34.64 |
LGMRES | spai0 | aggregation | 38.44 |
2. AMGCL:CUDA Backend (GPU: GeForce GTX 1080 Ti,花费主要在setup上)
solver | Relaxation | Coarsening | Time cost (s) |
---|---|---|---|
CG | spai0 | ruge_stuben | 7.35 |
CG | spai0 | aggregation | 4.30 |
CG | spai0 | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 9.98 |
CG | spai0 | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 14.78 |
BiCGStabL(L=2) | spai0 | ruge_stuben | 7.85 |
BiCGStabL(L=2) | spai0 | aggregation | 3.88 |
BiCGStabL(L=2) | spai0 | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 10.10 |
BiCGStabL(L=2) | spai0 | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 15.07 |
LGMRES | spai0 | ruge_stuben | 7.55 |
LGMRES | spai0 | aggregation | 4.22 |
LGMRES | spai0 | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 10.06 |
LGMRES | spai0 | smoothed_aggr_emin (eps_strong=0.0, block_size = 3) | 14.44 |
3. 该系统分别用CHOLMOD和cuSolver求解所需要的时间
CHOLMOD | 9.05s |
---|---|
cuSolver (GPU: GeForce GTX 1080 Ti, solver=chol, reorder=metis) | 17.11s |
4. 关于AMGCL中backend,solver,Preconditioner,relaxation,coarsening选取的总结(仅针对该线性系统,不适合所有情况)
backend:
- CUDA比CPU (builtin)快5-12倍左右
solver:
- CPU上,较快:FGMRES > LGMRES > CG;BiCGStab没测;IDR(s)和Richardson较慢
- CUDA上,CG, BiCGStab, LGMRES, FGMRES性能接近,较快;IDR(s)和Richardson较慢
Preconditioners:
- AMG较快,其他没测试,Composite preconditioner针对the solution of saddle point或者Stokes-like systems
Relaxation:
- CPU上,较快:Gauss-Seidel和SPAI0,较慢:Damped Jacobi, Chebyshev, ILU, SPAI1
- CUDA上,较快:SPAI0,较慢:Damped Jacobi, Chebyshev, ILU, SPAI1,不能使用:Gauss-Seidel
Coarsening:
- CPU和CUDA的结论一致,aggregation > ruge_stuben > smoothed aggregation > smoothed_aggr_emin