#### 0. 待测线性系统：规模为50w*50w，非零项nnz为1.7e7。来源：lattice晶格压缩工况的IPC仿真。



#### 1. AMGCL：CPU backend (花费主要在solve上)

| solver     | Relaxation       | Coarsening                                            | Time cost (s) |
| ---------- | ---------------- | ----------------------------------------------------- | ------------- |
| CG         | gauss_seidel     | ruge_stuben                                           | 51.55         |
| CG         | gauss_seidel     | aggregation                                           | 31.56         |
| CG         | gauss_seidel     | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 38.48         |
| CG         | gauss_seidel     | smoothed_aggr_emin (eps_strong=0.0, block_size = 3)   | 42.70         |
| CG         | spai0            | aggregation                                           | 51.30         |
| LGMRES     | gauss_seidel     | ruge_stuben                                           | 25.54         |
| **LGMRES** | **gauss_seidel** | **aggregation**                                       | **17.86**     |
| LGMRES     | gauss_seidel     | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 30.30         |
| LGMRES     | gauss_seidel     | smoothed_aggr_emin (eps_strong=0.0, block_size = 3)   | 34.64         |
| LGMRES     | spai0            | aggregation                                           | 38.44         |



#### 2. AMGCL：CUDA Backend (GPU: GeForce GTX 1080 Ti，花费主要在setup上)

| solver             | Relaxation | Coarsening                                            | Time cost (s) |
| ------------------ | ---------- | ----------------------------------------------------- | ------------- |
| CG                 | spai0      | ruge_stuben                                           | 7.35          |
| CG                 | spai0      | aggregation                                           | 4.30          |
| CG                 | spai0      | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 9.98          |
| CG                 | spai0      | smoothed_aggr_emin (eps_strong=0.0, block_size = 3)   | 14.78         |
| BiCGStabL(L=2)     | spai0      | ruge_stuben                                           | 7.85          |
| **BiCGStabL(L=2)** | **spai0**  | **aggregation**                                       | **3.88**      |
| BiCGStabL(L=2)     | spai0      | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 10.10         |
| BiCGStabL(L=2)     | spai0      | smoothed_aggr_emin (eps_strong=0.0, block_size = 3)   | 15.07         |
| LGMRES             | spai0      | ruge_stuben                                           | 7.55          |
| LGMRES             | spai0      | aggregation                                           | 4.22          |
| LGMRES             | spai0      | smoothed_aggregation (eps_strong=0.0, block_size = 3) | 10.06         |
| LGMRES             | spai0      | smoothed_aggr_emin (eps_strong=0.0, block_size = 3)   | 14.44         |



#### 3. 该系统分别用CHOLMOD和cuSolver求解所需要的时间

| CHOLMOD                                                      | 9.05s  |
| ------------------------------------------------------------ | ------ |
| cuSolver (GPU: GeForce GTX 1080 Ti, solver=chol, reorder=metis) | 17.11s |



#### 4. 关于AMGCL中backend，solver，Preconditioner，relaxation，coarsening选取的总结（仅针对该线性系统，不适合所有情况）

##### backend：

* CUDA比CPU (builtin)快**5-12倍**左右

##### solver：

* CPU上，较快：FGMRES > LGMRES > CG；BiCGStab没测；IDR(s)和Richardson较慢
* CUDA上，CG, BiCGStab, LGMRES, FGMRES性能接近，较快；IDR(s)和Richardson较慢

**Preconditioners：**

* AMG较快，其他没测试，Composite preconditioner针对the solution of saddle point或者Stokes-like systems

##### Relaxation：

* CPU上，较快：Gauss-Seidel和SPAI0，较慢：Damped Jacobi, Chebyshev, ILU, SPAI1
* CUDA上，较快：SPAI0，较慢：Damped Jacobi, Chebyshev, ILU, SPAI1，不能使用：Gauss-Seidel

##### Coarsening：

* CPU和CUDA的结论一致，aggregation > ruge_stuben > smoothed aggregation > smoothed_aggr_emin