The implementations of the algorithms described in this chapter are still not yet mature enough to be considered core-functionality, and/or are available with the OpenCL backend only.

Warning: The implementations of the algorithms described in this chapter are not considered mature yet.

Additional Preconditioners

In addition to the Preconditioners available in the ViennaCL core, two more preconditioners are available with the OpenCL backend and are described in the following.

Sparse Approximate Inverses

Note: Sparse Approximate Inverse preconditioners are only available with the OpenCL backend and are experimental in ViennaCL. Interface changes as well as considerable performance improvements may be included in future releases!; Sparse Approximate Inverse preconditioners depend on Boost.uBLAS.

An alternative construction of a preconditioner for a sparse system matrix $ A $ is to compute a matrix $ M $ with a prescribed sparsity pattern such that

$\Vert AM - I \Vert_F \rightarrow \min \ ,$

where $\Vert \cdot \Vert_F$ denotes the Frobenius norm. This is the basic idea of sparse approximate inverse (SPAI) preconditioner. It becomes increasingly attractive because of their inherent high degree of parallelism, since the minimization problem can be solved independently for each column of $ M $ . ViennaCL provides two preconditioners of this family: The first is the classical SPAI algorithm as described by Grote and Huckle [16] , the second is the factored SPAI (FSPAI) for symmetric matrices as proposed by Huckle [17] .

SPAI can be employed for a CPU matrix A of type MatrixType as follows:

// setup SPAI preconditioner, purely CPU-based
viennacl::linalg::spai_precond<MatrixType> spai_cpu(A, viennacl::linalg::spai_tag(1e-3, 3, 5e-2));
//solve (e.g. using stab. Bi-conjugate gradient solver)
vcl_result = viennacl::linalg::solve(A, rhs,
                                     viennacl::linalg::bicgstab_tag(),
                                     spai_cpu);

The first parameter denotes the residual norm threshold for the full matrix, the second parameter the maximum number of pattern updates, and the third parameter is the threshold for the residual of each minimization problem.

For GPU-matrices, only parts of the setup phase are computed on the CPU, because compute-intensive tasks can be carried out on the GPU:

// setup SPAI preconditioner, GPU-assisted
viennacl::linalg::spai_precond<GPUMatrixType> spai_gpu(vcl_matrix, viennacl::linalg::spai_tag(1e-3, 3, 5e-2));
//solve (e.g. using conjugate gradient solver)
vcl_result = viennacl::linalg::solve(vcl_matrix,
                                     vcl_rhs,
                                     viennacl::linalg::bicgstab_tag(),
                                     spai_gpu);

The GPUMatrixType is typically a viennacl::compressed_matrix type.

For symmetric matrices, FSPAI can be used with the conjugate gradient solver:

viennacl::linalg::fspai_precond<MatrixType> fspai_cpu(A, viennacl::linalg::fspai_tag());
//solve (e.g. using stab. Bi-conjugate gradient solver)
vcl_result = viennacl::linalg::solve(A,
                                     rhs,
                                     viennacl::linalg::cg_tag(),
                                     fspai_cpu);

Our experience is that FSPAI is typically more efficient than SPAI when applied to the same matrix, both in computational effort and in terms of convergence acceleration of the iterative solvers.

Note: At present, there is no GPU-accelerated FSPAI included in ViennaCL.

Note that FSPAI depends on the ordering of the unknowns, thus bandwidth reduction algorithms may be employed first, cf. Bandwidth Reduction.

Additional Eigenvalue Routines

Several routines for computing the eigenvalues of symmetric tridiagonal as well as dense matrices are provided with ViennaCL. The following routines are available for one or two compute backends, hence they are not considered to be part of the core. Also, use the implementations with care: Even though the routines pass the respective tests, they are not as extensively tested as core functionality.

Symmetric Tridiagonal Matrices: Bisection

The bisection method for finding the eigenvalues of a symmetric tridiagonal matrix has been implemented within the CUDA SDK as an example. ViennaCL provides an extension of the implementation also suitable for eigenvalues with multiplicity higher than one. In addition, our implementation is not limited to CUDA and also available for OpenCL.

The interface in ViennaCL takes STL vectors as inputs, holding the diagonal and the sub/superdiagonal. The third argument is the vector to be overwritten with the computed eigenvalues. A sample code snippet is as follows:

std::vector<NumericT> diagonal(mat_size);
std::vector<NumericT> superdiagonal(mat_size);
std::vector<NumericT> eigenvalues_bisect(mat_size);
bool bResult = viennacl::linalg::bisect(diagonal, superdiagonal, eigenvalues_bisect);

The return value bResult is false if an error occurred, and true otherwise. Note that the sub/superdiagonal array superdiagonal is required to have an value of zero at index zero (indicating an element outside the matrix).

Note: A fully working example is available in examples/tutorial/bisect.cpp.

Symmetric Tridiagonal Matrices: TQL2

The bisection method allows for a fast computation of eigenvalues, but it does not compute eigenvectors directly. If eigenvectors are needed, the tql2-version of the QL algorithm as described in the Algol procedures can be used.

The interface to the routine tql2() is as follows: The first argument is a dense ViennaCL matrix. The second and third argument are STL vectors representing the diagonal and the sub/superdiagonal, respectively. Thus, a minimal code example for a 20-by-20 tridiagonal matrix is as follows:

std::size_t N = 20;
std::vector<NumericT> d(N), e(N);
viennacl::matrix<NumericT> Q = viennacl::identity_matrix<NumericT>(N);
// fill d and e with data here
viennacl::linalg::tql2(Q, d, e);

Note that the sub/superdiagonal array e is required to have an value of zero at index zero (indicating an element outside the matrix).

Note: A fully working example is available in examples/tutorial/tql2.cpp.

QR Method for Symmetric Dense Matrices

Symmetric dense real-valued matrices have real-valued eigenvalues, hence the current lack of complex arithmetic in ViennaCL is not limiting. An implementation of the QR method for the symmetric case for computing eigenvalues and eigenvectors is provided.

The current implementation of the QR method for symmetric dense matrices in ViennaCL takes three parameters:

The input matrix to find eigenvalues and eigenvectors from
A matrix in which the calculated eigenvectors will be stored
An STL vector in which the calculated eigenvalues will be stored

A minimal example is as follows:

std::size_t N = 20;
viennacl::matrix<ScalarType> A_input(N,N);
viennacl::matrix<ScalarType> Q(N, N);
std::vector<ScalarType> eigenvalues(N);
viennacl::linalg::qr_method_sym(A_input, Q, eigenvalues);

Note: A fully working example is available in examples/tutorial/qr_method.cpp.

Fast Fourier Transform

Since there is no standardized complex type in OpenCL at the time of the release of ViennaCL, vectors need to be set up with real- and imaginary part before computing a fast Fourier transform (FFT). In order to store complex numbers $ z_0 $ , $ z_1 $ , etc. in a viennacl::vector, say v, the real and imaginary parts are mapped to even and odd entries of v respectively: v[0] = Real(z_0), v[1] = Imag(z_0), v[2] = Real(z_1), v[3] = Imag(z_1), etc.

The FFT of v can then be computed either by writing to a second vector output or by directly writing the result to v

viennacl::fft(v, output);

viennacl::inplace_fft(v);

Conversely, the inverse FFT is computed as

viennacl::ifft(v, output);

viennacl::inplace_ifft(v);

The second option for computing the FFT is with Bluestein algorithm. Currently, the implementation supports only input sizes less than $2^{16} = 65536$ . The Bluestein algorithm uses at least three-times more additional memory than another algorithms, but should be fast for any size of data. As with any efficient FFT algorithm, the sequential implementation has a complexity of $\mathcal{O}(n * \lg n)$ . To compute the FFT with the Bluestein algorithm from a complex vector v and store the result in a vector output, one uses the code

viennacl::linalg::bluestein(v, output,batch_size);

Warning: In ViennaCL the FFT with complexity $N \log N$ is only computed for vectors with a size of a power of two. For other vector sizes, a standard discrete Fourier transform with complexity is employed. This is subject to change in future versions.

Some of the FFT functions are also suitable for matrices and can be computed in 2D. The computation of an FFT for objects of type viennacl::matrix, say mat, require that even entries are real parts and odd entries are imaginary parts of complex numbers. In order to store complex numbers $ z_0 $ , $ z_1 $ , etc.~in mat: mat(0,0) = Real(z_0), mat(0,1) = Imag(z_0),mat(0,2) = Real(z_0), mat(0,3) = Imag(z_0) etc.

Than user can compute FFT either by writing to a second matrix output or by directly writing the result to mat

viennacl::fft(mat, output);

viennacl::inplace_fft(v);

Note: In ViennaCL the FFT with complexity $N \log N$ is computed for matrices with a number of rows and columns a power of two only. For other matrix sizes, a standard discrete Fourier transform with complexity is employed. This is subject to change in future versions.

There are two additional functions to calculate the convolution of two vectors. It expresses the amount of overlap of one function represented by vector v as it is shifted over another function represented by vector u. Formerly a convolution is defined by the integral

$(v*u)(t) = \int_{-\infty}^\infty f(x)g(t-x) dx$

Define $\mathcal{F}$ as the Fast Fourier Transform operator, there holds for a convolution of two infinite sequences v and u

$\mathcal{F}\{v*u\} = \mathcal{F}\{u\} \mathcal{F}\{v\} \rightarrow v*u = \mathcal{F}^{-1}\{ \mathcal{F}\{v\} \mathcal{F}\{u\} \}$

To compute the convolution of two complex vectors v and u, where the result will be stored in output, one can use

viennacl::linalg::convolve(v, u, output);

which does not modify the input. If a modification of the input vectors is acceptable,

viennacl::linalg::convolve_i(v, u, output);

can be used, reducing the overall memory requirements.

Multiplication of two complex vectors u, v where the result will be stored in output, is provided by

viennacl::linalg::multiply_complex(u, v, output);

For creating a complex vector v from a real vector u with suitable sizes (size = v.size() / 2), use

viennacl::linalg::real_to_complex(u, v, size);

Conversely, computing a real vector v from a complex vector u with length size = v.size() / 2 is achieved through

viennacl::linalg::complex_to_real(u, v, size);

To reverse a vector v inplace, use

viennacl::linalg::reverse(v);

Singular Value Decomposition

The singular decomposition of a real-valued matrix $A \in \mathbb{R}^{M \times N}$ is a decomposition of the form

$A = U \Sigma V^T$

where $U \in \mathbb{R}^{M\times M}$ is a unitary matrix holding the left-singular vectors, $\Sigma \in \mathbb{R}^{M \times N}$ holds the singular values of $ A $ on the diagonal, and $V \in \mathbb{R}^{N \times N}$ holds the right-singular vectors of $ A $ . The singular vectors are stored column-wise in both $ U $ and $ V $ .

ViennaCL provides an implementation of the SVD method based on bidiagonalization and QR methods with shift. The API of the svd() routine takes three parameters:

The input matrix , which gets overwritten with $\Sigma$ ,
The matrix for the left-singular vectors,
The matrix for the right-singular vectors.

A minimal example is as follows:

std::size_t M = 20;
std::size_t N = 30;
viennacl::matrix<NumericT> A(M, N), U(M, M), V(N, N);
// fill A with values here
viennacl::linalg::svd(A, U, V);

Note: Have a look at tests/src/svd.cpp for an example.; There are known performance bottlenecks in the current implementation. Any contributions welcome!

Bandwidth Reduction

Note: Bandwidth reduction algorithms are experimental in ViennaCL. Interface changes as well as considerable performance improvements may be included in future releases!

The bandwidth of a sparse matrix is defined as the maximum difference of the indices of nonzero entries in a row, taken over all rows. A low bandwidth may allows for the use of efficient banded matrix solvers instead of iterative solvers. Moreover, better cache utilization as well as lower fill-in in LU-factorization based algorithms can be expected.

For a given sparse matrix with large bandwidth, ViennaCL provides routines for renumbering the unknowns such that the reordered system matrix shows much smaller bandwidth. Typical applications stem from the discretization of partial differential equations by means of the finite element or the finite difference method. The algorithms employed are as follows:

Classical Cuthill-McKee algorithm [9]
Iterated Cuthill-McKee algorithm with different seeds [9]
Gibbs-Poole-Stockmeyer algorithm [20]

The iterated Cuthill-McKee algorithm applies the classical Cuthill-McKee algorithm to different starting nodes with small, but not necessarily minimal degree as root node into account. While this iterated application is more expensive in times of execution times, it may lead to better results than the classical Cuthill-McKee algorithm. A parameter $a \in [0,1]$ controls the number of nodes considered: All nodes with degree $ d $ fulfilling

$d_{\min} \leq d \leq d_{\min} + a(d_{\max} - d_{\min})$

are considered, where $d_{\min}$ and $d_{\max}$ are the miminum and maximum nodal degrees in the graph. A second parameter gmax specifies the number of additional root nodes considered.

The algorithms are called for a matrix A of a type compatible with std::vector< std::map<int, double> > by

r = viennacl::reorder(A, viennacl::cuthill_mckee_tag());
r = viennacl::reorder(A, viennacl::advanced_cuthill_mckee_tag(a, gmax));
r = viennacl::reorder(A, viennacl::gibbs_poole_stockmeyer_tag());

and return the permutation array. In ViennaCL, the user then needs to manually reorder the sparse matrix based on the permutation array. Example code can be found in examples/tutorial/bandwidth-reduction.cpp.

Nonnegative Matrix Factorization

In various fields such as text mining, a matrix V needs to be factored into factors W and H with the property that all three matrices have no negative elements, such that the function

$f(W, H) = \Vert V - WH \Vert_{\mathrm{F}}^2$

is minimized. This can be achieved using the algorithm proposed by Lee and Seoung [19] with the following code:

#include "viennacl/matrix.hpp"
#include "viennacl/linalg/nmf_operations.hpp"
viennacl::matrix<ScalarType> V(size1, size2);
viennacl::matrix<ScalarType> W(size1, k);
viennacl::matrix<ScalarType> H(k, size2);
viennacl::linalg::nmf_config conf;
viennacl::linalg::nmf(V, W, H, conf);

The fourth and last parameter to the function nmf() is an object of type viennacl::linalg::nmf_config containing all necessary parameters of NMF routine:

eps_: Relative tolerance for convergence
stagnation_eps_: Relative tolerance for the stagnation check
max_iters_: Maximum number of iterations for the NMF algorithm
iters_: The number of iterations of the last NMF run using this configuration object
print_relative_error_: Flag specifying whether the relative tolerance should be printed in each iteration
check_after_steps_: Number of steps after which the convergence of NMF should be checked (again)

Multiple tests can be found in file viennacl/test/src/nmf.cpp and tutorial in file viennacl/examples/tutorial/nmf.cpp