During the implementation of ViennaCL, several design decisions were necessary, which are often a trade-off among various advantages and disadvantages. In the following, we discuss several design decisions and the reasoning that has led to the same.
The ViennaCL scalar type scalar<>
essentially behaves like a CPU scalar in order to make any access to GPU ressources as simple as possible, for example
float cpu_float = 1.0f; viennacl::linalg::scalar<float> gpu_float = cpu_float; gpu_float = gpu_float * gpu_float; gpu_float -= cpu_float; cpu_float = gpu_float;
As an alternative, the user could have been required to use copy
as for the vector and matrix classes, but this would unnecessarily complicated many commonly used operations like
if (norm_2(gpu_vector) < 1e-10) { ... }
or
gpu_vector[0] = 2.0f;
where one of the operands resides on the CPU and the other on the GPU. Initialization of a separate type followed by a call to copy
iscertainly not desired for the above examples.
However, one should use scalar<>
with care, because the overhead for transfers from CPU to GPU and vice versa is very large for the simple scalar<>
type.
scalar<>
with care, operations may be much slower than built-in types on the CPU!The present way of data transfer for vectors and matrices from CPU to GPU to CPU is to use the provided copy
function, which is similar to its counterpart in the Standard Template Library (STL):
A first alternative approach would have been to to overload the assignment operator like this:
The first overload can be directly applied to the vector
-class provided by ViennaCL. However, the question of accessing data in the cpu_vector
object arises. For std::vector
and C arrays, the bracket operator can be used, but the parenthesis operator cannot. However, other vector types may not provide a bracket operator. Using STL iterators is thus the more reliable variant.
The transfer from GPU to CPU would require to overload the assignment operator for the CPU class, which cannot be done by ViennaCL. Thus, the only possibility within ViennaCL is to provide conversion operators. Since many different libraries could be used in principle, the only possibility is to provide conversion of the form
for the types in ViennaCL. However, this would allow even totally meaningless conversions, e.g. from a GPU vector to a CPU boolean and may result in obscure unexpected behavior.
Moreover, with the use of copy
functions it is much clearer, at which point in the source code large amounts of data are transferred between CPU and GPU.
We decided to provide an interface compatible to Boost.uBLAS for dense matrix operations. The only possible generalization for iterative solvers was to use the tagging facility for the specification of the desired iterative solver.
Since we use the iterator-driven copy
function for transfer from CPU to GPU to CPU, iterators have to be provided anyway. However, it has to be repeated that they are usually VERY slow, because each data access (i.e. dereferentiation) implies a new transfer between CPU and GPU. Nevertheless, CPU-cached vector and matrix classes could be introduced in future releases of ViennaCL.
A remedy for quick iteration over the entries of e.g. a vector is the following:
The three extra code lines can be wrapped into a separate iterator class by the library user, who also has to ensure data consistency during the loop.
Since OpenCL relies on passing the OpenCL source code to a built-in just-in-time compiler at run time, the necessary kernels have to be generated every time an application using ViennaCL is started.
One possibility was to require a mandatory
viennacl::init();
before using any other objects provided by ViennaCL, but this approach was discarded for the following two reasons:
viennacl::init();
is accidentally forgotten by the user, the program will most likely terminate in a rather uncontrolled way.Initialization is instead done in a lazy manner when requesting OpenCL kernels. Kernels with similar functionality are grouped together in a common compilation units. This allows a fine-grained control over which source code to compile where and when. For example, there is no reason to compile the sparse matrix compute kernels at program startup if there are no sparse matrices used at all.
Moreover, the just-in-time compilation of all available compute kernels in ViennaCL takes several seconds. Therefore, a request-based compilation is used to minimize any overhead due to just-in-time compilation.
The request-based compilation is a two-step process: At the first instantiation of an object of a particular type from ViennaCL, the full source code for all objects of the same type is compiled into an OpenCL program for that type. Each program contains plenty of compute kernels, which are not yet initialized. Only if an argument for a compute kernel is set, the kernel actually cares about its own initialization. Any subsequent calls of that kernel reuse the already compiled and initialized compute kernel.