Chapter 9: Basic Matrix/Vector Operations

Programming Notes for BLAS Using NVIDIA

This reference material is intended for users who want to use the computational resources of their NVIDIA GPU board for BLAS. Users who do not have the NVIDIA GPU board can ignore this section.

Rationale, General Algorithm and an Example

NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in their Library, CUDA CUBLAS Library, V3.1, July, 2010.  The NVIDIA external names and argument protocols are different from the equivalent Fortran names and argument addressing.  See Table 9.2 for names marked in the color GREEN.  IMSL has written these marked Fortran BLAS so that they call equivalent NVIDIA C language codes from the CUBLAS library.  No direct use or knowledge of C is required by a Fortran programmer in order to take advantage of these codes.  It is necessary that a user code or application package be compiled with a Fortran 2003 compiler that has implemented the C Interoperability Standard feature.  See The Fortran 2003 Handbook, Adams, et al., p. 561.  IMSL’s use of this feature is the key to providing a portable version of these Fortran-callable IMSL/NVIDIA BLAS.  The program or application is then compiled and linked using IMSL and NVIDIA libraries that contain these BLAS.

Note: An NVIDIA Graphics Processing Unit (GPU) is required to take advantage of the BLAS.

The strategy for using the attached NVIDIA GPU is given by the following algorithm:

      If the maximum of vector or matrix dimensions are larger than a switchover array size,  NSTART, and NVIDIA provides a CUBLAS code,  then

      Copy the required vector and matrix data from the CPU to the GPU

      Compute the result on the GPU

      Copy the result from the GPU to the CPU

      Else, use the IMSL equivalent version of the BLAS routine that does not use the GPU.

Normally a code that calls a IMSL/NVIDIA BLAS code does not have to be aware of the copy steps or the switchover size, NSTART. These are hidden from the user code.  In the first algorithm step, a working block is allocated on the GPU for each array argument.  A table within the IMSL module, CUBLAS_LIBRARY, records the sizes and GPU addresses of these blocks.  If the sizes are too small for the current problem size and data type the blocks are reallocated to be of adequate size.  The same working block on the GPU may be used for many calls to the IMSL/NVIDIA BLAS.  The IMSL versions of the BLAS also allow a user to define individual values of NSTART for each routine.  This is important because using the GPU may be slower than using a CPU Fortran version until a switchover array size is reached.  Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default value of NSTART=32 is used for each vector/matrix argument of each routine but it may not be optimal. This default allows the routines to function correctly without initial attention to this value.

The user can change the default switchover value for all IMSL/NVIDIA BLAS vector/matrix arguments by setting NSTART to the desired value prior to calling the BLAS routine. Additionally, users can reset this value for each individual vector/matrix argument of the routines listed in Table 9.2 and marked with the color GREEN by using the IMSL routine CUBLAS_SET(…).   Note that CUBLAS_SET cannot be used prior to an initial call to a BLAS code.  The switchover values can be obtained using the IMSL routine CUBLAS_GET().

The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component.  These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board.  This can be an important detail when comparing results for purposes of benchmarking or code regression.  Generally either result should be acceptable for numerical work.

As an added feature, the user can flag when the data values for a vector or matrix are present on the GPU and hence suppress the IMSL/NVIDIA BLAS code from first copying the data.  This is often important since the data movement from the CPU to the GPU may be a significant part of the computation time.  If there is no indication that the data is present, it is copied from the CPU to the GPU each time a routine is called.  The necessity of copying for each use of a BLAS code depends on the application.  Valid results are always copied back from the GPU to the CPU memory.  The indication that data for that positional array argument requires no initial copy step is that the switchover value for that array argument is negative.  The absolute value is used as the switchover value.  Caution:  Be sure and reset this to a positive value when the argument requires an initial copy step.

In Tables 9.3-9.5, we list an enumeration that includes the routines in Table 9.2 marked with the color GREEN.  Note the prefix to each name joined with the string ‘CUDABLAS_’.  There are enumerated names that currently do not use the NVIDIA hardware.  They are included in anticipation of future additions that will use the CUBLAS library.

There are four utility routines provided in the IMSL module CUDABLAS_LIBRARY that can be used to help manage the use of NVIDIA BLAS. These utilities appear in Table 9.7 and are described in more detail in the routines section of these notes.

For example, to set the value at 500 wherein the GPU is first used for the Level-2 routine ‘SGEMV first positional array argument, ‘A(*,*)’, i.e. Array_Arg = 1, execute the code:

USE CUDABLAS_LIBRARY
   INTEGER ISWITCH,
Array_Arg
   …
   ISWITCH=500

   Array_Arg = 1

! Switch to using GPU when largest size of A(*,*)  > 500.

CALL CUBLAS_SET(CUDABLAS_SGEMV, Array_Arg, ISWITCH)

When the positional array argument, ‘A(*,*)’ does not have to be copied for each subsequent use of ‘SGEMV:

USE CUDABLAS_LIBRARY

INTEGER ISWITCH, Array_Arg

Array_Arg = 1

ISWITCH=CUBLAS_GET(CUDABLAS_SGEMV, Array_Arg)

! Avoid copying data from CPU to GPU for subsequent calls to ‘SGEMV’

CALL CUBLAS_SET(CUDABLAS_SGEMV, Array_Arg, -abs(ISWITCH))

! Make several calls to ‘SGEMV’ with A(*,* )maintained unchanged on the GPU.

! Reset flag for copying A(*,*) when this matrix-vector product  sequence is completed.

CALL CUBLAS_SET(CUDABLAS_SGEMV, Array_Arg, abs(ISWITCH))

Some NVIDIA hardware does not have working double precision versions of BLAS because there is no double precision arithmetic available.  However, the double precision code itself is part of the CUDA CUBLAS library.  It will appear to execute even though it will not give correct results when the device has no double precision arithmetic.  When the IMSL software detects that the correct results are not returned, a warning error message will be printed.  The user may instruct the application to henceforth use the Fortran code by setting the switchover value to zero.  For example, if it is known that the hardware does not support DOUBLE PRECISION, then a code that has calls to ‘DGEMM’ will use an alternate version of this routine.  Therefore, ignoring the error message and continuing the code will result in using the alternate version to compute the result.  That code would include:

 

USE CUDABLAS_LIBRARY

 ! Flag first array argument A(*,*) to avoid use of the GPU for DGEMM:

CALL CUBLAS_SET(CUDABLAS_DGEMM,1,0)

 

If it is necessary to know if the GPU or the CPU version of ‘SGEMM’ was used following a call to that code, the inquiry code would include:

USE CUDABLAS_LIBRARY

! Get the current status for the last call to SGEMM  with the INTEGER function
! CUBLAS_GET.  The value ISWITCH=0 if an alternate was used, and ISWITCH=1 if the
! GPU was used.

ISWITCH = CUBLAS_GET(CUDABLAS_SGEMM, 4)

Enumeration of IMSL/NVIDIA BLAS

Table 9.3. Enumeration of Level-1 BLAS

 

 

 

 

CUDABLAS_SROTG

CUDABLAS_DROTG

CUDABLAS_CROTG

CUDABLAS_ZROTG

CUDABLAS_SROTMG

CUDABLAS_DROTMG

 

 

CUDABLAS_SROT

CUDABLAS_DROT

CUDABLAS_CROT

CUDABLAS_ZROT

CUDABLAS_SROTM

CUDABLAS_DROTM

CUDABLAS_CSROT

CUDABLAS_ZSROT

CUDABLAS_SSWAP

CUDABLAS_DSWAP

CUDABLAS_CSWAP

CUDABLAS_ZSWAP

CUDABLAS_SCOPY

CUDABLAS_DCOPY

CUDABLAS_CCOPY

CUDABLAS_ZCOPY

CUDABLAS_SAXPY

CUDABLAS_DAXPY

CUDABLAS_CAXPY

CUDABLAS_ZAXPY

CUDABLAS_SDOT

CUDABLAS_DDOT

CUDABLAS_CDOTC

CUDABLAS_ZDOTC

CUDABLAS_SDSDOT

CUDABLAS_DSDOT

CUDABLAS_CDOTU

CUDABLAS_ZDOTU

CUDABLAS_SSCAL

CUDABLAS_DSCAL

CUDABLAS_CSCAL

CUDABLAS_ZSCAL

 

 

CUDABLAS_CSSCAL

CUDABLAS_ZSSCAL

CUDABLAS_SNRM2

CUDABLAS_DNRM2

CUDABLAS_SCNRM2

CUDABLAS_DZNRM2

CUDABLAS_SASUM

CUDABLAS_DASUM

CUDABLAS_SCASUM

CUDABLAS_DZASUM

CUDABLAS_ISAMIN

CUDABLAS_IDAMIN

CUDABLAS_ICAMIN

CUDABLAS_IZAMIN

CUDABLAS_ISAMAX

CUDABLAS_IDAMAX

CUDABLAS_ICAMAX

CUDABLAS_IZAMAX

 

Table 9.4. Enumeration of Level-2 BLAS

CUDABLAS_SGEMV

CUDABLAS_DGEMV

CUDABLAS_CGEMV

CUDABLAS_ZGEMV

CUDABLAS_SGBMV

CUDABLAS_DGBMV

CUDABLAS_CGBMV

CUDABLAS_ZGBMV

CUDABLAS_SSYMV

CUDABLAS_DSYMV

CUDABLAS_CHEMV

CUDABLAS_ZHEMV

CUDABLAS_SSBMV

CUDABLAS_DSBMV

CUDABLAS_CHBMV

CUDABLAS_ZHBMV

CUDABLAS_SSPMV

CUDABLAS_DSPMV

CUDABLAS_CHPMV

CUDABLAS_ZHPMV

CUDABLAS_STRMV

CUDABLAS_DTRMV

CUDABLAS_CTRMV

CUDABLAS_ZTRMV

CUDABLAS_STBMV

CUDABLAS_DTBMV

CUDABLAS_CTBMV

CUDABLAS_ZTBMV

CUDABLAS_STPMV

CUDABLAS_DTPMV

CUDABLAS_CTPMV

CUDABLAS_ZTPMV

CUDABLAS_STRSV

CUDABLAS_DTRSV

CUDABLAS_CTRSV

CUDABLAS_ZTRSV

CUDABLAS_STBSV

CUDABLAS_DTBSV

CUDABLAS_CTBSV

CUDABLAS_ZTBSV

CUDABLAS_STPSV

CUDABLAS_DTPSV

CUDABLAS_CTPSV

CUDABLAS_ZTPSV

CUDABLAS_SGER

CUDABLAS_DGER

CUDABLAS_CGERU

CUDABLAS_ZGERU

 

 

CUDABLAS_CGERC

CUDABLAS_ZGERC

CUDABLAS_SSYR

CUDABLAS_DSYR

CUDABLAS_CHER

CUDABLAS_ZHER

CUDABLAS_SSYR2

CUDABLAS_DSYR2

CUDABLAS_CHER2

CUDABLAS_ZHER2

CUDABLAS_SSPR

CUDABLAS_DSPR

CUDABLAS_CHPR

CUDABLAS_ZHPR

CUDABLAS_SSPR2

CUDABLAS_DSPR2

CUDABLAS_CHPR2

CUDABLAS_ZHPR2

 

Table 9.5. Enumeration of Level-3 BLAS

CUDABLAS_SGEMM

CUDABLAS_DGEMM

CUDABLAS_CGEMM

CUDABLAS_ZGEMM

CUDABLAS_SSYMM

CUDABLAS_DSYMM

CUDABLAS_CSYMM

CUDABLAS_ZSYMM

CUDABLAS_SSYRK

CUDABLAS_DSYRK

CUDABLAS_CSYRK

CUDABLAS_ZSYRK

CUDABLAS_SSYR2K

CUDABLAS_DSYR2K

CUDABLAS_CSYR2K

CUDABLAS_ZSYR2K

CUDABLAS_STRMM

CUDABLAS_DTRMM

CUDABLAS_CTRMM

CUDABLAS_ZTRMM

CUDABLAS_STRSM

CUDABLAS_DTRSM

CUDABLAS_CTRSM

CUDABLAS_ZTRSM

 

 

CUDABLAS_CHEMM

CUDABLAS_ZHEMM

 

 

CUDABLAS_CHERK

CUDABLAS_ZHERK

 

 

CUDABLAS_CHER2K

CUDABLAS_ZHER2K

 

Table 9.6.  Public Symbols and Parameters in Module CUDABLAS_LIBRARY

CUBLAS_STATUS_SUCCESS=0

CUBLAS_STATUS_NOT_INITIALIZED=1

CUBLAS_STATUS_ALLOC_FAILED=3

CUBLAS_STATUS_INVALID_VALUE=7

CUBLAS_STATUS_ARCH_MISMATCH=8

CUBLAS_STATUS_MAPPING_ERROR=11

CUBLAS_STATUS_EXECUTION_FAILED=13

CUBLAS_STATUS_INTERNAL_ERROR=14

FSIZE=4

DSIZE=8

CSIZE=8

ZSIZE=16

SKIND=kind(1.E0)

DKIND=kind(1.D0)

SZERO=0.E0

DZERO=0.D0

SONE=1.E0

DONE=1.D0

LEVEL=6 (IMSL Error or Warning Level)

NSTART(=32) (Default Switchover Value)

 

Table 9.7.  Subprograms Packaged in Module CUDABLAS_LIBRARY

Fortran Name Implemented in Module

CUBLAS_GET

CUBLAS_SET

CHECK_BUFFER_ALLOCATION

CUDA_ERROR_PRINT

 

Table 9.8 lists a number of NVIDIA Helper subprograms called within the CUDABLAS_LIBRARY Modules.  These are mostly for internal use only but are documented in the case that a knowledgeable NVIDIA Library user chooses to make use of them.

 

Table 9.8.  NVIDIA Helper Subprograms Called in Module CUDABLAS_LIBRARY

Fortran Usage Name in Module

NVIDIA External C Name

ISW = cublasInit()

cublasInit()

ISW = cublasShutdown()

cublasShutdown()

ISW = cublasError()

cublasError()

ISW = cublasAlloc(n, datasize, c_ptr)

cublasAlloc(n, datasize, c_ptr)

ISW = cublasFree(c_ptr)

cublasFree(c_ptr)

ISW = cublasSetVector(n, datasize, x, incx, y, incy)

cublasSetVector

(n, datasize, x, incx, y, incy)

ISW = cublasGetVector(n, datasize, x, incx, y, incy)

cublasGetVector

(n, datasize, x, incx, y, incy)

ISW = cublasSetMatrix(m, n, datasize, A, lda, devA, ldd)

cublasSetMatrix(m, n, datasize, A, lda, devA, ldd)

ISW = cublasGetMatrix(m, n, datasize, devA, lda, B, ldb)

cublasGetMatrix(m, n, datasize, devA, lda, B, ldb)

In Table 9.8 the arguments c_ptr, x, y, A, devA, and B are C pointers to arrays either on the GPU or the CPU.  These are instantiated with calls to helper routine cublasAlloc() or by use of the Fortran 2003 intrinsic function c_loc() for array arguments residing on the CPU.  This intrinsic returns a C pointer to a Fortran object.  The helper function cublasError()is called from each of the double precision IMSL/NVIDIA BLAS codes to assess the availability of double precision floating point hardware on the GPU.

The NVIDIA Environmental Subprograms listed in Table 9.9 provide details about the runtime working environment.

 

Table 9.9.  NVIDIA Environmental Subprograms

Fortran Usage Name in Module

NVIDIA External Name

ISW = cudaGetDeviceCount(ICOUNT)

cudaGetDeviceCount()

ISW = cudaSetDevice(IDEVICE),{0 indexed}

cudaSetDevice()

ISW = cudaGetDeviceProperties &

(<TYPE> cudaDeviceProp, IDEVICE )

cudaGetDeviceProperties()

One argument for cudaGetDeviceProperties is a Fortran derived type, cudaDeviceProp, with a C binding.  This contains technical information about the device, including its name.  This C character string, NAME(*), is terminated with C_NULL_CHAR. The derived type, cudaDeviceProp is described below:

 

TYPE, BIND(C) :: cudaDeviceProp

        CHARACTER(C_CHAR) NAME(256)

        INTEGER(C_SIZE_T) totalGlobalMem

        INTEGER(C_SIZE_T) sharedMemPerBlock

        INTEGER(C_INT) regsPerBlock

        INTEGER(C_INT) warpSize

        INTEGER(C_SIZE_T) memPitch

        INTEGER(C_INT) maxThreadsPerBlock

        INTEGER(C_INT) maxThreadsDim(3)

        INTEGER(C_INT) maxGridSize(3)

        INTEGER(C_SIZE_T) totalConstMem

        INTEGER(C_INT) major

        INTEGER(C_INT) minor

        INTEGER(C_INT) clockRate

        INTEGER(C_SIZE_T) textureAlignment

        INTEGER(C_INT) deviceOverlap

        INTEGER(C_INT) multiProcessorCount

        INTEGER(C_INT) kernelExecTimeoutEnabled

        INTEGER(C_INT) integrated

        INTEGER(C_INT) canMapHostMemory

        INTEGER(C_INT) computeMode

        INTEGER(C_INT) concurrentKernels

END TYPE

 

Required NVIDIA Copyright Notice:

© 2005–2010 by NVIDIA Corporation. All rights reserved.

Portions of the NVIDIA SGEMM and DGEMM library routines were written by Vasily Volkov and are subject to the Modified Berkeley Software Distribution License as follows:

Copyright (©) 2007-09, Regents of the University of California

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. ( See CUDA CUBLAS Library,Version 3.1, July, 2010, for these remaining conditions.)



http://www.vni.com/
PHONE: 713.784.3131
FAX:713.781.9260