Assignment 1

Due Date

October 8th (this assignment can be done in pairs)

Problem statement

Your task is to optimize matrix multiplication to run fast on a single processor core of XSEDE's Bridges-2 cluster.

We consider a special case of matmul:

C := C + A * B

where A, B, and C are $n \times n$ matrices. This can be performed using $2n^3$ floating point operations ($n^3$ adds, $n^3$ multiplies), as in the following pseudocode:

for i = 1 to n
    for j = 1 to n
        for k = 1 to n
            C(i,j) = C(i,j) + A(i,k) * B(k,j)
        end
    end
end

Notes

Please carefully read this section for implementation details.

If you are new to optimizing numerical codes, we recommend reading the papers in the references section.
There are other formulations of matmul (e.g., Strassen's algorithm) that are mathematically equivalent, but perform asymptotically fewer computations - we will not grade submissions that do fewer computations than the $2n^3$ algorithm. Once you have finished and are happy with your square_dgemm implementation you should consider this and other optional improvements for further coding practice but they will not be graded for this assignment.
Your code must use double-precision to represent real numbers. A common reference implementation for double-precision matrix multiplication is the dgemm (double-precision general matrix-matrix multiply) routine in the level-3 BLAS. We will compare your implementation with the tuned dgemm implementation available - on Bridges-2, we will compare with the Intel MKL implementation of dgemm. Note that dgemm has a more general interface than square_dgemm - an optional part of this assignment encourages you to explore this richer tuning space.
You may use any compiler available. We recommend starting with the GNU C compiler (gcc). If you use a compiler other than gcc, you will have to change the provided Makefile, which uses gcc-specific flags. Note that the default compilers, every time you open a new terminal, are PGI - you will have to type module unload pgi or module purge and then module load gnu to switch to, e.g., GNU compilers. You can type module list to see which compiler wrapper you have loaded.
You may use C99 features when available. The provided benchmark.c uses C99 features, so you must compile with C99 support - for gcc, this means using the flag -std=gnu99 (see the Makefile). Here is the status of C99 functionality in gcc - note that C90 (ANSI C) is fully implemented.
Besides compiler intrinsic functions and built-ins, your code (dgemm-blocked.c) must only call into the C standard library.
You may not use compiler flags that automatically detect dgemm kernels and replace them with BLAS calls, i.e. Intel's -matmul flag.
You should try to use your compiler's automatic vectorizer before manually vectorizing.
- GNU C provides many extensions, which include intrinsics for vector (SIMD) instructions and data alignment (other compilers may have different interfaces)
- Ideally your compiler injects the appropriate intrinsics into your code automatically (eg, automatic vectorization and/or automatic data alignment). gcc's auto-vectorizer reports diagnostics that may help you identify if manual vectorization is required
- To manually vectorize, you must add compiler intrinsics to your code.
- Consult your compiler's documentation.
You may assume that A and B do not alias C; however, A and B may alias each other. It is semantically correct to qualify C (the last argument to square_dgemm) with the C99 restrict keyword. There is a lot online about restrict and pointer-aliasing - this is a good article to start with.
The matrices are all stored in column-major order, i.e. $C_{i,j} == C(i,j) == C[(i-1)+(j-1)*n]$, for $i=1:n$, where $n$ is the number of rows in $C$. Note that we use 1-based indexing when using mathematical symbols $C_{i,j}$ and MATLAB index notation $C(i,j)$, and 0-based indexing when using C index notation $C[(i-1)+(j-1)*n]$.
We will check correctness by the following componentwise error bound: $ |\text{square_dgemm}(n,A,B,0) - A*B| < eps*n*|A|*|B| $ where $eps := 2^{-52} = 2.2 * 10^{-16}$ is the machine epsilon.
One possible optimization to consider for the multiple tuning parameters in an optimized Matrix Multiplication code is autotuning in order to find the optimal/best available value. Libraries like OSKI and ATLAS have shown that achieving the best performance sometimes can be done most efficiently by automatic searching over the parameter space. Some papers on this topic can be found on the Berkeley Benchmarking and Optimization (BeBOP) page.

Submission Instructions

You will submit your files via Gradescope. Your submission should be a single zip archive name: hw1.zip. It should contain (please do use exactly these formats and naming conventions):

dgemm-blocked.c, a C-language source file containing your implementation of the routine: void square_dgemm(int, double*, double*, double*);
Makefile, only if you modified it. If you modified it, make sure it still correctly builds the provided benchmark.c, which we will use to grade your submission.
hw1.pdf, your write-up.

Your write-up should contain:

the names of the people in your group (and each member's contribution),
the optimizations used or attempted,
the results of those optimizations,
the reason for any odd behavior (e.g., dips) in performance, and how the performance changed when running your optimized code on a different machine.

Grading

Your grade will depend on the performance sustained by your codes on Bridges-2 as a percentage of peak:
- If your sustained performance is between 0% and 50% you will receive a score between 0 and 75 proportional to your sustained performance (Ex: 25% gives a score of 37.5)
- If your sustained performance is between 50% and 80% you will receive a score between 75 and 100 proportional to your sustained performance (Ex: 56% gives a score of 80)
- If your sustained performance is above 80% you will receive 100
Any compiler information, flags and options must be copied to the starting comments of the dgemm submitted file; The default options available in the Makefile will already be present in dgemm_blocked.c
Your submission must pass the error bound test and cannot call BLAS for dgemm; any submissions that fail these tests this will receive a grade of 0

Source Files

The starting files with their descriptions can be found in the starting code archive.

The easiest way to access the machines is to login directly with your own ssh client to login.xsede.org and from there ssh into the correct machine. For this assignment we will be using Bridges-2.

To unarchive the files from the tar archive use the following command:

$ tar -xvf xsede-cs267-hw1-2021.tar.gz

Enter into the folder, run source modules.sh and then you can simply type make to compile the code.

To submit jobs on the Bridges-2 system you will use the SLURM interface for example:

$ sbatch job-blocked

To check the status of running jobs you can use the following squeue command where again $USER should be replaced with your username

$ squeue -u $USER

For more details on sbatch commands please see Bridges-2' documentation page.

References

Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
Chellappa, S., Franchetti, F., and Puschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196-259.
Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63-74.