Assignment 1
Due Date
 October 8th (this assignment can be done in pairs)
Problem statement
Your task is to optimize matrix multiplication to run fast on a single processor core of XSEDE's Bridges2 cluster.
We consider a special case of matmul:
C := C + A * B
where A
, B
, and C
are \(n \times n\) matrices. This can be performed using \(2n^3\) floating point operations (\(n^3\) adds, \(n^3\) multiplies), as in the following pseudocode:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end
end
end
Notes
Please carefully read this section for implementation details.

If you are new to optimizing numerical codes, we recommend reading the papers in the references section.

There are other formulations of matmul (e.g., Strassen's algorithm) that are mathematically equivalent, but perform asymptotically fewer computations  we will not grade submissions that do fewer computations than the \(2n^3\) algorithm. Once you have finished and are happy with your
square_dgemm
implementation you should consider this and other optional improvements for further coding practice but they will not be graded for this assignment. 
Your code must use doubleprecision to represent real numbers. A common reference implementation for doubleprecision matrix multiplication is the dgemm (doubleprecision general matrixmatrix multiply) routine in the level3 BLAS. We will compare your implementation with the tuned
dgemm
implementation available  on Bridges2, we will compare with the Intel MKL implementation ofdgemm
. Note thatdgemm
has a more general interface thansquare_dgemm
 an optional part of this assignment encourages you to explore this richer tuning space. 
You may use any compiler available. We recommend starting with the GNU C compiler (
gcc
). If you use a compiler other thangcc
, you will have to change the providedMakefile
, which uses gccspecific flags. Note that the default compilers, every time you open a new terminal, arePGI
 you will have to typemodule unload pgi
ormodule purge
and thenmodule load gnu
to switch to, e.g., GNU compilers. You can typemodule list
to see which compiler wrapper you have loaded. 
You may use C99 features when available. The provided
benchmark.c
uses C99 features, so you must compile with C99 support  forgcc
, this means using the flagstd=gnu99
(see theMakefile
). Here is the status of C99 functionality in gcc  note that C90 (ANSI C) is fully implemented. 
Besides compiler intrinsic functions and builtins, your code (
dgemmblocked.c
) must only call into the C standard library. 
You may not use compiler flags that automatically detect
dgemm
kernels and replace them with BLAS calls, i.e. Intel's matmul flag. 
You should try to use your compiler's automatic vectorizer before manually vectorizing.
 GNU C provides many extensions, which include intrinsics for vector (SIMD) instructions and data alignment (other compilers may have different interfaces)
 Ideally your compiler injects the appropriate intrinsics into your code automatically (eg, automatic vectorization and/or automatic data alignment). gcc's autovectorizer reports diagnostics that may help you identify if manual vectorization is required
 To manually vectorize, you must add compiler intrinsics to your code.
 Consult your compiler's documentation.

You may assume that
A
andB
do not aliasC
; however,A
andB
may alias each other. It is semantically correct to qualifyC
(the last argument tosquare_dgemm
) with the C99restrict
keyword. There is a lot online about restrict and pointeraliasing  this is a good article to start with. 
The matrices are all stored in columnmajor order, i.e. \(C_{i,j} == C(i,j) == C[(i1)+(j1)*n]\), for \(i=1:n\), where \(n\) is the number of rows in \(C\). Note that we use 1based indexing when using mathematical symbols \(C_{i,j}\) and MATLAB index notation \(C(i,j)\), and 0based indexing when using C index notation \(C[(i1)+(j1)*n]\).

We will check correctness by the following componentwise error bound: \( \text{square_dgemm}(n,A,B,0)  A*B < eps*n*A*B \) where \(eps := 2^{52} = 2.2 * 10^{16}\) is the machine epsilon.

One possible optimization to consider for the multiple tuning parameters in an optimized Matrix Multiplication code is autotuning in order to find the optimal/best available value. Libraries like OSKI and ATLAS have shown that achieving the best performance sometimes can be done most efficiently by automatic searching over the parameter space. Some papers on this topic can be found on the Berkeley Benchmarking and Optimization (BeBOP) page.
Submission Instructions
You will submit your files via Gradescope. Your submission should be a single zip archive name: hw1.zip
. It should contain (please do use exactly these formats and naming conventions):
dgemmblocked.c
, a Clanguage source file containing your implementation of the routine:void square_dgemm(int, double*, double*, double*);
Makefile
, only if you modified it. If you modified it, make sure it still correctly builds the provided benchmark.c, which we will use to grade your submission.hw1.pdf
, your writeup.
Your writeup should contain:

the names of the people in your group (and each member's contribution),

the optimizations used or attempted,

the results of those optimizations,

the reason for any odd behavior (e.g., dips) in performance, and how the performance changed when running your optimized code on a different machine.
Grading

Your grade will depend on the performance sustained by your codes on Bridges2 as a percentage of peak:
 If your sustained performance is between 0% and 50% you will receive a score between 0 and 75 proportional to your sustained performance (Ex: 25% gives a score of 37.5)
 If your sustained performance is between 50% and 80% you will receive a score between 75 and 100 proportional to your sustained performance (Ex: 56% gives a score of 80)
 If your sustained performance is above 80% you will receive 100

Any compiler information, flags and options must be copied to the starting comments of the
dgemm
submitted file; The default options available in theMakefile
will already be present indgemm_blocked.c

Your submission must pass the error bound test and cannot call BLAS for
dgemm
; any submissions that fail these tests this will receive a grade of 0
Source Files
The starting files with their descriptions can be found in the starting code archive.
Login, Compilation and Job Submission
The easiest way to access the machines is to login directly with your own ssh client to login.xsede.org
and from there ssh into the correct machine. For this assignment we will be using Bridges2.
To unarchive the files from the tar archive use the following command:
$ tar xvf xsedecs267hw12021.tar.gz
Enter into the folder, run source modules.sh
and then you can simply type make
to compile the code.
To submit jobs on the Bridges2 system you will use the SLURM
interface for example:
$ sbatch jobblocked
To check the status of running jobs you can use the following squeue
command where again $USER
should be replaced with your username
$ squeue u $USER
For more details on sbatch
commands please see Bridges2' documentation page.
References

Goto, K., and van de Geijn, R. A. 2008. Anatomy of HighPerformance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.

Chellappa, S., Franchetti, F., and Puschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196259.

Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.

Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 6374.