# Assignment 2

## Due Date

• October 27th (this assignment can be done in pairs)

## Problem Statement

This assignment will test your ability to write scalable and efficient shared memory code. It should also teach you to recognize the importance of starting from an efficient serial implementation before parallelization.

Your goal is to parallelize on XSEDE's Bridges-2 supercomputer a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behaviour shown in the following animation:

The range of interaction forces (cutoff) is limited as shown in grey for a selected particle. Density is set sufficiently low so that given $$n$$ particles, only $$O(n)$$ interactions are expected.

Suppose we have a code that runs in time $$T = O(n)$$ on a single processor. Then we'd hope to run in time $$T/p$$ when using $$p$$ processors. We'd like you to write parallel codes that approach these expectations.

## Correctness and Performance

A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7 and 0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01 and 0.05. More details as well as an average distance are described in the source file.

The code we are providing will do this distance check based on the calls to the interaction function but the full autograder will do the checks based on the outputted text file. We'd recommend keeping the correctness checker in your code (for the OpenMP, Pthreads and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct text file.

The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts.

There will be two types of scaling that are tested for your parallel codes:

• In strong scaling we keep the problem size constant but increase the number of processors
• In weak scaling we increase the problem size proportionally to the number of processors so the work/processor stays the same (Note that for the purposes of this assignment we will assume a linear scaling between work and processors)

For more details on the options provided by each file you can use the -h flag on the compiled executables.

While the scripts we are providing have small numbers of particles (500-1000) to allow for the $$O(n^2)$$ algorithm to finish execution, the final codes will be tested with values 100 times larger (50000-100000) to better see their performance.

• Serial Code performance will be tested via fitting multiple runs of the code to a line on a log/log plot and calculating the slope of that line. This will determine whether your code is attaining the $$O(n)$$ desired complexity versus the starting $$O(n^2)$$. With an achieved result of $$O(n^x)$$ you will receive:

• if $$x$$ is between 2 and 1.5 you will receive a serial score between 0 and 75 proportional to $$x$$ (e.g. 1.75 gives a score of 37.5)
• if $$x$$ is between 1.5 and 1.3 you will receive a serial score between 75 and 100 proportional to $$x$$ (e.g. 1.34 gives a score of 95)
• if $$x$$ is below 1.3 you will receive a serial score of 100
• Shared Memory performance will be tested via calculating the strong and weak scaling average efficiencies for 1, 2, 4, 8, 16 processor/thread runs.

• the strong and weak average efficiencies ($$\text{eff}_{ss}$$, $$\text{eff}_{ws}$$) will each contribute 50% of a grade
• the grade will be selected as the maximum between the OpenMP and Pthreads grades so only one implementation needs to be provided
• an efficiency (strong or weak) will give a grade according to the following rules:
• if the efficiency is between 0 and 0.5 you will receive a score between 0 and 75 proportional to it (e.g. 0.2 gives a score of 30)
• if the efficiency is between 0.5 and 0.8 you will receive a score between 75 and 100 proportional to it (e.g. 0.62 gives a score of 85)
• if the efficiency is 0.8 or above you will receive a score of 100

## Source files

The starting files with their descriptions can be found in the starting code archive.

## Login, Compilation and Job submission

The easiest way to access the machines is to login directly with your own ssh client to login.xsede.org and from there ssh into the correct machine. For this assignment we will be using Bridges-2.

To unarchive the files from the tar archive use the following command:

$tar -xvf xsede-cs267-hw2-2021.tar.gz  Enter into the folder, run source modules.sh and then you can simply type make to compile the code. To submit jobs on the Bridges-2 system you will use the SLURM interface for example: $ sbatch job-brigdes-openmp16


To check the status of running jobs you can use the following squeue command where $USER should be replaced with your username $ squeue -u $USER  For more details on sbatch commands please see Bridges-2' documentation page. Once you have code that gives correct output you should check its efficiency with the autograder provided by running the command: $ sbatch auto-brigdes-openmp16


This example will run your code with the "no output" (-no) option and run 1, 2, 4, 8, 16 thread runs of your parallel code with varying sizes of problems to determine strong and weak scaling.

Note that at the moment the Bridges-2 system is very fast in processing jobs; therefore depending on your problem size the job might finish even before running the squeue command; check the folder for the .stdout output file if you had a successful submission

## Submission Instructions

You will submit your files via Gradescope. Your submission should be a single zip archive name: hw2.zip. It should contain (please do use exactly these formats and naming conventions):

• serial.cpp and openmp.cpp (or pthreads.cpp)
• Makefile, only if you modified it
• hw2.pdf, your write-up.

• A plot in log-log scale that shows that your serial and parallel codes run in $$O(n)$$ time and a description of the methods you used to achieve it