Scientific Computing on a Cluster

CSE Training • Spring 2016

This workshop is intended to give you a whirlwind tour of the issues underlying scientific computing, particularly as that is executed today on distributed clusters. What you should walk out of here with today is at least a list of topics to dig deeper into, because semester-long courses could be written with the outline we use to introduce scientific computing.

What is Scientific Computing?

What is High-Performance Computing (HPC)?

Strategies

Seymour Cray
If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?

— Seymour Cray

Concepts

Hardware mapping


Scalability

Let’s imagine we are solving a problem of size N (that is, we have to make N operations) and there is P processing elements (CPUs, cores, processes, threads, etc.) at our disposal. To execute code on a massively parallel machine, we have to be sure that our code effectively uses computer resourses (time- and energy-wise). We can measure how much faster the computer code that performs these operations is by calculating its speedup when we use more processing element:

Implementations (software libraries)

Executing Scientific Code

1. Shells & architecture

99.999% of a time, you will be accesssing a supercomputer (i.e., a cluster of computers) remotely.
99.999% of a time, you will be using ssh program to do that.


2. Environment

Be aware of the environment: $PATH, $LD_LIBRARY_PATH, module

3. Queueing / Scheduling

Most popular queueing systems are: PBS and slurm.
PBS: Portable Batch System
SLURM: Simple Linux Utility for Resource Management
All of them require the so-called submission scripts, that are shell scripts (bash, sh, ksh, zsh, csh) with a set of special commands to the queueing system.

4. Job Scripts (mpiexec, etc.)

Remember, you can not do a simple ./myprogram.exe.
You always have to do something like mpiexec -n 16 ./a.out.

5. Checkpoint often if your code supports it.

This is a common sense when working on a supercomputer. You don’t want to lose a lot of time when emergency happens.



Compiling Scientific Code

wget https://github.com/maxim-belkin/hpc-sp16/raw/gh-pages/lessons/scicomp/hpc-code-examples.tar.gz
tar -xvzf hpc-code-examples.tar.gz
module load gcc

Compiling & building (gcc, make)

Common workflow:



Developing Scientific Code


Why do we never have time to do it right, but always have time to DO IT OVER?

— Anonymous

Design

Design refers to the decisions you need to make about what the software will do and how it will work. This includes deciding the language and libraries that you require, and the target platform.

Construction

Construction refers to the process of actually coding the software.

You are not a serious software developer. That is not to say that you are not serious, nor that you do not develop software; however, you think of yourself as an engineer first and as a coder only incidentally.

There are extremely sophisticated tool chains in use in software development today, but we are going to highlight only a few immediately useful selections.

External to coding

Internal to coding

Other potentially useful utilities include documentation generators and integrated development environments.

Access

Testing

Before software can be reusable it first has to be usable.

— Ralph Johnson

$ module load valgrind
$ valgrind --tool=memcheck --leak-check=yes --show-reachable=yes -v ./cache_test 10000
==12877== Memcheck, a memory error detector
==12877== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12877== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==12877== Command: ./cache_test 10000
...
--12877-- Valgrind library directory: /usr/local/apps/valgrind/3.9.0/lib/valgrind
...
Matrix size:  10000x10000
--12877-- REDIR: 0x4eaa520 (free) redirected to 0x4c273fd (free)
--12877-- REDIR: 0x4ea9640 (malloc) redirected to 0x4c27a23 (malloc)
==12877== 
==12877== HEAP SUMMARY:
==12877==     in use at exit: 0 bytes in 0 blocks
==12877==   total heap usage: 10,001 allocs, 10,001 frees, 400,080,000 bytes allocated
==12877== 
==12877== All heap blocks were freed -- no leaks are possible
==12877== 
==12877== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
--12877-- 
--12877-- used_suppression:      6 dl-hack3-cond-1 /usr/local/apps/valgrind/3.9.0/lib/valgrind/default.supp:1196
==12877== 
==12877== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)
  
$ valgrind --tool=cachegrind ./cache_test 10000
==12881== Cachegrind, a cache and branch-prediction profiler
==12881== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==12881== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==12881== Command: ./cache_test 10000
==12881== 
--12881-- warning: L3 cache found, using its data for the LL simulation.
Matrix size:  10000x10000
==12881== 
==12881== I   refs:      504,082,163
==12881== I1  misses:            868
==12881== LLi misses:            860
==12881== I1  miss rate:        0.00%
==12881== LLi miss rate:        0.00%
==12881== 
==12881== D   refs:      201,473,856  (100,934,392 rd   + 100,539,464 wr)
==12881== D1  misses:    112,537,715  ( 12,524,791 rd   + 100,012,924 wr)
==12881== LLd misses:     97,883,166  (     54,665 rd   +  97,828,501 wr)
==12881== D1  miss rate:        55.8% (       12.4%     +        99.4%  )
==12881== LLd miss rate:        48.5% (        0.0%     +        97.3%  )
==12881== 
==12881== LL refs:       112,538,583  ( 12,525,659 rd   + 100,012,924 wr)
==12881== LL misses:      97,884,026  (     55,525 rd   +  97,828,501 wr)
==12881== LL miss rate:         13.8% (        0.0%     +        97.3%  )

First, learn to use debugging and profiling tools. Two that are supported on Campus Cluster are gdb and valgrind. gdb, the GNU Debugger, is designed to work with many programming languages besides C, but it can be painful to get it to work with a parallel program. (Actually, any parallel debugging is uniquely painful.) Anyway, if you intend to use gdb, you need to compile your code using the -g flag and gcc.

The other tool, valgrind, monitors memory behavior. It can detect cache misses, memory leaks, and other problems which can lead to poor code performance and excessive memory demand.

Debugging

It’s hard enough to find an error in your code when you’re looking for it;
its even harder when you’ve _assumed_ your code is _error-free_.

— Steve McConnell

wget https://github.com/maxim-belkin/hpc-sp16/raw/gh-pages/lessons/scicomp/numerical-error.ipynb
source /class/cs101/etc/venv/cse/bin/activate /class/cs101/etc/venv/cse/
jupyter notebook