APC523 Lectures 13 and 14: Hardware and Software Issues in Computational Physics

These are notes, lecture slides and code for use in Princeton University’s graduate course “Numerical Algorithms for Scientific Computing” for the year 2020 and for Lectures 13 and 14. The main instructor for this is Greg Hammett. All material is copyrighted by Ammar Hakim and released under the Creative Commons CC BY License.

Background

Computational physics software is very complex. In addition to the physics itself being complicated (multiple spatial and temporal scales, highly nonlinear physics, coupling between models, huge number of unknowns, etc), the algorithms required may be sophisticated, difficult to implement efficiently and may require complex data-structures. In Lectures 13 and 14 we will look at the specific software and hardware issues that arise in many computational physics codes.

In general, to achieve expertise in this topic one needs to write a lot of code under the supervision of a “Master Craftsman”. However, I will try my best to point out resources and issues that commonly arise while writing computational physics codes, which will allow you learn/explore on your own.

Github repo for lectures and code

You can get all the tex files for the slides, source for this website and the code I ran in class by cloning the Github repo for these lectures.

Lecture 13: Hardware and Software Issues: History and Optimization

PDF of Lecture 13 slides

In this lecture we will look at hardware oriented optimization for our algorithms. We will begin by looking at the early history of computer hardware. Numerical computations really drove the design of computers since the beginning, and even though now non-numeric applications dominate, numerics continues to push the boundaries of hardware design. Even in what may appear as non-numeric applications, mathematical algorithms are deeply embedded in nearly all forms of software. For example, games and movies use very sophisticated rendering algorithms (almost always hardware accelerated) and designing better, realistic rendering is a multi-billion dollar industry.

The first and perhaps greatest pioneer in computer hardware was Charles Babbage. He essentially, ab-inito, designed a series of mechanical computers, culminating in the Analytical Engine. Most of Babbage’s machines were not built in his lifetime. However, his design for the Analytical Engine contains all the modern architectural details found in our processors (of course, Babbage worked with mechanical machines and not electronics). By a stroke of misfortune (probably as Babbage never published anything), Babbage’s ideas were not widely known, and especially his designs fell into obscurity. They were only rediscovered in 1960s, much after the modern von Neumann architecture was designed. That two independent designs made a century apart should be so similar is remarkable.

Allan Bromley is the credited for rediscovering Babbage’s legacy. See his paper in Annals of the History of Computing for a detailed overview of the Analytical Engine. Babbage’s Difference engine has been built twice now. See Computer History Museum page. Babbage also designed an extraordinary printer which was also built by the Science Museum. London. See BBC news report.

The von Neumann architecture is named after the polymath John von Neumann. He worked on his designs over years, including at the Institute of Advanced Study where he designed a machined called the IAS Machine. von Neumann and Herman Goldstine wrote the first major paper on error analysis of Gaussian elimination, bringing numerical analysis into the mainstream of applied mathematics.

Moore’s Law continues to hold and number of transistors on-chip are doubling every two years. However, we are hitting limit of clock speeds now and the way towards faster computing is via multi-core machines. Many companies ship chips with large number of cores. For example, Intel’s Cascade Lake architecture has upto 28 cores. AMD’s Epyc chips come with up to 64 cores. These are now becoming available in various supercomputing centers (including Princeton Research Computing).

We looked at a matrix-matrix multiply problem, showing how rearranging the data improves performance. However, we saw that professional grade libraries like Eigen still out-perform our hand-written code by \(5\times\) or more. Further improvements can be made by using blocking algorithms and SIMD instructions, in essence, vectorizing the code.

For excruciating details on how to optimize C++ code see Agner Fog’s website. In particular, he has five(!) manuals on all sorts of gory details on processor architectures and how you can utilize this information to optimize your code. His first manual is probably the most useful.

We saw an example of SIMD programming. Manually programming SIMD is painful and very tedious. The processor intrinsics are not intuitive and need significant expertise to use. Wrapper C++ libraries should be used instead. For example, Agner Fog’s VectorClass library or Agenium Scale’s NSMID library are good libraries to use. (Intel has it’s own library but it does not seem to be widely used or even maintained. I may be wrong, though).

Lecture 14: Hardware and Software Issues: C++ and Parallel Programming

PDF of Lecture 14 slides

In this lecture we will look at some concepts in software engineering and parallel programming. In particular we will focus on MPI standard, mentioning briefly OpenMP and GPUs. The latter are major topic in themselves and can’t be covered in a single lecture like this.

My general recommendation for new projects is to use modern C++ (i.e. use C++ language as standardized in 2011 and after). Modern C++ is a very powerful language that has the right combination of high-level features and low-level control that can result in very flexible but optimized code. (Often, keeping a code flexible and ensuring it is efficient is difficult). Most newer large libraries being built are being written in C++ and it is a good to get familiar with the language.

However, C++ is a difficult language to learn and very hard to use effectively. It is best to study good C++ code and apprentice to a “Master Craftsman” to become a real expert. I recommend Bjarne Stroustrup’s “The C++ Programming Language” as the standard reference. Stroustrup is the inventor of C++ and also wrote the first compiler for it.

The most comprehensive and complete reference documentation for C++ is the cppreference website. If you write any serious program in C++ this is indispensable. This is not for beginning C++ programmers.

When using C++ use the Standard Library (called the “Standard Template Library” (STL)) extensively. For example, use the std::vector<> class to allocate arrays and not allocate memory directly yourself. Use plenty of struct or class and in parts of the code use objects oriented programming. Objects oriented programming is a powerful paradigm but is not always suitable for computational code. For example, the STL is not object oriented. Instead it uses the idea of “types” with operators that work on these abstract types that satisfy some property. The theory of types has a long and distinguished history in mathematics and logic, going back to Bertrand Russell. Many programming languages have a very sophisticated concept of types, though the C++ type-system leaves much to be desired.

The Message Passing Interface (MPI) is the de-facto standard in high-performance computing (HPC) for message based communication between processors. The MPI standard specifies an API that applications can use to communicate in parallel. See the documentation here. The MPI standard is complex. However, for most explicit codes you only need to understand a few key methods and can learn others as and when needed.

I went over the basic design pattern of an explicit, parallel PDE solver. (In the CS literature “Design Patterns” mean something very specific: these encode a fundamental algorithmic pattern that occurs again and again in large number of applications. CS folks have designed and discovered many patterns and good programmers should be aware of some of these). I showed that to communicate between sub-domains one needs to copy “skin-cell” data from one sub-domain to the “ghost-cell” region of another sub-domain. The size and layout the skin/ghost-cell regions depends on the stencil you are using (for example, if you are using a 5-point of 9-point Laplacian stencil). For unstructured grids the layout of skin/ghost-cell regions is very complex and needs significant book-keeping.

As a complete illustration of the steps needed in writing an explicit, parallel PDE solver, I wrote and showed an advection equation solver. See the source code in the github directory. This implements a forward Euler in time and a central difference in space scheme. Note that this scheme is unstable and implementing a stable scheme (using upwinding) is left as an exercise. To understand the structure of the code see the slides from Lecture 14.