{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cluster computing\n",
    "\n",
    "## Performance of geodynamic models\n",
    "\n",
    "![Memory hierarchy](img/memory-hierarchy.jpg)\n",
    "\n",
    "*Memory hierarchy of modern PC ([https://allthingsvlsi.wordpress.com](https://allthingsvlsi.wordpress.com))*\n",
    "\n",
    "Let's consider a typical 2D thermomechanical geodynamic model:\n",
    "\n",
    "- Dimensions: 1000 km x 1000 km\n",
    "- Resolution: 1 km \n",
    "    - Grid points: **1 000 000**\n",
    "    - Maximum time step (diffusion limit): 16 kyrs\n",
    "- Four unknowns: $v_x$, $v_y$, $P$, $T$\n",
    "    - Four equations per grid point\n",
    "- Discretized versions of the equations:\n",
    "    - About 20 operations (+ - \\* /) per equation\n",
    "    - Operations per step: **80 000 000**\n",
    "- Modern PC processors can do about 10-100 GFLOPS (1 GFLOP = $10^9$ floating-point operations per second)\n",
    "    - The *processor* could do 1000 steps per second\n",
    "    - For example, 50 Myrs / 16 kyrs per step = 3200 steps\n",
    "    - Model run time: **3.2 secs**\n",
    "- **BUT**: Memory access time (random): approx. 50 ns\n",
    "    - Each operation needs to fetch at least one number from memory\n",
    "    - Worst case: Random location:\n",
    "        - $80\\times10^6\\times50\\times10^{-9}~s=4.0~s$ per step\n",
    "    - **Total runtime** (\"**wall clock time**\"): $\\approx 4~\\mathrm{s/step}\\times3200~\\mathrm{steps}=3.5~\\mathrm{hours}$\n",
    "- Also, a lot of other \"book keeping\" during the model calculations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise - How heavy is a 3D model?\n",
    "\n",
    "- Make a similar runtime estimation for a 3D model with same resolution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Improving model performance\n",
    "\n",
    "The cure: Split the job onto multiple processors.\n",
    "\n",
    "- Each will have fewer operations to do\n",
    "    - *Partitioning* of the job:\n",
    "        - Each processor will handle its own grid points, or\n",
    "        - Each processor will handle its own part in solving the coefficient matrix\n",
    "- Each will have a smaller memory region to worry about (can store numbers closer to the processing unit)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modern computer architecture\n",
    "\n",
    "![Processor architecture](img/arch1.jpg)\n",
    "\n",
    "*Processor architecture of a 4-core processor ([http://sips.inesc-id.pt/~nfvr/msc_theses/msc10g/](http://sips.inesc-id.pt/~nfvr/msc_theses/msc10g/))*\n",
    "\n",
    "Modern PCs already use multiple *cores* (CPUs within one physical processor).\n",
    "\n",
    "- No speedup if the program/code used does not support multiple cores!\n",
    "- Limited (currently) to about 16 cores, typically 2-4\n",
    "    - Some CPUs with larger numbers of cores (up to 72) exist, but are expensive\n",
    "- Some PC hardware allows two physical processors\n",
    " \n",
    "More cores can be used by interconnecting multiple physical computers (*nodes*)\n",
    "\n",
    "- Needs a fast way to communicate between computers\n",
    "    - Faster is better (>10 Gb/s)\n",
    "- Needs a protocol for CPUs/nodes to discuss with each other in order to distribute (partition) the work\n",
    "    - One of the most common: MPI (Message Passing Interface)\n",
    " \n",
    "![Architecture of a computing cluster](img/nodesNetwork.gif)\n",
    "\n",
    "*Architecture of a computing cluster*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The **geo-hpcc** computer cluster\n",
    "\n",
    "![The geo-hpcc cluster](img/IMG_1903.jpg)\n",
    "\n",
    "- 35 nodes, each with 2 processors, each with 8 cores = 560 cores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performance of parallel programs\n",
    "\n",
    "We will test the effect of running a code in parallel, using the *geo-hpcc* cluster.\n",
    "\n",
    "1. Login to the cluster using instructions at [https://introgm.github.io/2020/instructions/cluster.html](https://introgm.github.io/2020/instructions/cluster.html)\n",
    "2. Type\n",
    "```bash\n",
    "$ cd mpi\n",
    "$ srun -n 64 python mpi.py\n",
    "```\n",
    "3. To see and edit the Python code\n",
    "```bash\n",
    "$ nano mpi.py\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise - Timing parallel performance\n",
    "\n",
    "- Run the `mpi.py` script with different number of cores (modify the number after `-n`, try values between 1-400 cores). Keep record of the core count and time elapsed. We'll compile our results and plot them as a group.\n",
    "    - What kind of relationship would you expect to see?\n",
    "    - What do you actually see?\n",
    "- Try commands `squeue` and `sinfo` to see the job queue and the status of different nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parallel performance results\n",
    "You can find the results of the parallel performance exercise in the [exercise summary notebook](https://introgm.github.io/2020/day-4/parallel-performance.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}