diff --git a/01projects/sec02IntroToCpp.md b/01projects/sec02IntroToCpp.md
index 2851e556e..3343e2650 100644
--- a/01projects/sec02IntroToCpp.md
+++ b/01projects/sec02IntroToCpp.md
@@ -12,25 +12,26 @@ Since 1998, C++'s development has been governed by an ISO working group that col
 
 This course will assume the use of the `C++17` standard, which is presently widely supported by compilers. (Compiler support tends to lag behind the release of the C++ standard, since the compiler developers need time to implement it and check that the new compilers work properly!)
 
-> \* The exact meaning of "low level" depends on whom you ask, but in general _lower level_ languages are languages more directly represent the way that the machine works, or give you control over more aspects of the machine, and _higher level_ languages abstract more of that away, and focus on declaring the behaviour that you want. Traditionally, a "low level" language refers to _assembly code_, which is where the instructions that you write are the instruction set of the machine itself! This code is by its nature specific to a given kind of machine and therefore not portable between systems, and doesn't express things in a way that is intuitive to most people. High level languages were introduced to make code easier to understand and more independent of the hardware; the highest level languages, like Haskell, are highly mathematical in their structure and give hardly any indication of how the computer works at all! C++ falls somewhere in the middle, with plenty of high level abstractions and portability, but it still gives us some features associated with low level programming like direct addressing of memory. This extra degree of control is very valuable when you need to get the best out of systems that require high performance or have limited resources. 
+> \* The exact meaning of "low level" depends on whom you ask, but in general _lower level_ languages are languages more directly represent the way that the machine works, or give you control over more aspects of the machine, and _higher level_ languages abstract more of that away, and focus on declaring the behaviour that you want. Traditionally, a "low level" language refers to _assembly code_, which is where the instructions that you write are (a human readable version of) the instruction set of the machine itself! This code is by its nature specific to a given kind of machine and therefore not portable between systems, and doesn't express things in a way that is intuitive to most people. High level languages were introduced to make code easier to understand and more independent of the hardware; the highest level languages, like Haskell, are highly mathematical in their structure and give hardly any indication of how the computer works at all! C++ falls somewhere in the middle, with plenty of high level abstractions and portability, but it still gives us some features associated with low level programming like direct addressing of memory. This extra degree of control is very valuable when you need to get the best out of systems that require high performance or have limited resources.
 
 ## Why are we using C++?
 
 The most common language for students to learn at present is probably Python, and many of you may have taken the Python based software engineering course last term. So why are we now changing to C++?
 
 1. C++ is the standard language for high performance computing, both in research and industry. 
-2. C++ code runs much faster than native Python code. Those fast running Python libraries are written in C! As scientific programmers, we sometimes have to implement our own novel methods, which need to run efficiently. 
+2. C++ code runs much faster than native Python code. Those fast running Python libraries are written in C! As scientific programmers, we sometimes have to implement our own novel methods, which need to run efficiently. We can't always rely on someone else having implemented the tools that we need for us. 
 3. C++ is a great tool for starting to understand memory management better.
     - Most code that we write will not need us to allocate and free resources manually, but C++ gives us a clear understanding of when resources are allocated and freed, and this is important for writing effective and safe programs. 
     - Many structures in C++ have easy to understand and well defined layouts in memory. The way that data is laid out in memory can have a major impact on performance as we shall see later, and interacting with some high performance libraries requires directly referencing contiguous blocks of memory. 
 4. C++ has strong support for object-oriented programming, including a number of features not present in Python. These features allow us to create programs that are safer and more correct, by allowing us to define objects which have properties that have particular properties (called _invariants_). For example, defining a kind of list that is always sorted, and can't be changed into an un-sorted state, means that we can use faster algorithms that rely on sorted data _without having to check that the data is sorted_. 
 5. C++ is multi-paradigm and gives us a lot of freedom in how we write our programs, making it a great language to explore styles and different programming patterns. 
+6. C++ has a static type system (as do many other languages), which is quite a big shift from Python's dynamic typing. Familiarity with this kind of type system is extremely useful if you haven't used it before, and as we will see it can help us to write faster code with fewer bugs.
 
 ### Why is C++ fast?
 
 Because a C++ program is compiled before the program runs, it can be much faster than interpreted languages. Not only is the program compiled to native machine code, the lowest-level representation of a program possible with today's CPUs, compilers are capable of performing clever optimisations to vastly improve runtimes. With C++, C, Rust, Fortran, and other natively compiled languages, there is no virtual machine (like in Java) or interpreter (like in Python) that could introduce overheads that affect performance.
 
-Many languages use a process called _garbage collection_ to free memory resources, which adds run-time overheads and is less predictable than C++'s memory management system. In C++ we know when resources will be allocated and freed, and we can run without less computational overhead, at the cost of having to be careful to free any resources that we manually allocate. (Manually allocating memory is relatively rare in modern C++ practices! This is more common in legacy code or C code, with which you will sometimes need to interact.)
+Many languages use a process called _garbage collection_ to free memory resources, which adds run-time overheads and is less predictable than C++'s memory management system. In C++ we know when resources will be allocated and freed, and we can run with less computational overhead, at the cost of having to be careful to free any resources that we manually allocate. (Manually allocating memory is relatively rare in modern C++ practices! This is more common in legacy code or C code, with which you will sometimes need to interact.)
 
 Static type checking also helps to improve performance, because it means that the types of variables do not need to be checked during run-time, and that extra type data doesn't need to be stored.
 
@@ -48,12 +49,14 @@ C++ Pros:
 - Can write code which runs on exciting and powerful hardware like supercomputing clusters, GPUs, FPGAs, and more!
 - Can program for "bare metal", i.e. architectures with no operating system, making it appropriate for extremely high performance or restrictive environments such as embedded systems.
 - Static typing makes programs safer and easier to reason about. 
+- C++ is well known in high performance computing (HPC) communities, which is useful for collaborative work.
 
 C++ Cons:
 - Code can be more verbose than a language like Python.
 - C++ is a very large language, so there can be a lot to learn.
 - More control also means more responsibility: it's very possible to cause memory leaks or undefined behaviour if you misuse C++.
 - Compilation and program structure means there's a bit of overhead to starting a C++ project, and you can't run it interactively. This makes it harder to jump into experimenting and plotting things the way you can in the Python terminal. 
+- C++ is less well known in more general research communities, so isn't always the most accessible choice for collaboration outside of HPC. (You can also consider creating Python bindings to C or C++ code if you need the performance but your collaborators don't want to deal with the language!)
 
 For larger scale scientific projects where performance and correctness are critical, then C++ can be a great choice. This makes C++ an excellent choice for numerical simulations, low-level libraries, machine learning backends, system utilities, browsers, high-performance games, operating systems, embedded system software, renderers, audio workstations, etc, but a poor choice for simple scripts, small data analyses, frontend development, etc. If you want to do some scripting, or a bit of basic data processing and plotting, then it's probably not the best way to go (this is where Python shines). For interactive applications with GUIs other languages, like C# or Java, are often more desirable (although C++ has some options for this too). 
 
diff --git a/05libraries/sec01DesigningClasses.md b/05libraries/sec01DesigningClasses.md
index 23997fde0..35c1d7d1d 100644
--- a/05libraries/sec01DesigningClasses.md
+++ b/05libraries/sec01DesigningClasses.md
@@ -135,115 +135,220 @@ The goal is to guarantee the following:
 - Resources that are required by the object exist for the full lifetime of the object. This will prevent invalid memory access attempts.
 - Resources that are allocated by the object do not exist for longer than the object itself. This will prevent memory leaks. 
 
-Since it's good practice to use smart pointers for any pointers which actually own data (and therefore we should not need to manually make calls to `delete` in our destructor), the main times when we need to be concerned with RAII are in dealing with opening and reading or writing resources such as files.  
+Since it's good practice to use smart pointers for any pointers which actually own data (and therefore we should not need to manually make calls to `delete` in our destructor), the main times when we need to be concerned with RAII are in dealing with opening and reading or writing resources such as files. However, RAII can also be very useful for interfacing with C libraries which deal with raw pointers and which have specialised methods for creating and freeing them (rather than using `new` and `delete`); it can also often be easier to deal with C-style arrays rather than vectors when programming with MPI or interfacing to other devices like GPUs. 
 
 RAII typically means wrapping these resources that you want to use in some class: rather than accessing a file directly in a function, which could be interrupted by an exception before it can close the file, wrap the file in a class which will automatically close the file in the destructor if the object goes out of scope. Then use that class in your function to access your file. If something goes wrong and an exception is thrown, your file will be closed when the stack unwinds and the file wrapper object is deleted. 
 
+### RAII and Copying
+
+Special care needs to be taken when objects that implement the RAII pattern are allowed to be copied. C++ will, where possible, implement a _default_ [copy constructor](https://en.cppreference.com/w/cpp/language/copy_constructor), which allows the object to be copied, e.g. 
+
+```cpp
+MyObj obj1;
+MyObj obj2 = obj1;  // calls copy constructor to build obj2 by copying data from obj1
+```
+
+The trouble with copies comes when we have classes which contain resources like raw pointers or file handles that need to be deleted or closed.
+
+```cpp
+class MyObj
+{
+public:
+    MyObj()
+    {
+        p = new int(5);
+    }
+
+    ~MyObj()
+    {
+        delete p;
+    }
+
+private:
+    int *p;
+};
+```
+- This class very responsibly places the allocation for the pointer in the constructor and the deallocation in the destructor, so creating a `MyObj` and letting it go out of scope will not cause any leaks.
+
+The problem with the raw pointer is that the default copy will simply copy the pointer across. This means that **both** `obj1` and `obj2` will contain a pointer **to the same address**, and consequenctly **both destructors will attempt to free it**. This is a double free error and will cause our program to crash! We have failed to properly model _ownership_ of the resources in the case of the copy: we must always know which objects own what resources. 
+
+When smart pointers are not appropriate, we can control this copy behaviour by overriding or disabling the copy constructor. 
+
+#### Overriding the Copy Constructor
+
+The copy constructor for a given type looks like this:
+
+```cpp
+class MyObj
+{
+public:
+    // copy constructor
+    MyObj(const MyObj &other)
+    {
+        ...
+    }
+
+...
+```
+- It takes a (possibly `const`) _reference_ to an object of the same class as its argument. It's usually a good idea to make this a `const` reference since you probably don't want your copy operation to be able to alter the original object!
+- It can take other parameters if you want **but** they must have default values supplied in the argument list. E.g. `MyObj(const MyObj &other, int i=2){...}`.
+
+We can override this to make a deep copy by having the new object's pointer point to a different memory location, and instead copy the _data_ that the first object points to into the new location as well. 
+
+```cpp
+class MyObj
+{
+public:
+    // copy constructor
+    MyObj(const MyObj &other)
+    {
+        p = new int(*other.p);
+    }
+
+...
+```
+
+Note that this deep copy means that the data that these two objects point to is now independent: changing one won't change the other because they are looking at different addresses. 
+
+#### Disabling the Copy Constructor
+
+We can prevent copying entirely by disabling the copy constructor. 
+
+```cpp
+class MyObj
+{
+public:
+    // copy constructor
+    MyObj(const MyObj &other) = delete;
+
+...
+```
+This makes it a compilation error to try to copy the object, and therefore the our code `MyObj obj2 = obj1;` won't compile at all. 
+
+There are more approaches that one can take to this problem depending on exactly what ownership behaviour you want, just **always remember to consider ownership when implementing classes with RAII**. 
+
 ## Decoupling Code with Abstract Classes & Dependency Injection 
 
 Dependency injection is a commonly used technique to make a pair of classes which depend on one another _loosely coupled_, i.e. to make changes to one class as independent of the other class as possible. 
 
-Consider for example the case where we have one class which contains an instance of another.
+Consider for example the case where we have one class which contains an instance of another. In this case, a `Simulation` class which contains a simple `Data` class. 
+
 ```cpp
-class Bar
+class Data
 {
     public:
     void print()
     {
-        cout << "BAR" << endl;
+        for(auto x: data)
+        {
+            std::cout << x << " ";
+        }
+        std::cout << std::endl;
     }
+
+    private:
+    vector<int> data;
 };
 
-class Foo
+class Simulation
 {
     public:
-    Foo()
+    Simulation()
     {
-        myBar = std::unique_ptr<Bar>(new Bar());
+        data = std::unique_ptr<Data>(new Data());
     }
 
-    void printBar()
+    void printData()
     {
-        myBar->print();
+        data->print();
     }
 
     private:
-        std::unique_ptr<Bar> myBar;
+        std::unique_ptr<Data> data;
 };
 ```
-- The definition of class `Foo` is dependent on the definition of class `Bar`. 
-- The constructor for `Foo` calls the constructor of `Bar` directly; if the constructor of `Bar` changes then the class `Foo` must also be changed. 
-- The class `Bar` may develop and contain functionality that is irrelevant to what `Foo` needs. 
+- The definition of class `Simulation` is dependent on the definition of class `Data`. 
+- The constructor for `Simulation` calls the constructor of `Data` directly; if the constructor of `Data` changes (because we have changed something about our data representation) then the class `Simulation` must also be changed. 
+- The class `Data` may develop and contain functionality that is irrelevant to what `Simulation` needs. 
 
 Dependency injection is generally achieved by using an abstract class in place of a concrete type for a component of a class. The abstract class defines a interface that must be met by any class that you want to use, but does not enforce what exactly that class should be. This allows you to design a class which can be reused with different components which fulfil the same functionality depending on what you need it for. 
 
 ```cpp
-class AbstractBar
+class AbstractSimData
 {
     public:
     virtual void print() = 0;
 };
 
-class Bar : public AbstractBar
+class Data : public AbstractSimData
 {
     public:
     void print()
     {
-        cout << "BAR" << endl;
+        for(auto x: data)
+        {
+            std::cout << x << " ";
+        }
+        std::cout << std::endl;
     }
+
+    private:
+    vector<int> data;
 };
 ```
-- `AbstractBar` is an abstract class, because its function `print` is not implemented. 
+- `AbstractSimData` is an abstract class, because its function `print` is not implemented. It defines the interface that any data class that wants to be used with the `Simulation` class would need to implement.
 - `print` is _pure_ and _virtual_ which means that it will always be overridden by a derived class. This defines a "contract": a set of functionality that anything which inherits from this abstract class _must_ implement. We can use such abstract classes to define minimal functionality required by other classes: this is sometimes referred to as an "interface". 
     - Interfaces are a core language feature of some other languages like Java and C#, but are not explicitly implemented in C++. 
     - In C++ we generally implement interfaces using abstract classes containing only pure virtual functions and variables. 
 
-The trick with dependency injection is to the then pass (or "inject") the component you want to use to a constructor or setter function. This is done at runtime rather than compile time, and means that different instances of the class can be instantiated with different components based on run-time considerations. 
+The trick with dependency injection is to then pass (a.k.a. "inject") the component you want to use to a constructor or setter function. This is done at runtime rather than compile time, and means that different instances of the class can be instantiated with different components based on run-time considerations. 
 ```cpp
-class Foo
+class Simulation
 {
-    Foo(unique_ptr<AbstractBar> &inBar)
+    Simulation(unique_ptr<AbstractSimData> &inData)
     {
-        myBar = std::move(inBar);
+        data = std::move(inData);
     }
 
-    void printBar()
+    void printData()
     {
-        myBar->print();
+        data->print();
     }
 
     private:
-        std::unique_ptr<AbstractBar> myBar;
+        std::unique_ptr<AbstractSimData> data;
 };
 ```
 
-- Now `Foo` works with an abstract class `AbstractBar`, which does not itself contain an implementation of `print`. 
-- Note that the `Foo` class now does not call the constructor for the `myBar` object: the `Bar` implementation can change completely as long as it still implements the `print` method, which is the only thing that we need from it in this example. 
+- Now `Simulation` works with an abstract class `AbstractSimData`, which does not itself contain an implementation of `print`. It doesn't care _how_ it gets done, just that it _can_ be done.
+- Note that the `Simulation` class now does not call the constructor for the `data` object: the `Data` implementation can change completely as long as it still implements the `print` method, which is the only thing that we need from it in this example. The `Simulation` class is now _decoupled_ from any elements of the `Data` class that it does not directly need to know about and use. 
 
 We gain even more flexibility by using a setter function. With this kind of structure we can also create classes that allow components to be swapped out during the lifetime of the object, meaning that the functionality of the object can be changed during runtime. 
 ```cpp
-class Foo
+class Simulation
 {
     public:
-    Foo(unique_ptr<AbstractBar> &inBar)
+    Simulation(unique_ptr<AbstractSimData> &inData)
     {
-        myBar = std::move(inBar);
+        data = std::move(inData);
     }
 
-    void setBar(unique_ptr<AbstractBar> &inBar)
+    void setData(unique_ptr<AbstractSimData> &inData)
     {
-        myBar = std::move(inBar);
+        data = std::move(inData);
     }
 
-    void printBar()
+    void printData()
     {
-        myBar->print();
+        data->print();
     }
 
     private:
-        std::unique_ptr<AbstractBar> myBar;
+        std::unique_ptr<AbstractSimData> data;
 };
 ```
+- If we have two data sets `dataSet1` and `dataSet2` we can now change the data that the `Simulation` object looks at runtime without creating a new `Simulation` object. 
+- `dataSet1` and `dataSet2` don't even need to be the same type, as long as they are both of a type which inherits from `AbstractSimData`!
 
 ## Example: Strategy Pattern
 
diff --git a/07performance/sec03Optimisation.md b/07performance/sec03Optimisation.md
index 8577dda77..04d51d2dc 100644
--- a/07performance/sec03Optimisation.md
+++ b/07performance/sec03Optimisation.md
@@ -6,7 +6,7 @@ Estimated Reading Time: 45 minutes
 
 # Compiler Optimisation 
 
-Compilation is the translation of our high level code (in this case C++) into machine code that reflects the instruction set of the specific hardware for which it is compiled. This machine code can closely reflect the C++ code, implementing everything explicitly the way that it's written, or it can be quite different from the structure and form of the C++ code as long as it produces an equivalent program. The purpose of this restructuring is to provide optimisations, usually for speed. Modern compilers have a vast array of optimisations which can be applied to code as it is compiled to the extent that few people could write better optimised assembly code manually, a task that rapidly becomes infeasible and forbiddingly time consuming as projects become larger and more complex. 
+Compilation is the translation of our high level code (in this case C++) into machine code that reflects the instruction set of the specific hardware for which it is compiled. This machine code can closely reflect the C++ code, implementing everything explicitly the way that it's written, or it can be quite different from the structure and form of the C++ code **as long as it produces an equivalent program**. The purpose of this restructuring is to provide optimisations, usually for speed. Modern compilers have a vast array of optimisations which can be applied to code as it is compiled to the extent that few people could write better optimised assembly code manually, a task that rapidly becomes infeasible and forbiddingly time consuming as projects become larger and more complex. 
 
 There is another benefit to automated compiler optimisation. Compilers, by necessity, do produce hardware specific output, as they must translate programs into the instruction set of a given processor. This means that even if we have written highly portable code which makes no hardware specific optimisations, we can still benefit from these optimisations if they can be done by the compiler when compiling our code for different targets! As we shall see below, some processors may have different features such as machine level instructions for vectorised arithmetic which can be implemented by the compiler without changing the C++ code, producing different optimised programs for different hardware from a single, generic C++ code.   
 
@@ -40,6 +40,34 @@ There are others in the documentation linked above which I would recommend that
 
 ## Some Optimisation Examples
 
+In the following sections we will look at a few common compiler optimisations and their implications. It's important to bear in mind however that these optimisations are not _guaranteed_ to happen, and optimisation is usually a heuristic process. Analysing programs is complex, and you are certainly not guaranteed to get the optimal version of your compiled code out of the compiler, and the compiler will make decisions about what optimisations to apply where based on various rules of thumb that are different for each compiler (and target architecture). If you _need_ an optimisation to be applied that isn't guaranteed by the C++ standard then you should consider implementing it yourself directly in your source code. 
+
+### Compile-time Calculations and Redundancy
+
+If the compiler is given an expression which it can calculate and replace with a value at compile time, then it may do so without changing the meaning of the program. As a simple example:
+
+```c++
+const int x = 12;
+const int y = 8;
+
+int z = x + y;
+```
+
+- The compiler knows the values of `x` and `y`, and they can't have been changed (partly because there is no intervening code, and also because they are declared `const`). So at compile time the compiler can deduce that the initial value of `z` will be `20`. The compiler can replace the addition operation with a hard coded initialisation value.
+
+The compiler may also notice that certain variables are not accessed or used, or that the result of a calculation is thrown away, and therefore that some calculation can be avoided. In general it's good practice to remove any redundant variables or calculations yourself! (Remember that you can turn on compiler warnings to help with this.) 
+
+You may find that this kind of optimisation can negatively affect simple benchmarks that you write. Let's say we want to benchmark a sorting algorithm:
+
+```c++
+vector<int> unsorted = gen_list(1000'000);
+auto t_start = std::chrono::high_resolution_clock::now();
+vector<int> sorted = sort(unsorted); 
+auto t_end = std::chrono::high_resolution_clock::now();
+```
+
+If our program never accesses `sorted` before it terminates (because we were only interested in the timing information), and the compiler can tell that `sort` does not have side-effects, then the compiler may recognise that this calculation is redundant and skip it entirely! When benchmarking with optimisations on you may have to force the program to actually do the work you're interested in, e.g. by accessing the result directly afterwards and making sure that the result won't be pre-calculated at compile time. When you benchmark things, make sure that you're getting sensible results and check how they scale with the problem size to make sure that work is actually being done. 
+
 ### Loop Unrolling
 
 Loops in C++ are usually directly modelled in the machine code as well, using conditional tests and "jump" statements (essentially `goto` statements). This means that when we have a loop in our code like this:
@@ -69,15 +97,42 @@ v[7] = 7;
 
 Modern CPUs typically contain units specially designed for SIMD, or "Single Instruction Multiple Data", workflows. As the name suggests, SIMD refers to performing the same operation on multiple pieces of data at the same time. (This kind of behaviour is done at a much larger scale on accelerated devices like GPUs!)
 
-A typical CPU (x86 architecture) SIMD register will be 128 bits, meaning that it can operate simultaneously on:
+A typical CPU (x86 architecture or ARM) SIMD register will be 128 bits, meaning that it can operate simultaneously on:
 
 - 2 x `double` (each 64 bits)
 - 4 x `float`  (32 bit)
 - 4 x `int`    (32 bit)
 
-and might contain 8 or 16 of these registers.
+and might contain 8 or 16 of these registers. Many x86 processors will also have 256-bit or even 512-bit registers; making use of these requires compiling with additional flags since they are less common and therefore the code generated will be less portable.
 
-Compilers can make use of these kinds of units as long as the calculations are independent - we can't calculate two things in parallel if one depends on the output of the other. Determining whether things are independent in this way, especially when there are loops involved, is not always trivial so you may not always get the most usage out of these registers. A loop like the following:
+#### Data Alignment for SIMD
+
+To get the best performance out of SIMD operations we need to consider data _alignment_. When loading, for example, four `float` values into a 16-byte (128-bit) SIMD register, then we load 16 contiguous bytes in memory. The most efficient loading mechanism doesn't just load 16-byte pieces of memory starting at _any_ address, but rather the view of RAM is broken up into 16-byte sections, and you can load any one of these sections quickly. Loading four floats that crosses one of these boundaries is less efficient than being _aligned_ with these boundaries. Luckily we can align our data with specific boundaries in memory using the following syntax:
+
+```cpp
+// Aligned stack allocations@
+//4 floats aligned to 16 byte boundary
+alignas(16) float f16[4];
+
+//24 floats aligned to 32 byte boundary. 
+alignas(32) float f32[24];
+
+// Algined heap allocations
+// four floats aligned to 16-byte boundary
+float *x = new (std::align_val_t(16)) float[4];
+// four doubles aligned to 32-byte boundary
+double *y = new (std::align_val_t(32)) double[4]; 
+```
+
+- For 128-bit registers in x86 and ARM processors you want to be aligned with 16-byte boundaries
+- For 256-bit registers as in AVX you want to be aligned with 32-byte boundaries. 
+    - 32-byte alignment will also work with 128-bit registers, which means it can allow for efficient vectorisation whether or not you have the larger SIMD registers, but is not necessary if you are worried about packing data as densely as possible in memory. 
+- In the example above the first value of each array is aligned to the boundary. If we look at `f32`, which contains 24 floats aligned to a 32-byte boundary, then `f32[0]`, `f32[8]`, and `f32[16]` will all be aligned with 32-byte boundaries, because each float is 4-bytes and therefore there are 8 floats to a 32-byte block. This means that we can efficient load the blocks `f32[0] ... f32[7]`, `f32[8] ... f32[15]`, and `f32[16] ... f32[23]` into registers for SIMD operations. 
+- We _can_ load unaligned data into registers for SIMD operations, it just isn't quite as efficient. 
+
+#### SIMD Optimisation and Loop Dependency
+
+Compilers can make use of these kinds of units as long as the calculations are independent - we can't calculate two things in parallel if one depends on the output of the other. Determining whether things are independent in this way, especially when there are loops complicated, is not always trivial so you may not always get the most usage out of these registers. A loop like the following:
 
 ```cpp
 for(int i = 0; i < 4; i++)
@@ -95,7 +150,62 @@ for(int i = 0; i < 4; i++)
 }
 ```
 
-can't because there is loop dependency. 
+can't because there is loop dependency. Calculating a sum like this in parallel would require reformulating the problem in a way that the compiler will not do by itself. Remember that **re-ordering floating point operations changes the result** and so many arithmetic processes cannot be automatically vectorised by the compiler, even if the vectorisation appears obvious.
+
+#### Manual SIMD
+
+SIMD can be manually implemented using [C++ intrinsics](https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170), which map very closely onto specific assembly level instructions. These are easier to use than writing directly in assembly, and can be used to enforce that you get the exact vectorisation strategy that you want, but because of their close relationship with low-level instructions these are not as portable as normal C++ code. x86 and ARM processors have similar functionality for the most part, but a completely different set of intrinsics. In order to write portable code with this kind of approach, programmers usually detect or specify the architecture at build time and use pre-processor directives to determine which intrinsics are used, what data alignments are required, and so on. That however is beyond the scope of this course! 
+
+To see how intrinsics work, consider the example of adding floats in parallel using SIMD. 
+1. You load 4 (or 8) floats into a 16 (or 32) byte register from your first memory address.
+2. You load 4 (or 8) floats into a 16 (or 32) byte register from your second memory address.
+3. You perform a vectorised addition operation.
+4. You place the resulting 4 (or 8) floats into your destination memory address.
+
+For x86 we need the commands / includes:
+```cpp
+    // 128 bit definitions
+    #include <xmmintrin.h>
+
+    // 256 bit definitions
+    #include <immintrin.h>
+
+    // _mm_load_ps takes a pointer to the first float of a pack of four floats
+    __m128 loaded_floats = _mm_load_ps(address);
+        
+    // vectorised addition, takes 2 __mm128 arguments
+    __mm128 result = _mm_add_ps(loaded_x, loaded_y);
+
+    // store value into memory address, takes a pointer and an __mm128
+    _mm_store_ps(address, result)
+
+    //256 bit intrinsics
+    // load a float buffer. __mm256 i
+    __m256 loaded_floats = _mm256_load_ps(address);
+
+    // performing a vector addition
+    __mm256 result = _mm256_add_ps(loaded_x, loaded_y)
+            
+    //store
+    _mm256_store_ps(address, result);
+```
+    
+for ARM we need the commands / includes:
+    
+```cpp
+    #include <arm_neon.h>
+
+    // Load data: takes a pointer to the first of the four floats
+    // float32x4_t is for 4 floats i.e. 128 bit
+    float32x4_t loaded_floats = vld1q_f32(addres);
+        
+    // Vectorised add: takes 2 float32x4_t type arguments
+    float32x4_t result = vaddq_f32(loaded_x, loaded_y);
+
+    // store in memory: takes a pointer and a float32x4_t argument and stores the result at that address
+    vst1q_f32(address, result);
+```
+
 
 ### Function Inlining 
 
@@ -139,15 +249,17 @@ Floating point computation **is not exact**.
 - Adding values of very different sizes leads to significant loss of precision since values must be converted to have the same exponent to be added together. This means the difference in scale is pushed into the mantissa, which then loses precision due to leading `0` digits on the smaller number. In some cases the smaller number may be so small that the closest representable number with that exponent is `0` and so the addition is lost completely. 
 - Subtracting values which are close in size leads to cancellation of many digits and a result with far fewer significant digits and therefore lower precision. 
 - Identities from real arithmetic do not necessarily hold, in particular addition and multiplication are not associative, so $(a + b) + c \neq a + (b + c)$ in floating point! 
-- Handling these difficulties in numerical methods is a major field in and of itself. Many numerical algorithms are specially crafted to correct rounding errors in floating point arithmetic. 
+- Handling these difficulties in numerical methods is a major field in and of itself. Many numerical algorithms are specially crafted to correct rounding errors in floating point arithmetic.  
+
+### Floating Point Precision
 
-### Precision and Speed
+Higher precision floating point numbers (like `double` as opposed to `float`) will give more accurate results when doing numerical work, but may also be slower to perform operations. Historically `double` operations have taken more time to compute than `float` operations, although this is no longer typically the case on modern CPUs. Nevertheless, if you are exploiting SIMD registers for maximal performance, fewer `double` values can fit in an available register and therefore fewer operations can be performed in a given amount of time. Some fast algorithms use single precision `float` or even half precision floating point numbers in areas where the results will not be significantly impacted by this. This is particularly common in areas like statistics and machine learning where the statistical variance is much larger than the precision of the floating point numbers. You must always bear in mind that the use of lower precision floating point types can lead to numerical instability from cancellation errors or division/multiplication by extreme values causing under or overflow. You should always have tests for robustness and precision to check that any compromises made to precision are acceptable. 
 
-Higher precision floating point numbers (like `double` as opposed to `float`) will give more accurate results when doing numerical work, but are also slower to perform calculations. `double` arithmetic takes more time, and if you are exploiting SIMD registers, fewer `double` values can fit in an available register. Some fast algorithms use single precision `float` or even half precision floating point numbers in areas where the results will not be significantly impacted by this. You should always have tests for robustness and precision to check that these compromises are acceptable. 
+### Optimisation of Floating Point Arithmetic
 
-### Fast Math 
+Since floating point computation is not exact, many statements which are mathematically equivalent using real numbers are not equivalent in floating point. As mentioned above, compiler optimisations should not change the meaning of the code - the outcome of a calculation. Floating point operations therefore place limitations on the kinds of optimisations that can be applied, and if performance is a major consideration you should try to write out your floating point operations in the most efficient way possible to begin with. 
 
-Compilers can optimise numerical code for speed by rearranging arithmetic operations. Most C++ optimisations are designed not to change the results of the calculations as written but some, such as those enabled by `-ffast-math`, allow for rearrangement of arithmetic according to the rules of _real numbers_, for example rearranging associations. 
+Nevertheless, compilers can optimise numerical code for speed by rearranging arithmetic operations, even floating point operations. (Integer arithmetic can be rearranged by the compiler because it is exact.) While C++ optimisations are generally designed not to change the results of the calculations as written there are some, such as those enabled by `-ffast-math`, that allow for rearrangement of arithmetic according to the rules of _real numbers_. This allows for example the rearranging associations. This can be a powerful tool in some cases, not only allowing your numerical code to be rearranged into a more efficient format but also permitting the compiler to make the necessary reorderings of operations for vectorising floating point algorithms. There are however, significant drawbacks to using fast-math optimisations due to the way that it can change the meaning of your program.
 
 Suppose we have a large number $N$ and two much smaller numbers $a$ and $b$ and we want to calculate $N + a + b$. We know from the way that floating point numbers work that adding small and large numbers leads to rounding errors, so the best order to add these number is $(a+b) + N$, to keep the precision of $(a+b)$ and maximise the size of the smaller number that we add to $N$. This is why re-associations in optimisation can cause a problem. Numerical algorithms with factors which compensate for rounding errors can have their error corrections optimised away by fast-math, because for _real_ numbers the error corrections would be zero and the compiler can identify them as redundant! 
 
@@ -163,7 +275,8 @@ As we have mentioned already, debugging optimised code can be substantially more
 
 - Compile code without optimisations (or with `-Og`) for debugging and most development purposes other than profiling / benchmarking. 
 - Turn on appropriate optimisations when compiling for actual deployment. 
-- Run your unit tests on both unoptimised and optimised code. 
+- Run your unit tests on both unoptimised and optimised code.
+    - If you have tests which pass when unoptimised but fail when optimised, then the chances are you have a bug in your code causing [**undefined behaviour**](https://en.cppreference.com/w/cpp/language/ub). 
 
 ### Optional: Dumping Executables and Inspecting Assembly Code
 
@@ -258,6 +371,7 @@ If I optimise my code however, my loop can look quite different:
 - `jne    11e0 <main+0xa0>` jumps back to the start of the loop if the values in these are not equal. 
 - The value in `%rax` which represents the loop iteration variable is now advanced by `0x10` (16), resulting in 25 iterations. This is what we might expect given that we are using an SIMD register which can handle 4 32-bit integers at a time. 
     - Note that although I have multiple SIMD registers, I can still only perform one instruction at a time, which means I can't do more than 4 integer additions at once. 
-- We're now making use of SIMD instructions like `movdqa`, `paddd`, and `movaps` for _packed integers_ i.e. multiple integers stored as one string of data. 
+- We're now making use of SIMD instructions like `movdqa`, `paddd`, and `movaps` for _packed integers_ i.e. multiple integers stored as one string of data.
+    - Note that compiled programs will often use the vectorised registers `xmm_` and `ymm_` for sequential floating point arithmetic as well as vectorised arithmetic, so you need to look out specifically for the vectorised arithmetic instructions rather than just what registers are being used. 
 
 There are many other changes made to this program due to the high level of optimisation asked for (`-03`), but this should illustrate the impact that compiler optimisation can have on the actual machine operation, and how we can inspect and understand this. 
diff --git a/09distributed_computing/sec01DistributedMemoryModels.md b/09distributed_computing/sec01DistributedMemoryModels.md
index 63d3bb89e..a2a8c5a48 100644
--- a/09distributed_computing/sec01DistributedMemoryModels.md
+++ b/09distributed_computing/sec01DistributedMemoryModels.md
@@ -5,10 +5,10 @@ title: Distributed Memory Model
 # The Distributed Memory Model of Parallel Programming 
 
 Last week we looked at the use of shared memory parallel programming. As a reminder, the shared memory model is used when each thread has access to the same memory space. 
-- Shared memory means that all threads can rapidly access the full data set.*
+- Shared memory means that all threads can rapidly access the full data set.
 - Can lead to awkward problems like data races. 
 - Sometimes requires additional structures like mutexes to be introduced. 
-    - A mutex can refer to any solution which means that a variable can only be accessed by a single thread at a time. 
+    - A mutex can refer to any solution which means that a variable can only be accessed by a single thread at a time. It is a contraction of "mutually exclusive".
 - The more concurrent threads we have operating on shared memory, the more pressure we put on resources which require access controls like mutexes, which can delay our program execution. 
 - Shared memory is limited by the number of cores which can share on-chip RAM. This is generally not a very large number, even for high performance computing resources. The more cores we try to connect to a single piece of memory, the more latency there will be for at least some of these cores, and the more independent ports the memory banks will need to have.
 
@@ -19,13 +19,13 @@ In the distributed memory model, we take the parallelisable part of our program
 
 Distributed memory programming is incredibly broad and flexible, as we've only specified that there are processes with private memory and some kind of message passing. We've said nothing about what each of the processes _does_ (they can all do entirely different things; not just different tasks but even entirely different programs), what those processes run _on_ (you could have many nodes in a cluster or a series of completely different devices), or what medium they use to communicate (they can all be directly linked up or they could be communicated over channels like the internet). The distributed memory model can apply to anything from running a simple program with different initial conditions on a handful of nodes in a cluster to running a client-server application with many users on computers and mobile devices to a world-wide payment system involving many different potential individuals, institutions, devices, and software. It can even apply to separate processes running on the _same core_ or on cores with shared memory, as long as the memory is partitioned in such a way that the processes cannot _access_ the same memory. (Remember when you write programs you use _virtual memory addresses_ which are mapped to a limited subset of memory as allocated by your OS; you generally have many processes running on the same core or set of cores with access to non-overlapping subsets of RAM.) 
 
-For our purposes, we will focus on code written for a multi-node HPC cluster, such as [UCL's Myriad cluster](https://www.rc.ucl.ac.uk/docs/Clusters/Myriad/), using the [MPI (Message Passing Interface) standard](https://www.mpi-forum.org/). We will, naturally, do our programming with C++, but it is worth noting that the MPI standard has been implemented for many languages including C, C#, Fortran, and Python. We will use the [Open MPI](https://www.open-mpi.org/) implementation. We won't be covering much programming in these notes, but focussing on the models that we use and their implications. 
+For our purposes, we will focus on code written for a multi-node HPC cluster, such as [UCL's Myriad cluster](https://www.rc.ucl.ac.uk/docs/Clusters/Myriad/), using the [MPI (Message Passing Interface) standard](https://www.mpi-forum.org/). We will, naturally, do our programming with C++, but it is worth noting that the MPI standard has been implemented for many languages including C, C#, Fortran, and Python. We will use the [Open MPI](https://www.open-mpi.org/) implementation. We won't be covering much programming in this section, but focussing on the models that we use and their implications. 
 
 ## Aside: Task Parallelism
 
 It's worth addressing the fact that different processes or threads don't need to do identical work. If we need to calculate something of the form:
 
-$y = f(x) \circ g(x)$
+$y = f(x) + g(x)$
 
 then we can assign the calculation of $f(x)$ and $g(x)$ to separate processes or threads to run concurrently. Then there needs to be some kind of synchronisation (using a barrier or message passing) to bring the results together when they're both finished and calculate $y$. 
 
@@ -162,10 +162,11 @@ Over the last few weeks we've looked at a few different ways of approaching perf
 
 ### Mixing Distributed and Shared Memory Models
 
-Distributed memory models have a lot of advantages in terms of simplifying memory management and, generally speaking, allowing for a greater amount of flexibility and scalability. However, data sharing in distributed systems is slow, which makes them very poorly suited to parallelising certain kinds of problems, especially those which are _memory bound_. There'd be no point in using a distributed system to transpose a matrix in parallel because passing the matrix components between processes would take longer than the memory reads to perform the transpose itself! But a matrix transpose _does_ parallelise easily in a shared memory system, and if you do an out of place transpose there is no chance of writing to the same place in memory! In general things like linear algebra and list methods work well using shared memory models but rarely scale to the point of benefitting from distributed systems, so if you have a program that uses a lot of these kinds of things (and can't be parallelised between processes using task parallelism in an effective way), you might want to consider including shared memory parallelism within independent processes. 
+Distributed memory models have a lot of advantages in terms of simplifying memory management and allowing for much more complex systems involving different machines. However, data sharing in distributed systems is slow, which makes them very poorly suited to parallelising certain kinds of problems which involve substantial data sharing. There'd be no point in using a distributed system to transpose a matrix in parallel because passing the matrix components between processes would take longer than the memory reads to perform the transpose itself! But a matrix transpose _does_ parallelise easily in a shared memory system, and if you do an out of place transpose there is no chance of writing to the same place in memory! Many of our basic list and matrix methods make more sense to perform with shared memory parallelism to avoid the messaging overheads.
 
-Let's take for example likelihood sampling, a very common need in scientific applications. We have some parameter space which defines our model (e.g. the cosmological parameters $\Lambda$, $\Omega_m$, etc...) and we have a set of points in this space which represent models with different sets of values for these parameters, and we want to calculate the likelihood of each of these points. Calculating the likelihood will involve generating some observables from a model, and doing some kind of multi-variate Gaussian. 
+As an example of a complex problem which can take advantage of multiple levels of parallelism, let's consider likelihood sampling, a very common need in scientific applications. We have some parameter space which defines our model (for example properties of the universe that we want to measure in physics, or parameters describing a disease in epidemiology) and we have a set of points in this space which represent models with different sets of values for these parameters, and we want to calculate the likelihood of each of these sets of parameters given the data that we have. This is how we _infer_ model parameters from data. Calculating the likelihood will involve generating some observables from a model (i.e. calculating some function of these parameters), and doing some kind of multi-variate Gaussian (to compare to the data). 
 
-- Each likelihood calculation is computationally expensive, but only needs a small amount of data to initialise (just the set of parameters). A perfect example of something which can be allocated to separate processes in a distributed model! Each process is allocated one point in the parameter space and calculates the likelihood. There is no need for communication between processes to calculate the likelihood since they all work on independent points, and they only need to send a message to the parent process when they are done to report the calculated likelihood (again a very small amount of data). 
-- The likelihood calculation itself will generally involve a lot of steps which need to be ordered (we tend to calculate a lot of functions which depend on the results of other functions etc. in science) and some good old linear algebera (for our multi-variate Gaussian we will need to handle the $x^T \Sigma^{-1} x$ term for our covariance matrix $\Sigma$, which itself may need to be generated for each point). This would mean a lot of communication and potential stalls if we were to try to parallelise these operations in a distributed way, but likely some good opportunities for threading when we need to deal with vectors, matrices, integrations and so on. So each process could also be multi-threaded on cores with shared memory.
+- Each model evaluation and likelihood calculation is computationally expensive, but only needs a small amount of data to initialise (just the set of parameters)[^data]. This is a perfect example of something which can be allocated to separate processes in a distributed model! Each process is allocated one point in the parameter space and calculates the likelihood. There is no need for communication between processes to calculate the likelihood since they all work on independent points, and they only need to send a message to the parent process when they are done to report the calculated likelihood (again a very small amount of data). 
+- The likelihood calculation itself will generally involve a lot of steps which need to be ordered (we tend to calculate a lot of functions which depend on the results of other functions etc. in scientific applications) and some linear algebera (for our multi-variate Gaussian we will need to handle the $x^T \Sigma^{-1} x$ term for our covariance matrix $\Sigma$, which itself may need to be generated for each point). This would mean a lot of communication and potential stalls if we were to try to parallelise these operations in a distributed way, but likely some good opportunities for threading when we need to deal with vectors, matrices, integrations and so on. So each process could also be multi-threaded on cores with shared memory.
 
+[^data]: The observable data _does_ need to be communicated to every process, _but only once_ since it does not change. So for each new calculation we only need the new parameter set.
\ No newline at end of file
diff --git a/10parallel_algorithms/AsynchronousMPI.md b/10parallel_algorithms/AsynchronousMPI.md
new file mode 100644
index 000000000..a290fdfc1
--- /dev/null
+++ b/10parallel_algorithms/AsynchronousMPI.md
@@ -0,0 +1,101 @@
+---
+title: Asynchronous MPI Programs
+---
+
+# Asynchronous Strategies
+
+Last week we focussed our attention on distributed programs in which each process completes an equal amount of work in a roughly equal period of time, and then shares its results either with all the other process or with a master process. This kind of parallelism in which each process proceeds in lock-step with one another is called _synchronous_; by contrast an _asynchronous_ program allows the processes to proceed more independently without barriers which force all of the processes to be kept closely in sync. 
+
+Asynchronous strategies are particularly useful when the amount of time that a process will take to produce a result that needs to be shared is unpredictable. Consider the following simple problem of cracking a password by brute force:
+
+- For an $n$-bit key there are $2^n$ possible keys
+- Each possible key is tried until a key is found which works to decrypt the message / access the resource etc.
+
+This brute force approach is easily parallelised by assigning each process a subset of the possible keys to check (so that no process repeats the work of another process). Here are two possible synchronous approaches to the problem:
+
+1. Frequent Updates:
+    - Each process tries one key and determines success or failure.
+    - Each process shares its result with the others and if any process is successful then the key is sent to the master process and all other processes stop.
+2. Update once at the end:
+    - Each process tries each key in its subset and determines success or failure each time. 
+    - A process finishes its work when it finds the correct key or exhausts its subset of keys. 
+    - Each process sends a message to the master process when it finishes its work. 
+    - In general the master process has to wait for all processes to finish before it can receive messages since it doesn't know which process will finish first, and therefore doesn't know which message to check first.
+
+We can see that both of these approaches are sub-optimal in different ways. The first allows for early termination once the key has been found, but wastes enormous amounts of time sending messages with no useful information since every process must report its failures, and this reporting blocks each of the processes from continuing their work until it has been determined whether the key has been found. The second approach avoids all this messsage sending wasting time on each process, but it is disastrous because most processes are doomed to fail (the key will only be found in one subset i.e. one process) and waiting for these to finish means waiting for an _exponential time_ algorithm to complete. 
+
+An _asynchronous_ approach allows us to take advantage of the early termination of the first strategy, and the minimal messaging of the second strategy. 
+
+The asynchronous approach is straight-forward to conceptualise in human terms:
+
+- Each process needs to check through its subsets of keys one at a time.
+- If a process finds the key, then it gives the keys to the master process and all other processes are told to stop.
+
+The problem here is that processes need a way of knowing that another process has sent it a message. Using `MPI_Recv` means that our process sits around _waiting for a matching message_, which we don't want to do since we don't know _which_ process will send us a message or when, so we are wasting valuable time and might be waiting for a message that never comes. 
+
+If we know that a process _might_ receive a message, then we need to regularly _check_ whether a message has arrived or not. In a message has arrived, we can read it and act upon it, and if not we can continue to do something else while we wait. 
+
+We can use this idea to make our asynchronous algorithm more explicit
+
+- Each process loops over the keys in its subset 
+- The master process loop: 
+    - the master process checks its current key
+    - if the master process finds the key then it sends a message to the worker processes to stop.
+    - otherwise it checks to see if any of the other processes have sent it a message
+    - if they have then it stores the received key and sends a message to the other worker process to stop
+    - if no processes have sent it a message it generates the next key and returns to the start of the loop
+- The worker process loop:
+    - the worker process checks its current key
+    - if the worker process finds the key then it sends a message to the master process with the key in it and can then stop.
+    - otherwise it checks to see if the master has sent it a message to stop
+    - if it is not told to stop then it generates the next key and returns to the beginning of the loop
+
+Note that _checking_ if a message has arrived does not require it to wait for a message: if nothing has arrived it moves right along. Also note that even though each process is going through a similar loop, **each process' loop is independent of all the others**. Processes can move at very different speeds and they are not required to synchronise at each iteration.
+
+Checking for messages like this can feel a bit clunky, but it is necessary for any process to be aware of what is being sent to it. If you are working on an asynchronous design it's important to check regularly for updates to avoid processes wasting work on outdated information or a task which has already been completed elsewhere.
+
+## MPI Syntax For Asynchronous Methods
+
+In order to implement this kind of asynchronous strategy we need to introduce some new MPI functions. In order to check for a message we use the `MPI_Probe` function. The arguments are very similar to `MPI_Recv`, although they are in a different order and you don't need to know size and type of the data being sent/
+
+- `int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag MPI_Status *status)`
+    - `source`, `tag`, `comm`, and `MPI_Status` function that same as `MPI_Recv`.
+    - `flag` is the really crucial part of this function. This is a **pointer** to an `int` value which will be modified by the function. If a message with a matching source and tag is found then `*flag` will evaluate to `1`, and otherwise `*flag` will evaluate to `0`. 
+
+We might implement this inside a loop as follows:
+
+```
+while(!complete)
+{
+    // do work for this iteration
+    ...
+
+    // check for messages from other ranks
+    for(int &rank : process_ranks)
+    {
+        int message_found = 0;
+        MPI_Iprobe(rank, 0, MPI_COMM_WORLD, &message_found, MPI_STATUS_IGNORE);
+        if(message_found)
+        {
+            complete = true;
+            break;
+        }
+    }
+}
+
+// terminate processes and clean up
+...
+```
+
+# Optional: Asynchronous Time Dependent Simulations
+
+Oftentimes in the sciences we are working on _simulations_, typically updating a time-dependent system by some time-step in a loop. We can parallelise simulations by dividing the domain of the simulation between processes, as in the example of diving a grid from last week. In simulations however, information often needs to be communicated across the boundaries of the sub-domains: think, for example, of a moving particle crossing from one quadrant of space to another. In order for our simulation to be consistent we need to make sure that the information which is communicated from one sub-domain to the other happens consistently *in time*: if the particle leaves one quadrant at time $t$ then it also needs to enter the other quadrant at time $t$. This is automatically the case in a synchronous simulation with a barrier at the end of each iteration to exchange information and update the global clock. But what about an _asynchonrous_ simulation? In this case different processes -- different sub-domains of the simulation -- could have reached different simulation times $t_p$ for each process $p$. How can we communicate information across a boundary like this?
+
+Let's assume that information must pass from process $p$ to $q$, i.e. $p$ is sending some information which will affect subsequent simulation on $q$. There are three cases to consider:
+1. $t_p = t_q$: the information can be updated just as in the synchronous simulation.
+2. $t_p > t_q$: process $p$ is ahead of process $q$, so $q$ is not yet ready to receive it. In this case the information can be stored until $q$ catches up and then the update can take place. Process $p$ can continue simulating as normal while this is happening so the simulation is not blocked.
+3. $t_p < t_q$: process $q$ is ahead of process $p$. This is the most challenging case because all timesteps that $q$ has calculated which are $> t_p$ are now invalidated. In this case we have wasted some work, and we must also be able to **back track** process $q$ back to time $t_p$, then perform the update using the information from process $p$ and restart the simulation on $q$. This approach requires the simulation to be reversible in some way. 
+    - Simluations typically can't be reversed analytically in this way and so this may require storing some kind of state history, which can use significant memory. 
+    - Depending on your balance of resources, you may only been able to store occasional snapshots of the state and therefore have to backtrack _further_ than $t_p$ and evolve forward to $t_p$ again before being able to perform the update. 
+    - You can also store the _changes_ to the state rather than the state itself for each time step (like a git commit!), and roll back these changes in order to reach a previous state.
+    - The best strategy will be strongly dependent on your specific simulation problem and the computing resources available to you.
diff --git a/10parallel_algorithms/index.md b/10parallel_algorithms/index.md
index 228bb8791..9bdad2cd2 100644
--- a/10parallel_algorithms/index.md
+++ b/10parallel_algorithms/index.md
@@ -4,6 +4,7 @@ title: "Week 10: Work Depth Models and Parallel Strategies"
 
 ## Week 10: Overview 
 
-This week we'll take a deeper look at how we quantify how parallelisable algorithms are, and discuss some broad strategies for tacking parallel problems. These approaches apply to both shared and dsitributed memory systems.
+This week we'll take a deeper look at how we quantify how parallelisable algorithms are, and discuss some broad strategies for tacking parallel problems. These approaches apply to both shared and dsitributed memory systems. We'll also look at how to program with MPI for asynchronous computations - those in which the processes do not necessarily run in lock-step with one another. 
 
 1. [Work Depth Models](WorkDepth.html)
+2. [Asynchronous MPI](AsynchronousMPI.html)
\ No newline at end of file
diff --git a/index.md b/index.md
index 9071713a2..f3b3cbd7b 100644
--- a/index.md
+++ b/index.md
@@ -7,24 +7,22 @@ slidelink: false
 ## Introduction
 
 In this course, we build on your knowledge of C++ to enable you to work on complex numerical codes for research.
-Research software needs to handle complex mathematics quickly, so the course focuses on writing software to perform multi-threaded computation. But research software is also
-very complicated, so the course also focuses on techniques which will enable you to build software which is easy for colleagues
-to understand and adapt.
+Research software frequently needs to handle complex mathematics quickly, so the course will introduce techniques for writing performant and parallelised programs; however research software is also a complicated research output which must be understood, trusted, and maintained by your colleagues, and thus the course will also teach techniques to enable you to build software which is robust, well tested, and maintainable.
 
 ## Pre-requisites
 
-We have found in previous years that C++ is no longer commonly taught at undergraduate level. As such, we do not expect deep prior understanding of C++ but **we do expect prior programming experience and some basic C++ knowledge**. This should include:
+We have found in previous years that C++ is no longer commonly taught at undergraduate level. As such, we do not expect deep prior understanding of C++ but **prior programming experience is essential**, and some knowledge of C/C++ is very useful. Prior programming experience should include:
 
-* Variables, loops and control flow statements (like `if`)
-* Arrays and structures
-* Basic object oriented design (classes, inheritance, polymorphism)
+* Variables, loops and control flow statements (like `if`/`else`)
+* Basic containers like arrays / vectors / lists / dictionaries
+* Basic object oriented design (classes, member variables / functions)
 
-This could be obtained through online resources such as the the C++ Essential Training course by Bill Weinman on [LinkedIn Learning](https://www.ucl.ac.uk/isd/linkedin-learning) (accessible using your UCL single sign-on) or via a variety of C++ courses in college, such as [MPHYGB24](https://moodle.ucl.ac.uk).
+You may find the C++ Essential Training course by Bill Weinman on [LinkedIn Learning](https://www.ucl.ac.uk/isd/linkedin-learning) (accessible using your UCL single sign-on) to be a useful resource throughout the course if you want to spend more time on your introduction to C++.
 
-* Eligibility: This course designed for UCL post-graduate students but with agreement of their course tutor a limited number of undegraduate students can also take it.
+* Eligibility: This course designed for UCL post-graduate students but with agreement of their course tutor a limited number of undegraduate students can also take it. Due to the introductory nature of some of the computing material, this course is not appropriate for those who have undertaken undergraduate studies in Computer Science.
 
 ## Registration
 
 Members of doctoral training schools, or Masters courses who offer this module as part of their programme should register through their course organisers.
 
-This course may not be audited without the prior permission of the course organiser Dr. Jamie Quinn as due to the practical nature of the lectures there is a cap on the total number of students who can enrol.
+This course may not be audited without the prior permission of the course organiser Dr. Michael McLeod as due to the practical nature of the lectures there is a cap on the total number of students who can enrol.