itsdrike.com/content/posts/concurrency-and-parallelism.md

---
title: Concurrency and Parallelism
date: 2021-11-17
tags: [programming, python]
---

Concurrency is an exciting topic that's becoming more and more important, yet I see so many people that aren't very
familiar with topic and it's possibilities. I'll try to explain the differences between threading, multiprocessing and
asynchronous run. I'll also show some examples when concurrency should be avoided, and when it makes sense.

I'll be talking about concurrency with python language in mind, but even if you don't use python, I still think that
you can learn a lot from this article if you aren't that familiar with key concurrency concepts. My hope is that after
you read this article, you will confidently know the differences between the various concurrency methods and their
individual advantages or disadvantages when compared to each other.

## Why concurrency?

In programming, we often have the need to do things very quickly so that our program isn't slow. But we also often need
to perform complex operations which take some time to actually compute. To cope with this, we can sometimes perform
certain tasks at the same time.

As an example, we can think about concurrency as the amount of lanes on a highway. If we have a highway with just one
single lane, all cars on it would have to use that lane and they would travel only as quickly as the slowest car in
front of them. But once we bring in another lane, we can already see huge improvements because the cars can go at their
own speeds on separate lanes and we can physically fit in more cars.

Similarly to this example, when we use concurrency, we allocate multiple physical CPUs/cores to a process, essentially
giving it more clock cycles, however not every task is suited for concurrent run, consider this example:

```python
x = my_function()
y = my_other_function(x)
```

We can clearly see that `my_other_function` is completely dependent on the result of `my_function`, this means that it
wouldn't make any sense to run these concurrently on 2 cores, because `my_other_function` would just wait for
`my_function` and only after it's finished will it start running. We just used 2 cores and did something slower than
with one core. It was slower because it took some time to send the result of `my_function` to `my_other_function`
running in a separate process.

This shows us that not all tasks are suited to run concurrently, but there are some that really could benefit from this
form of run. For example if we wanted to read the content of 200 different files, reading them one-by-one would take a
lot of time, but if we were able to read all 200 concurrently, it would only take us the duration of reading 1 file,
yet we would get the content of all 200 files. (Of course we're assuming that our disk would support reading 200 things
at once).

Even though this sounds good, it's never as simple as it first sounds. Even though it is true, on most machines, we
won't actually be able to run 200 things at once because we don't have 200 CPUs/cores. Concurrency like this will
always be limited by the hardware of the computer your software is running on. If you have a computer with 8 logical
CPUs/cores we can only run 8 things at once. Even though it obviously won't be as good as running 200 tasks at once, it
will still be way better than running single task at once. In our example, we would be able to get the results of all
200 tasks in the amount of time it would take to run 25 tasks sequentially, this is still a huge improvement.

## Threads vs Processes

Understanding the concept of concurrency is one thing, but how do we actually run our code on multiple cores/CPUs?
Luckily for us, this is all handled by the operating system that we're running. The kernel of this OS has to manage
thousands of processes with their threads that all have to run and all of those constantly fight to get as much CPU
clock cycles as they can. It is up to the OS to determine which processes are more important then others and to
interchange these processes so that each process gets enough CPU time, having multiple cores helps a lot because rather
than constantly swapping processes around on a single core, we can run n processes at once and the OS has less overall
swapping to do. But what are these processes and the threads attached to hem?

But what are Threads? The concept of a process is probably not a hard one to understand, it's just a separate program
that we started, but a thread is a bit more interesting than that. Threads are essentially a way for a single process
to do 2 things concurrently, yet keep existing in the shared-state of a single process. This means we don't have any
communication overhead and it's very easy to pass information along. However this property can often be disadvantage,
since threads work on a single shared state, we often need to use locks to properly communicate without causing issues.
(I'll explain the importance of locks with some examples later, but essentially, we need locks to prevent data loss
when 2 threads make a change to the same place in memory at once.)

As you were probably able to figure out, the advantage of processes is that they don't have these shared states,
processes are fully independent, however this is also a disadvantage because it is hard to communicate between these
processes. Since we don't have this shared state, if processes want to talk to each other, they need to find take the
objects from memory, serialize them and move them across a raw socket to another process, where it can get
deserialized. (This will most likely be done with `pickle` library in python.) This means processes have huge
communication cost compared to threads.

## Why do we need locks?

Consider this code:

```python
>>> import sys
>>> a = []
>>> b = a
>>> sys.getrefcount(a)
3
```

In the example here, we can see that python keeps a reference count for the empty list object, and in this case, it was
3. The list object was referenced by a, b and the argument passed to `sys.getrefcount`. If we didn't have locks,
threads could attempt to increase the reference count at once, this is a problem because what would actually happen
would go something like this:

> Thread 1: Read the current amount of references from memory (for example 5) <br>
> Thread 2: Read the current amount of references from memory (same as above - 5) <br>
> Thread 1: Increase this amount by 1 (we're now at 6) <br>
> Thread 2: Increase this amount by 1 (we're also at 6 in the 2nd thread) <br>
> Thread 1: Store the increased amount back to memory (we store this increased amount of 6 back to memory) <br>
> Thread 2: Store the increased amount back to memory (we store the increased amount of 6 to memory?) <br>

[Treat sections of 2 lines as things happening concurrently]

You can see that because threads 1 and 2 both read the reference amount from memory  at the same time, they read the
same number, then they've increased it and stored it back without ever knowing that some other thread is also in the
process of increasing the reference count but it read the same amount from memory as this process, so even though the
first thread stored the updated amount, the 2nd thread also stored the updated amount, except they were the same
amounts.

Suddenly we have no solid way of knowing how many references there actually are to our list which means it may get
removed by automated garbage collection because we've hit 0 references when we actually still have an active reference.
There is a way to circumvent this though, and that is with the use of locks

Dummy internal code:

```python
lock.acquire()
references = sys.getrefcount()
references += 1
update_references(references)
lock.release()
```

Here, before we even started to read the amount of references, we've acquired a lock, preventing other threads from
continuing and causing them to wait until a lock is released so that another thread can acquire it. With this code,
it wold go something like this:

> Thread 1: Try to acquire a shared lock between threads (lock is free, Thread 1 now has the lock) <br>
> Thread 2: Try to acquire a shared lock between threads (lock is already acquired by Thread 1, we're waiting) <br>
> Thread 1: Read the current amount of references from memory (for example 5) <br>
> Thread 2: Try to acquire the lock (still waiting) <br>
> Thread 1: Increase this amount by 1 (we're now at 6) <br>
> Thread 2: Try to acquire the lock (still waiting) <br>
> Thread 1: Store the increased amount back to memory (we now have 6 in memory) <br>
> Thread 2: Try to acquire the lock (still waiting) <br>
> Thread 1: Release the lock <br>
> Thread 2: Try to acquire the lock (success, Thread 2 now has the lock) <br>
> Thread 1: Finished (died) <br>
> Thread 2: Read the current amount of references from memory (read value 6 from memory) <br>
> Thread 2: Increase this amount by 1 (we're now at 7) <br>
> Thread 2: Store the increased amount back to memory (we now have 7 in memory) <br>
> Thread 2: Release the lock <br>
> Thread 2: Finished (died) <br>

We can immediately see that this is a lot more complex than having lock-free code, but it did fix our problem, we
managed to correctly increase the reference count across multiple threads. The question is, at what cost?

It takes a while to acquire or release a lock and these additional instructions slow down our code a lot, not to
mention that thread 2 was completely blocked while thread 1 had the lock and it was spending CPU cycles by sleeping and
waiting for the 1st thread to finish and release the lock. This is why threading can be quite complicated to deal with
and why some tasks should really stay single-threaded.

In this small example, it may be easy to understand what's going on, but if you add enough locks, it becomes
increasingly difficult to know whether there will be any "dead-locks" (this can happen when a thread acquires a lock
but never releases it, often the case if we forcefully kill a thread), to test your code, etc. Managing locks can
become a nightmare in a complex enough code-base.

Another problem about locks is, that they don't actually lock anything. Lock is essentially just a signal that can be
checked for and if it's active the thread can choose to wait until that signal is gone (the lock is released). But this
only happens if we actually check and if we decide to respect it, the threads are supposed to respect them, but there's
absolutely nothing preventing these threads from actually running anyway. If these threads forget to acquire a lock
they can do something that they shouldn't have been able to do. This means that even if we have a large code-base with
a lot of locks written correctly, it may not stay correct over time. Small adjustments to the code can cause it to
become incorrect in a way that's hard to see during code reviews.

## Debugging multi-threaded code

As an example, this is a multi-threaded code that will pass all tests and yet it is full of bugs:
```python
import threading

counter = 0

def foo():
    global counter
    counter += 1
    print(f"The count is {counter}")
    print("----------------------")

print("Starting")
for _ in range(5):
    threading.Thread(target=foo).start()
print("Finished")
```
When you run this code, you will most likely get a result that you would expect, but it is possible that you could also
get a complete mess, it's just not very likely because the code runs very quickly. This means you can write code
multi-threaded code that will pass all tests and still fail in production, which is very dangerous.

To actually debug this code, we can use a technique called "fuzzing". With it, we essentially add a random sleep delay
behind every instruction to ensure that it is safe if a switch happens during that time. But even with this technique,
it is advised to run the code multiple times because there is a chance of getting the correct result even with this
method since it always is one of the possibilities, this is why multi-threaded code can introduce a lot of problems.
This would be the code with this "fuzzing" method applied:
```python
import threading
import time
import random

def fuzz():
    time.sleep(random.random())

counter = 0

def foo():
    global counter

    fuzz()
    old_counter = counter
    fuzz()
    counter = old_counter + 1
    fuzz()
    print(f"The count is {counter}")
    fuzz()
    print("----------------------")

print("Starting")
for _ in range(5):
    threading.Thread(target=foo).start()
print("Finished")
```
You may also notice that I didn't just add `fuzz()` call to every line, I've also split the line that incremented
counter into 2 lines, one that reads the counter and another one that actually increments it, this is because
internally, that's what would be happening it would just be hidden away, so to add a delay between these instructions
I had to actually split the code like this. This makes it almost impossible to test multi-threaded code, which is a big
problem.

It is possible to fix this code with the use of locks, which would look like this:
```python
import threading

counter_lock = threading.Lock()
printer_lock = threading.Lock()

counter = 0

def foo():
    global counter
    with counter_lock:
        counter += 1
        with printer_lock:
            print(f"The count is {counter}")
            print("----------------------")

with printer_lock:
    print("Starting")

worker_threads = []
for _ in range(5):
    t = threading.Thread(target=foo)
    worker_threads.append(t)
    t.start()

for t in worker_threads:
    t.join()

with printer_lock:
    print("Finished")
```
As we can see, this code is a lot more complex than the previous one, it's not terrible, but you can probably imagine
that with a bigger codebase, this wouldn't be fun to manage.

Not to mention that there is a core issue with this code.  Even though the code will work and doesn't actually have any
bugs, it is still wrong. Why? When we use enough locks in our multi-threaded code, we may end up making it full
sequential, which is what happened here. Our code is running synchronously, with huge amount of overhead from the locks
that didn't need to be there and the actual code that would've been sufficient looks like this:
```python
counter = 0
print("Starting")
for _ in range(5)
    counter += 1
    print(f"The count is {counter}")
    print("----------------------")
print("Finished")
```
While in this particular case, it may be pretty obvious that there was no need to use threading at all, there are a lot
of cases in which it isn't as clear and I have seen some projects with code that could've been sequential but they were
already using threading for something else and so they made use of locks and added some other functionality, which made
the whole code completely sequential and they didn't even realize.

## Global Interpreter Lock in Python

As I said, this article is mainly based around the Python language, if you aren't interested in python, this part
likely won't be very relevant to you. However it is still pretty interesting to know how it works and why it isn't such
a huge issue as many claim it is. I also explain something about how threads are managed by the OS here which may be
interesting for you too.

Concurrency in python is a bit complicated because it has something called the "Global Interpreter Lock" (GIL), or at
least, that's what many people think, I actually quite like the GIL, this is what it does and why it actually isn't as
bad as many people think it is:

GIL solves the problem of needing countless locks all across the standard library. These locks would force the threads
to wait for some other thread that currently has the lock acquired which is inevitable at some places, as explained in
the above section. Removing the global lock and introducing this many smaller locks isn't even that complicated, just
time-taking, the real problem about it is that acquiring and releasing locks is expensive and takes some time, so not
only does removing GIL introduce a lot of additional complexity of dealing locks all over the standard library, it also
makes python a lot slower.

What's actually bad about GIL is the fact that it completely prevents 2 threads from being able to run parallel with
each other. 1 thread running at 1 core and 2nd thread running along the 1st one on another core. But this isn't as big
of an issue as it may sound. Even though we can't run more threads at once, i.e. there's no actual parallelism
involved, it doesn't prevent concurrency. Instead, these threads are constantly being switched around first we're at
thread 1, then thread 2, then thread 1, then back to thread 2, etc. The lock is constantly moving from one thread to
another.

But this interchanging of threads is happening in languages without any interpreter-wide lock. Every machine will have
limited amount of cores/CPUs at it's disposal and it is actually up to the OS itself to manage when a thread is
scheduled to run. The OS needs to determine the importance of each process and it's threads and decide which should run
and when. Sometimes it may happen that the OS will schedule 2 threads of the same process at once to be ran, which
wouldn't be possible with python due to GIL, but if other processes occupy the cores, every other thread on the system
is paused and waiting for the OS to start it again. This switching between the threads itself can happen at any
arbitrary instruction and we don't have control over it anyway.

So when would it make sense to even use threads if they can't run in parallel? Even though we don't have control over
when these switches happen, we do have control over when the GIL is passed, and the OS is clever enough to not schedule
in a thread that is currently waiting to acquire a lock, it will schedule to active thread that is actually doing
something. The advantage of threads is just that they can cleverly take turns to speed up the overall process. Say
you have a `time.sleep(10)` operation in one thread, we can pass the GIL over to another thread, that isn't currently
waiting and constantly check if the first thread is done yet, once it is, we can switch around between them at
arbitrary order, until again it makes more sense to run one thread over another, such as when a thread is sleeping.

## Threads vs Asynchronous run

As I explained in the last paragraph of the previous section about GIL, threads are always being interchanged for us,
we don't need any code that explicitly causes this switching, which is an advantage of threading. This interchanging
allows for some speed-ups and we don't need to worry about the switching ourselves at all!

But the cost to this convenience is that you have to assume a switch can happen at any time, this means we can hop over
to another thread after the first one finished reading data from memory, but it didn't yet store them back. This is why
we need locks. Threads switch preemptively, the system decides for us.

The limit on threads is the total CPU power we have minus the cost of task switches and synchronization overhead
(locks).

With asynchronous processing, we switch cooperatively, i.e. we use explicit code (`await` keyword in python) to cause a
task switch manually. This means that locks and other synchronization is no longer necessary. (In practice we actually
do still have locks even in async code, but they're much less common and many people don't even know about them because
they're simply not necessary in most cases)

With python's asyncio, the cost of task switches is incredibly low, because they internally use generators (awaitables)
and it is much quicker to restart a generator that stores it's all of it's state, than calling a pure python function
which has to build up a whole new stack frame on every call whereas a generator already has a stack frame and picks up
where it left off, this makes asyncio task switching the cheapest way to handle task-switching in python by far. In
comparison, you can run hundreds of threads, but tens of thousands of async tasks per second.

This makes async easier to get done than threads, and much faster and lighter-weight in comparison.
But nothing can be perfect, and async has it's downside too, one downside is that we have to perform the switches
cooperatively, so we need to add the `await` keyword to our code, but that's not very hard. The much more relevant
downside is that everything we now do has to be non-blocking. We can no longer simply read from a file, we need to
launch a task to read from a file, let it start reading and when the data is available, go back and pick it up. This
means we can't even use regular `time.sleep` anymore, instead, we need it's async alternative `await asyncio.sleep`.

This means that we need a huge ecosystem of support tools that adds the support for asynchronous alternatives to every
blocking synchronous operation, which increases the learning curve.

### Comparison

- Async maximizes CPU utilization because it has less overhead than threads
- Threading typically works with existing code and tools as long as locks are added around critical sections
- For complex systems, async is much easier to get right than threads with locks
- Threads require very little tooling (locks and queues)
- Async needs a lot of tooling (futures, event loops, non-blocking versions of everything)

## Conclusion

- If you need to run something in parallel, you will need to use multiprocessing because GIL prevents parallel threads
- If you need to run something concurrently, but not necessarily in parallel, you can either use threads or async
- Threads make more sense if you already have a huge code-base because they don't require rewriting everything to
  non-blocking versions you will just need to add some locks and queues
- Async make more sense if you know you will need concurrency from the start, since it helps to keep everything a lot
  more manageable and it's quicker than threads.