Add a post about concurrency and parallelism

2026-01-09 03:56:56 +00:00 · 2021-11-17 18:54:11 +01:00 · 2021-11-17 18:54:11 +01:00 · de560ef99b
commit de560ef99b
parent ea39ed18c7
1 changed files with 371 additions and 0 deletions
--- a/content/posts/concurrency-and-parallelism.md
+++ b/content/posts/concurrency-and-parallelism.md
@ -0,0 +1,371 @@
+---
+title: Concurrency and Parallelism
+date: 2021-11-17
+tags: [programming]
+---
+
+Concurrency is an exciting topic that's becoming more and more important, yet I see so many people that aren't very
+familiar with topic and it's possibilities. I'll try to explain the differences between threading, multiprocessing and
+asynchronous run. I'll also show some examples when concurrency should be avoided, and when it makes sense.
+
+I'll be talking about concurrency with python language in mind, but even if you don't use python, I still think that
+you can learn a lot from this article if you aren't that familiar with key concurrency concepts. My hope is that after
+you read this article, you will confidently know the differences between the various concurrency methods and their
+individual advantages or disadvantages when compared to each other.
+
+## Why concurrency?
+
+In programming, we often have the need to do things very quickly so that our program isn't slow. But we also often need
+to perform complex operations which take some time to actually compute. To cope with this, we can sometimes perform
+certain tasks at the same time.
+
+As an example, we can think about concurrency as the amount of lanes on a highway. If we have a highway with just one
+single lane, all cars on it would have to use that lane and they would travel only as quickly as the slowest car in
+front of them. But once we bring in another lane, we can already see huge improvements because the cars can go at their
+own speeds on separate lanes and we can physically fit in more cars.
+
+Similarly to this example, when we use concurrency, we allocate multiple physical CPUs/cores to a process, essentially
+giving it more clock cycles, however not every task is suited for concurrent run, consider this example:
+
+```py
+x = my_function()
+y = my_other_function(x)
+```
+
+We can clearly see that `my_other_function` is completely dependent on the result of `my_function`, this means that it
+wouldn't make any sense to run these concurrently on 2 cores, because `my_other_function` would just wait for
+`my_function` and only after it's finished will it start running. We just used 2 cores and did something slower than
+with one core. It was slower because it took some time to send the result of `my_function` to `my_other_function`
+running in a separate process.
+
+This shows us that not all tasks are suited to run concurrently, but there are some that really could benefit from this
+form of run. For example if we wanted to read the content of 200 different files, reading them one-by-one would take a
+lot of time, but if we were able to read all 200 concurrently, it would only take us the duration of reading 1 file,
+yet we would get the content of all 200 files. (Of course we're assuming that our disk would support reading 200 things
+at once).
+
+Even though this sounds good, it's never as simple as it first sounds. Even though it is true, on most machines, we
+won't actually be able to run 200 things at once because we don't have 200 CPUs/cores. Concurrency like this will
+always be limited by the hardware of the computer your software is running on. If you have a computer with 8 logical
+CPUs/cores we can only run 8 things at once. Even though it obviously won't be as good as running 200 tasks at once, it
+will still be way better than running single task at once. In our example, we would be able to get the results of all
+200 tasks in the amount of time it would take to run 25 tasks sequentially, this is still a huge improvement.
+
+## Threads vs Processes
+
+Understanding the concept of concurrency is one thing, but how do we actually run our code on multiple cores/CPUs?
+Luckily for us, this is all handled by the operating system that we're running. The kernel of this OS has to manage
+thousands of processes with their threads that all have to run and all of those constantly fight to get as much CPU
+clock cycles as they can. It is up to the OS to determine which processes are more important then others and to
+interchange these processes so that each process gets enough CPU time, having multiple cores helps a lot because rather
+than constantly swapping processes around on a single core, we can run n processes at once and the OS has less overall
+swapping to do. But what are these processes and the threads attached to hem?
+
+But what are Threads? The concept of a process is probably not a hard one to understand, it's just a separate program
+that we started, but a thread is a bit more interesting than that. Threads are essentially a way for a single process
+to do 2 things concurrently, yet keep existing in the shared-state of a single process. This means we don't have any
+communication overhead and it's very easy to pass information along. However this property can often be disadvantage,
+since threads work on a single shared state, we often need to use locks to properly communicate without causing issues.
+(I'll explain the importance of locks with some examples later, but essentially, we need locks to prevent data loss
+when 2 threads make a change to the same place in memory at once.)
+
+As you were probably able to figure out, the advantage of processes is that they don't have these shared states,
+processes are fully independent, however this is also a disadvantage because it is hard to communicate between these
+processes. Since we don't have this shared state, if processes want to talk to each other, they need to find take the
+objects from memory, serialize them and move them across a raw socket to another process, where it can get
+deserialized. (This will most likely be done with `pickle` library in python.) This means processes have huge
+communication cost compared to threads.
+
+## Why do we need locks?
+
+Consider this code:
+
+```py
+>>> import sys
+>>> a = []
+>>> b = a
+>>> sys.getrefcount(a)
+3
+```
+
+In the example here, we can see that python keeps a reference count for the empty list object, and in this case, it was
+3. The list object was referenced by a, b and the argument passed to `sys.getrefcount`. If we didn't have locks,
+threads could attempt to increase the reference count at once, this is a problem because what would actually happen
+would go something like this:
+
+> Thread 1: Read the current amount of references from memory (for example 5)
+> Thread 2: Read the current amount of references from memory (same as above - 5)
+> Thread 1: Increase this amount by 1 (we're now at 6)
+> Thread 2: Increase this amount by 1 (we're also at 6 in the 2nd thread)
+> Thread 1: Store the increased amount back to memory (we store this increased amount of 6 back to memory)
+> Thread 2: Store the increased amount back to memory (we store the increased amount of 6 to memory?)
+
+[Treat sections of 2 lines as things happening concurrently]
+
+You can see that because threads 1 and 2 both read the reference amount from memory  at the same time, they read the
+same number, then they've increased it and stored it back without ever knowing that some other thread is also in the
+process of increasing the reference count but it read the same amount from memory as this process, so even though the
+first thread stored the updated amount, the 2nd thread also stored the updated amount, except they were the same
+amounts.
+
+Suddenly we have no solid way of knowing how many references there actually are to our list which means it may get
+removed by automated garbage collection because we've hit 0 references when we actually still have an active reference.
+There is a way to circumvent this though, and that is with the use of locks
+
+Dummy internal code:
+
+```py
+lock.acquire()
+references = sys.getrefcount()
+references += 1
+update_references(references)
+lock.release()
+```
+
+Here, before we even started to read the amount of references, we've acquired a lock, preventing other threads from
+continuing and causing them to wait until a lock is released so that another thread can acquire it. With this code,
+it wold go something like this:
+
+> Thread 1: Try to acquire a shared lock between threads (lock is free, Thread 1 now has the lock)
+> Thread 2: Try to acquire a shared lock between threads (lock is already acquired by Thread 1, we're waiting)
+> Thread 1: Read the current amount of references from memory (for example 5)
+> Thread 2: Try to acquire the lock (still waiting)
+> Thread 1: Increase this amount by 1 (we're now at 6)
+> Thread 2: Try to acquire the lock (still waiting)
+> Thread 1: Store the increased amount back to memory (we now have 6 in memory)
+> Thread 2: Try to acquire the lock (still waiting)
+> Thread 1: Release the lock
+> Thread 2: Try to acquire the lock (success, Thread 2 now has the lock)
+> Thread 1: Finished (died)
+> Thread 2: Read the current amount of references from memory (read value 6 from memory)
+> Thread 2: Increase this amount by 1 (we're now at 7)
+> Thread 2: Store the increased amount back to memory (we now have 7 in memory)
+> Thread 2: Release the lock
+> Thread 2: Finished (died)
+
+We can immediately see that this is a lot more complex than having lock-free code, but it did fix our problem, we
+managed to correctly increase the reference count across multiple threads. The question is, at what cost?
+
+It takes a while to acquire or release a lock and these additional instructions slow down our code a lot, not to
+mention that thread 2 was completely blocked while thread 1 had the lock and it was spending CPU cycles by sleeping and
+waiting for the 1st thread to finish and release the lock. This is why threading can be quite complicated to deal with
+and why some tasks should really stay single-threaded.
+
+In this small example, it may be easy to understand what's going on, but if you add enough locks, it becomes
+increasingly difficult to know whether there will be any "dead-locks" (this can happen when a thread acquires a lock
+but never releases it, often the case if we forcefully kill a thread), to test your code, etc. Managing locks can
+become a nightmare in a complex enough code-base.
+
+Another problem about locks is, that they don't actually lock anything. Lock is essentially just a signal that can be
+checked for and if it's active the thread can choose to wait until that signal is gone (the lock is released). But this
+only happens if we actually check and if we decide to respect it, the threads are supposed to respect them, but there's
+absolutely nothing preventing these threads from actually running anyway. If these threads forget to acquire a lock
+they can do something that they shouldn't have been able to do. This means that even if we have a large code-base with
+a lot of locks written correctly, it may not stay correct over time. Small adjustments to the code can cause it to
+become incorrect in a way that's hard to see during code reviews.
+
+## Debugging multi-threaded code
+
+As an example, this is a multi-threaded code that will pass all tests and yet it is full of bugs:
+```py
+import threading
+
+counter = 0
+
+def foo():
+    global counter
+    counter += 1
+    print(f"The count is {counter}")
+    print("----------------------")
+
+print("Starting")
+for _ in range(5):
+    threading.Thread(target=foo).start()
+print("Finished")
+```
+When you run this code, you will most likely get a result that you would expect, but it is possible that you could also
+get a complete mess, it's just not very likely because the code runs very quickly. This means you can write code
+multi-threaded code that will pass all tests and still fail in production, which is very dangerous.
+
+To actually debug this code, we can use a technique called "fuzzing". With it, we essentially add a random sleep delay
+behind every instruction to ensure that it is safe if a switch happens during that time. But even with this technique,
+it is advised to run the code multiple times because there is a chance of getting the correct result even with this
+method since it always is one of the possibilities, this is why multi-threaded code can introduce a lot of problems.
+This would be the code with this "fuzzing" method applied:
+```py
+import threading
+import time
+import random
+
+def fuzz():
+    time.sleep(random.random())
+
+counter = 0
+
+def foo():
+    global counter
+
+    fuzz()
+    old_counter = counter
+    fuzz()
+    counter = old_counter + 1
+    fuzz()
+    print(f"The count is {counter}")
+    fuzz()
+    print("----------------------")
+
+print("Starting")
+for _ in range(5):
+    threading.Thread(target=foo).start()
+print("Finished")
+```
+You may also notice that I didn't just add `fuzz()` call to every line, I've also split the line that incremented
+counter into 2 lines, one that reads the counter and another one that actually increments it, this is because
+internally, that's what would be happening it would just be hidden away, so to add a delay between these instructions
+I had to actually split the code like this. This makes it almost impossible to test multi-threaded code, which is a big
+problem.
+
+It is possible to fix this code with the use of locks, which would look like this:
+```py
+import threading
+
+counter_lock = threading.Lock()
+printer_lock = threading.Lock()
+
+counter = 0
+
+def foo():
+    global counter
+    with counter_lock:
+        counter += 1
+        with printer_lock:
+            print(f"The count is {counter}")
+            print("----------------------")
+
+with printer_lock:
+    print("Starting")
+
+worker_threads = []
+for _ in range(5):
+    t = threading.Thread(target=foo)
+    worker_threads.append(t)
+    t.start()
+
+for t in worker_threads:
+    t.join()
+
+with printer_lock:
+    print("Finished")
+```
+As we can see, this code is a lot more complex than the previous one, it's not terrible, but you can probably imagine
+that with a bigger codebase, this wouldn't be fun to manage.
+
+Not to mention that there is a core issue with this code.  Even though the code will work and doesn't actually have any
+bugs, it is still wrong. Why? When we use enough locks in our multi-threaded code, we may end up making it full
+sequential, which is what happened here. Our code is running synchronously, with huge amount of overhead from the locks
+that didn't need to be there and the actual code that would've been sufficient looks like this:
+```py
+counter = 0
+print("Starting")
+for _ in range(5)
+    counter += 1
+    print(f"The count is {counter}")
+    print("----------------------")
+print("Finished")
+```
+While in this particular case, it may be pretty obvious that there was no need to use threading at all, there are a lot
+of cases in which it isn't as clear and I have seen some projects with code that could've been sequential but they were
+already using threading for something else and so they made use of locks and added some other functionality, which made
+the whole code completely sequential and they didn't even realize.
+
+## Global Interpreter Lock in Python
+
+As I said, this article is mainly based around the Python language, if you aren't interested in python, this part
+likely won't be very relevant to you. However it is still pretty interesting to know how it works and why it isn't such
+a huge issue as many claim it is. I also explain something about how threads are managed by the OS here which may be
+interesting for you too.
+
+Concurrency in python is a bit complicated because it has something called the "Global Interpreter Lock" (GIL), or at
+least, that's what many people think, I actually quite like the GIL, this is what it does and why it actually isn't as
+bad as many people think it is:
+
+GIL solves the problem of needing countless locks all across the standard library. These locks would force the threads
+to wait for some other thread that currently has the lock acquired which is inevitable at some places, as explained in
+the above section. Removing the global lock and introducing this many smaller locks isn't even that complicated, just
+time-taking, the real problem about it is that acquiring and releasing locks is expensive and takes some time, so not
+only does removing GIL introduce a lot of additional complexity of dealing locks all over the standard library, it also
+makes python a lot slower.
+
+What's actually bad about GIL is the fact that it completely prevents 2 threads from being able to run parallel with
+each other. 1 thread running at 1 core and 2nd thread running along the 1st one on another core. But this isn't as big
+of an issue as it may sound. Even though we can't run more threads at once, i.e. there's no actual parallelism
+involved, it doesn't prevent concurrency. Instead, these threads are constantly being switched around first we're at
+thread 1, then thread 2, then thread 1, then back to thread 2, etc. The lock is constantly moving from one thread to
+another.
+
+But this interchanging of threads is happening in languages without any interpreter-wide lock. Every machine will have
+limited amount of cores/CPUs at it's disposal and it is actually up to the OS itself to manage when a thread is
+scheduled to run. The OS needs to determine the importance of each process and it's threads and decide which should run
+and when. Sometimes it may happen that the OS will schedule 2 threads of the same process at once to be ran, which
+wouldn't be possible with python due to GIL, but if other processes occupy the cores, every other thread on the system
+is paused and waiting for the OS to start it again. This switching between the threads itself can happen at any
+arbitrary instruction and we don't have control over it anyway.
+
+So when would it make sense to even use threads if they can't run in parallel? Even though we don't have control over
+when these switches happen, we do have control over when the GIL is passed, and the OS is clever enough to not schedule
+in a thread that is currently waiting to acquire a lock, it will schedule to active thread that is actually doing
+something. The advantage of threads is just that they can cleverly take turns to speed up the overall process. Say
+you have a `time.sleep(10)` operation in one thread, we can pass the GIL over to another thread, that isn't currently
+waiting and constantly check if the first thread is done yet, once it is, we can switch around between them at
+arbitrary order, until again it makes more sense to run one thread over another, such as when a thread is sleeping.
+
+## Threads vs Asynchronous run
+
+As I explained in the last paragraph of the previous section about GIL, threads are always being interchanged for us,
+we don't need any code that explicitly causes this switching, which is an advantage of threading. This interchanging
+allows for some speed-ups and we don't need to worry about the switching ourselves at all!
+
+But the cost to this convenience is that you have to assume a switch can happen at any time, this means we can hop over
+to another thread after the first one finished reading data from memory, but it didn't yet store them back. This is why
+we need locks. Threads switch preemptively, the system decides for us.
+
+The limit on threads is the total CPU power we have minus the cost of task switches and synchronization overhead
+(locks).
+
+With asynchronous processing, we switch cooperatively, i.e. we use explicit code (`await` keyword in python) to cause a
+task switch manually. This means that locks and other synchronization is no longer necessary. (In practice we actually
+do still have locks even in async code, but they're much less common and many people don't even know about them because
+they're simply not necessary in most cases)
+
+With python's asyncio, the cost of task switches is incredibly low, because they internally use generators (awaitables)
+and it is much quicker to restart a generator that stores it's all of it's state, than calling a pure python function
+which has to build up a whole new stack frame on every call whereas a generator already has a stack frame and picks up
+where it left off, this makes asyncio task switching the cheapest way to handle task-switching in python by far. In
+comparison, you can run hundreds of threads, but tens of thousands of async tasks per second.
+
+This makes async easier to get done than threads, and much faster and lighter-weight in comparison.
+But nothing can be perfect, and async has it's downside too, one downside is that we have to perform the switches
+cooperatively, so we need to add the `await` keyword to our code, but that's not very hard. The much more relevant
+downside is that everything we now do has to be non-blocking. We can no longer simply read from a file, we need to
+launch a task to read from a file, let it start reading and when the data is available, go back and pick it up. This
+means we can't even use regular `time.sleep` anymore, instead, we need it's async alternative `await asyncio.sleep`.
+
+This means that we need a huge ecosystem of support tools that adds the support for asynchronous alternatives to every
+blocking synchronous operation, which increases the learning curve.
+
+### Comparison
+
+- Async maximizes CPU utilization because it has less overhead than threads
+- Threading typically works with existing code and tools as long as locks are added around critical sections
+- For complex systems, async is much easier to get right than threads with locks
+- Threads require very little tooling (locks and queues)
+- Async needs a lot of tooling (futures, event loops, non-blocking versions of everything)
+
+## Conclusion
+
+- If you need to run something in parallel, you will need to use multiprocessing because GIL prevents parallel threads
+- If you need to run something concurrently, but not necessarily in parallel, you can either use threads or async
+- Threads make more sense if you already have a huge code-base because they don't require rewriting everything to
+  non-blocking versions you will just need to add some locks and queues
+- Async make more sense if you know you will need concurrency from the start, since it helps to keep everything a lot
+  more manageable and it's quicker than threads.