From de560ef99be60c513ccd6cb6efc856f6a5f99a8e Mon Sep 17 00:00:00 2001 From: ItsDrike Date: Wed, 17 Nov 2021 18:54:11 +0100 Subject: [PATCH] Add a post about concurrency and parallelism --- content/posts/concurrency-and-parallelism.md | 371 +++++++++++++++++++ 1 file changed, 371 insertions(+) create mode 100644 content/posts/concurrency-and-parallelism.md diff --git a/content/posts/concurrency-and-parallelism.md b/content/posts/concurrency-and-parallelism.md new file mode 100644 index 0000000..00c4ad5 --- /dev/null +++ b/content/posts/concurrency-and-parallelism.md @@ -0,0 +1,371 @@ +--- +title: Concurrency and Parallelism +date: 2021-11-17 +tags: [programming] +--- + +Concurrency is an exciting topic that's becoming more and more important, yet I see so many people that aren't very +familiar with topic and it's possibilities. I'll try to explain the differences between threading, multiprocessing and +asynchronous run. I'll also show some examples when concurrency should be avoided, and when it makes sense. + +I'll be talking about concurrency with python language in mind, but even if you don't use python, I still think that +you can learn a lot from this article if you aren't that familiar with key concurrency concepts. My hope is that after +you read this article, you will confidently know the differences between the various concurrency methods and their +individual advantages or disadvantages when compared to each other. + +## Why concurrency? + +In programming, we often have the need to do things very quickly so that our program isn't slow. But we also often need +to perform complex operations which take some time to actually compute. To cope with this, we can sometimes perform +certain tasks at the same time. + +As an example, we can think about concurrency as the amount of lanes on a highway. If we have a highway with just one +single lane, all cars on it would have to use that lane and they would travel only as quickly as the slowest car in +front of them. But once we bring in another lane, we can already see huge improvements because the cars can go at their +own speeds on separate lanes and we can physically fit in more cars. + +Similarly to this example, when we use concurrency, we allocate multiple physical CPUs/cores to a process, essentially +giving it more clock cycles, however not every task is suited for concurrent run, consider this example: + +```py +x = my_function() +y = my_other_function(x) +``` + +We can clearly see that `my_other_function` is completely dependent on the result of `my_function`, this means that it +wouldn't make any sense to run these concurrently on 2 cores, because `my_other_function` would just wait for +`my_function` and only after it's finished will it start running. We just used 2 cores and did something slower than +with one core. It was slower because it took some time to send the result of `my_function` to `my_other_function` +running in a separate process. + +This shows us that not all tasks are suited to run concurrently, but there are some that really could benefit from this +form of run. For example if we wanted to read the content of 200 different files, reading them one-by-one would take a +lot of time, but if we were able to read all 200 concurrently, it would only take us the duration of reading 1 file, +yet we would get the content of all 200 files. (Of course we're assuming that our disk would support reading 200 things +at once). + +Even though this sounds good, it's never as simple as it first sounds. Even though it is true, on most machines, we +won't actually be able to run 200 things at once because we don't have 200 CPUs/cores. Concurrency like this will +always be limited by the hardware of the computer your software is running on. If you have a computer with 8 logical +CPUs/cores we can only run 8 things at once. Even though it obviously won't be as good as running 200 tasks at once, it +will still be way better than running single task at once. In our example, we would be able to get the results of all +200 tasks in the amount of time it would take to run 25 tasks sequentially, this is still a huge improvement. + +## Threads vs Processes + +Understanding the concept of concurrency is one thing, but how do we actually run our code on multiple cores/CPUs? +Luckily for us, this is all handled by the operating system that we're running. The kernel of this OS has to manage +thousands of processes with their threads that all have to run and all of those constantly fight to get as much CPU +clock cycles as they can. It is up to the OS to determine which processes are more important then others and to +interchange these processes so that each process gets enough CPU time, having multiple cores helps a lot because rather +than constantly swapping processes around on a single core, we can run n processes at once and the OS has less overall +swapping to do. But what are these processes and the threads attached to hem? + +But what are Threads? The concept of a process is probably not a hard one to understand, it's just a separate program +that we started, but a thread is a bit more interesting than that. Threads are essentially a way for a single process +to do 2 things concurrently, yet keep existing in the shared-state of a single process. This means we don't have any +communication overhead and it's very easy to pass information along. However this property can often be disadvantage, +since threads work on a single shared state, we often need to use locks to properly communicate without causing issues. +(I'll explain the importance of locks with some examples later, but essentially, we need locks to prevent data loss +when 2 threads make a change to the same place in memory at once.) + +As you were probably able to figure out, the advantage of processes is that they don't have these shared states, +processes are fully independent, however this is also a disadvantage because it is hard to communicate between these +processes. Since we don't have this shared state, if processes want to talk to each other, they need to find take the +objects from memory, serialize them and move them across a raw socket to another process, where it can get +deserialized. (This will most likely be done with `pickle` library in python.) This means processes have huge +communication cost compared to threads. + +## Why do we need locks? + +Consider this code: + +```py +>>> import sys +>>> a = [] +>>> b = a +>>> sys.getrefcount(a) +3 +``` + +In the example here, we can see that python keeps a reference count for the empty list object, and in this case, it was +3. The list object was referenced by a, b and the argument passed to `sys.getrefcount`. If we didn't have locks, +threads could attempt to increase the reference count at once, this is a problem because what would actually happen +would go something like this: + +> Thread 1: Read the current amount of references from memory (for example 5) +> Thread 2: Read the current amount of references from memory (same as above - 5) +> Thread 1: Increase this amount by 1 (we're now at 6) +> Thread 2: Increase this amount by 1 (we're also at 6 in the 2nd thread) +> Thread 1: Store the increased amount back to memory (we store this increased amount of 6 back to memory) +> Thread 2: Store the increased amount back to memory (we store the increased amount of 6 to memory?) + +[Treat sections of 2 lines as things happening concurrently] + +You can see that because threads 1 and 2 both read the reference amount from memory at the same time, they read the +same number, then they've increased it and stored it back without ever knowing that some other thread is also in the +process of increasing the reference count but it read the same amount from memory as this process, so even though the +first thread stored the updated amount, the 2nd thread also stored the updated amount, except they were the same +amounts. + +Suddenly we have no solid way of knowing how many references there actually are to our list which means it may get +removed by automated garbage collection because we've hit 0 references when we actually still have an active reference. +There is a way to circumvent this though, and that is with the use of locks + +Dummy internal code: + +```py +lock.acquire() +references = sys.getrefcount() +references += 1 +update_references(references) +lock.release() +``` + +Here, before we even started to read the amount of references, we've acquired a lock, preventing other threads from +continuing and causing them to wait until a lock is released so that another thread can acquire it. With this code, +it wold go something like this: + +> Thread 1: Try to acquire a shared lock between threads (lock is free, Thread 1 now has the lock) +> Thread 2: Try to acquire a shared lock between threads (lock is already acquired by Thread 1, we're waiting) +> Thread 1: Read the current amount of references from memory (for example 5) +> Thread 2: Try to acquire the lock (still waiting) +> Thread 1: Increase this amount by 1 (we're now at 6) +> Thread 2: Try to acquire the lock (still waiting) +> Thread 1: Store the increased amount back to memory (we now have 6 in memory) +> Thread 2: Try to acquire the lock (still waiting) +> Thread 1: Release the lock +> Thread 2: Try to acquire the lock (success, Thread 2 now has the lock) +> Thread 1: Finished (died) +> Thread 2: Read the current amount of references from memory (read value 6 from memory) +> Thread 2: Increase this amount by 1 (we're now at 7) +> Thread 2: Store the increased amount back to memory (we now have 7 in memory) +> Thread 2: Release the lock +> Thread 2: Finished (died) + +We can immediately see that this is a lot more complex than having lock-free code, but it did fix our problem, we +managed to correctly increase the reference count across multiple threads. The question is, at what cost? + +It takes a while to acquire or release a lock and these additional instructions slow down our code a lot, not to +mention that thread 2 was completely blocked while thread 1 had the lock and it was spending CPU cycles by sleeping and +waiting for the 1st thread to finish and release the lock. This is why threading can be quite complicated to deal with +and why some tasks should really stay single-threaded. + +In this small example, it may be easy to understand what's going on, but if you add enough locks, it becomes +increasingly difficult to know whether there will be any "dead-locks" (this can happen when a thread acquires a lock +but never releases it, often the case if we forcefully kill a thread), to test your code, etc. Managing locks can +become a nightmare in a complex enough code-base. + +Another problem about locks is, that they don't actually lock anything. Lock is essentially just a signal that can be +checked for and if it's active the thread can choose to wait until that signal is gone (the lock is released). But this +only happens if we actually check and if we decide to respect it, the threads are supposed to respect them, but there's +absolutely nothing preventing these threads from actually running anyway. If these threads forget to acquire a lock +they can do something that they shouldn't have been able to do. This means that even if we have a large code-base with +a lot of locks written correctly, it may not stay correct over time. Small adjustments to the code can cause it to +become incorrect in a way that's hard to see during code reviews. + +## Debugging multi-threaded code + +As an example, this is a multi-threaded code that will pass all tests and yet it is full of bugs: +```py +import threading + +counter = 0 + +def foo(): + global counter + counter += 1 + print(f"The count is {counter}") + print("----------------------") + +print("Starting") +for _ in range(5): + threading.Thread(target=foo).start() +print("Finished") +``` +When you run this code, you will most likely get a result that you would expect, but it is possible that you could also +get a complete mess, it's just not very likely because the code runs very quickly. This means you can write code +multi-threaded code that will pass all tests and still fail in production, which is very dangerous. + +To actually debug this code, we can use a technique called "fuzzing". With it, we essentially add a random sleep delay +behind every instruction to ensure that it is safe if a switch happens during that time. But even with this technique, +it is advised to run the code multiple times because there is a chance of getting the correct result even with this +method since it always is one of the possibilities, this is why multi-threaded code can introduce a lot of problems. +This would be the code with this "fuzzing" method applied: +```py +import threading +import time +import random + +def fuzz(): + time.sleep(random.random()) + +counter = 0 + +def foo(): + global counter + + fuzz() + old_counter = counter + fuzz() + counter = old_counter + 1 + fuzz() + print(f"The count is {counter}") + fuzz() + print("----------------------") + +print("Starting") +for _ in range(5): + threading.Thread(target=foo).start() +print("Finished") +``` +You may also notice that I didn't just add `fuzz()` call to every line, I've also split the line that incremented +counter into 2 lines, one that reads the counter and another one that actually increments it, this is because +internally, that's what would be happening it would just be hidden away, so to add a delay between these instructions +I had to actually split the code like this. This makes it almost impossible to test multi-threaded code, which is a big +problem. + +It is possible to fix this code with the use of locks, which would look like this: +```py +import threading + +counter_lock = threading.Lock() +printer_lock = threading.Lock() + +counter = 0 + +def foo(): + global counter + with counter_lock: + counter += 1 + with printer_lock: + print(f"The count is {counter}") + print("----------------------") + +with printer_lock: + print("Starting") + +worker_threads = [] +for _ in range(5): + t = threading.Thread(target=foo) + worker_threads.append(t) + t.start() + +for t in worker_threads: + t.join() + +with printer_lock: + print("Finished") +``` +As we can see, this code is a lot more complex than the previous one, it's not terrible, but you can probably imagine +that with a bigger codebase, this wouldn't be fun to manage. + +Not to mention that there is a core issue with this code. Even though the code will work and doesn't actually have any +bugs, it is still wrong. Why? When we use enough locks in our multi-threaded code, we may end up making it full +sequential, which is what happened here. Our code is running synchronously, with huge amount of overhead from the locks +that didn't need to be there and the actual code that would've been sufficient looks like this: +```py +counter = 0 +print("Starting") +for _ in range(5) + counter += 1 + print(f"The count is {counter}") + print("----------------------") +print("Finished") +``` +While in this particular case, it may be pretty obvious that there was no need to use threading at all, there are a lot +of cases in which it isn't as clear and I have seen some projects with code that could've been sequential but they were +already using threading for something else and so they made use of locks and added some other functionality, which made +the whole code completely sequential and they didn't even realize. + +## Global Interpreter Lock in Python + +As I said, this article is mainly based around the Python language, if you aren't interested in python, this part +likely won't be very relevant to you. However it is still pretty interesting to know how it works and why it isn't such +a huge issue as many claim it is. I also explain something about how threads are managed by the OS here which may be +interesting for you too. + +Concurrency in python is a bit complicated because it has something called the "Global Interpreter Lock" (GIL), or at +least, that's what many people think, I actually quite like the GIL, this is what it does and why it actually isn't as +bad as many people think it is: + +GIL solves the problem of needing countless locks all across the standard library. These locks would force the threads +to wait for some other thread that currently has the lock acquired which is inevitable at some places, as explained in +the above section. Removing the global lock and introducing this many smaller locks isn't even that complicated, just +time-taking, the real problem about it is that acquiring and releasing locks is expensive and takes some time, so not +only does removing GIL introduce a lot of additional complexity of dealing locks all over the standard library, it also +makes python a lot slower. + +What's actually bad about GIL is the fact that it completely prevents 2 threads from being able to run parallel with +each other. 1 thread running at 1 core and 2nd thread running along the 1st one on another core. But this isn't as big +of an issue as it may sound. Even though we can't run more threads at once, i.e. there's no actual parallelism +involved, it doesn't prevent concurrency. Instead, these threads are constantly being switched around first we're at +thread 1, then thread 2, then thread 1, then back to thread 2, etc. The lock is constantly moving from one thread to +another. + +But this interchanging of threads is happening in languages without any interpreter-wide lock. Every machine will have +limited amount of cores/CPUs at it's disposal and it is actually up to the OS itself to manage when a thread is +scheduled to run. The OS needs to determine the importance of each process and it's threads and decide which should run +and when. Sometimes it may happen that the OS will schedule 2 threads of the same process at once to be ran, which +wouldn't be possible with python due to GIL, but if other processes occupy the cores, every other thread on the system +is paused and waiting for the OS to start it again. This switching between the threads itself can happen at any +arbitrary instruction and we don't have control over it anyway. + +So when would it make sense to even use threads if they can't run in parallel? Even though we don't have control over +when these switches happen, we do have control over when the GIL is passed, and the OS is clever enough to not schedule +in a thread that is currently waiting to acquire a lock, it will schedule to active thread that is actually doing +something. The advantage of threads is just that they can cleverly take turns to speed up the overall process. Say +you have a `time.sleep(10)` operation in one thread, we can pass the GIL over to another thread, that isn't currently +waiting and constantly check if the first thread is done yet, once it is, we can switch around between them at +arbitrary order, until again it makes more sense to run one thread over another, such as when a thread is sleeping. + +## Threads vs Asynchronous run + +As I explained in the last paragraph of the previous section about GIL, threads are always being interchanged for us, +we don't need any code that explicitly causes this switching, which is an advantage of threading. This interchanging +allows for some speed-ups and we don't need to worry about the switching ourselves at all! + +But the cost to this convenience is that you have to assume a switch can happen at any time, this means we can hop over +to another thread after the first one finished reading data from memory, but it didn't yet store them back. This is why +we need locks. Threads switch preemptively, the system decides for us. + +The limit on threads is the total CPU power we have minus the cost of task switches and synchronization overhead +(locks). + +With asynchronous processing, we switch cooperatively, i.e. we use explicit code (`await` keyword in python) to cause a +task switch manually. This means that locks and other synchronization is no longer necessary. (In practice we actually +do still have locks even in async code, but they're much less common and many people don't even know about them because +they're simply not necessary in most cases) + +With python's asyncio, the cost of task switches is incredibly low, because they internally use generators (awaitables) +and it is much quicker to restart a generator that stores it's all of it's state, than calling a pure python function +which has to build up a whole new stack frame on every call whereas a generator already has a stack frame and picks up +where it left off, this makes asyncio task switching the cheapest way to handle task-switching in python by far. In +comparison, you can run hundreds of threads, but tens of thousands of async tasks per second. + +This makes async easier to get done than threads, and much faster and lighter-weight in comparison. +But nothing can be perfect, and async has it's downside too, one downside is that we have to perform the switches +cooperatively, so we need to add the `await` keyword to our code, but that's not very hard. The much more relevant +downside is that everything we now do has to be non-blocking. We can no longer simply read from a file, we need to +launch a task to read from a file, let it start reading and when the data is available, go back and pick it up. This +means we can't even use regular `time.sleep` anymore, instead, we need it's async alternative `await asyncio.sleep`. + +This means that we need a huge ecosystem of support tools that adds the support for asynchronous alternatives to every +blocking synchronous operation, which increases the learning curve. + +### Comparison + +- Async maximizes CPU utilization because it has less overhead than threads +- Threading typically works with existing code and tools as long as locks are added around critical sections +- For complex systems, async is much easier to get right than threads with locks +- Threads require very little tooling (locks and queues) +- Async needs a lot of tooling (futures, event loops, non-blocking versions of everything) + +## Conclusion + +- If you need to run something in parallel, you will need to use multiprocessing because GIL prevents parallel threads +- If you need to run something concurrently, but not necessarily in parallel, you can either use threads or async +- Threads make more sense if you already have a huge code-base because they don't require rewriting everything to + non-blocking versions you will just need to add some locks and queues +- Async make more sense if you know you will need concurrency from the start, since it helps to keep everything a lot + more manageable and it's quicker than threads.