mirror of
https://github.com/ItsDrike/itsdrike.com.git
synced 2025-01-23 20:14:33 +00:00
Add a post about concurrency and parallelism
This commit is contained in:
parent
ea39ed18c7
commit
de560ef99b
371
content/posts/concurrency-and-parallelism.md
Normal file
371
content/posts/concurrency-and-parallelism.md
Normal file
|
@ -0,0 +1,371 @@
|
|||
---
|
||||
title: Concurrency and Parallelism
|
||||
date: 2021-11-17
|
||||
tags: [programming]
|
||||
---
|
||||
|
||||
Concurrency is an exciting topic that's becoming more and more important, yet I see so many people that aren't very
|
||||
familiar with topic and it's possibilities. I'll try to explain the differences between threading, multiprocessing and
|
||||
asynchronous run. I'll also show some examples when concurrency should be avoided, and when it makes sense.
|
||||
|
||||
I'll be talking about concurrency with python language in mind, but even if you don't use python, I still think that
|
||||
you can learn a lot from this article if you aren't that familiar with key concurrency concepts. My hope is that after
|
||||
you read this article, you will confidently know the differences between the various concurrency methods and their
|
||||
individual advantages or disadvantages when compared to each other.
|
||||
|
||||
## Why concurrency?
|
||||
|
||||
In programming, we often have the need to do things very quickly so that our program isn't slow. But we also often need
|
||||
to perform complex operations which take some time to actually compute. To cope with this, we can sometimes perform
|
||||
certain tasks at the same time.
|
||||
|
||||
As an example, we can think about concurrency as the amount of lanes on a highway. If we have a highway with just one
|
||||
single lane, all cars on it would have to use that lane and they would travel only as quickly as the slowest car in
|
||||
front of them. But once we bring in another lane, we can already see huge improvements because the cars can go at their
|
||||
own speeds on separate lanes and we can physically fit in more cars.
|
||||
|
||||
Similarly to this example, when we use concurrency, we allocate multiple physical CPUs/cores to a process, essentially
|
||||
giving it more clock cycles, however not every task is suited for concurrent run, consider this example:
|
||||
|
||||
```py
|
||||
x = my_function()
|
||||
y = my_other_function(x)
|
||||
```
|
||||
|
||||
We can clearly see that `my_other_function` is completely dependent on the result of `my_function`, this means that it
|
||||
wouldn't make any sense to run these concurrently on 2 cores, because `my_other_function` would just wait for
|
||||
`my_function` and only after it's finished will it start running. We just used 2 cores and did something slower than
|
||||
with one core. It was slower because it took some time to send the result of `my_function` to `my_other_function`
|
||||
running in a separate process.
|
||||
|
||||
This shows us that not all tasks are suited to run concurrently, but there are some that really could benefit from this
|
||||
form of run. For example if we wanted to read the content of 200 different files, reading them one-by-one would take a
|
||||
lot of time, but if we were able to read all 200 concurrently, it would only take us the duration of reading 1 file,
|
||||
yet we would get the content of all 200 files. (Of course we're assuming that our disk would support reading 200 things
|
||||
at once).
|
||||
|
||||
Even though this sounds good, it's never as simple as it first sounds. Even though it is true, on most machines, we
|
||||
won't actually be able to run 200 things at once because we don't have 200 CPUs/cores. Concurrency like this will
|
||||
always be limited by the hardware of the computer your software is running on. If you have a computer with 8 logical
|
||||
CPUs/cores we can only run 8 things at once. Even though it obviously won't be as good as running 200 tasks at once, it
|
||||
will still be way better than running single task at once. In our example, we would be able to get the results of all
|
||||
200 tasks in the amount of time it would take to run 25 tasks sequentially, this is still a huge improvement.
|
||||
|
||||
## Threads vs Processes
|
||||
|
||||
Understanding the concept of concurrency is one thing, but how do we actually run our code on multiple cores/CPUs?
|
||||
Luckily for us, this is all handled by the operating system that we're running. The kernel of this OS has to manage
|
||||
thousands of processes with their threads that all have to run and all of those constantly fight to get as much CPU
|
||||
clock cycles as they can. It is up to the OS to determine which processes are more important then others and to
|
||||
interchange these processes so that each process gets enough CPU time, having multiple cores helps a lot because rather
|
||||
than constantly swapping processes around on a single core, we can run n processes at once and the OS has less overall
|
||||
swapping to do. But what are these processes and the threads attached to hem?
|
||||
|
||||
But what are Threads? The concept of a process is probably not a hard one to understand, it's just a separate program
|
||||
that we started, but a thread is a bit more interesting than that. Threads are essentially a way for a single process
|
||||
to do 2 things concurrently, yet keep existing in the shared-state of a single process. This means we don't have any
|
||||
communication overhead and it's very easy to pass information along. However this property can often be disadvantage,
|
||||
since threads work on a single shared state, we often need to use locks to properly communicate without causing issues.
|
||||
(I'll explain the importance of locks with some examples later, but essentially, we need locks to prevent data loss
|
||||
when 2 threads make a change to the same place in memory at once.)
|
||||
|
||||
As you were probably able to figure out, the advantage of processes is that they don't have these shared states,
|
||||
processes are fully independent, however this is also a disadvantage because it is hard to communicate between these
|
||||
processes. Since we don't have this shared state, if processes want to talk to each other, they need to find take the
|
||||
objects from memory, serialize them and move them across a raw socket to another process, where it can get
|
||||
deserialized. (This will most likely be done with `pickle` library in python.) This means processes have huge
|
||||
communication cost compared to threads.
|
||||
|
||||
## Why do we need locks?
|
||||
|
||||
Consider this code:
|
||||
|
||||
```py
|
||||
>>> import sys
|
||||
>>> a = []
|
||||
>>> b = a
|
||||
>>> sys.getrefcount(a)
|
||||
3
|
||||
```
|
||||
|
||||
In the example here, we can see that python keeps a reference count for the empty list object, and in this case, it was
|
||||
3. The list object was referenced by a, b and the argument passed to `sys.getrefcount`. If we didn't have locks,
|
||||
threads could attempt to increase the reference count at once, this is a problem because what would actually happen
|
||||
would go something like this:
|
||||
|
||||
> Thread 1: Read the current amount of references from memory (for example 5)
|
||||
> Thread 2: Read the current amount of references from memory (same as above - 5)
|
||||
> Thread 1: Increase this amount by 1 (we're now at 6)
|
||||
> Thread 2: Increase this amount by 1 (we're also at 6 in the 2nd thread)
|
||||
> Thread 1: Store the increased amount back to memory (we store this increased amount of 6 back to memory)
|
||||
> Thread 2: Store the increased amount back to memory (we store the increased amount of 6 to memory?)
|
||||
|
||||
[Treat sections of 2 lines as things happening concurrently]
|
||||
|
||||
You can see that because threads 1 and 2 both read the reference amount from memory at the same time, they read the
|
||||
same number, then they've increased it and stored it back without ever knowing that some other thread is also in the
|
||||
process of increasing the reference count but it read the same amount from memory as this process, so even though the
|
||||
first thread stored the updated amount, the 2nd thread also stored the updated amount, except they were the same
|
||||
amounts.
|
||||
|
||||
Suddenly we have no solid way of knowing how many references there actually are to our list which means it may get
|
||||
removed by automated garbage collection because we've hit 0 references when we actually still have an active reference.
|
||||
There is a way to circumvent this though, and that is with the use of locks
|
||||
|
||||
Dummy internal code:
|
||||
|
||||
```py
|
||||
lock.acquire()
|
||||
references = sys.getrefcount()
|
||||
references += 1
|
||||
update_references(references)
|
||||
lock.release()
|
||||
```
|
||||
|
||||
Here, before we even started to read the amount of references, we've acquired a lock, preventing other threads from
|
||||
continuing and causing them to wait until a lock is released so that another thread can acquire it. With this code,
|
||||
it wold go something like this:
|
||||
|
||||
> Thread 1: Try to acquire a shared lock between threads (lock is free, Thread 1 now has the lock)
|
||||
> Thread 2: Try to acquire a shared lock between threads (lock is already acquired by Thread 1, we're waiting)
|
||||
> Thread 1: Read the current amount of references from memory (for example 5)
|
||||
> Thread 2: Try to acquire the lock (still waiting)
|
||||
> Thread 1: Increase this amount by 1 (we're now at 6)
|
||||
> Thread 2: Try to acquire the lock (still waiting)
|
||||
> Thread 1: Store the increased amount back to memory (we now have 6 in memory)
|
||||
> Thread 2: Try to acquire the lock (still waiting)
|
||||
> Thread 1: Release the lock
|
||||
> Thread 2: Try to acquire the lock (success, Thread 2 now has the lock)
|
||||
> Thread 1: Finished (died)
|
||||
> Thread 2: Read the current amount of references from memory (read value 6 from memory)
|
||||
> Thread 2: Increase this amount by 1 (we're now at 7)
|
||||
> Thread 2: Store the increased amount back to memory (we now have 7 in memory)
|
||||
> Thread 2: Release the lock
|
||||
> Thread 2: Finished (died)
|
||||
|
||||
We can immediately see that this is a lot more complex than having lock-free code, but it did fix our problem, we
|
||||
managed to correctly increase the reference count across multiple threads. The question is, at what cost?
|
||||
|
||||
It takes a while to acquire or release a lock and these additional instructions slow down our code a lot, not to
|
||||
mention that thread 2 was completely blocked while thread 1 had the lock and it was spending CPU cycles by sleeping and
|
||||
waiting for the 1st thread to finish and release the lock. This is why threading can be quite complicated to deal with
|
||||
and why some tasks should really stay single-threaded.
|
||||
|
||||
In this small example, it may be easy to understand what's going on, but if you add enough locks, it becomes
|
||||
increasingly difficult to know whether there will be any "dead-locks" (this can happen when a thread acquires a lock
|
||||
but never releases it, often the case if we forcefully kill a thread), to test your code, etc. Managing locks can
|
||||
become a nightmare in a complex enough code-base.
|
||||
|
||||
Another problem about locks is, that they don't actually lock anything. Lock is essentially just a signal that can be
|
||||
checked for and if it's active the thread can choose to wait until that signal is gone (the lock is released). But this
|
||||
only happens if we actually check and if we decide to respect it, the threads are supposed to respect them, but there's
|
||||
absolutely nothing preventing these threads from actually running anyway. If these threads forget to acquire a lock
|
||||
they can do something that they shouldn't have been able to do. This means that even if we have a large code-base with
|
||||
a lot of locks written correctly, it may not stay correct over time. Small adjustments to the code can cause it to
|
||||
become incorrect in a way that's hard to see during code reviews.
|
||||
|
||||
## Debugging multi-threaded code
|
||||
|
||||
As an example, this is a multi-threaded code that will pass all tests and yet it is full of bugs:
|
||||
```py
|
||||
import threading
|
||||
|
||||
counter = 0
|
||||
|
||||
def foo():
|
||||
global counter
|
||||
counter += 1
|
||||
print(f"The count is {counter}")
|
||||
print("----------------------")
|
||||
|
||||
print("Starting")
|
||||
for _ in range(5):
|
||||
threading.Thread(target=foo).start()
|
||||
print("Finished")
|
||||
```
|
||||
When you run this code, you will most likely get a result that you would expect, but it is possible that you could also
|
||||
get a complete mess, it's just not very likely because the code runs very quickly. This means you can write code
|
||||
multi-threaded code that will pass all tests and still fail in production, which is very dangerous.
|
||||
|
||||
To actually debug this code, we can use a technique called "fuzzing". With it, we essentially add a random sleep delay
|
||||
behind every instruction to ensure that it is safe if a switch happens during that time. But even with this technique,
|
||||
it is advised to run the code multiple times because there is a chance of getting the correct result even with this
|
||||
method since it always is one of the possibilities, this is why multi-threaded code can introduce a lot of problems.
|
||||
This would be the code with this "fuzzing" method applied:
|
||||
```py
|
||||
import threading
|
||||
import time
|
||||
import random
|
||||
|
||||
def fuzz():
|
||||
time.sleep(random.random())
|
||||
|
||||
counter = 0
|
||||
|
||||
def foo():
|
||||
global counter
|
||||
|
||||
fuzz()
|
||||
old_counter = counter
|
||||
fuzz()
|
||||
counter = old_counter + 1
|
||||
fuzz()
|
||||
print(f"The count is {counter}")
|
||||
fuzz()
|
||||
print("----------------------")
|
||||
|
||||
print("Starting")
|
||||
for _ in range(5):
|
||||
threading.Thread(target=foo).start()
|
||||
print("Finished")
|
||||
```
|
||||
You may also notice that I didn't just add `fuzz()` call to every line, I've also split the line that incremented
|
||||
counter into 2 lines, one that reads the counter and another one that actually increments it, this is because
|
||||
internally, that's what would be happening it would just be hidden away, so to add a delay between these instructions
|
||||
I had to actually split the code like this. This makes it almost impossible to test multi-threaded code, which is a big
|
||||
problem.
|
||||
|
||||
It is possible to fix this code with the use of locks, which would look like this:
|
||||
```py
|
||||
import threading
|
||||
|
||||
counter_lock = threading.Lock()
|
||||
printer_lock = threading.Lock()
|
||||
|
||||
counter = 0
|
||||
|
||||
def foo():
|
||||
global counter
|
||||
with counter_lock:
|
||||
counter += 1
|
||||
with printer_lock:
|
||||
print(f"The count is {counter}")
|
||||
print("----------------------")
|
||||
|
||||
with printer_lock:
|
||||
print("Starting")
|
||||
|
||||
worker_threads = []
|
||||
for _ in range(5):
|
||||
t = threading.Thread(target=foo)
|
||||
worker_threads.append(t)
|
||||
t.start()
|
||||
|
||||
for t in worker_threads:
|
||||
t.join()
|
||||
|
||||
with printer_lock:
|
||||
print("Finished")
|
||||
```
|
||||
As we can see, this code is a lot more complex than the previous one, it's not terrible, but you can probably imagine
|
||||
that with a bigger codebase, this wouldn't be fun to manage.
|
||||
|
||||
Not to mention that there is a core issue with this code. Even though the code will work and doesn't actually have any
|
||||
bugs, it is still wrong. Why? When we use enough locks in our multi-threaded code, we may end up making it full
|
||||
sequential, which is what happened here. Our code is running synchronously, with huge amount of overhead from the locks
|
||||
that didn't need to be there and the actual code that would've been sufficient looks like this:
|
||||
```py
|
||||
counter = 0
|
||||
print("Starting")
|
||||
for _ in range(5)
|
||||
counter += 1
|
||||
print(f"The count is {counter}")
|
||||
print("----------------------")
|
||||
print("Finished")
|
||||
```
|
||||
While in this particular case, it may be pretty obvious that there was no need to use threading at all, there are a lot
|
||||
of cases in which it isn't as clear and I have seen some projects with code that could've been sequential but they were
|
||||
already using threading for something else and so they made use of locks and added some other functionality, which made
|
||||
the whole code completely sequential and they didn't even realize.
|
||||
|
||||
## Global Interpreter Lock in Python
|
||||
|
||||
As I said, this article is mainly based around the Python language, if you aren't interested in python, this part
|
||||
likely won't be very relevant to you. However it is still pretty interesting to know how it works and why it isn't such
|
||||
a huge issue as many claim it is. I also explain something about how threads are managed by the OS here which may be
|
||||
interesting for you too.
|
||||
|
||||
Concurrency in python is a bit complicated because it has something called the "Global Interpreter Lock" (GIL), or at
|
||||
least, that's what many people think, I actually quite like the GIL, this is what it does and why it actually isn't as
|
||||
bad as many people think it is:
|
||||
|
||||
GIL solves the problem of needing countless locks all across the standard library. These locks would force the threads
|
||||
to wait for some other thread that currently has the lock acquired which is inevitable at some places, as explained in
|
||||
the above section. Removing the global lock and introducing this many smaller locks isn't even that complicated, just
|
||||
time-taking, the real problem about it is that acquiring and releasing locks is expensive and takes some time, so not
|
||||
only does removing GIL introduce a lot of additional complexity of dealing locks all over the standard library, it also
|
||||
makes python a lot slower.
|
||||
|
||||
What's actually bad about GIL is the fact that it completely prevents 2 threads from being able to run parallel with
|
||||
each other. 1 thread running at 1 core and 2nd thread running along the 1st one on another core. But this isn't as big
|
||||
of an issue as it may sound. Even though we can't run more threads at once, i.e. there's no actual parallelism
|
||||
involved, it doesn't prevent concurrency. Instead, these threads are constantly being switched around first we're at
|
||||
thread 1, then thread 2, then thread 1, then back to thread 2, etc. The lock is constantly moving from one thread to
|
||||
another.
|
||||
|
||||
But this interchanging of threads is happening in languages without any interpreter-wide lock. Every machine will have
|
||||
limited amount of cores/CPUs at it's disposal and it is actually up to the OS itself to manage when a thread is
|
||||
scheduled to run. The OS needs to determine the importance of each process and it's threads and decide which should run
|
||||
and when. Sometimes it may happen that the OS will schedule 2 threads of the same process at once to be ran, which
|
||||
wouldn't be possible with python due to GIL, but if other processes occupy the cores, every other thread on the system
|
||||
is paused and waiting for the OS to start it again. This switching between the threads itself can happen at any
|
||||
arbitrary instruction and we don't have control over it anyway.
|
||||
|
||||
So when would it make sense to even use threads if they can't run in parallel? Even though we don't have control over
|
||||
when these switches happen, we do have control over when the GIL is passed, and the OS is clever enough to not schedule
|
||||
in a thread that is currently waiting to acquire a lock, it will schedule to active thread that is actually doing
|
||||
something. The advantage of threads is just that they can cleverly take turns to speed up the overall process. Say
|
||||
you have a `time.sleep(10)` operation in one thread, we can pass the GIL over to another thread, that isn't currently
|
||||
waiting and constantly check if the first thread is done yet, once it is, we can switch around between them at
|
||||
arbitrary order, until again it makes more sense to run one thread over another, such as when a thread is sleeping.
|
||||
|
||||
## Threads vs Asynchronous run
|
||||
|
||||
As I explained in the last paragraph of the previous section about GIL, threads are always being interchanged for us,
|
||||
we don't need any code that explicitly causes this switching, which is an advantage of threading. This interchanging
|
||||
allows for some speed-ups and we don't need to worry about the switching ourselves at all!
|
||||
|
||||
But the cost to this convenience is that you have to assume a switch can happen at any time, this means we can hop over
|
||||
to another thread after the first one finished reading data from memory, but it didn't yet store them back. This is why
|
||||
we need locks. Threads switch preemptively, the system decides for us.
|
||||
|
||||
The limit on threads is the total CPU power we have minus the cost of task switches and synchronization overhead
|
||||
(locks).
|
||||
|
||||
With asynchronous processing, we switch cooperatively, i.e. we use explicit code (`await` keyword in python) to cause a
|
||||
task switch manually. This means that locks and other synchronization is no longer necessary. (In practice we actually
|
||||
do still have locks even in async code, but they're much less common and many people don't even know about them because
|
||||
they're simply not necessary in most cases)
|
||||
|
||||
With python's asyncio, the cost of task switches is incredibly low, because they internally use generators (awaitables)
|
||||
and it is much quicker to restart a generator that stores it's all of it's state, than calling a pure python function
|
||||
which has to build up a whole new stack frame on every call whereas a generator already has a stack frame and picks up
|
||||
where it left off, this makes asyncio task switching the cheapest way to handle task-switching in python by far. In
|
||||
comparison, you can run hundreds of threads, but tens of thousands of async tasks per second.
|
||||
|
||||
This makes async easier to get done than threads, and much faster and lighter-weight in comparison.
|
||||
But nothing can be perfect, and async has it's downside too, one downside is that we have to perform the switches
|
||||
cooperatively, so we need to add the `await` keyword to our code, but that's not very hard. The much more relevant
|
||||
downside is that everything we now do has to be non-blocking. We can no longer simply read from a file, we need to
|
||||
launch a task to read from a file, let it start reading and when the data is available, go back and pick it up. This
|
||||
means we can't even use regular `time.sleep` anymore, instead, we need it's async alternative `await asyncio.sleep`.
|
||||
|
||||
This means that we need a huge ecosystem of support tools that adds the support for asynchronous alternatives to every
|
||||
blocking synchronous operation, which increases the learning curve.
|
||||
|
||||
### Comparison
|
||||
|
||||
- Async maximizes CPU utilization because it has less overhead than threads
|
||||
- Threading typically works with existing code and tools as long as locks are added around critical sections
|
||||
- For complex systems, async is much easier to get right than threads with locks
|
||||
- Threads require very little tooling (locks and queues)
|
||||
- Async needs a lot of tooling (futures, event loops, non-blocking versions of everything)
|
||||
|
||||
## Conclusion
|
||||
|
||||
- If you need to run something in parallel, you will need to use multiprocessing because GIL prevents parallel threads
|
||||
- If you need to run something concurrently, but not necessarily in parallel, you can either use threads or async
|
||||
- Threads make more sense if you already have a huge code-base because they don't require rewriting everything to
|
||||
non-blocking versions you will just need to add some locks and queues
|
||||
- Async make more sense if you know you will need concurrency from the start, since it helps to keep everything a lot
|
||||
more manageable and it's quicker than threads.
|
Loading…
Reference in a new issue