mirror of
https://github.com/ItsDrike/itsdrike.com.git
synced 2024-11-10 05:59:41 +00:00
221 lines
13 KiB
Markdown
221 lines
13 KiB
Markdown
---
|
|
title: Removing duplicates from lists
|
|
date: 2021-08-31
|
|
tags: [programming]
|
|
---
|
|
|
|
In programming, we often need to handle removing duplicates from an array-like structure. While this may seem like an
|
|
arbitrary and easy question, that can be solved relatively simply, that's not necessarily always the case. While it may
|
|
be easy to solve for the obvious case of things like a list of numbers, it's not so easy if we need to deal with
|
|
non-hashable objects, while keeping the algorithm runtime relatively low.
|
|
|
|
For this post, I'll be using python for the explanation and examples, but this will apply to every programming
|
|
language, it will just follow a different syntax.
|
|
|
|
## Removing hashable duplicates (Simple case)
|
|
|
|
Let's say we have a list like `[1, 2, 4, 4, 5, 1, 3]`, the easiest and most optimized way to remove all elements from
|
|
this list would be to simply convert it into a `set`. In python, set is a data-structure that acts a bit like a
|
|
mapping/dictionary. Sets rely on hashes, and since the same objects should hash to same values, by design, sets prevent
|
|
the possibility of having multiple same elements in them.
|
|
|
|
An example of this would be as simple as this:
|
|
|
|
```python
|
|
duplicate_list = [1, 2, 4, 4, 5, 1, 3]
|
|
result = list(set(duplicate_list))
|
|
```
|
|
|
|
This will result in a clean list without any duplicate values. By converting the list into a set, we need to loop over
|
|
it (done internally) which means we take `O(n)`, there's also another `O(n)` for conversion back to list, but `O(2n)`
|
|
is just `O(n)`. This makes hashing a very compelling solution.
|
|
|
|
However this solution does have it's issues, and that's specifically the fact that it can't be used for everything.
|
|
Specifically if we had a list such as `[[1, 2, 3], [1, 2, 3]]` we would have a pretty big problem, that is the fact
|
|
that the inner lists aren't hashable, so we can't use sets here. This would also happen with inner sets or any custom
|
|
unhashable types.
|
|
|
|
## Removing unhashable duplicates
|
|
|
|
To cope with this problem, we have 5 general options:
|
|
|
|
1. Do `==` (`__eq__`) comparisons for each value against each other value
|
|
2. Compare exact object
|
|
3. Convert the elements into their hashable counterpart
|
|
4. Generate our own universal hashes for our unhashables
|
|
5. Generate specific hashes depending on the objects
|
|
|
|
**Note about hashes:** even though in the last 3 options, what we're essentially doing is somehow making an unhashable
|
|
object a hashable one, while this is probably fine for our use-case, there usually is a reason why some object isn't
|
|
hashable (unless the object's author was just careless). This reason is usually simply because the object is mutable,
|
|
i.e.: it can be changed without making a new object. This mutability will mean that it will get a new hash, since the
|
|
hash was probably computed from the defined values which were later changed.
|
|
|
|
### Equality comparisons with `__eq__` (the simpler option)
|
|
|
|
The easiest way would be to simply rely on `__eq__` (i.e.: the special function called by python when performing `==`
|
|
comparisons). Doing this means we don't need hashable object (objects defining `__hash__`), however we now need the
|
|
objects to define this equality check. This isn't a huge issue though, since it is much more common for objects to
|
|
define equality checks than to be hashable. At least for builtin objects, all of the unhashable objects do define this.
|
|
|
|
We can tackle this simply by making another `result` list, that will only hold the values once, this is the list
|
|
against which we will be running the `==` comparisons (for each value in it), to avoid ever inserting a duplicate value
|
|
into it. Example of this code:
|
|
|
|
```python
|
|
duplicate_list = [1, 1, 2, 4, [1, 2], [2, 1], [1, 2], "foo", "bar"]
|
|
result = []
|
|
for element in duplicate_list:
|
|
# Go through the whole `result` list and only add current
|
|
# `element` if it isn't already there.
|
|
if element not in result:
|
|
result.append(element)
|
|
```
|
|
|
|
Using `element in result` will internally call the `__in__` method of our `result` list object, which is implemented
|
|
in a way that goes through every element of that list and performs an `__eq__` equality check, which if true, we return
|
|
early and end with true. But in the worst case, we go through every element of the list and perform this comparison for
|
|
each one, once we've depleted all elements, only then we can return false (element wasn't in the `result` list)
|
|
|
|
This is a very common and simple solution, however it isn't a very effective one. In worst case (every element is
|
|
unique), it will first go through the whole `duplicate_list` (that's `O(n)`), and in there, we will go through the
|
|
whole `result` list, which in the worst case will be growing with every iteration until it gets to the size of `n`
|
|
(since in the worst case, every element is unique and gets added). This results in another `O(n)` inside of our
|
|
original `O(n)`, making our algorithm `O(n^2)`.
|
|
|
|
To resolve the efficiency problem, we could however add an optimization, where all hashable values would use `set`,
|
|
and only resorting to `__eq__` check for non-hashables. This would still result in worst case `O(n^2)` (all elements
|
|
are non-hashable), but best case of `O(n)` (all elements are hashable). This approach is probably the best way to
|
|
achieve simple implementation, yet relatively good efficiency.
|
|
|
|
### Compare exact objects
|
|
|
|
In most programming languages, there is a way to access the memory id of an object, in python, this is done with `id()`
|
|
function. This memory id will always be unique, since we can't have 2 separate objects sharing the same memory id.
|
|
Essentially this is the difference between `x is y` and `x == y`, `is` will basically do `id(x) == id(y)` while `==`
|
|
will use the `x.__eq__(y)` method defined for `x`'s class.
|
|
|
|
We can use this `id` for our comparisons instead of using the `__eq__` and since it will always be unique, we can even
|
|
use it as a hash of that object. However we should know that this will result in some weirdness and this method likely
|
|
isn't what you'd want for general duplicate removing.
|
|
|
|
```python
|
|
a = "hello"
|
|
b = a # Python will not make a new string object here, so it will use the same one
|
|
|
|
print(a == b) # <- True
|
|
print(a is b) # <- True
|
|
|
|
c = "hi"
|
|
d = "hi" # This is a new string object, and the ids will be different here
|
|
|
|
print(c == d) # <- True
|
|
print(c is d) # <- False
|
|
```
|
|
|
|
**Note:** In many python interpretations, `c is d` will actually also be `True` because of internal memory
|
|
optimizations, it is simply more efficient to use the same object if there already exists one, but this shouldn't be
|
|
relied on and with some interpreters, this simply won't be the case and we would actually get the shown `False`.
|
|
|
|
The algorithm like this is relatively simple:
|
|
|
|
```python
|
|
x = Foo(x=5)
|
|
y = Foo(x=5)
|
|
duplicate_list = [x, x, y, 1, 2, "hi", Foo(x=5)]
|
|
|
|
result = {}
|
|
for element in duplicate_list:
|
|
result[id(element)] = element
|
|
|
|
return list(result.values())
|
|
```
|
|
|
|
To preserve the original elements, we used a dict that held the unique hashable memory ids of our objects as keys and
|
|
the actual unhashable objects as values. Once we were done, we just returned all of the values in it as a list.
|
|
|
|
The result of this would be: `[x, y, 1, 2, "hi", Foo(x=5)]`. *(Note that `x`, `y` and `Foo(x=5)` would actually be
|
|
printed in the same way, since they're the same class, sharing the same `__repr__`)*. From this output we can clearly
|
|
see that even though `x`, `y`, and `Foo(x=5)` are exactly the same thing, sharing the same attributes, they're
|
|
different objects and therefore they have different memory ids, which means our algorithm didn't remove them, however
|
|
there is now only one `x`, because the second one was indeed exactly the same object, so that did get removed.
|
|
|
|
### Convert hashables into their unhashable counterpart (case-by-case)
|
|
|
|
We could try and find some alternative hashable data-structure for our unhashable ones. With builtin data-structures,
|
|
this is completely doable:
|
|
|
|
- unhashable `list` can be fully represented by a hashable immutable `tuple`: Tuples are ordered and a hash of `(1, 2)`
|
|
will be different from `(2, 1)`, just like it should be with lists. Since tuples aren't mutable, they can be hashable.
|
|
- unhashable `set` can be fully represented by a hashable immutable `frozenset`: This is a python built-in, even though
|
|
it's not very well known, it's simply an immutable set, giving it the possibility to be hashable too.
|
|
|
|
However we can't simply represent these objects as these other data-structures, since this would cause collisions, for
|
|
example with a list like `[[1, 2], (1, 2)]`, since `[1, 2]` would get converted into a tuple of `(1, 2)` it now becomes
|
|
identical, even though it really shouldn't be. To fix this, we can simply represent each object with a tuple:
|
|
`(type(obj), hashable_counterpart(obj))`, for example for a list, this would look like `(list, tuple(obj))` with the
|
|
first element in this tuple being the actual type, with this, real tuples would simply be `(tuple, obj)`, making
|
|
conversion possible once we're finished.
|
|
|
|
Even though for built-in data types, this is a viable solution, we need to be sure that we really will only use these
|
|
built-ins, and that our list will never contain anything else, or if it will, that it's handled appropriately. Since
|
|
the code for this example would be relatively long and entirely dependent on what data types will we have in the list,
|
|
I won't write an example code here, but the description should sufficiently describe how the code would look.
|
|
|
|
This will result in `O(n)`, however it should be kept in mind that the conversion into the hashable counterparts could
|
|
take significant time, we will be creating new objects for each unhashable object, which could end up taking longer
|
|
than the `O(n^2)` method using simple `__eq__` comparisons. This will be entirely dependent on how will the individual
|
|
cases be handled.
|
|
|
|
### Generate universal custom hashes for unhashables
|
|
|
|
If we don't have a strictly defined set of data structures that can be present in our list, we need to handle every
|
|
possibility. We can simply device our own way to generate a unique hash for any object whatsoever, this may sound hard,
|
|
but it's not actually that complicated.
|
|
|
|
We can go about this in many ways, but as a simple proof that this is possible, we know that all objects must be stored
|
|
somehow, this happens in the memory (RAM) which is a device that can only hold bits. This means that our object is
|
|
stored somewhere in the memory as an array of bytes, and any array of bytes can be interpreted as a simple integer.
|
|
Since integers are hashable we're done, we can compute a hash of any object.
|
|
|
|
Python has a handy library called `pickle` that is used for data serialization (converting data into a stream of bytes
|
|
that can be written into a file, for storage or sending whole python objects). Since these streams of bytes are
|
|
hashable, we can simply pickle all of the objects and use their hashes.
|
|
|
|
Again though, just like with the previous case, we should keep in mind that this process can take a while. Especially
|
|
when using `pickle` for objects with a big amount of internal attributes, not to mention that these attributes may also
|
|
be holders of a wide amount of other attributes. This could end up to be a very intensive process, depending on the
|
|
individual objects. So even with `O(n)` benefit of using hashing, this could take longer than the `O(n^2)` equality
|
|
approach.
|
|
|
|
### Generate hashes specific to our objects (case-by-case)
|
|
|
|
Another approach we could take is similar to the above, but rather than using `pickle` and generating a hash from every
|
|
single aspect of our object, we could use a more specific approach and only use the needed aspects of these objects to
|
|
make the hash. This is perhaps a bit more similar to the case-by-case representation with a hashable counterpart.
|
|
|
|
For example for an object that holds many attributes, however the only attributes that could actually be changed and
|
|
that determines it's uniqueness, we can construct our hash purely from those. `hash((obj.attribute1, obj.attribute2))`
|
|
Ideally we would also include the object's class: `hash((obj.__class__, ...))`.
|
|
|
|
This will end up being a lot faster than having to do a full serialization of our object with `pickle`, however since
|
|
this is a case-by-case approach, it will again only work when we have a pre-determined set of classes that can be in
|
|
our list, it isn't applicable to everything since we don't know the unique attributes of every custom class, that's why
|
|
these custom classes should be defining their own `__hash__`. Again, since this is hashing comparisons, it will be
|
|
`O(n)`.
|
|
|
|
## Recap
|
|
|
|
In most of the cases, the first approach will work just fine (the simple case where our list only have hashable
|
|
objects).
|
|
|
|
If however if we do need to deal with unhashables, we can either use `==` comparisons, or even direct memory ids and
|
|
ensure that our objects are exactly the same, or we can use some hashable counterpart to our unhashable object that
|
|
could represent it. We could also construct a universal hash function with serialization of our object, or a specific
|
|
one if we know which classes will be used there.
|
|
|
|
Even though we do have ways to deal with unhashables, if you're in control of the classes, and they aren't supposed to
|
|
be mutable, always make sure to add a `__hash__` method to them, so that duplicates can be easily removed in `O(n)`
|
|
without any complicated inconveniences.
|
|
|