Previously, we discussed hashing briefly under the section IP hashing-based selection in the article โ Load Balancing. However, it might be a little tricky for some which is why this is an article with the basic concept of hashing.
What is hashing?
The internet has been growing in its content rapidly. The content on the internet doubles every six months making it very difficult to find a particular one. The usual array and linked lists used to store data lead to a hard time when the data is large.
Thus, Enters HASHING.
A certain data of large size in an array takes O(logn) times in a sorted one and O(n) in the unsorted one. It might feel easier if the sorted array uses special techniques like binary search but if itโs not, we might have to use linear search. However, we cannot be sure if the error margin will be minimized. Hence, hashing is used so that we can have the constant time of O(1) on the operations. This is a far more effective method since the time of operation is independent of the data size n.
Map-The data structure
The word โmapโ means the relation of two sets โ (key, value). The relation between the key and the value is that given a key, you can use a function to get a value. The function is called a hash function.
For a simple understanding, letโs imagine an array where the key is the index, and the value of that index is the required value. The array X with the key โiโ gives the value from X[i]. The difference from the concept of an array on hashing is that the concept of a hash table is more generalized in comparison. The latter can have any value as a key instead of having an integer all the time. It can even be a name or an object times according to your requirement. The only trick here is to have a valid hash function to form an index that can store the object at a particular location from where we can later find the object easily.
Example: -
Letโs imagine a set of strings {โpragyaโ, โsapkotaโ, โmediumโ}. Now we need to store these strings on a table. We need to have a quick update and the operation time should be O(1). We canโt be responsible for the ordering of the operation or even the maintenance.
Now, letโs think of a schema where we number the alphabets like a=1, b=2, c=3,. โฆ.., z=26.
So, letโs have a number for each string.
โpragyaโ = 16+18+1+7+25+1 = 68
โsapkotaโ = 19+1+16+11+15+20+1 = 83
โmediumโ = 13+5+4+9+21+13 = 65
If you have a table of size 4, we can use the hash function as
sum mod 4
and thus place it on the table for the table.
โpragyaโ = 68 mod 4 = 0
โsapkotaโ = 83 mod 4 = 3
โmediumโ = 65 mod 4 = 1
The table will look as below: -
0 |
1 |
2 |
3 |
pragya |
medium |
|
sapkota |
As you can notice from above that the idea is computing a location for a string with the help of a hash function. Here, the hash function is sum of the characters mod Table size
.
The way of searching a string with the use of the value is a very effective way of storing words in a dictionary.
Hashing: The problems
On the above strings {โpragyaโ, โsapkotaโ, โmediumโ} we used the lowercase letters which have the numbers 68, 83, and 65 respectively. However, if we have the permutations of the letters such as โaygarpโ for โpragyaโ or โsaPkotAโ for โsapkotaโ or โMediumโ for โmediumโ โ the numbers will the same as 68, 83, and 65. This leads to storing more than one string in one location which ultimately leads to what we call on hashing terms โ COLLISION.
Likewise, another problem with hashing is determining the size of a hash table. The size is preferred to be a prime number to prevent collisions.
All of this leads us to two points as follows: -
The hash function is supposed to be easy to compute and the one to avoid the maximum collisions.
Consistent Hashing
If you have five servers and one of the servers dies due to some reason, the hashing function wonโt be able to know it by itself. And if you then add a replacement server, it wonโt know about it too. But the mod operator still stays 5 despite one server being removed and the other being added. For solving this problem, we have Consistent Hashing.
The concept of consistent hashing was brought to reduce the problem of hashing. However, it doesnโt eliminate those problems. This may not sound much in small-scale data but for large data โ this is very important. The function applies to the requests and the servers to result in output in a set range of values. It reduces the problem of failed servers where the traffic still gets routed and the new server that has no allocations.
Conclusion
Hashing can be an intricate topic when we learn system design. But we can divide each of the terms and mini topics to learn them individually in detail. Consistent hashing is one of the important ones to study and understand so that we understand other system design concepts like load balancing.
A repo with some system design concepts.
System Design
Systems design is the process of defining elements of a system like modules, architecture, components and their interfaces and data for a system based on the specified requirements.
This is a index for the concepts of system.
If you wish to open these in a new tab, Press CTRL+click
Thank you!!!
Show some โค๏ธ by starring this repository!
I hope this article was helpful to you.
Please donโt forget to follow me!!!
Any kind of feedback or comment is welcome!!!
Thank you for your time and support!!!!
Keep Reading!! Keep Learning!!!