dev-resources.site
for different kinds of informations.
Bloom Filters
Imagine walking into a mobile shop that only sells Samsung and OnePlus phones. You ask the shopkeeper for an iphone. Without wasting any time, he says, âNo, we donât sell iphones.â Now, you change your mind to buy a One Plus model(a rare one). Instead of denying straightaway, the shopkeeper responds, âIt might be in stock, but I canât be 100% sure until I look.â
This is similar to how a Bloom Filter works, it can definitely tell you if something doesnât exist(like an iphone in an android shop), but if it says something might exist, there is a certain uncertainty of the dataâs presence.
To keep it more simple, Bloom Filter can give you a false positive(i.e. return true for something that is false). But it can never give you a false negative(i.e. return false for something that is true).
We all have encountered those warning messages while signing up on some platform that says, âUsername already taken/existsâ. Thatâs Bloom Filters in picture(not always as there are some more methods like querying the database or cache but that becomes inefficient when the user traffic explodes).
Letâs understand this Data Structure bit by bit:
A Bloom Filter is a probabilistic data structure. Now what does that mean?
Probabilistic in context of data structures means that this data structure provides answers with a certain possibility of being correct. This filter can definitely say when an element is not present inside it, but it might be wrong in some cases when it says, âYes, I have this dataâ.
A bloom filter uses a bit array to store the data. Now when I say it stores the data, it doesnât store the actual raw data but applies some hash functions to those data sets and accordingly marks the corresponding positions in the bit array. For example, if you want to check if âKartikâ exists inside a bloom filter it will apply multiple hash functions to this input and then mark their corresponding indexes as 1.
Letâs break this down:
Input â âKartikâ to be added inside a bloom filter.
Hash functions applied: Since a bloom filter uses multiple hash functions, we will consider it has 3 for better understanding.
Positions to be marked: Since this filter uses a bit array(a relatively large size bit arrray) where all indexes are set to 0(switch off) by default, based on the hash functionâs result the positions will be marked as 1(switch on).
Thatâs it. The data is stored now. Go ahead to check if it is present or not.
Below is a visual representation of hash functions being applied to an input:
Now, the outputs of the above hash functions(3, 5, 9) are the positions that need to be set. Below is its visual representation:
As you can see the positions 3,5 and 9 are set to 1.
Checking if the data is present in the filter or not:
You give the input. Letâs say the input is again âKartikâ.
The same hash functions are applied to âKartik.â
On applying the hash functions we would get the same result for obvious reasons which would fetch us the same positions inside the bit array i.e. 3,5 and 9.
The filter checks if these positions in the bit array are set to 1.
Oh, these indexes are already marked as 1(they are on). This means the data might be present.(Username already exists, please try using a different username).
However, if any of the hash function results point to a position that is still 0, the filter can confidently say, âNo, the data is not here.â This is because the same input, when hashed, will always give the same result, so if those positions arenât set, the data was never added.
A Bloom filter can give false positives (saying something might exist when it doesnât), but it can never give false negatives (saying something doesnât exist when it actually does). So, when you check for âKartikâ and all bits are 1, it means âKartikâ might exist. If even one bit is 0, the filter is sure it doesnât exist.
To minimize the number of false positives you carefully need to decide the length of the bit array to be used(within your systemâs capabilities) and also the number of hash functions.
Increasing the number of hash functions exponentially would result in slowing down the process but decreasing them wonât increase the speed but would result in an increased number of false positives.
Read more about how to calculate the above specifications here
Thatâs all that I know about Bloom Filters as of now. I have used it in my personal projects but not on big scale ones so the choice of using this data structure should be entirely yours. Iâm attaching a link where you can experiment on this data structure and find out how many times you get a false positive?
Here you go:
https://llimllib.github.io/bloomfilter-tutorial/
If you read it till here, Iâd like you to please share your story in case you used this data structure in your experience. If not used, do consider giving it a try and if you liked this post, please like, comment and share to increase the reach. You can also follow me for more such content. I post on weekends. Thanks :)
Featured ones: