dev-resources.site
for different kinds of informations.
Distributed Hash Generation Algorithm in Python โ Twitter Snowflake Approach
Into The Backgroundโฆ
In any software development lifecycle, unique IDs play a major role in manipulating and showcasing data. A lot depends on the uniqueness of each data segment associated with user transactions.
There are several ways to generate unique IDs, but when it comes to generating hashes at a rate of approximately 10,000 hashes per second in a distributed fashion, it becomes a computational challenge.
The Twitter Snowflake approach is a widely used concept in recent advancements in distributed algorithms. Several other approaches can be used for hash generation, which will be covered in a future article!
Some Knowledge on Approachโฆ
To enable distributed hash generation, this algorithm employs the principle of โdivide and conquerโ on a 64-bit hash. The algorithm uses five different parameters to establish uniqueness while generating the hash and ensuring there are no collisions.
The five parameters are as follows:
- Timestamp (41 bits): This is the binary version of the total milliseconds from the difference between the current time and the default epoch set by the party developing this algorithm. (In our example, we have set it to Twitterโs default epoch, i.e., November 4, 2010, at 1:42:54 UTC.)
- Datacenter ID (5 bits): This is the binary version of the Datacenter ID, which can accommodate 2โต datacenters, totaling 32 data centers.
- Machine ID (5 bits): This is the binary version of the Machine ID, which can accommodate 2โต machines per datacenter, totaling 32 machines per datacenter. (Virtual addressing can be introduced to increase the count, but currently, the algorithm limits it to 32 machines.)
- Sequence Number (12 bits): This is the binary format of the sequence number, which counts the number of hashes generated by a machine/process within any millisecond. The sequence number resets to 0 after each millisecond. After each ID generation, the sequence number increments by one if the request is received within the same millisecond.
- Sign Bit (1-bit defaults to 0): The one extra bit in the system is reserved for any custom case while using this algorithm on a large scale. It can be used as parity, to distinguish between signed and unsigned numbers, to set flags, or for any future uses. Each machine in the system involved in generating hashes can run the following templates to generate a hash:
Letโs Codeโฆ
While writing the code for generating a hash, if you donโt have a distributed architecture or the resources to implement one, you can set the default bits of โDatacenter IDโ as โ00000โ and โMachine IDโ as โ00000โ. This scenario is also reflected in the following code:
from datetime import datetime, timedelta
class GenerateHash:
def __init__(self):
self.sequence_timestamp = datetime.utcnow()
self.sequence_number = 0
# Setting default epoch to 04 November 2010 at 1:42:54 UTC Time
self.time_ref = datetime(2010, 11, 4, 1, 42,54)
def generate(self):
## Sign bit: Will be useful for future references
sign_bit = '0'
## Datacenter bit (Default Datacenter: '00000')
datacenter_bit = '0' * 5
## Machine bit (Default Machine: '00000')
mac_bit = '0' * 5
## Timestamp bit
# Getting Current Time in utc
time_now = datetime.utcnow()
# Time Difference since default epoch
time_diff = time_now - self.time_ref
# Total time_diff in miliseconds
ms = int(time_diff.total_seconds() * 1000)
# Miliseconds conversion in binary to 41 bits
ms_bin = format(ms, '41b')
timestamp_bit = ms_bin
## Sequence bit
# Time difference since last request on this machine
time_difference = time_now - self.sequence_timestamp
# If time_difference > 1 ms: reset the sequence number
if time_difference > timedelta(milliseconds=1):
self.sequence_number = 0
# Set the sequence timestamp to current requests timestamp
self.sequence_timestamp = time_now
# Joining all strings
# Also formatting self.sequence_number int -> binary of twelve bytes
final_bin = sign_bit + timestamp_bit + datacenter_bit + mac_bit + format(self.sequence_number, '12b')
# Replacing all spaces with zero as bit
final_bin = final_bin.replace(' ', '0')
# Incrementing the sequence_number
self.sequence_number += 1
# Converting final binary to decimal and decimal to hex value
hex_val = hex(int(final_bin, 2))
# Avoiding extracting string 0x.. generated due to hex() function
final_hex = hex_val[2:]
return final_hex
hash_generator = GenerateHash()
new_hash = hash_generator.generate()
print(new_hash)
Here we have used the โdatetimeโ library to support time operations since we use universal time to employ machines/servers globally.
The Benefits, Concerns, and moreโฆ
Benefits First :)
- We generate alphanumeric hashes in a distributed fashion ensuring consistency and scalability across our systems.
- These hashes are timestamp-based, allowing us to track their creation time, sort them accordingly, and trace clicks on features, among other uses.
- Not only can we control when these hashes are generated, but we can also customize how they are generated. This flexibility enables us to modify or update our hash generation methods as our systems evolve.
- This algorithm supports the generation of at least 10,000 hashes per second.
Some Concerns :(
- The hash IDs are timestamp-based and include specific data center and machine IDs, which makes them vulnerable to the prediction of future generated IDs.
- Due to the predefined guidelines for hash generation and the lengthy hash string, there is limited scope for customization. This constraint can restrict the algorithmโs utility and adaptability in various systems.
Something more to focus onโฆ
- In our current algorithm, we do not have built-in support for clock synchronization in a multi-machine environment. We currently assume that our ID generation servers have synchronized clocks. However, the optimal and widely adopted solution would be to implement the Network Time Protocol (NTP), which we will cover in future articles. So stay tunedโฆ
Thank you so much for getting this far!
Featured ones: