Logo

dev-resources.site

for different kinds of informations.

Distributed Hash Generation Algorithm in Python โ€” Twitter Snowflake Approach

Published at
7/3/2024
Categories
distributedsystems
softwaredevelopment
hashing
systemdesign
Author
manandoshi1301
Author
14 person written this
manandoshi1301
open
Distributed Hash Generation Algorithm in Python โ€” Twitter Snowflake Approach

Into The Backgroundโ€ฆ

In any software development lifecycle, unique IDs play a major role in manipulating and showcasing data. A lot depends on the uniqueness of each data segment associated with user transactions.

There are several ways to generate unique IDs, but when it comes to generating hashes at a rate of approximately 10,000 hashes per second in a distributed fashion, it becomes a computational challenge.

The Twitter Snowflake approach is a widely used concept in recent advancements in distributed algorithms. Several other approaches can be used for hash generation, which will be covered in a future article!

Some Knowledge on Approachโ€ฆ

To enable distributed hash generation, this algorithm employs the principle of โ€œdivide and conquerโ€ on a 64-bit hash. The algorithm uses five different parameters to establish uniqueness while generating the hash and ensuring there are no collisions.

Twitter Snowflake Hash template

The five parameters are as follows:

  • Timestamp (41 bits): This is the binary version of the total milliseconds from the difference between the current time and the default epoch set by the party developing this algorithm. (In our example, we have set it to Twitterโ€™s default epoch, i.e., November 4, 2010, at 1:42:54 UTC.)
  • Datacenter ID (5 bits): This is the binary version of the Datacenter ID, which can accommodate 2โต datacenters, totaling 32 data centers.
  • Machine ID (5 bits): This is the binary version of the Machine ID, which can accommodate 2โต machines per datacenter, totaling 32 machines per datacenter. (Virtual addressing can be introduced to increase the count, but currently, the algorithm limits it to 32 machines.)
  • Sequence Number (12 bits): This is the binary format of the sequence number, which counts the number of hashes generated by a machine/process within any millisecond. The sequence number resets to 0 after each millisecond. After each ID generation, the sequence number increments by one if the request is received within the same millisecond.
  • Sign Bit (1-bit defaults to 0): The one extra bit in the system is reserved for any custom case while using this algorithm on a large scale. It can be used as parity, to distinguish between signed and unsigned numbers, to set flags, or for any future uses. Each machine in the system involved in generating hashes can run the following templates to generate a hash:

Letโ€™s Codeโ€ฆ

While writing the code for generating a hash, if you donโ€™t have a distributed architecture or the resources to implement one, you can set the default bits of โ€˜Datacenter IDโ€™ as โ€˜00000โ€™ and โ€˜Machine IDโ€™ as โ€˜00000โ€™. This scenario is also reflected in the following code:

from datetime import datetime, timedelta

class GenerateHash:
    def __init__(self):
        self.sequence_timestamp = datetime.utcnow()
        self.sequence_number = 0

        # Setting default epoch to 04 November 2010 at 1:42:54 UTC Time
        self.time_ref = datetime(2010, 11, 4, 1, 42,54)

    def generate(self):        
        ## Sign bit: Will be useful for future references
        sign_bit = '0'

        ## Datacenter bit (Default Datacenter: '00000')
        datacenter_bit = '0' * 5

        ## Machine bit (Default Machine: '00000')
        mac_bit = '0' * 5

        ## Timestamp bit                
        # Getting Current Time in utc
        time_now = datetime.utcnow()

        # Time Difference since default epoch
        time_diff = time_now - self.time_ref

        # Total time_diff in miliseconds
        ms = int(time_diff.total_seconds() * 1000)

        # Miliseconds conversion in binary to 41 bits
        ms_bin = format(ms, '41b')
        timestamp_bit = ms_bin        

        ## Sequence bit
        # Time difference since last request on this machine
        time_difference = time_now - self.sequence_timestamp
        # If time_difference > 1 ms: reset the sequence number
        if time_difference > timedelta(milliseconds=1):
            self.sequence_number = 0

        # Set the sequence timestamp to current requests timestamp 
        self.sequence_timestamp = time_now

        # Joining all strings 
        # Also formatting self.sequence_number int -> binary of twelve bytes
        final_bin = sign_bit + timestamp_bit + datacenter_bit + mac_bit + format(self.sequence_number, '12b')
        # Replacing all spaces with zero as bit
        final_bin = final_bin.replace(' ', '0')

        # Incrementing the sequence_number
        self.sequence_number += 1

        # Converting final binary to decimal and decimal to hex value
        hex_val = hex(int(final_bin, 2))
        # Avoiding extracting string 0x.. generated due to hex() function
        final_hex = hex_val[2:]
        return final_hex

hash_generator = GenerateHash()
new_hash = hash_generator.generate()
print(new_hash)
Enter fullscreen mode Exit fullscreen mode

Here we have used the โ€˜datetimeโ€™ library to support time operations since we use universal time to employ machines/servers globally.

The Benefits, Concerns, and moreโ€ฆ

Benefits First :)

  1. We generate alphanumeric hashes in a distributed fashion ensuring consistency and scalability across our systems.
  2. These hashes are timestamp-based, allowing us to track their creation time, sort them accordingly, and trace clicks on features, among other uses.
  3. Not only can we control when these hashes are generated, but we can also customize how they are generated. This flexibility enables us to modify or update our hash generation methods as our systems evolve.
  4. This algorithm supports the generation of at least 10,000 hashes per second.

Some Concerns :(

  1. The hash IDs are timestamp-based and include specific data center and machine IDs, which makes them vulnerable to the prediction of future generated IDs.
  2. Due to the predefined guidelines for hash generation and the lengthy hash string, there is limited scope for customization. This constraint can restrict the algorithmโ€™s utility and adaptability in various systems.

Something more to focus onโ€ฆ

  • In our current algorithm, we do not have built-in support for clock synchronization in a multi-machine environment. We currently assume that our ID generation servers have synchronized clocks. However, the optimal and widely adopted solution would be to implement the Network Time Protocol (NTP), which we will cover in future articles. So stay tunedโ€ฆ

Thank you so much for getting this far!

hashing Article's
30 articles in total
Favicon
What is Hashing? A Complete Guide for Developers and Security Professionals
Favicon
The Evolution of Hashing Algorithms: From MD5 to Modern Day
Favicon
Pattern 9: Hashing
Favicon
Lithe Hash: Um Mรณdulo Robusto para Hashing Seguro de Senhas
Favicon
The Bcrypt Algorithm for Secure Password Hashing
Favicon
Sorting Algorithms That Use Hash Tables
Favicon
๐–ค๐—‡๐–ผ๐—‹๐—’๐—‰๐—๐—‚๐—ˆ๐—‡ ๐–บ๐—‡๐–ฝ ๐–ง๐–บ๐—Œ๐—๐—‚๐—‡๐—€: ๐–ง๐—ˆ๐— ๐–ณ๐—๐–พ๐—’ ๐–ฏ๐—‹๐—ˆ๐—๐–พ๐–ผ๐— ๐–ธ๐—ˆ๐—Ž๐—‹ ๐–ฃ๐–บ๐—๐–บ ๐–ฃ๐—‚๐–ฟ๐–ฟ๐–พ๐—‹๐–พ๐—‡๐—๐—…๐—’
Favicon
Two Unconventional Ways to store Passwords: Honeywords & Rock Salt
Favicon
Difference Between Encryption and Hashing ๐Ÿ”๐Ÿ”‘
Favicon
Basic File Integrity Monitoring System
Favicon
Lithe Hash: A Robust Module for Secure Password Hashing
Favicon
Comparative Analysis of Password Hashing Algorithms: Argon2, bcrypt, scrypt, and PBKDF2
Favicon
Cryptojs vs. Bcryptjs: Which password hashing method should you trust?
Favicon
Distributed Hash Generation Algorithm in Python โ€” Twitter Snowflake Approach
Favicon
Understanding Bcrypt Rounds: Balancing Security and Performance
Favicon
Wednesday Links - Edition 2023-05-03 ๐Ÿ‡ต๐Ÿ‡ฑ๐Ÿ“œ
Favicon
Understanding Hashing Algorithms: A Beginner's Guide
Favicon
Hashing in Blockchain Technology: How it Ensures Data Integrity
Favicon
Using Hash Codes as Unique Identifiers
Favicon
What is Password Hashing Algorithm?
Favicon
Hashing
Favicon
๐Ÿ”How to encrypt variables in NodeJS
Favicon
Lava lamps securing the Web?๐Ÿคทโ€โ™‚๏ธ
Favicon
What is the difference between encryption, hashing and salting?
Favicon
Security in The Blockchain
Favicon
Hashing Algorithms and creating a simple file integrity monitor (FIM)
Favicon
Hashing, Explain it to me like I'm 5.
Favicon
#Hashing in Java
Favicon
How to hash a password in Go
Favicon
Sony FAIL : why it is absolutely necessary to encrypt passwords

Featured ones: