Logo

dev-resources.site

for different kinds of informations.

The unicode encoding system

Published at
2/11/2022
Categories
encoding
python
beginners
Author
salemzii
Categories
3 categories in total
encoding
open
python
open
beginners
open
Author
8 person written this
salemzii
open
The unicode encoding system

Have you ever wondered what went on behind the scene when you type series of texts on your keyboard? or send an e-mail to a friend from Rwanda who speaks french? how is the mail converted from english to french, how does the computer find the equivalent for each character you type? does the computer understand english? so many questions comes to heart. Well what we can all agree for sure is that the computer understands only 2 characters "0" and "1" respectively, these can be referred to as bits.This implies that an alphabets has to be interpreted as numbers for computers to store texts.

So how were all these characters in-cooperated into the computer's system? well in the early days of computer's inception (1960s), the primary means of communication between people were the use of teletypes(typewriters, teleprinters, etc).
These teletypes used a 5-bit encoding system which could range up to 32 character sets(2 ^ 5 = 32), the problem with this system was that it didn't provide enough space to represent all the english letters(a-z, A-Z), punctuation signs, numbers and other quintessential characters needed for effective communication.

Introducing Ascii

Due to the limitations of the 32-bits encoding system, there was need for a much better and standardized means of communication, in october 1960 The American Standards Association (ASA), now the American National Standards Institute (ANSI),led by Robert William Bemer (February 8, 1920 – June 22, 2004 began work on ASCII which is an acronym for American Standard Code for Information Interchange.In 1963 ASA introduced the first version of Ascii, unlike the former it was a 7-bit encoding system that could hold up to 128 character sets(2 ^ 7 = 128), numbered 0-127.

So for the English language, which has 26 letters, ASCII had enough slots for both upper and lower letter cases, numbers (0 to 9), punctuation marks, and unprintable control codes for teleprinters.
The Ascii Table

It was a great improvement obviously and by march of 1968, then US President Lyndon B. Johnson, announced that henceforth all computer systems should adopt the Ascii system as the default standard for information interchange(see more here), but as with every technology Ascii had it's own bottlenecks, one of which was it's in-ability to represent non-english characters, So, for European languages that use accented alphabets like German ä, ë, or Polish ź, ł, ę, ASCII wasn’t a favorable option.

Unicode to the rescue

Once again there was need for a much more diverse encoding system, that breached the disparities in communication and enhanced universal inclusion, as all other attempt towards tackling the problem resulted in a more complicated problem, during this period, globalization and internalization had become a core aspect of marketing and distribution, therefore global inclusion was vital at this point.

So in 1988 Joe Becker a computer scientist and expert on multilingual computing introduced an encoding scheme known as Unicode(Uniform character enCoding) in which each character is assigned a unique number known as a Code Point(A code point is the value that a character is given in the Unicode standard). This was a real breakthrough as not only was this applicable to english language alone, but also for every language around the world. The objective of Unicode was/is to unify all the different encoding schemes so that the confusion between computers can be limited to the very minimum.
Currently the Unicode is of three variants namely:

  • UTF-8: which is made of one byte(or 8 bits) and is well known for it's wide adoption in email systems and the internet in general

  • UTF-16: as you guessed is made up of two bytes(or 16 bits)

  • UTF-32: this encoding scheme utilizes four bytes(or 32 bits) to represent textual characters.

Note: UTF means Unicode Transformation Unit.

And that brings us to the end of this article, of course there's much more to encoding as this is just a quick see-through of the broad field of encoding/multi-lingual processing.
If you enjoyed this article, kindly leave a comment on what you learned from this one. Peace Out :)

encoding Article's
30 articles in total
Favicon
Why I Built the Laravel Encoding Package I Couldn’t Find Anywhere Else
Favicon
From 'A' to '😊': How Programming Languages Handle Strings
Favicon
Base64 strings concepts in different programming language
Favicon
Secure and Scalable Encoding Made Easy with Laravel Encoder: A Complete Tutorial
Favicon
Encoding
Favicon
On Transformers and Vectors
Favicon
The ü/ü Conundrum
Favicon
Unlocking the Potential of Video Transcoding
Favicon
How to inverse transform both ordinal and label encoding?
Favicon
Introducción a Buffer en JavaScript
Favicon
Intl.Segmenter(): Don't use string.split() nor string.length
Favicon
Packing and unpacking bytes
Favicon
Chuw Vidf Nam sogp sogp 4.0 (Cvnss4.0) zujx goc nhinl mas hoaj
Favicon
The Hitchhiker's Guide to Binary-to-Text Encoding
Favicon
Text versus bytes
Favicon
Transforming Categorical Data: A Practical Guide to Handling Non-Numerical Variables for Machine Learning Algorithms.
Favicon
Dealing with Categorical Data: Encoding Features for ML Algorithms
Favicon
Application of Media Processing Technology to 4K/8K FHD Video Processing
Favicon
Base64's goodness
Favicon
How does Base64 work?
Favicon
Ordinal Vs One Hot Vs Label Encoding
Favicon
PHP: Useful Encoding and decoding Functions You Need to Know
Favicon
How good is my video? A (traditional) video quality metrics survey
Favicon
String encodings
Favicon
The unicode encoding system
Favicon
Unicode
Favicon
Serialization
Favicon
Base 64 Encoder-Decoder in Java
Favicon
Windows 系統上 Python 的文字輸出編碼
Favicon
UTF-8 strings in C (3/3)

Featured ones: