There is a meta tag which has attribute charset equals to UTF-8 in an HTML template. What is it?
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
</body>
</html>
 |
Unicode Homepage |
History - starts with ASCII
ASCII - American Standard Code for Information Interchange
It is a character encoding standard for electronic communication between machines. That allows communications between machines to happen.
ASCII is a 7-bit binary system. Each letter that is keyed in is converted into 7 binary numbers and sent over the wire.
With 7-bit binary, we can have 0-127.
The first 32 are for control codes.
For example:
A - 65: 10 00001
B - 66: 10 00010
a - 97: 11 00001
b - 98: 11 00010
Without the first two binary digits, A and a are just 1 which makes it easier to identify English alphabets.
 |
ASCII table from http://www.asciitable.com/ |
Other countries, for example, Japan created multibyte encoding that can include Kanji characters.
This caused incompatibility for communications among machines. If your machine or computer can't decode a message it receives, your messages would end up being garbled.
Unicode - UTF8
Unicode consortium figured out a standard to cover all the characters in the world.
If we use 32 bits to encode ASCII-encodable characters directly, the binary numbers will have a long list of zeros prefix. That will take 4 times more space for each English character.
Problems that were solved:
1) Get rids of all zeroes in English characters and ASCII set
2) Handle old computer systems that interpret 8 zeroes in a row as NULL and stop listening further (end of string)
3) It has to be backward-compatible. Let machines that understands only basic ASCII understands this new Unicode text.
It starts with a header that tells how many bytes that are in the current number. The number of 1s determines how many octets there are in the current number
For example:
1) First octet has 110. Two 1's mean two octet: header octet and a continuation octet. Continuation octet is marked by 10. x's are filled by the number that represents a character that is being transmitted.
110xxxxx 10xxxxxx
2) First octet has 1110. Two 1's mean two octet: header octet and a continuation octet. Continuation octet is marked by 10. x's are filled by the number that represents a character that is being transmitted.
1110xxxx 10xxxxxx 10xxxxxx
This UTF-8 system solved the 3 problems mentioned above cleverly.
Source: Computerphile on youtube
Thanks for reading!
Jun
Comments
Post a Comment