Signs That Read â€å“no Smoking,ã¢â‚¬â â€å“honk Horn to Open,ã¢â‚¬â and â€å“emergency Exit Onlyã¢â‚¬â Specify

Your PHP project probably involves dealing with lots of data coming from different places, such as the database or an API, and every time you need to process it, yous may run into an encoding issue.

This article volition help you prepare for when that happens and meliorate empathize what'southward going on behind the scenes.

An introduction to encoding

Encoding is at the core of whatever programming language, and commonly, we have information technology for granted. Everything works until information technology doesn't, and we get an ugly mistake, such as "Malformed UTF-8 characters, peradventure incorrectly encoded".

To find out why something in the encoding might non piece of work, nosotros first need to understand what we mean by encoding and how it works.

Morse code

Morse lawmaking is a neat way to explain what encoding is about. When it was adult, it was one of the first times in history that a message could be encoded, sent, and then decoded and understood past the receiver.

If we used Morse code to transmit a message, we'd first need to transform our message into dots and dashes (also called brusk and long marks), the but two signals available in this method. Once the message reaches its destination, the receiver needs to transform it from Morse code to English. It looks something like this:

              "Hi" -> Encode("Hullo") -> Transport(".... ..") -> Decode(".... ..") -> "How-do-you-do"                          

This organization was invented effectually 1837, and people manually encoded and decoded the messages. For example,

  • S is encoded as ... (three curt marks)
  • T as - (1 long marking)
  • U as ..- (ii short marks and one long mark)

Here's a radio operator encoding in Morse lawmaking:

radio_operator

On the Titanic, Morse code was used to send and receive letters, including the last one where they were request for help ("CQD" is a distress call).

              SOS SOS CQD CQD Titanic. We are sinking fast. Passengers are being put into boats. Titanic                          

In computer encoding, computers encode and decode characters in a very similar mode. The only difference is that instead of dots and dashes, we accept ones and zeros in a binary lawmaking.

Binary and characters

Equally you probably know, computers merely understand binary lawmaking in 1s and 0s, and so at that place's no such affair as a graphic symbol. It's interpreted by the software you use.

To encode and decode characters into 1s and 0s, we demand a standard way to do information technology then that if I send you lot a bunch of 1s and 0s, you will translate them (decode them) in the same way I've encoded them.

Imagine what would happen if each computer translated binary code into characters and vice-versa in its own way. If you sent a bulletin to a friend, they couldn't see your real bulletin because, for their reckoner, your 1s and 0s would mean something else. This is why we demand to agree on how nosotros transform the characters into binary lawmaking and vice-versa; we need a standard.

Standards

Encoding standards have a long history. We don't need to fully explore the history here, merely it'due south essential to know two pregnant milestones that divers how computers can use encoding, peculiarly with the nascency of the Internet.

ASCII

ASCII, developed in 1963, is one of the first and most important standards, and it is still in utilize (nosotros'll explicate this afterwards). ASCII stands for American Standard Code for Information Interchange. The "American" part is very relevant since it could just encode 127 characters in its beginning version, including the English alphabet and some basic symbols, such equally "?" and ";".

Here's the full table:

ASCII Table Source

Computers tin can't really use numbers. Equally we already know, computers only understand binary code, 1s and 0s, so these values were and then encoded into binary.

For example, "K" is 75 in ASCII, so we can transform it into binary by dividing 75 by 2 and continue until nosotros go 0. If the sectionalization is not verbal, we add i every bit a remainder:

              75 / 2 = 37 + 1 37 / 2 = 18 + 1 18 / 2 =  ix + 0 9 / 2 =   4 + 1 4 / ii =   2 + 0 ii / two =   one + 0 1 / ii =   0 + 1                          

Now, we extract the "remainders" and put in them in inverse order:

And then, in ASCII, "Grand" is encoded as 1001011 in binary.

The main problem with ASCII was that it didn't cover other languages. If you lot wanted to utilize your figurer in Russian or Japanese, you needed a dissimilar encoding standard, which would not be compatible with ASCII.

Have you always seen symbols like "???" or "Ã,ÂÂÂÂ" in your text? They're caused by an encoding trouble. The plan tries to interpret the characters using one encoding method, but they don't represent anything meaningful since it was created with some other encoding method. This is why we needed our 2d big breakthrough, Unicode and UTF-8.

Unicode

The goal in developing Unicode was to take a unique way to transform any character or symbol in any language in the globe into a unique number, zippo more.

If you go to unicode.org, you can expect upward the number for whatsoever character, including emojis!

For instance, "A" is 65, "Y" is 121, and 🍐 is 127824.

The problem is that computers can simply store and deal with binary code, so we still demand to transform these numbers. A variety of encoding systems tin achieve this feat, but nosotros'll focus on the most common one today: UTF-viii.

UTF-8

UTF-8 makes the Unicode standard usable by giving united states an efficient way to transform numbers into binary lawmaking. In many cases, it'southward the default encoding for many programming languages and websites for two crucial reasons:

  • UTF-viii (and Unicode) are compatible with ASCII. When UTF-8 was created in 1993, a lot of data was in ASCII, so past making UTF-eight compatible with information technology, people didn't demand to transform the data before using it. Substantially, a file in ASCII can be treated as UTF-8, and it only works!
  • UTF-8 is efficient. When we shop or send characters through computers, it's important that they don't have upwards too much space. Who wants a 1 GB file when you can have a 256 MB one?

Let's explore how UTF-8 works a bit further and why it has different lengths depending on the character being encoded.

How is UTF-8 efficient?

UTF-eight stores the numbers in a dynamic way. The beginning ones in the Unicode listing take 1 byte, merely the last ones tin accept upwards to 4 bytes, then if you're dealing with an English language file, most characters are likely only taking 1 byte, the same as in ASCII.

This works by covering different ranges in the Unicode spectrum with a different number of bytes.

For instance, to encode any graphic symbol in the original ASCII table (from 0 to 127 in decimals), nosotros simply need seven bits since 2^7 = 128. Therefore, we tin store everything in 1 byte of eight bits, and we however have 1 gratis.

For the next range (from 128 to 2047), nosotros demand 11 $.25 since 2^11 = 2,048, which is 2 bytes in UTF-viii, with some permanent bits to give us some clues. Let's take a look at the full table, and you'll see what I mean:

Variable Width in UTF-8

When reading 1s and 0s in a figurer, nosotros don't take the concept of space betwixt them, so we need a way to say, "here comes this kind of value", or "read 10 $.25 now". In UTF-viii, we achieve this past strategically placing some 1s and 0s.

If y'all're a figurer and read something that starts with 0 in UTF-eight, y'all know that you only need to read 1 byte and show the correct graphic symbol from Unicode in the range of 0-127.

If you encounter 2 1s together, information technology means you need to read two bytes, and y'all're in the range of 128-ii,047. Three 1s together ways that y'all demand to read 3 bytes.

width-utf8-2

Allow's run into a couple of examples:

A grapheme (such as "A") is translated into a number according to the giant Unicode table ("65"). Then, UTF-8 transforms this number into binary code (01000001) following the pattern we've shown.

If we have a character in a higher range, such as the emoji "⚡", which is 9889 according to Unicode, nosotros need 3 bytes:

              11100010 10011010 10100001                          

We can also prove how this works with PHP just for fun:

                              // We first excerpt the hexadecimal value of a string, like "A"                $value                =                unpack                (                'H*'                ,                "A"                );                // Convert it at present from hexadecimal to decimal (just a number)                $unicodeValue                =                base_convert                (                $value                [                1                ],                xvi                ,                10                );                // $unicodeValue is 65                // Now we transform it from base x (decimal) to base of operations 2 (binary)                echo                base_convert                (                $unicodeValue                ,                x                ,                2                );                // 1000001                          

Encoding in PHP

Now that we've taken a wait at how encoding works in general, nosotros tin focus on the essential parts that nosotros ordinarily need to handle in PHP.

A quick annotation on PHP versions

As y'all probably know, PHP has had a bad reputation for quite some time. Withal, thankfully, many of its original flaws were fixed in the more recent versions (from five.X). Therefore, I recommend that you lot apply the most modernistic version you tin to prevent whatsoever unexpected problems.

Where encoding matters in PHP

There are usually three places where encoding matters in a plan:

  • The source code files for your programme.
  • The input you receive.
  • The output you show or shop in a database.

Setting the right default encoding

Since UTF-viii is and so universal, information technology's a adept idea to prepare it as the default encoding for PHP. This encoding is set past default, but if someone has inverse this setting, here'due south how to do it. Become to your php.ini file and add together (or update) the following line:

                              default_charset                =                "UTF-viii"                          

What happens when a string coming in uses a different encoding? Permit's encounter what to do in this case.

Detecting encoding

When we receive a string from reading a file, for example, or in a database, we don't know the encoding, so the first pace is to observe information technology.

Detecting a specific encoding isn't always possible, just we have a practiced take chances with mb_detect_encoding. To use it, nosotros need to pass the string, a listing of valid encodings that yous expect to discover, and whether you want a strict comparison (recommended in well-nigh cases).

Here's an example of a way to determine whether a string is in UTF-eight:

                              mb_detect_encoding                (                $string                ,                'UTF-eight'                ,                true                );                          

With a list of potential encodings, we could pass a string or an assortment:

                              mb_detect_encoding                (                $string                ,                "JIS, eucjp-win, sjis-win"                ,                truthful                );                $array                []                =                "ASCII"                ;                $assortment                []                =                "JIS"                ;                $array                []                =                "EUC-JP"                ;                mb_detect_encoding                (                $string                ,                $array                ,                truthful                );                          

This part will render the detected grapheme encoding or fake if information technology cannot observe the encoding.

Convert to a different encoding

Once it's clear which encoding we're dealing with, the side by side footstep is transforming information technology to our default encoding, ordinarily UTF-eight. Now, this is not always possible since some encodings are non compatible, just we tin can effort the following arroyo:

                              // Convert EUC-JP to UTF-eight                $string                =                mb_convert_encoding                (                $stringInEUCJP                ,                "UTF-8"                ,                "EUC-JP"                );                          

If we desire to car-detect the encoding from a list, we can utilise the following:

                              // Auto notice encoding from JIS, eucjp-win, sjis-win, and so convert str to UTF-viii                $string                =                mb_convert_encoding                (                $str                ,                "UTF-8"                ,                "JIS, eucjp-win, sjis-win"                );                          

We also have another function in PHP called iconv, but since information technology depends on the underlying implementation, using mb_convert_encoding is more reliable and consistent.

Checking that we have the right encoding

Earlier processing or storing whatsoever input, information technology is a good thought to check that we take the string in the right encoding. To attain this, we can utilise mb_check_encoding, and information technology'll return true or false. For example, to bank check that a string is in UTF-8:

                              mb_check_encoding                (                $string                ,                "UTF-8"                );                          

Output in HTML

Since information technology's so common to render some HTML code for a website from PHP, hither's how we can make sure that nosotros prepare the right encoding for the browser. We tin can do it just by sending a header before the output:

                              header                (                'Content-Type: text/html; charset=utf-8'                );                          

A note on databases

Databases are an of import office of handling encoding correctly since they are configured to use one for all the data we have there.

In many cases, they're where we'll store all our strings and from where we'll read them to show them to the user.

I recommend that you make sure that the encoding yous're using for your project is also the same you lot have set in your database to prevent problems in the futurity.

Setting your encoding for the database depends on the database system you lot use, so we tin can't describe every way in this article. However, it makes sense to go to the online docs and meet how we can change information technology. For instance, hither's how to do it with PostgreSQL and with MySQL.

Malformed UTF-8 characters, possibly incorrectly encoded

When transforming an array to JSON with json_encode, yous might run across this result. This merely means that what PHP was expecting to get as UTF-8 is not in that encoding, so we can solve the problem by converting it start:

                              $assortment                [                'name'                ]                =                mb_convert_encoding                (                $array                [                'name'                ],                'UTF-8'                ,                'UTF-8'                );                          

Encoding error in the database

When reading from or writing to a database, you might run across some weird characters, such equally the post-obit:

This mistake is unremarkably a sign that the encoding yous're using to read your string is not the same ane that the database is using. To fix this issue, make certain that you're checking the string'due south encoding before storing information technology and that yous take set the right encoding in your database.

Determination

Encoding is sometimes difficult to sympathise, just hopefully, with this commodity, information technology's a scrap clearer, and yous feel more prepared to fix any errors that might come up your way.

The nearly important lesson to take away is to always remember that all strings have an associated encoding, and then brand sure y'all're using the right one from the kickoff time you lot run across information technology, and utilise the aforementioned encoding in your whole project, including the database and source files. If you lot need to pick ane, selection a modern and common one, such every bit UTF-eight, since information technology will serve you well with any new characters that might announced in the hereafter, and it'due south very well designed.

wardwate1989.blogspot.com

Source: https://www.honeybadger.io/blog/php-character-encoding-unicode-utf8-ascii/

0 Response to "Signs That Read â€å“no Smoking,ã¢â‚¬â â€å“honk Horn to Open,ã¢â‚¬â and â€å“emergency Exit Onlyã¢â‚¬â Specify"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel