C++ Character Encoding: char vs wchar_t vs Unicode

The char type in C++ has an interesting history that is tied to both the development of the C programming language (from which C++ was derived) and the evolution of character encoding standards, especially ASCII and later, Unicode. Here’s a brief overview of the history of the char type in C++.

In the beginning: 8-bit character sets

The char type was first introduced in the C programming language in the early 1970s as a way to represent single-byte characters. C’s char type is based on ASCII (American Standard Code for Information Interchange), which was the dominant character encoding standard at the time. ASCII uses 7 bits to represent characters, but the char type in C is typically 8-bits (1 byte) to allow for additional control characters and potential extensions beyond ASCII.

When C++ was developed by Bjarne Stroustrup in the early 1980s as an extension of C, the char type was adopted from C without modification. Like in C, char was used in C++ to represent single-byte characters. It was typically used to store ASCII values or arrays of characters (strings) terminated by the null character ('\0'). As computer use spread around the world, the need to represent characters from languages other than English led to the development of many 8-bit character sets that start with ASCII but then assign the remaining 128 available characters to characters from other languages. These include the ISO standard characters, Windows Codepages and MacOS character sets among others.

  • ISO 8859-1 (Latin-1): Supports Western European languages (e.g., English, French, Spanish, German)
  • ISO 8859-2 (Latin-2): Supports Central and Eastern European languages (e.g., Polish, Czech, Hungarian)
  • ISO 8859-5: Supports Cyrillic alphabet languages (e.g., Russian, Bulgarian)
  • ISO 8859-7: Support for the Greek alphabet
  • MacRoman: The standard character set for Western systems on older Macintosh computers
  • Windows-1252: Most common Windows code page, often used as the default in Western systems
  • Windows-1250: Similar to ISO 8859-2 for Central and Eastern European languages
  • Windows-1251: Supports Cyrillic script for Russian and other languages
  • Windows-1253: Support for the Greek alphabet
  • Windows-1255: Supports for Hebrew alphabet and right-to-left script
  • EBCDIC: A character encoding used by IBM mainframes (not based on ASCII)

The problem with this approach is that you have to pick one character set to use which makes it impossible to use characters from languages not supported by that character set. It also makes it difficult to exchange documents among different computers. If you create a document using MacRoman, it won't display the correct characters when displayed on a Windows system using codepage Windows-1252 or a Linux system that uses ISO Latin-1 or an IBM mainframe.

The Solution: A Universal Character Set

To solve the problem created by the proliferation of 8-bit character sets, a universal character set was needed. The history of Unicode is tied to this need for a universal character encoding system capable of representing the world’s diverse scripts and symbols.

The idea of a universal character encoding standard was first proposed by Joe Becker of Xerox in 1987, and soon after, a group of engineers from Xerox and Apple began working on it. These engineers included Joe Becker, Lee Collins, and Mark Davis. In 1989 the Unicode Consortium was founded with the goal of developing a new standard encoding system that could be applied consistently across different platforms and languages.

Like 8-bit character sets such as ASCII or ISO Latin-1, the idea was to assign numeric codes to each character of all the known languages. They started with the 128 ASCII characters plus the extra 128 ISO Latin-1 characters to minimize the conversion of western documents. Then they continued to add new codes for new languages one at a time. This necessitated using more than just 8-bits per character.

In 1991 The first official version of the Unicode Standard (Unicode 1.0) was released. It defined a 16-bit encoding system that could represent up to 65,536 characters (64K characters). The initial release included characters for many of the world’s most common scripts, such as Latin, Greek, Cyrillic, Hebrew, Arabic, and a large set of Chinese characters.

As the rise of the internet and the increasing need for cross-platform communication made the limitations of traditional encodings more apparent, Unicode became attractive to tech companies, especially for internationalization and supporting multiple languages in software and websites. Because Unicode 1.0 uses codes that can fit in 16-bits, many languages designed in the 1990s, like Java and JavaScript, assumed that 16-bits was the final answer for how big a character type should be.

Wide Characters: 16-bit character sets

The wchar_t type was introduced in ANSI C (C89) and later in ISO C90, as a “wide character” type to represent multi-byte or larger characters. wchar_t is a data type intended to handle character encodings that required more than 8-bits. It was defined as a distinct type from char, primarily for supporting non-ASCII character encodings, such as Unicode or locale-specific multibyte character sets. Like other C intrinsic types, its size is implementation-defined, meaning it could vary across platforms. Common implementations use either 16 bits or 32 bits for wchar_t.

When C++ was standardized in 1998 (C++98), it inherited wchar_t from C. In C++, wchar_t was introduced as a built-in type for representing wide characters. However, its size was not standardized so different platforms continued to use different sizes. For example:

  • Windows typically defined wchar_t as 16 bits
  • Linux/Unix systems generally used 32 bits

Along with the introduction of wchar_t, C++ also added wide-character string literals using the prefix L, e.g., L"wide character string", and wide-character functions like std::wcout for outputting wide characters.

Like the raw char type in C++, wchar_t does not mandate any particular character set. This flexibility allows it to work with Latin character sets as well as more complex character sets used for Japanese, Chinese or Farsi. But because a character set is not specified for wchar_t, it causes the same confusion and difficulties that we have with the different character sets used with the char type. The solution: Unicode.

Encoding Schemes: Beyond 16-bits

As Unicode adoption grew, it became clear that the original 16-bit space would not be enough to represent all characters, especially given the vast number of East Asian ideographs and historical scripts. Unicode 2.0 was released in 1996 and expanded the encoding beyond the 16-bit limit.

Up until Unicode 2.0, a character set, i.e. the character symbols and the integers that represent them, was all that you needed to know. Character sets that could fit into 8-bits were stored in 8-bit integers. Character sets that needed more than 8-bits but no more than 16-bits were stored as 16-bit integers.

Unicode 2.0 introduced encoding forms that can be used to store Unicode characters in an 8-bit, 16-bit or 32-bit encoding scheme. The encoding form is distinct from the Unicode character set. One Unicode character could be represented as one 32-bit value, variable 1-2 16-bit values or variable 1-4 8-bit values. To differentiate between the numerical value the Unicode character set assigns each character and the numerical values assigned to a character based on a specific encoding scheme, we need different terms. We call the numerical value that Unicode assigns each character that character's codepoint and the bytes assigned by the encoding scheme code units. Codepoints are encoded to code unit streams using either the UTF-8, UTF-16 or UTF-32 encoding forms.

  • UTF-8: A variable-length encoding that uses 1 to 4 bytes per codepoint. It is backward compatible with ASCII, making it ideal for use in systems originally designed for ASCII
  • UTF-16: A variable-length encoding that uses 2 to 4 bytes per code point. It uses 2 byte for the BMP (Basic Multilingual Plane), i.e. the first 65535 codepoints and 4 bytes for all higher codepoints
  • UTF-32: A fixed-length encoding that uses 4 bytes for every codepoint, simplifying some processing but being less space-efficient

This expansion allowed Unicode to cover more scripts and symbols, including ancient scripts and large sets of ideographic characters from Chinese, Japanese, and Korean.

The Rise of Unicode

The web and software localization accelerated the adoption of Unicode in the 2000s. Browsers, programming languages, and operating systems began adopting Unicode as the default character encoding. UTF-8 became the dominant encoding for web content due to its backward compatibility with ASCII and its efficiency in encoding characters. By the early 2000s, most major software platforms (such as Windows, macOS, and Linux) supported Unicode natively, allowing text in multiple languages to be processed and displayed seamlessly.

Unicode continued to expand to cover more characters, adding new languages, scripts, symbols, and emoji with each new version. The standard now includes not only modern scripts but also historical scripts, mathematical symbols, emojis, and many other types of characters. Unicode 15.0 (released in 2022) includes over 149,000 characters across 159 scripts, covering a wide array of languages, symbols, and emojis. It even includes some fictional languages like Tengwar (J.R.R. Tolkien’s Elvish script), Cirth (Tolkien’s Dwarvish runes) and Klingon (Star Trek).

How to Use Unicode in C++

C++11 and C++20 introduced new types specifically for dealing with UTF-8, UTF-16 or UTF-32 Unicode encodings.

char8_t (Introduced in C++20)

  • Purpose: Represents characters encoded in UTF-8
  • Size: 8 bits (1 byte)
  • Usage: This type is used when working with UTF-8 encoded text. Each char8_t represents one byte of a UTF-8 sequence. ASCII characters can be represented by one char8_t, but all others require a variable-length encoding consisting of multiple char8_t (2 to 4 char8_t per code point).
  • Literals: UTF-8 string literals are written with the u8 prefix, for example: u8"hello"
  • Note: Before C++20, UTF-8 strings were typically stored in std::string using char, but with C++20, char8_t was introduced to provide explicit distinction for UTF-8 text.

char16_t (Introduced in C++11)

  • Purpose: Represents characters encoded in UTF-16
  • Size: 16 bits (2 bytes)
  • Usage: This type is used when working with UTF-16 encoded text, where each code unit is 16 bits. Basic Multilingual Plane (BMP) characters can be represented with a single char16_t, but characters beyond the BMP (like emojis) require surrogate pairs (two char16_t values).
  • Literals: UTF-16 string literals are written with the u prefix, for example: u"hello"

char32_t (Introduced in C++11)

  • Purpose: Represents characters encoded in UTF-32
  • Size: 32 bits (4 bytes)
  • Usage: This type is used when working with UTF-32 encoded text, where each char32_t represents a single Unicode codepoint directly, without the need for variable length encodings.
  • Literals: UTF-32 string literals are written with the U prefix, for example: U"hello"

So we have intrinsic types for the code units in UTF-8, UTF-16 and UTF-32 and we have string literal syntax for strings in all three coding schemes, so how do we manipulate the various types of Unicode strings. Unfortunately there isn't much support for working with strings in the C++ standard. Fortunately we have a good third party library in Boost::Text that helps us out.

UnicodeStrings.cpp
#include <iostream>
#include <string>
#include <ranges>
#include <boost/text/transcode_algorithm.hpp>
#include <boost/text/normalize_algorithm.hpp>
#ifdef _WIN32
#include <windows.h>
#endif

/******************************************************************************
 * Boost
 * https://www.boost.org/users/download/
 * run bootstrap.sh or bootstrap.bat
 * run
 * > ./b2 install

 * Boost Text
 * https://github.com/tzlaine/text  (Click code button to download ZIP)
 * Move include/boost/text folder to 
 *    /usr/local/include/boost                (on Linux or MacOS)
 *    C:\Boost\include\boost-1_86\boost       (on Windows)
 * 
 * On Windows Microsoft C++ compiler add compiler options
 * /source-charset:utf-8 /I C:/Boost/include/boost-1_86 /std:c++latest
 *****************************************************************************/

#define utf8AsString(utf8Str) (*reinterpret_cast<std::string*>(&(utf8Str)))

int main() {
    // Set Windows console to UTF-8 code page
    // Linux and macOS terminals are already UTF-8
#ifdef _WIN32
    SetConsoleOutputCP(CP_UTF8);
#endif
    /*******************
     * UTF-8
     ******************/
    std::cout << "Sizeof char8_t: " << sizeof(char8_t) << std::endl;
    // UTF-8 encoded string
    std::u8string utf8Str = u8"Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀";

    // Output the size of the string (in bytes)
    std::cout << "UTF-8 string size: " << utf8Str.size() << " bytes" << std::endl;

    // Display the string (you'll need to convert to std::string or cast if necessary)
    std::cout << "UTF-8 string: " << utf8AsString(utf8Str) << std::endl;
    std::string displayStr(utf8Str.begin(), utf8Str.end());  // Converting to std::string
    std::cout << "UTF-8 string: " << displayStr << std::endl;

    /*******************
     * UTF-16
     ******************/
    std::cout << "Sizeof char16_t: " << sizeof(char16_t) << std::endl;
    // UTF-16 encoded string
    std::u16string utf16Str = u"Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀";

    // Output the size of the string (in char16_t units)
    std::cout << "UTF-16 string size: " << utf16Str.size() << " units = "
        << (utf16Str.size() * sizeof(char16_t)) << " bytes" << std::endl;

    // Display the string (conversion required to output)
    // Create a UTF-8 string
    std::string utf16to8Str;

    // Convert UTF-16 to UTF-8 using Boost.Text's transcode algorithm
    boost::text::transcode_to_utf8(utf16Str.begin(), utf16Str.end(), std::back_inserter(utf16to8Str));

    // Output the result
    std::cout << "UTF-16 string: " << utf16to8Str << std::endl;
    
    /*******************
     * UTF-32
     ******************/
    std::cout << "Sizeof char32_t: " << sizeof(char32_t) << std::endl;
    // UTF-32 encoded string
    std::u32string utf32Str = U"Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀";

    // Output the size of the string (in char32_t units)
    std::cout << "UTF-32 string size: " << utf32Str.size() << " units = "
        << (utf32Str.size() * sizeof(char32_t)) << " bytes" << std::endl;

    // UTF-32 strings are easier to manipulate since each character is one 32-bit unit.
    // Create a UTF-8 string
    std::string utf32to8Str;

    // Convert UTF-32 to UTF-8 using Boost.Text's transcode algorithm
    boost::text::transcode_to_utf8(utf32Str.begin(), utf32Str.end(), std::back_inserter(utf32to8Str));

    // Output the result
    std::cout << "UTF-32 string: " << utf32to8Str << std::endl;

    /*******************
     * Unicode Codepoints
     ******************/
    // Transcode to UTF-8 to 32-bit Unicode codepoints to perform string operations.
    std::u32string codepoints;
    boost::text::transcode_to_utf32(utf8Str.begin(), utf8Str.end(), std::back_inserter(codepoints));
    for (long cp : codepoints) {
        std::cout << "U+" << std::hex << cp << "";
    }
    std::cout << std::endl;
    /*******************
     * Unicode Normalization
     ******************/
    std::u8string str8A = u8"café"; // precomposed é
    std::u8string str8B = u8"cafe\u0301"; // "e" with combining accent
    std::u16string str16A = u"café"; // precomposed é
    std::u16string str16B = u"cafe\u0301"; // "e" with combining accent
    std::u32string str32A = U"café"; // precomposed é
    std::u32string str32B = U"cafe\u0301"; // "e" with combining accent

    std::cout << "UTF8 strings equal: " << (str8A == str8B) << std::endl;
    std::cout << "UTF16 strings equal: " << (str16A == str16B) << std::endl;
    std::cout << "UTF32 strings equal: " << (str32A == str32B) << std::endl;

    // Normalize both strings using Boost.Text to NFC
    std::u8string str8ANorm;
    std::u8string str8BNorm;
    // Normalize codepoints to NFC (composed normal form)
    boost::text::normalize_append<boost::text::nf::c>(str32A.begin(), str32A.end(), str8ANorm);
    boost::text::normalize_append<boost::text::nf::c>(str32B.begin(), str32B.end(), str8BNorm);

    std::cout << "Normal UTF8 strings equal: " << (str8ANorm == str8BNorm) << std::endl;

    /* 
    Note: had to modify boost/text/view_adaptor.hpp to change 
    std::range::range_adaptor_closure to std::__range_adaptor_closure
    */
    return 0;
}
UTF-8 string size: 128 bytes
UTF-8 string: Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀
UTF-8 string: Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀
Sizeof char16_t: 2
UTF-16 string size: 82 units = 164 bytes
UTF-16 string: Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀
Sizeof char32_t: 4
UTF-32 string size: 81 units = 324 bytes
UTF-32 string: Hello, World! – ¡Hola, Mundo! – Olá, Mundo! – こんにちは、世界! – 你好, 世界! – שלום, עולם! 😀
U+48 U+65 U+6c U+6c U+6f U+2c U+20 U+57 U+6f U+72 U+6c U+64 U+21 U+20 U+2013 U+20 U+a1 U+48 U+6f U+6c U+61 U+2c U+20 U+4d U+75 U+6e U+64 U+6f U+21 U+20 U+2013 U+20 U+4f U+6c U+e1 U+2c U+20 U+4d U+75 U+6e U+64 U+6f U+21 U+20 U+2013 U+20 U+3053 U+3093 U+306b U+3061 U+306f U+3001 U+4e16 U+754c U+21 U+20 U+2013 U+20 U+4f60 U+597d U+2c U+20 U+4e16 U+754c U+21 U+20 U+2013 U+20 U+5e9 U+5dc U+5d5 U+5dd U+2c U+20 U+5e2 U+5d5 U+5dc U+5dd U+21 U+20 U+1f600

Linux and MacOS terminals support UTF-8 display but Windows terminal does not. That is why we have the special Windows code to set the codepage to the UTF-8 codepage at the beginning of this example program.

The program has three sections that demonstrate how to use each of the UTF-8, UTF-16 and UTF-32 coding schemes in C++. Each section starts with printing the size of a code unit using sizeof(). Then we create a string variable and initialize it with a corresponding string literal. The source file is encoded in UTF-8 but the string literals are stored in the generated code as UTF-8, UTF-16 or UTF-32 depending on the corresponding string literal prefix used u8, u or U. Depending on your compiler you may have to indicate that the source file is UTF-8 encoded. For example, Microsoft's C++ compiler requires the /source-charset:utf-8 command line switch. Then we use the size() member of the string object to print out the number of code units in the string. The number of code units times the sizeof() the code unit type gives us the total number of bytes used to store the string. Then in the case of UTF-16 and UTF-32 strings, we need to use the boost::text::transcode_to_utf8() method to convert it into a UTF-8 coded string so that we can print it to std::cout. Even though a std::u8string is identical to a std::string with UTF-8 encoding, we can't print a std::u8string directly. You can copy it byte-by-byte into a std::string but since both types are instantiated using the same std::basic_string<> template using equivalent one byte types, you can just typecast it to a std::string. This can be done easily with

Finally, we conclude with an example of converting a UTF-8 string to Unicode code points (UTF-32) to enable direct work with Unicode. However, in Unicode, each code point doesn’t always represent a standalone character in the usual sense. While many code points do correspond to individual characters, some are modifiers, such as accents or umlauts, that combine with other code points to form composite characters. The complexity increases because Unicode includes both precomposed characters (like ö, an “o” with an umlaut, U+00F6) and decomposed forms (an “o” followed by an umlaut, U+006F followed by U+0308). Although intended to be equivalent linguistically, these variations are encoded differently, which can make comparison and ordering of Unicode text challenging. Usually we have to rely on third party libraries for these Unicode functions.

With boost::text we can convert a sequence of Unicode code points to a number of different normal forms. The most common are NFC form which converts all decomposed forms into equivalent composed forms or the NFD form which coverts composed forms into the decomposed equivalent. Then once normalized, we can compare for equality. The normalize_append<>() function is used to convert to normal form. It takes a template parameter which is either boost::text::nf::c for NFC or boost::text::nf::d for NFD. The function arguments are the begin and end iterators of the container of Unicode code points such as a u32string, followed by a u8string to store the results of the normalization.