JavaScript: Convert a string to Unicode code points (2 ways)

Updated: August 5, 2023 By: Khue Post a comment

Unicode code points are the numerical values that represent each character in the Unicode standard, which covers over a million characters from various languages, scripts, symbols, and emojis. Converting a string to Unicode code points can be useful for various purposes, such as encoding, decoding, escaping, or analyzing text data.

This concise, example-based article will walk you through a couple of different ways to turn a given string into an array of Unicode code points in both modern JavaScript (ES6 and beyond) and classic JavaScript (that can run on ancient browsers like IE 10). Without any further ado, let’s get started.

Using String.prototype.codePointAt()

This approach uses the built-in method codePointAt() of the String.prototype object to return the Unicode code point of a character at a given index in the string. It can handle any valid Unicode character in the string.

The steps to get the job done are:

  1. Declare an empty array to store the output code points.
  2. Use a for-of loop to iterate over each character in the string.
  3. Use the codePointAt() method with the index of the current character as the argument to get its code point value.
  4. Push the code point value to the output array using the Array.push() method.
  5. Return or log the output array.

Words might be confusing. Here’s an example:

// Input string
const str = 'Welcome to Sling Academy!';

// Output array
const codePoints = [];

// Loop over each character in the string
for (const char of str) {
  // Get the code point value of the character
  const codePoint = char.codePointAt(0);
  // Push the code point value to the output array
  codePoints.push(codePoint);
}

// Log the output array
console.log(codePoints);

Output:

[87, 101, 108, 99, 111, 109, 101, 32, 116, 111, 32, 83, 108, 105, 110, 103, 32, 65, 99, 97, 100, 101, 109, 121, 33]

This approach may not work in older browsers or environments that do not support ES6 features. If you cannot accept that, the next section of this article is the way to go.

Using String.prototype.charCodeAt() and bitwise operations

This approach is compatible with older browsers and environments that do not support ES6 features. It can also handle any valid Unicode character in the string. The trade-off is that it is more complex and verbose than the preceding technique.

The core idea here is to use the built-in method charCodeAt() of the String.prototype object to return the UTF-16 code unit value of a character at a given index in the string. This method can only handle 2-byte characters (BMP characters) by returning their code unit value directly. For 4-byte characters (supplementary characters), it returns two separate values for each half of their surrogate pair. To get their full code point value, some bitwise operations are needed to combine their high and low surrogates.

The steps are as follows:

  1. Declare an empty array to store the output code points.
  2. Use a for-of loop to iterate over each character in the string.
  3. Use the charCodeAt() method with the index of the current character as the argument to get its UTF-16 code unit value.
  4. Check if the code unit value is between 0xD800 and 0xDBFF, which means it is a high surrogate of a supplementary character.
  5. If yes, use another charCodeAt() method with the index of the next character as the argument to get its low surrogate value. Then use some bitwise operations to combine them into a full code point value. The formula is: (high - 0xD800) * 0x400 + (low - 0xDC00) + 0x10000.
  6. If no, use the code unit value as the code point value directly.
  7. Push the code point value to the output array.
  8. Return or log the output array.

Code example:

// Input string
const str = 'Welcome to Sling Academy!';

// Output array
const codePoints = [];

// Loop over each character in the string
for (let i = 0; i < str.length; i++) {
  // Get the UTF-16 code unit value of the character
  let codeUnit = str.charCodeAt(i);
  // Check if it is a high surrogate of a supplementary character
  if (codeUnit >= 0xd800 && codeUnit <= 0xdbff) {
    // Get the low surrogate value of the next character
    let lowSurrogate = str.charCodeAt(i + 1);
    // Combine them into a full code point value
    let codePoint =
      (codeUnit - 0xd800) * 0x400 + (lowSurrogate - 0xdc00) + 0x10000;
    // Push the code point value to the output array
    codePoints.push(codePoint);
    // Skip the next character as it is already processed
    i++;
  } else {
    // Use the code unit value as the code point value directly
    codePoints.push(codeUnit);
  }
}

// Log the output array
console.log(codePoints);

Output:

(25) [87, 101, 108, 99, 111, 109, 101, 32, 116, 111, 32, 83, 108, 105, 110, 103, 32, 65, 99, 97, 100, 101, 109, 121, 33]

The result is the same as the first approach. However, the code is far longer.