JavaScript: Convert a string to Unicode code points (2 ways)

Unicode code points are the numerical values that represent each character in the Unicode standard, which covers over a million characters from various languages, scripts, symbols, and emojis. Converting a string to Unicode code points can be useful for various purposes, such as encoding, decoding, escaping, or analyzing text data.

This concise, example-based article will walk you through a couple of different ways to turn a given string into an array of Unicode code points in both modern JavaScript (ES6 and beyond) and classic JavaScript (that can run on ancient browsers like IE 10). Without any further ado, let’s get started.

Using String.prototype.codePointAt()

This approach uses the built-in method codePointAt() of the String.prototype object to return the Unicode code point of a character at a given index in the string. It can handle any valid Unicode character in the string.

The steps to get the job done are:

Declare an empty array to store the output code points.
Use a for-of loop to iterate over each character in the string.
Use the codePointAt() method with the index of the current character as the argument to get its code point value.
Push the code point value to the output array using the Array.push() method.
Return or log the output array.

Words might be confusing. Here’s an example:

// Input string
const str = 'Welcome to Sling Academy!';

// Output array
const codePoints = [];

// Loop over each character in the string
for (const char of str) {
  // Get the code point value of the character
  const codePoint = char.codePointAt(0);
  // Push the code point value to the output array
  codePoints.push(codePoint);
}

// Log the output array
console.log(codePoints);

Output:

[87, 101, 108, 99, 111, 109, 101, 32, 116, 111, 32, 83, 108, 105, 110, 103, 32, 65, 99, 97, 100, 101, 109, 121, 33]

This approach may not work in older browsers or environments that do not support ES6 features. If you cannot accept that, the next section of this article is the way to go.

Using String.prototype.charCodeAt() and bitwise operations

This approach is compatible with older browsers and environments that do not support ES6 features. It can also handle any valid Unicode character in the string. The trade-off is that it is more complex and verbose than the preceding technique.

The core idea here is to use the built-in method charCodeAt() of the String.prototype object to return the UTF-16 code unit value of a character at a given index in the string. This method can only handle 2-byte characters (BMP characters) by returning their code unit value directly. For 4-byte characters (supplementary characters), it returns two separate values for each half of their surrogate pair. To get their full code point value, some bitwise operations are needed to combine their high and low surrogates.

The steps are as follows:

Declare an empty array to store the output code points.
Use a for-of loop to iterate over each character in the string.
Use the charCodeAt() method with the index of the current character as the argument to get its UTF-16 code unit value.
Check if the code unit value is between 0xD800 and 0xDBFF, which means it is a high surrogate of a supplementary character.
If yes, use another charCodeAt() method with the index of the next character as the argument to get its low surrogate value. Then use some bitwise operations to combine them into a full code point value. The formula is: (high - 0xD800) * 0x400 + (low - 0xDC00) + 0x10000.
If no, use the code unit value as the code point value directly.
Push the code point value to the output array.
Return or log the output array.

Code example:

// Input string
const str = 'Welcome to Sling Academy!';

// Output array
const codePoints = [];

// Loop over each character in the string
for (let i = 0; i < str.length; i++) {
  // Get the UTF-16 code unit value of the character
  let codeUnit = str.charCodeAt(i);
  // Check if it is a high surrogate of a supplementary character
  if (codeUnit >= 0xd800 && codeUnit <= 0xdbff) {
    // Get the low surrogate value of the next character
    let lowSurrogate = str.charCodeAt(i + 1);
    // Combine them into a full code point value
    let codePoint =
      (codeUnit - 0xd800) * 0x400 + (lowSurrogate - 0xdc00) + 0x10000;
    // Push the code point value to the output array
    codePoints.push(codePoint);
    // Skip the next character as it is already processed
    i++;
  } else {
    // Use the code unit value as the code point value directly
    codePoints.push(codeUnit);
  }
}

// Log the output array
console.log(codePoints);

Output:

(25) [87, 101, 108, 99, 111, 109, 101, 32, 116, 111, 32, 83, 108, 105, 110, 103, 32, 65, 99, 97, 100, 101, 109, 121, 33]

The result is the same as the first approach. However, the code is far longer.

Next Article: JavaScript: Convert a byte array to a hex string and vice versa

Previous Article: JavaScript: Ways to Compare 2 Strings Ignoring Case

Series: JavaScript Strings

JavaScript

JavaScript: Convert a Map object to JSON and vice versa

May 15, 2023

Working with WeakSet in Modern JavaScript

May 15, 2023

Working with WeakMap in Modern JavaScript

May 14, 2023

Working with Sets in Modern JavaScript

May 13, 2023

Let, Const, and Var in Modern JavaScript

April 27, 2023

JavaScript: Convert an Object to a Query String and Vice Versa

April 27, 2023

JavaScript: How to randomly change background color

April 10, 2023

JavaScript: “I agree to terms” checkbox example

April 10, 2023

JavaScript: Get the Position (X & Y Coordinates) of an Element

April 05, 2023

JavaScript: Programmatically open a URL in a new tab/window

March 31, 2023

JavaScript: Define a Function with Default Parameters

March 30, 2023

Using HTML Native Date Picker with JavaScript

March 30, 2023