utf32char

4-byte-width (UTF-32) characters and unsigned integers for working with strings

Usage no npm install needed!

<script type="module">
  import utf32char from 'https://cdn.skypack.dev/utf32char';
</script>

README

UTF32Char

A minimalist, dependency-free implementation of immutable 4-byte-width (UTF-32) characters for easy manipulation of characters and glyphs, including simple emoji.

Also includes an immutable unsigned 4-byte-width integer data type, UInt32 and easy conversions from and to UTF32Char.

Motivation

If you want to allow a single "character" of input, but consider emoji to be single characters, you'll have some difficulty using basic JavaScript strings, which use UTF-16 encoding by default. While ASCII characters all have length-1...

console.log("?".length) // 1

...many emoji have length > 1

console.log("ðŸ’Đ".length) // 2

...and with modifiers and accents, that number can get much larger

console.log("!ĖŋĖ‹ÍĨÍĨĖ‚ÍĢĖĖĖÍžÍœÍ–ĖŽĖ°Ė™Ė—".length) // 17

As all Unicode characters can be expressed with a fixed-length UTF-32 encoding, this package mitigates the problem a bit, though it doesn't completely solve it. Note that I do not claim to have solved this issue, and this package accepts any group of one to four bytes as a "single UTF-32 character", whether or not they are rendered as a single grapheme. See this package if you want to split text into graphemes, regardless of the number of bytes required to render each grapheme.

If you just want a simple, dependency-free API to deal with 4-byte strings, then this package is for you.

This package provides an implementation of 4-byte, UTF-32 "characters" UTF32Char and corresponding unsigned integers UInt32. The unsigned integers have an added benefit of being usable as safe array indices.

Installation

Install from npm with

$ npm i utf32char

Or try it online at npm.runkit.com

var lib = require("utf32char")

let char = new lib.UTF32Char("ðŸ˜Ū")

Use

Create new UTF32Chars and UInt32s like so

let index: UInt32 = new UInt32(42)
let char: UTF32Char = new UTF32Char("ðŸ˜Ū")

You can convert to basic JavaScript types

console.log(index.toNumber()) // 42
console.log(char.toString())  // ðŸ˜Ū

Easily convert between characters and integers

let indexAsChar: UTF32Char = index.toUTF32Char()
let charAsUInt: UInt32 = char.toUInt32()

console.log(indexAsChar.toString()) // *
console.log(charAsUInt.toNumber())  // 3627933230

...or skip the middleman and convert integers directly to strings, or strings directly to integers:

console.log(index.toString()) // *
console.log(char.toNumber())  // 3627933230

Edge Cases

UInt32 and UTF32Char ranges are enforced upon object creation, so you never have to worry about bounds checking:

let tooLow: UInt32 = UInt32.fromNumber(-1)
// range error: UInt32 has MIN_VALUE 0, received -1

let tooHigh: UInt32 = UInt32.fromNumber(2**32)
// range error: UInt32 has MAX_VALUE 4294967295 (2^32 - 1), received 4294967296

let tooShort: UTF32Char = UTF32Char.fromString("")
// invalid argument: cannot convert empty string to UTF32Char

let tooLong: UTF32Char = UTF32Char.fromString("hey!")
// invalid argument: lossy compression of length-3+ string to UTF32Char

Because the implementation accepts any 4-byte string as a "character", the following are allowed

let char: UTF32Char = UTF32Char.fromString("hi")
let num: number = char.toNumber()

console.log(num) // 6815849
console.log(char.toString()) // hi
console.log(UTF32Char.fromNumber(num).toString()) // hi

Floating-point values are truncated to integers when creating UInt32s, like in many other languages:

let pi: UInt32 = UInt32.fromNumber(3.141592654)
console.log(pi.toNumber()) // 3

let squeeze: UInt32 = UInt32.fromNumber(UInt32.MAX_VALUE + 0.9)
console.log(squeeze.toNumber()) // 4294967295

Compound emoji -- created using variation selectors and joiners -- are often larger than 4 bytes wide and will therefore throw errors when used to construct UTF32Chars:

let smooch: UTF32Char = UTF32Char.fromString("ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ")
// invalid argument: lossy compression of length-3+ string to UTF32Char

console.log("ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ".length) // 11

...but many basic emoji are fine:

// emojiTest.ts
let emoji: Array<string> = [ "😂", "😭", "ðŸĨš", "ðŸĪĢ", "âĪïļ", "âœĻ", "😍", "🙏", "😊", "ðŸĨ°", "👍", "💕", "ðŸĪ”", "ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ" ]

for (const e of emoji) {
  try {
    UTF32Char.fromString(e)
    console.log(`✅: ${e}`)
  } catch (_) {
    console.log(`❌: ${e}`)
  }
}
$ npx ts-node emojiTest.ts
✅: 😂
✅: 😭
✅: ðŸĨš
✅: ðŸĪĢ
✅: âĪïļ
✅: âœĻ
✅: 😍
✅: 🙏
✅: 😊
✅: ðŸĨ°
✅: 👍
✅: 💕
✅: ðŸĪ”
❌: ðŸ‘Đ‍âĪïļâ€ðŸ’‹â€ðŸ‘Đ

Arithmetic, Comparison, and Immutability

UInt32 provides basic arithmetic and comparison operators

let increased: UInt32 = index.plus(19)
console.log(increased.toNumber()) // 61

let comp: boolean = increased.greaterThan(index)
console.log(comp) // true

Verbose versions and shortened aliases of comparison functions are available

  • lt and lessThan
  • gt and greaterThan
  • le and lessThanOrEqualTo
  • ge and greaterThanOrEqualTo

Since UInt32s are immutable, plus() and minus() return new objects, which are of course bounds-checked upon creation:

let whoops: UInt32 = increased.minus(100)
// range error: UInt32 has MIN_VALUE 0, received -39

Contact

Feel free to open an issue with any bug fixes or a PR with any performance improvements.

Support me @ Ko-fi!

Check out my DEV.to blog!