unzalgo

Transforms ť͈͓̠̓̆ͅḥ͇͔̟̭̰̤̏̔̐̑ì̭̜͔ͯ͞s̈́̄̑͋̋͏̟͓̞͖͝ͅ into this without breaking internationalization

Usage no npm install needed!

<script type="module">
  import unzalgo from 'https://cdn.skypack.dev/unzalgo';
</script>

README

unzalgo

build codecov dependency Status

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Installation

$ npm install unzalgo

About

You can use unzalgo to both detect Zalgo text and transform it back into normal text without breaking internationalization. For example, you could transform:

T͘H͈̩̬̺̩̭͇I͏̼̪͚̪͚S͇̬̺ ́E̬̬͈̮̻̕V҉̙I̧͖̜̹̩̞̱L͇͍̝ ̺̮̟̙̘͎U͝S̞̫̞͝E͚̘͝R IṊ͍̬͞P̫Ù̹̳̝͓̙̙T̜͕̺̺̳̘͝

into

THIS EVIL USER INPUT

while also keeping

thiŝ te̅xt unchanged, since some lângûaĝes aĉtuallŷ uŝe thêse sŷmbo̅ls,

and, at the same time, keep all diacritics in

Z nich ovšem pouze předposlední sdílí s výše uvedenou větou příliš žluťoučký kůň úpěl […]

which remains unchanged after a transformation.

Is there a demo?

Yes! You can check it out here. You can edit the text at the top; the lower part shows the text after clean using the default threshold.

How does it work?

In Unicode, every character is assigned to a character category. Zalgo text uses characters that belong to the categories Mn (Mark, Nonspacing) or Me (Mark, Enclosing).

First, the text is divided into words; each word is then assigned to a score that corresponds to the usage of the categories above, combined with small use of statistics. If the score exceeds a threshold, we're able to detect Zalgo text (which allows us to strip away all characters from the above categories).

Getting started

Regular cleaning

import { clean } from "unzalgo";
assert("this" === clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋"));

Configuring detection

import { clean } from "unzalgo";
/* Clean only if there are no "normal" characters in the word (t, h, i and s are "normal") */
assert("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋" === clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋", {
    thresholds: {
        detection: 1
    }
}));
/* Clean only if there is at least one combining character */
import { clean } from "unzalgo";
assert("francais" === clean("français", {
    thresholds: {
        detection: 0
    }
}));
import { clean } from "unzalgo";
/* `français` remains intact by default */
assert("français" === clean("français"));

Internationalization

import { isZalgo } from "unzalgo";
/* "français" is not a Zalgo text, of course */
assert(isZalgo("français") === false);
import { isZalgo } from "unzalgo";
/* Unless you define the Zalgo property as containing combining characters */
assert(isZalgo("français", 0) === true);
import { isZalgo } from "unzalgo";
/* You can also define the Zalgo property as consisting of nothing but combining characters */
assert(isZalgo("français", 1) === false);

Detection threshold

Some of this library's functions accept a detectionThreshold option that let you configure how sensitively unzalgo behaves. The number detectionThreshold is a number from 0 to 1 and defaults to 0.55.

A detection threshold of 0 indicates that a string should be classified as Zalgo text if at least 0 % of its codepoints have the Unicode category Mn or Me.

A detection threshold of 1 indicates that a string should be classified as Zalgo text if at least 100 % of its codepoints have the Unicode category Mn or Me.

Exports

clean(string[, options]): string [default export]

Removes all combining characters for every word in a string if the word is classified as Zalgo text. If targetDensity is specified, not all the Zalgo characters will be removed. Instead, they will be thinned out uniformly.

Returns a cleaned, more readable string.

Arguments:

  • string: string A string for which combining characters are removed for every word whose Zalgo property is met.
  • options: object An object of options.
  • options.detectionThreshold: number = 0.55 A threshold ∈ [0, 1]. The higher the threshold, the more combining characters are needed for it to be detected as Zalgo text.
  • options.targetDensity: number = 0 A threshold ∈ [0, 1]. The higher the density, the more Zalgo characters will be part of the resulting string. The result is guaranteed to have a Zalgo-character density that is less than or equal to the one provided. A target density of 0 indicates that none of the combining characters should be part of the resulting string. A target density of 1 indicates that all combining characters should be part of the resulting string.

computeScores(string): number[]

Computes a score ∈ [0, 1] for every word in the input string. Each score represents the ratio of Zalgo characters to total characters in a word.

Returns An array of scores where each score describes the Zalgo ratio of a word.

Arguments:

  • string: string The input string for which to compute scores.

isZalgo(string[, detectionThreshold = 0.55]): boolean

Determines if the string consists of Zalgo text. Note that the occurrence of a combining character is not enough to trigger the detection. Instead, it computes a ratio for the input string and checks if it exceeds a given threshold. Thus, internationalized strings aren't automatically classified as Zalgo text.

Returns whether the string is a Zalgo text string.

Arguments:

  • string: string A string for which a Zalgo text check is run.
  • detectionThreshold: number = 0.55 A threshold ∈ [0, 1]. The higher the threshold, the more combining characters are needed for it to be detected as Zalgo text.