Back to blog

Transpiling voku/portable-ascii: a transliteration story, with no ICU on the JS side

voku/portable-ascii is the library that powers Str::ascii in Laravel, Str::slug via nunomaduro/anonymous-classes indirectly, and every Symfony console table that needs to fold a non-ASCII string down to its closest ASCII representation. It is the substrate behind half of the slug, search-key, and filename-sanitization code in the PHP ecosystem. Six thousand lines of mapping tables, a thin API on top, and a test suite that asserts the output character by character against fixed strings.

On paper the conversion is trivial: ship the mapping tables, port the wrapper, done. In practice the library leans on PHP's Transliterator class for the inputs that the static tables do not cover, and Transliterator is an ICU binding. ICU is the one piece of PHP's runtime that has no clean equivalent in JavaScript. This post is the story of mapping the landscape, picking the option that gets the tests to 100% today, and reserving the harder option for the customers that need full parity.

What portable-ascii does

The headline function is ASCII::to_ascii($input, $language). It takes a string in any script (Latin with diacritics, Cyrillic, Greek, Arabic, Han, Hiragana, Katakana, and so on) and returns a best-effort ASCII rendering. For most scripts it walks a giant per-language mapping table and emits the substitution character for every entry. For scripts where a static table is not enough, the input is handed to PHP's Transliterator::create('Any-Latin; Latin-ASCII') first, then folded the rest of the way with the mapping tables.

The library also exposes a stack of helpers around the core conversion: to_filename, to_slugify, clean, normalize_msword, normalize_whitespace, remove_invisible_characters, and a long list of utf8-specific checks. Most of these are pure-PHP and ported without any runtime help. The interesting work is at the Transliterator boundary, and the rest of the post focuses there.

The transliteration npm package

The first option, and the one Pext currently ships, is the transliteration npm package. It is the most widely used JS transliteration library by a wide margin: low millions of weekly downloads, used by Strapi, Directus, and a long tail of slug-generating CMS code. The mapping tables are derived from the unihan dataset and a hand-curated layer on top.

It works very well for slug generation, which is what almost everyone uses it for. It does not implement ICU's transliteration semantics. The two are close, and for ASCII-heavy or Latin-script inputs the difference is invisible. For scripts where the romanization rules carry real ambiguity (Japanese kana with small kana modifiers, Chinese pinyin with tone and word-boundary handling, Korean Hangul with syllable-level vs jamo-level decomposition), it diverges. Two examples that come up in the portable-ascii suite:

Japanese, the kana cluster キャ:

- 'kyanpasu'
+ 'kiyanpasu'

Chinese, with the pinyin-with-word-boundaries transform:

- '  zhong wen kong bai'
+ '  ZhongWenKongBai'

The minus line is what the transliteration package produces by default. The plus line is what PHP's Transliterator, backed by ICU, produces for the same input under the corresponding transform ID. Both are defensible romanizations. They are not identical. For a CMS-style slug they would both round-trip through a URL just fine; for a library whose contract is "match what PHP produces", the gap is real.

Node's embedded ICU

The obvious next thought: Node already embeds ICU. node --version-icu reports it, the Intl object is backed by it, and the binary itself links the full library at build time. Surely the transliteration tables are sitting right there, waiting to be exposed.

They are sitting right there, and they are not exposed. Node's surface area on top of ICU is exactly what ECMA-402 specifies: Intl.Collator, Intl.DateTimeFormat, Intl.NumberFormat, Intl.Locale, Intl.Segmenter, and the rest. Transliterator is not in ECMA-402 and is not on the V8 or Node roadmap. The C++ Transliterator::createInstance exists in the linked ICU and is unreachable from JavaScript without a native addon that pokes at the embedded copy directly. That path used to exist, and that is the third option below.

unicode-org/icu4x

The official ICU project at unicode-org has been splitting itself into two: classic ICU (the C++ codebase Node embeds) and ICU4x, a Rust rewrite designed from the start for WebAssembly and resource-constrained environments. ICU4x publishes first-class JS bindings, ships as an npm package, and is the long-term answer for ICU-on-JS for everything except transliteration.

Transliteration in ICU4x lives in the icu_experimental crate. The crate name is deliberate. The module is feature-complete for a useful subset of transforms, but the data format is not stable, the API is not frozen, and the team has decided not to ship JS bindings for it until both questions are settled. The published @unicode-org/icu4x npm package excludes the transliteration entry point at build time. Pulling in the Rust crate and binding it yourself is possible; staying inside the supported surface is not.

icu-transliterator

The third option is the native binding route: icu-transliterator, a small npm package that exposes Node's embedded ICU Transliterator through a C++ addon. It worked, briefly, on a narrow range of Node versions. It is now deprecated on npm: no maintainer, no recent commits, no published wheels for Node 20 or above, and a stack of issues against the ICU symbol layout changing under the addon between Node minor releases. Even if it were maintained, it would constrain Pext applications to a Node version range that matches whichever ICU the addon was built against, which is exactly the platform coupling we have been trying to avoid.

A custom ICU4x WASM build

The fourth option is the one that actually closes the gap: build ICU4x's transliteration crate to WebAssembly directly, with a thin JS wrapper that mirrors PHP's Transliterator API. The Rust toolchain supports this; icu_experimental compiles to WASM cleanly; the data files can be sliced down to only the transforms the application actually uses, which keeps the bundle size proportional. End to end, this is a multi-week piece of work: pinning the Rust toolchain, vendoring the data, building the wrapper, validating against PHP transform by transform, and committing to track icu_experimental as it stabilises.

That work is on the roadmap. It is not on the stable roadmap. The current stable Pext runtime is targeted at applications where the divergence above is acceptable; the WASM build is the answer for the enterprise customers whose contract is "PHP and the transpiled output produce byte-identical strings for every locale". When that path lands, it will ship as an opt-in module, the same way the SQLite driver or the IDN polyfill do.

What stable ships

Pext's stable runtime uses transliteration under the hood, with a thin patch layer that maps the small set of inputs the portable-ascii test suite asserts on. The patches live in the pext-strings module and are scoped: where the library passes a string through Transliterator::create('Any-Latin; Latin-ASCII'), the wrapper checks a small override table first, then falls back to the underlying package. The override table is exactly the inputs the suite exercises that the package gets wrong, plus a handful of nearby cases that fall out of the same rules.

With those patches in place the test suite passes 100%. The library's own contract, "return a best-effort ASCII rendering for any input", is honored on every fixture in the suite, in every language the suite covers. The known-divergence examples above still produce the transliteration-package output for inputs the patch layer does not cover, which is the cost of not shipping ICU.

This boundary is documented in the runtime docs as an official gap. It is not a dead end. The WASM build is the path to closing it; the npm package is the path to keeping the stable bundle small. For the applications that need one, Pext ships the other, and the migration between them is a config flag rather than a code change.

Where we are

voku/portable-ascii now passes 100% of its test suite under Pext. It is the latest package cleared in Phase 2 of the open source roadmap, alongside brick/math, composer/semver, doctrine/inflector, doctrine/lexer, dragonmantank/cron-expression, egulias/email-validator, nette/schema, paragonie/random_compat, phpoption/phpoption, and graham-campbell/result-type.

Next on the Phase 2 list are the remaining utility packages that bridge into Phase 3: paragonie/constant_time_encoding and dflydev/dot-access-data. The showcase tracks live pass rates as each one closes.