Transpiling egulias/email-validator: a multibyte string module, DNS, and the road to 100%

egulias/email-validator is the email-validation library in PHP. It is the validator behind symfony/mime, Laravel's email validation rule, and a long tail of frameworks and applications that need to know whether a string is a syntactically valid email address, whether the domain resolves, and whether the address conforms to one of the RFCs that defines what "valid" actually means. Three thousand lines of PHP, a deep parser with separate strategies for RFC 5322, no-warnings, DNS-checking, spoof-checking, and a strict mode, exercised by a test suite that has accreted every weird edge case the maintainers have run into for the last decade.

From Pext's side this project was less about parser quirks and more about two corners of the runtime that had been deferred until something forced them: PHP's multibyte string functions, and DNS. Both are now packaged as their own modules, and with them in place the validator is at 100%.

pext-multibyte-string and the `mb_*` family

The headline piece of work for this conversion is a new runtime module, pext-multibyte-string, that covers PHP's mb_* family of functions. This is the part of the standard library that exists because the rest of the string functions in PHP, strlen, substr, strpos, strtolower, strtoupper, and the rest, operate on bytes, not on characters. For ASCII that distinction is invisible. The moment you have UTF-8 in the string, byte-oriented functions start reporting lengths that are too high, slicing in the middle of multibyte sequences, and changing case for the ASCII portion only. The mb_* family is the encoding-aware counterpart, and any PHP code that processes human-typed input ends up calling into it sooner or later.

egulias/email-validator uses mb_* throughout. The local-part of an email address can contain non-ASCII characters under the RFC 6532 internationalised-email extensions, the domain can be an IDN, and the length checks the library performs are checks on character counts, not byte counts. The parser walks the input with mb_substr and mb_strlen, normalises encodings with mb_convert_encoding, sniffs declared encodings with mb_detect_encoding, and uses mb_strtolower for case-insensitive comparisons in the domain. Without a working mb_* surface the suite does not start.

The module covers the lookup-style functions (mb_strlen, mb_strpos, mb_strrpos, mb_stripos, mb_substr, mb_substr_count), the case-mapping functions (mb_strtolower, mb_strtoupper, mb_convert_case), the encoding functions (mb_convert_encoding, mb_detect_encoding, mb_check_encoding, mb_list_encodings), and the configuration surface (mb_internal_encoding, mb_regex_encoding, mb_language, mb_http_input, mb_http_output). The encoding configuration is held in the runtime context, the same context that already carries PHP_INT_SIZE and the locale, so it is per-request rather than per-process.

Two specifics that mattered for the validator. First, the encoding argument is genuinely optional and falls back to the current mb_internal_encoding(); that is the path the validator takes for almost every call, so the default needs to land on UTF-8 the way modern PHP does and not on the old ISO-8859-1 default that some older builds shipped with. Second, mb_strlen on an invalid byte sequence has to behave the same way PHP does, which is "return a length based on the substitution character that mb_substitute_character() currently configures", not "throw" and not "return false". The validator's strict mode produces a lot of these inputs by design, since the whole point is to assert that certain byte patterns are rejected.

The module sits alongside the existing byte-oriented string module rather than replacing it. PHP keeps both, and a transpiled codebase relies on the distinction being preserved at the call site: strlen($s) still means "bytes", mb_strlen($s) means "characters in the active encoding". This is the same boundary discussed in the note on string encodings from a couple of weeks ago.

Codepoints vs bytes in the parser

Having a working mb_* surface was necessary but not sufficient. A subtler class of bugs lives in the gap between what a function returns and how the caller interprets the result. mb_strpos returns a character offset, not a byte offset, and if the caller then passes that offset to a byte-oriented function the slice will land in the wrong place for any non-ASCII string. The validator does not do this on purpose, but a couple of internal helpers were mixing the two without realising, since on ASCII inputs the bug is invisible.

The fixes here were not in the runtime; they were in the transpiler's handling of how string offsets flow through the call graph. Where a value originates from an mb_* function and feeds a substr-style call, Pext now preserves the distinction so the downstream call gets the right kind of offset. Where the user code is explicitly byte-oriented the byte path is kept. The end result is that the IDN paths in the validator, which walk Punycode-decoded labels character by character, agree with PHP on every test case in the suite.

pext-network and DNS

The second new module, pext-network, covers the DNS-resolution functions: checkdnsrr, dns_get_record, gethostbyname, gethostbynamel, gethostbyaddr, and the related constants (DNS_A, DNS_MX, DNS_TXT, DNS_AAAA, and the rest). The validator's DNSCheckValidation and NoRFCWarningsValidation strategies use these to verify that a domain has at least one MX record, or at least one A/AAAA record as a fallback, before accepting an address as deliverable.

The implementation wraps Node's built-in dns module. The shape work was the interesting part. PHP's dns_get_record returns an array of associative arrays whose keys depend on the record type (host, class, ttl, type, then ip for A, ipv6 for AAAA, target and pri for MX, txt for TXT, and so on), with the type field as a string rather than an integer. Node returns plain arrays of strings or objects with different field names. The module normalises one to the other, including the trailing-dot convention on FQDNs that PHP carries into the returned host field. checkdnsrr is then a thin wrapper over dns_get_record that returns a boolean.

DNS is one of the few places in the validator where the test suite is not fully hermetic: a handful of integration tests resolve real domains. Those still pass under Pext because the underlying Node lookups go to the same resolver the host system would use; the unit tests, which constitute the bulk of the suite, run against a mocked resolver and exercise the shape-normalisation code rather than the network.

CRLF and folded header handling

The last cluster of failures, the one mentioned in last week's recap as "in flight", was around folded headers. RFC 5322 allows a long header value to be split across multiple physical lines as long as the continuation lines start with whitespace, and the separator is the literal sequence CRLF, two bytes, not just LF. The validator's tokeniser walks the input one character at a time and emits a folded-whitespace token whenever it sees a CR followed by an LF followed by a space or tab.

On the Pext side the tokeniser was hitting the carriage-return case slightly differently from PHP, because the byte-vs-codepoint fix above had not yet propagated to the constants that the tokeniser was comparing against. PHP keeps "\r" and "\n" as single-byte literals regardless of encoding, since they are control characters, but the matching logic in the transpiled output was going through the mb_* path and comparing codepoints against multi-byte representations of the same characters. The fix was to keep ASCII control characters on the byte path even when the surrounding code is encoding-aware, which is exactly what PHP does internally.

With that landed, the folded-header tests pass, and the validator's "no RFC warnings" strategy agrees with PHP on every header sample in the suite, including the ones that deliberately mix CRLF, bare CR, and bare LF to provoke parser disagreements.

Where we are

egulias/email-validator now passes 100% of its test suite under Pext. It is the fifth package cleared in Phase 2 of the open source roadmap, after brick/math, composer/semver, doctrine/inflector, and doctrine/lexer. As with the others, the work that mattered for this conversion generalises: pext-multibyte-string and pext-network are now part of the runtime for every project, not just this one. Anything in the Laravel dependency tree that touches mb_* (and a lot of it does, since Laravel itself uses mb_strlen and friends for its string helpers) inherits the work the validator paid for.

Next on the Phase 2 list are dragonmantank/cron-expression and nette/schema, both freshly bumped to converted. The showcase tracks live pass rates as each one closes.