How to Master PHP Encoding and Charset Issues for Flawless User Data Preservation

Hire a PHP developer for your project — click here.

by admin
php_encoding_charset_issues

Why encoding bugs hurt more than we admit

There’s a special kind of silence in the office when an encoding bug hits production.

You know the one.

Support tickets start piling up:
“Why are my customer names full of question marks?”
“Your email shows � instead of Russian characters.”
“Our invoice PDF has broken accents, this looks unprofessional.”

You open the logs.
You open the browser dev tools.
You open the database console.
Somewhere between HTTP request, PHP script, database, template, and response, a character quietly broke in half.

And the worst part?
The app “kind of” works. No fatal error. No stack trace. Just… subtle corruption of real human data.

Friends, colleagues, fellow PHP developers — this is an article about PHP encoding and charset issues. But underneath that, it’s about respect. Respect for users’ names, languages, alphabets, and messages they’ve trusted us with.

Let’s talk about the boring, painful, absolutely essential layer under everything we build: text encoding.

The classic crime scene: mojibake, question marks, broken umlauts

If you’ve worked with PHP and MySQL, you’ve probably met some of these symptoms:

  • “é” instead of “é”
  • “Привет” instead of “Привет”
  • “????” instead of any non-Latin text
  • JSON responses that look correct in PHP but broken in the browser
  • PDFs or CSVs exported from PHP with messed up names or addresses

Technically we call this mojibake — text that’s been encoded in one character set but interpreted as another.

In practice, it’s that moment when a client from São Paulo sends a screenshot of their company name, and it looks like someone smashed the keyboard with their elbow.

It stings.

The root of the problem: everyone needs to agree

The painful reality: encoding is not a PHP-only problem.

To get a single user’s name from keyboard to database and back to screen, these players need to agree on the same character set:

  • Browser (or API client)
  • HTTP headers
  • HTML / JSON / XML
  • PHP internal handling
  • Database connection
  • Database column types and collation
  • Templates, PDFs, CSV exports, logs

If anyone along this path uses a different charset, you get corruption.

The modern, sane solution is simple to say, harder to do consistently:

Use UTF-8 everywhere, for everything, without exceptions.

But PHP, MySQL, legacy hosting providers… they all carry history. And history carries defaults that aren’t UTF-8.

So we need to be deliberate.

Step one: decide who you are (encoding-wise)

Think of this as your project’s identity document:

  • We store everything as UTF-8
  • We send everything as UTF-8
  • We expect everything as UTF-8

That means:

  • Source code files are UTF-8 (without BOM)
  • Templates (Twig, Blade, plain PHP) are UTF-8
  • Database is configured for UTF-8 (or better, utf8mb4)
  • HTTP responses declare UTF-8
  • CLI scripts assume UTF-8 for input/output
  • Email headers and bodies use UTF-8

This sounds trivial, but you can feel the difference between a codebase where this decision was made on day one and one where people just “trusted defaults” for years.

PHP and encodings: not as simple as it looks

PHP itself is a bit… agnostic about encoding.

Regular strings in PHP are just sequences of bytes. PHP doesn’t inherently know that they represent UTF-8, ISO-8859-1, or something else. That’s both powerful and dangerous.

Some parts of PHP are binary-safe (they don’t care about encoding):

  • string concatenation
  • substr (but it will slice bytes, not characters)
  • strlen (returns bytes, not characters)
  • file_get_contents, fread, fwrite

Other parts are encoding-aware, but only if you use the right functions or extensions:

  • mb_* functions (mb_strlen, mb_substr, mb_convert_encoding)
  • intl extension (IntlChar, Normalizer, collator, etc.)

So if you do this:

$name = "Привет"; // UTF-8

echo strlen($name);   // bytes
echo mb_strlen($name, 'UTF-8'); // characters

These two numbers are not the same. And that matters when you’re validating max lengths, truncating values, or storing to fixed-width fields.

The database trap: utf8 vs utf8mb4 vs “whatever the hosting left”

Many encoding bugs in PHP apps are actually MySQL/MariaDB configuration issues.

Some real-world scenarios you may have seen:

  • Database created in latin1
  • Connection defaults to latin1
  • Application assumes UTF-8
  • Columns are utf8 (the old 3-byte MySQL UTF-8) but you store emojis (needs 4 bytes)
  • Different tables with different collations

If you’re working on a project today, the safest baseline is:

  • Database charset: utf8mb4
  • Tables and columns using utf8mb4_unicode_ci or utf8mb4_unicode_520_ci or utf8mb4_0900_ai_ci (depending on MySQL version)
  • PHP connection explicitly set to utf8mb4 (don’t rely on “should be fine”)

In PDO, for example:

$pdo = new PDO(
    'mysql:host=localhost;dbname=app;charset=utf8mb4',
    $user,
    $pass,
    [
        PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
    ]
);

If you’re using mysqli:

$mysqli = new mysqli($host, $user, $pass, $db);
$mysqli->set_charset('utf8mb4');

The keyword here is explicit. Encoding issues love ambiguity.

HTTP and headers: tell the browser what you mean

I’ve seen production apps that relied on default encodings in HTML and never explicitly declared charset in HTTP headers.

That’s like sending a letter with no language specified and hoping the recipient magically guesses.

At the very least, for HTML responses:

header('Content-Type: text/html; charset=UTF-8');

And in your HTML:

<meta charset="UTF-8">

For JSON APIs:

header('Content-Type: application/json; charset=UTF-8');
echo json_encode($data, JSON_UNESCAPED_UNICODE);

That last flag is important. Without JSON_UNESCAPED_UNICODE, PHP will escape non-ASCII characters like this:

{"name":"\u041f\u0440\u0438\u0432\u0435\u0442"}

This is technically valid, but harder to debug, and log readers will hate you. Using the flag, you get:

{"name":"Привет"}

The quiet killer: inconsistent input handling

Real users don’t care about your encoding strategy. They paste text from:

  • Excel
  • Word documents
  • Legacy ERP systems
  • Old websites
  • Email clients

Some of those still use Windows-1251, Windows-1252, or other legacy encodings. When this data hits your forms, it arrives as a mess of bytes the browser thinks are UTF-8.

Sometimes, even within the same field, different parts of the text can be from different source encodings. It’s chaos.

In PHP, you have a few tools:

  • mb_detect_encoding — can guess encodings, but not perfectly
  • iconv — conversion between encodings
  • mb_convert_encoding — similar, more modern, often safer

A defensive approach for legacy-heavy apps might be:

$raw = $_POST['comment'] ?? '';

$comment = mb_convert_encoding($raw, 'UTF-8', [
    'UTF-8',
    'Windows-1251',
    'Windows-1252',
    'ISO-8859-1',
]);

Not perfect. But much better than silently storing broken bytes that users cannot even retype easily.

HTML forms and invisible landmines

Even HTML forms can betray you.

By default, browsers submit form data in the page encoding. So if your page is properly declared as UTF-8 and served as UTF-8, you’re mostly safe.

But mix in:

  • old templates saved as Windows-1251
  • server misconfiguration
  • multi-page flows where one step is not UTF-8

And suddenly $_POST can contain not-actually-UTF-8 sequences.

A small but important habit:

  • Keep all templates in UTF-8.
  • Configure your editor to always save in UTF-8 without BOM.
  • Check in .editorconfig or IDE settings to enforce this in the team.

The mbstring extension: friend, not optional

If you’re serious about encoding, mbstring should be non-negotiable in your PHP environment.

The standard string functions do not understand multi-byte characters. mb_* functions do.

Some patterns that save future you from pain:

Instead of:

substr($str, 0, 20);

Use:

mb_substr($str, 0, 20, 'UTF-8');

Instead of:

strlen($str);

Use:

mb_strlen($str, 'UTF-8');

Instead of:

strtolower($str);

Use:

mb_strtolower($str, 'UTF-8');

And early in your app bootstrap, set:

mb_internal_encoding('UTF-8');

This won’t magically fix everything, but it makes the defaults safer.

Files, CSVs, PDFs: where encoding goes to die

Some of the ugliest encoding bugs appear when exporting data:

  • CSVs that open fine in LibreOffice, but show nonsense in Excel
  • PDF invoices with broken customer names
  • Generated XML that fails external validation because of stray bytes

For CSV, the painful truth: many versions of Excel on Windows expect Windows-1252 or similar, not UTF-8, unless you jump through some hoops.

Real projects end up doing things like:

$csv = fopen('php://output', 'w');

// Optional BOM for better Excel behavior:
fprintf($csv, chr(0xEF).chr(0xBB).chr(0xBF));

foreach ($rows as $row) {
    fputcsv($csv, $row, ';');
}

That BOM line is controversial, but in some organizations it’s the difference between “works for real users” and “looks technically pure but fails in the real world”.

See also
PHP vs Python: Which Programming Language Will Shape Your Coding Career and Why You Shouldn't Choose One Without the Other

For PDFs (TCPDF, Dompdf, etc.), you have to make sure:

  • The library knows your input is UTF-8.
  • The font you use supports the characters (CJK, Cyrillic, etc.).

How many late nights have been spent hunting bugs that turned out to be “the font doesn’t support that glyph”?

Collation: sorting, searching, comparing

Encoding is about how bytes map to characters. Collation is about how you compare and sort them.

In PHP, you might compare strings with === or use strcasecmp. But in the database, indexes, uniqueness, and ORDER BY behavior are influenced by collation.

If your collation doesn’t match your language needs, you get:

  • Unexpected sorting orders
  • Case-sensitive behavior where you don’t want it
  • “Duplicates” that users can’t distinguish, but DB treats as different (or vice versa)

With MySQL’s utf8mb4_0900_ai_ci, for example, you get modern Unicode-aware comparisons. In older versions, utf8mb4_unicode_ci is often a good balance.

For application-level sorting in PHP, the intl extension’s Collator can respect language rules much better than naive string comparisons.

Again, the pattern: be explicit. If you work in a multilingual environment, define your expectations about sorting and comparison, don’t just accept “whatever MySQL does by default”.

Character normalization: when visually identical isn’t equal

There’s another layer that tends to surprise people: Unicode normalization.

Some characters can be represented in more than one way:

  • As a single precomposed character (é)
  • As a base character plus combining diacritic (e + ́)

They look the same, but byte-wise they’re different. So string equality breaks, uniqueness constraints break, and search can behave strangely.

PHP’s Normalizer (from the intl extension) can help:

use Normalizer;

$normalized = Normalizer::normalize($input, Normalizer::FORM_C);

Especially in systems that receive data from many sources (imports, APIs, copy-paste from different platforms), normalizing to a single form before storing can save a lot of subtle bugs years down the line.

And now, let’s pause here.

Encoding talk tends to feel like plumbing: necessary, but not inspiring.
Yet if you’ve ever seen a user’s own name corrupted in a system you built, you know it’s not just plumbing.

It’s whether you treat their language and identity as first-class citizens in your codebase.

A story about a broken name

A few years ago, I was helping with an old PHP codebase that powered a mid-sized recruitment platform. Not unlike Find PHP, but in a different niche.

One day, a recruiter wrote to support:

“Your system changed the candidate’s last name. This is unacceptable.”

I checked the logs. Technically, the system didn’t “change” it. It corrupted it — half the characters turned into question marks when exporting to PDF, then that PDF was archived in their internal systems as the “official” application file.

Imagine that.
Your name, on a document that might decide your job, looking like someone didn’t care enough to handle your alphabet correctly.

Under the hood, the story was classic:

  • Database: latin1
  • PHP: assumed UTF-8
  • PDF library: partially configured for UTF-8, partially not
  • Export: touched by three layers that all quietly mangled bytes

No exceptions. No crashes. Just quiet disrespect.

We fixed it, of course. Migrated the database to utf8mb4, repaired the schema, forced connection encodings, refactored a bunch of string handling to mbstring, replaced fonts in the PDF library.

But I still remember the feeling:
It wasn’t about being “correct.” It was about being responsible.

Encoding as a professional habit, not a ticket task

On a platform like Find PHP, people look for:

  • Companies that need serious PHP developers.
  • Developers who can be trusted with production systems.
  • Teams who stay on top of the ecosystem, patterns, and traps.

Charset and encoding issues rarely show up as “feature requests”. They sneak in through:

  • “Export data to CSV”
  • “Add support for emojis in chat”
  • “Integrate with this legacy ERP”
  • “Add a public JSON API”
  • “Generate invoices in PDF”

If you’re a PHP developer looking for work, this is one of those areas that subtly separates juniors from seniors:

A junior might say:
“It works on my machine with these test strings.”

A senior quietly asks:
“What happens if the user’s name has accents, Eastern European characters, Chinese characters, or emojis?”
“Are we sure everything is UTF-8 end-to-end?”
“Do we validate or normalize inputs?”
“Can our email templates and PDFs handle multiple languages?”

This mindset doesn’t appear in job descriptions, but teams feel it when they work with you.

Practical checklist: making your PHP app truly UTF-8

Let’s distill all of this into something you can look at next time you touch a codebase.

Source and templates

  • All .php, .twig, .blade.php, etc. in UTF-8 without BOM
  • Editor or .editorconfig configured to enforce UTF-8
  • No ancient templates in Windows-1251 / ISO-8859-1 hiding in the repo

PHP configuration

  • default_charset = "UTF-8" in php.ini (or set via ini_set at bootstrap)
  • mbstring extension enabled
  • mb_internal_encoding('UTF-8'); early in the app lifecycle

Database

  • Database, tables, and columns set to utf8mb4
  • Collation set to something Unicode-aware (utf8mb4_unicode_ci or a newer variant)
  • Connection charset explicitly set (PDO DSN or set_charset('utf8mb4'))

HTTP layer

  • For HTML: Content-Type: text/html; charset=UTF-8
  • For JSON: Content-Type: application/json; charset=UTF-8 and JSON_UNESCAPED_UNICODE
  • <meta charset="UTF-8"> in HTML templates

Validation and transformation

  • Use mb_strlen, mb_substr, mb_strtolower, mb_strtoupper
  • Consider normalizing strings with Normalizer::normalize when relevant
  • For multi-language systems, define a clear strategy for collation and searching

Exports

  • CSV: decide on encoding and possibly BOM based on your user’s tools (Excel vs others)
  • PDF: verify fonts support required character sets; feed the library UTF-8
  • XML: always declare encoding in the XML declaration and ensure bytes match

Logging and debugging

  • Ensure logs are stored as UTF-8
  • When debugging, dump strings carefully: inspect both raw bytes and their interpreted form when something looks off

When you’re joining an existing PHP project

Many readers of the Find PHP blog are:

  • stepping into existing legacy projects,
  • doing freelance rescue missions,
  • or being hired to “clean up” an older system.

You won’t always have the luxury of designing encoding strategy from scratch. So what then?

Here’s how I’d approach it if you drop me into a 10-year-old PHP/MySQL project with “mysterious encoding bugs”:

  • Check database charset and collation with SHOW VARIABLES LIKE 'character_set%'; and SHOW VARIABLES LIKE 'collation%';.
  • Inspect a few critical tables: SHOW FULL COLUMNS FROM users;
  • Grab a few records with non-ASCII characters and see how they look in:
    • raw MySQL client
    • PHP var_dump
    • browser responses
  • Search the codebase for set names, set_charset, iconv, mb_convert_encoding
  • Identify exports (CSV, PDF, XML, JSON) and test them with real sample data from different languages
  • Check email templates and headers for correct charset and encoding

You start to build a mental map of where reality (bytes in DB) differs from intent (what humans think the encoding is).

Then comes the hardest part:
Deciding how much you can fix without breaking existing integrations and stored data.

Sometimes you do a full migration.
Sometimes you carefully isolate things and fix them step by step.
Sometimes you add explicit conversion just before output, as a compromise.

There’s no single clean answer. But having a clear mental model of encoding flows makes the decisions less random and less scary.

Encoding bugs and the human side of engineering

At the end of the day, encoding issues are not about bytes. They’re about people.

Behind every corrupted character there’s:

  • someone’s family name from a small town with a rare diacritic,
  • someone’s company in a language we don’t speak,
  • someone’s message written late at night, when they trusted the system would preserve their words.

We often talk about “respecting users’ time” by making fast and reliable systems.
Handling encodings correctly is a quieter version of the same respect.

It tells users:

“Your language is not an afterthought here.”
“We expect you to use your real name, in your real alphabet.”
“We built this with enough care that your text won’t be chewed up along the way.”

If you’re looking for PHP work or hiring PHP developers through a place like Find PHP, this is the kind of detail that never makes for flashy headlines, but it shapes how your software feels to people who live outside the narrow ASCII slice of the world.

So next time you see “é” on screen, maybe don’t just fix that one case and move on.

Trace it. Understand it. Fix the pipeline.
Turn that moment of annoyance into a quiet promise to handle people’s words with the care they deserve.

Because under all the abstractions, patterns, and frameworks, we’re still just humans sending each other text, hoping it arrives intact.
перейти в рейтинг

Related offers