← /blog

An idempotency bug that only affected posts with em-dashes

There is a button on my admin dashboard labeled "Trigger D1 -> GitHub backup", and right under it a line of copy I wrote myself: "Runs the cron worker on-demand. Idempotent -- no-ops if content is unchanged."

The button worked. The line was a lie. Every single click produced a fresh commit on the backup branch, titled chore(backup): post terraform-pulumi-kubernetes-or-none, even though nothing had changed in D1 between clicks. Someone noticed. I had to go look.

This is the story of the bug, the fix, and the broader point about writing claims down in UI copy before you have tested them.

The architecture, one paragraph

The main worker serves yigittanriverdi.com. A second, tiny worker runs on a cron trigger at 03:00 UTC nightly. It reads the site's D1 tables, renders each post as a Markdown file with YAML frontmatter, serializes projects and settings as JSON, and PUTs each file to the GitHub Contents API on the site's repo. The goal is a human-readable, version-controlled snapshot of everything editable through the admin -- so if D1 ever disappears, the site is recoverable from git.

The claim that made its way onto the admin button is that this backup is idempotent at the GitHub-commit level. If a file's bytes are identical to what's already on main, the worker should short-circuit and not commit. The alternative is a backup branch that grows by several commits per day with zero information content, which defeats the point of having a readable backup.

What the idempotency check actually did

The cron worker's putFile() helper did this:

const existingRes = await fetch(`${api}?ref=${env.GITHUB_BRANCH}`, { headers });
if (existingRes.ok) {
  const existing = (await existingRes.json()) as { sha: string; content: string };
  // GitHub returns base64 with line wraps -- strip whitespace then decode.
  const existingContent = atob(existing.content.replace(/\s/g, ''));
  if (existingContent === content) {
    return { changed: false, status: 200 };
  }
}

Looks fine. Fetch the existing file, decode base64, string-compare, skip if equal.

It is not fine.

The atob() trap

atob() is one of the oldest functions in the browser API. It takes base64 and returns a string. What most people never internalise is that it returns a latin-1 string -- a sequence where each character has a codepoint between 0 and 255, one character per decoded byte.

If the original file was ASCII, that is indistinguishable from the real content. If the original file contained UTF-8 multibyte sequences -- em-dashes, smart quotes, middle dots, Turkish characters, Unicode punctuation of any kind -- those bytes come back as multiple latin-1 characters, not the single Unicode codepoint you encoded.

Compare that to the freshly-rendered string the worker was generating: a real JavaScript string with those same characters as proper Unicode codepoints. An em-dash in the GitHub-roundtripped string is two chars (0xE2 0x80 0x94 masquerading as three latin-1 chars, actually), where the same em-dash in the live string is one char (U+2014).

=== on those two strings never returns true.

The Terraform/Pulumi post has em-dashes in its title. The projects-and-settings JSONs had been sanitized to ASCII-only through an earlier Windows-cp1252 fight, which is why they did idempotency-skip correctly. Only that one file with non-ASCII content was lighting up on every run.

The fix

The cron worker already had a proper b64encodeUtf8() helper on the write path -- it knows you cannot just btoa() a Unicode string. The read path needed the inverse:

function b64decodeUtf8(s: string): string {
  const bin = atob(s);
  const bytes = new Uint8Array(bin.length);
  for (let i = 0; i < bin.length; i++) bytes[i] = bin.charCodeAt(i);
  return new TextDecoder().decode(bytes);
}

Base64 -> latin-1 binary string -> bytes -> real UTF-8 string. Mirrors the encoder exactly. Swap atob(...) for b64decodeUtf8(...) in the comparison and the check behaves how the UI copy always claimed it did. Two back-to-back runs now show no changes and produce zero commits.

Thirteen lines. One commit.

What I take from this

Three things.

One. atob() and btoa() have been latin-1-only since 1995 and browsers are never going to change that for compatibility reasons. If you see either function in code that round-trips user text, look for a UTF-8 bug. The MDN page says so in plain English. The warning still catches people 30 years on, including me.

Two. Idempotency claims are not something you read off the code. They are something you test. The fastest possible test for this class of bug is "click the button twice in a row and assert nothing happened on the second click." I had never done that, because nobody ever clicked the button manually -- the cron runs at 3am. The moment I put a button on the admin page, the test became free to run, and a user ran it for me.

Three. If you ship UI copy that makes a testable assertion -- "idempotent", "encrypted at rest", "zero-downtime" -- the copy is the spec. Someone will notice when reality diverges. Better to notice it yourself, by clicking your own button twice, than to read about it in a Slack message.

The line on the admin button stays. It is true now.