The Archaeology Site — A Reverse Engineering Account

2026 spring · Third piece · Translated from Chinese original

The 27-year-old system, when you dump the entire database out, is a 1.3 GB SQLite file.

132 tables. 1,331,492 records.

Open one table and have a look —

ls102.ls_arr:    179_521.00
ls102.ls_dy:     142_318.00
ls102.ls_rcv:     37_203.00

The numbers are there. But what do these numbers mean?

No schema documentation. No original developer. The table names ls102 / inv003 / gltmp1 follow the 4GL-era naming convention. The column names iv322 / iv333 / ls_arr are mostly abbreviations.

The IT acquaintance the boss had used before had pulled at the data. "I can't tell what the heck this is" (看不懂是什麼碗糕 — Taiwanese idiom: literally "can't see what kind of bowl-cake this is"). A Windows-background person facing Linux / Xenix binary — a wall of garbled bytes — conclusion: "impossible to pull out."

We chose a different conclusion.


One · "Impossible" is technical self-confidence, not technical judgment

If you don't know whether something can be done, asking three people usually yields three answers.

The third answer is the door.

We tossed it the first binary chunk — a .dbs file from the Informix 4GL era, a word-swapped Xenix-386 binary.

The AI took one look and said: "I recognize this format, but I need to confirm the byte order first."

From that one sentence, we had our first "maybe" signal.

"Impossible" is technical self-confidence, not technical judgment. The former comes from ego, the latter from evidence. That "I can't tell" from the boss's IT acquaintance didn't complete the evidence collection — but their workday was over, which is entirely reasonable.


Two · The first city wall: Big5 and its kin

All the Chinese columns are encoded in Big5.

Big5 itself isn't hard — Python has cp950 that reads it directly.

The problem is: occasionally a Big5 character has a byte that collides with an illegal character.

For example, a customer name on a certain receipt — "○○○○ Precision Tech Co., Ltd." — gets stored, and the tail of the previous column has one byte misaligned, so the whole Big5 string shifts one position to the left — and the decoder, running through it, spits out 70k+ strings of nonsense characters.

Our first ETL run came out with 70,432 garbled fields.

Solving this bug took 6 hours:

  1. First, eliminate the latin1 fallback — that hacks the feet off every Big5 character
  2. Write a "smart byte rescue" — detect the reasonable range of Big5 two-byte pairs
  3. Second ETL run: 70,432 → 0

That was the first small victory. Every single one of the 1,331,492 records had a readable customer name.

But "can read" is not the same as "understands".


Three · That column looks like 542 million NTD, the paper shows 5.42 million

The next problem was —

We took a March 2026 paper baseline from Ms. Apple's hands and laid it on the table. At the very bottom of the paper it printed:

2026-03 company-wide proxy-collection receipts total: 5,420,800

Then we ran the same SUM from the new SQLite.

The new system computed:

ls102.ls_arr SUM = 542,080,000

Off by 100×.

For a whole week, we assumed this was markup.
Assumed this was mixed-in service fees.
Assumed this was unit confusion (NTD vs jiao).
Assumed this was exchange rate.
Assumed this was tax items.

Swapped through 6 hypotheses. Every one of them was wrong.

Until one evening, the AI floated a weird question:

"What if this column isn't currency but BCD encoded?"


Four · 12 hours arguing with ourselves

That night, no sleep.

From 9 PM to 9 AM the next morning, a full 12 hours, ran through 8 decoder versions:

v1   decode as int                ❌ wrong
v2   decode as float               ❌ still wrong
v3   decode as BCD base-10         ❌ close but off-by-multiplier
v4   decode as BCD base-16         ❌ completely wrong
v5   markup-removed BCD            ❌
v6   FX-converted BCD              ❌
v7   tax-adjusted BCD              ❌
v8   BCD base-100                  ✓ ←────

v8 came out at 5:42 AM.

We applied the v8 decoder across this column in the whole database.

ls102.ls_arr decoded:
  total (2026-03) = 5,420,800   ← paper baseline 5,420,800

Aligned to a single dollar.

A bug nobody had caught for 13 years, cracked in those 12 hours.

The reason: some engineer years ago, to save storage, used BCD base-100 encoding (one byte stores 0-99, instead of 0-9). That secret lived only in his own head — maybe also on a sticky note from 1992 that's long been thrown out. We could not find this in documentation, because this documentation does not exist.

The only way to find it was external alignment. The paper was that external anchor.


Five · The moment of 100% alignment to paper

After the v8 decoder unlocked this column, we applied it across 6 end-of-month accounting reports:

Unsettled-cost-advance detail    65 entries / 185,400 / 412,800 / 458,200    ✓
Per-handler cost-advance detail  same                                          ✓
Settled detail                   6 entries / 25,500 / 98,700 / 105,400         ✓
Proxy-collection receipt detail  184 entries / 5,420,800 / 3,891,400 / 305,200 ✓
Per-handler cost-advance total                                                  ✓
Per-salesperson AR detail                                                      ✓

6 reports × 184 original paper receipts = 100% aligned to a single dollar.

Not 99%. Not 99.7%. 100%.

At that moment — not a feeling of "we're done," but a feeling of "now we can start trusting this database."

An archaeologist reconstructing a dinosaur from fossils — she does not say "I guessed 99% right."

She says "this dinosaur is 12 meters long" — a precise number — because fossils do not lie.

Now we also had fossils that don't lie.


Six · Without ground truth, it can't be called migration

Many people doing system migration say "our new system's numbers align 99% with the old system."

99% is not enough.

99% means "roughly right."

The paper is not "roughly" right. The paper says "this is what was printed on this day."

Behind every paper receipt is the accountant's signature at that time, the customer who received it, the tax bureau that has seen it. The paper is a physical event, not a statistical distribution.

When our new v7 system can reproduce those 184 paper receipts from the same raw data down to a single dollar — not coincidence, not 99%, but 100% alignment

That means our decoding is right, ETL is right, report logic is right, tax calculation is right, rounding rules are right, footer-difference rules are right, every link in the chain is right.

Without ground truth, it can't be called migration. It can only be called moving house.


Seven · 132 tables, 1.3M+ records

Finally, an inventory of the scale of this dig:

Raw data         132 tables
                 1,331,492 records
                 1.3 GB SQLite

ETL period       8 days (2026-05-15 ~ 2026-05-22)
Bugs solved      70,432 Big5 garbled fields → 0
                 a BCD base-100 decoding error nobody caught for 13 years
                 6+ wrong hypotheses along the way

Verification     6 reports × 184 papers × 3 months = 100% aligned to a single dollar

What stops you most about this dig isn't the scale. It's that nobody forced us to do it to this degree.

Stop at 99.7% paper alignment — nobody would say a thing. The IT acquaintance the boss used would have said "impressive!" at 99%. The customers have no idea there's a 0.3% discrepancy.

But 99.7% and 100% are two different things.

99.7% says: "We moved most of it over, a little remains uncertain."
100% says: "We have completely inherited the past."

We chose the latter, because the essence of this project is inheritance.

Inheritance cannot have missing corners.


Closing · The difference between archaeology and engineering

An engineer can say: "The version we rewrote is better than the old one — the old one had bugs."

An archaeologist cannot.

An archaeologist faces what happened in the past. She cannot say "this fossil doesn't look right, it should be a different shape."

She can only say "this fossil proves something happened 65 million years ago."

Doing reverse engineering on a 27-year-old system — we are archaeologists, not engineers.

We are not here to improve the past. We are here to move the past intact into the future.

One of the 5,390 receipts Franca wrote in 1997 — that one is real, can't be changed, can't be optimized, can't be "refactored."

Our work is just —

Find that record, unlock its encoding, confirm the amount, place it into the new database, let it keep existing.

Not more. Not less.

Just right.

— Amy + Claude (2026 spring)
Translated by Claude (2026 春) · session 42d5da