The Colonial Record Records of Colonial Australia

How accuracy works

Two readings of every page,
reconciled word by word.

No single machine reading of nineteenth-century print can be trusted — so we never rely on one. This page explains, plainly, how a gazette page becomes searchable text, what the confidence ratings mean, and what we do when we're wrong.

Why old print defeats ordinary OCR

Gazette pages are dense, mixed and worn: tight two-column setting, broken and battered type, pounds-shillings-pence, fractions like ½ and ¼, and tables that run to dozens of columns. Ordinary OCR makes two distinct kinds of mistake on them, and they need different cures.

Character errors turn Geelong into Geeiong — the reading is in the right place, but the letters are wrong, so a search for the right spelling finds nothing.

Order errors are worse. Most engines read a two-column page straight across the gutter, welding the columns into nonsense:

NOTICE is hereby given, that the partnership lately subsisting between William Brown and
TENDERS will be received at this Office until noon on Tuesday the 14th instant,
Read straight across — welded

NOTICE is hereby given, that the TENDERS will be received at this Office partnership lately subsisting between until noon on Tuesday…

Read in the order the clerk meant

NOTICE is hereby given, that the partnership lately subsisting between William Brown and… — then the next column.

A welded page can still look accurate — nearly every word is correct — but the sentences are shuffled and any quotation from it is wrong. Catching this kind of error is most of the work.

Two independent readings

Every page is read twice, by two engines with opposite strengths, and each is trusted only for what it is good at.

The reading of order

  • Finds the columns, the gutter, and the full-width bands.
  • Fixes the sequence a clerk meant the page to be read in.
  • Locates every word's exact position on the scan.
  • Extracts tables as tables — rows and columns, not prose.

The reading of characters

  • Sees the whole page in context, the way a person reads it.
  • Resolves degraded and broken type into the likeliest words.
  • Gets the hard things right: names, £ s d, fractions.
  • Never reads tables — dense grids are left to the other engine.
The two readings are produced independently — neither sees the other's answer — so their agreement actually means something.

Reconciled word by word

The two readings are then aligned and compared, word by word, against the position of every word on the page. Where they agree, the word is kept. Where they disagree, the conflict is recorded — never silently guessed away — and the page's score reflects it.

First readingSecond readingWhat the Record keeps
GeelongGeelong Kept — independent agreement.
GeeiongGeelong Kept — the character engine corrects the broken type, in the order engine's position.
£4 17s. 0d.£4 17s. 6d. Flagged — genuine disagreement; the notice carries a lower confidence rating and the original scan settles it.

Each page also carries an honest agreement score from this comparison. A page where the two readings diverge isn't shipped as if all were well — it falls back to the safest available text and is rated accordingly.

Uncertainty is shown, not hidden

Every notice carries a confidence rating, scored by what was actually measured — how strongly the readings agreed, and how cleanly the notice was located on the page.

High

The readings agree and the notice is cleanly located. Quote it — and the citation will take you to the original anyway.

Medium

Mostly agreed, with recorded conflicts — a worn page, a hard name, a smudged figure. Sound for discovery; check the scan before you rely on a detail.

Needs review

The readings genuinely diverged, or the page resisted both engines. We say so, and show you exactly where to look.

The same honesty applies to matching: a possible-but-unproven match is labelled as exactly that, in cautious language — never silently promoted to a fact.

The original is always one click away

Nothing in the Record floats free of its source. Every notice links to its issue, date, page, and the exact region of the original scan — the scan itself is never altered, and every claim we make can be checked against it. How to quote and cite what you find is set out at Cite & sources.

Machine reading is good and it is imperfect. Important findings should be verified against the original page — which is why we put it beside every transcription, not behind a request form.

When we're wrong

If you find an error, tell us: office@colonialrecord.com.au. A corrected reading enters the Record the way everything else does — on the evidence of the original page. And if an unlock gives you the wrong record, one email inside seven days gets a refund or a re-credit; no forms, no questions.

See it for yourself — every search result shows its confidence rating and its source before you pay anything.