Replacing OCR with Gemini

The previous post covered an address sanitizer that fixes mangled OCR output using multi-strategy matching. It works, but it’s treating a symptom. A smarter OCR step would make most of it unnecessary.

Traditional OCR extracts characters, then downstream code figures out what they mean. A separate pipeline handles structure, validation, error correction. The address sanitizer is part of that pipeline. It exists because the OCR engine doesn’t understand what it’s reading.

Gemini’s vision model inverts this. You send it an image and a prompt describing the output structure you want. It returns structured data directly – fields identified, values extracted, context applied. No intermediate text extraction step.

The prompt is the specification

For Indonesian KTP (identity card) processing, the extraction prompt runs several hundred lines. It covers field identification, spatial layout, character disambiguation, cross-validation rules, and the exact JSON output structure.

The key sections:

# Task: KTP Document Processing

## Character Disambiguation
Handle common OCR-equivalent errors:
- 0 / O / D (zero, letter O, letter D)
- 1 / I / l / | (one, capital I, lowercase l, pipe)
- B / 8 / 3

## Cross-Validation Rules
- NIK first 4 digits must match province/city admin codes
- If place_of_birth is "JAKARTA", state should be "DKI JAKARTA"
- Age derived from date_of_birth < 17: marital_status is "BELUM KAWIN"

## Output Format
Return EXACTLY this JSON structure:
{
  "data": {
    "nik": "string (exactly 16 digits or empty string)",
    "full_name": "string (uppercase, no extra spaces)",
    "date_of_birth": "string (DD-MM-YYYY format only)"
  },
  "confidence": {
    "nik": 0-100,
    "full_name": 0-100,
    "date_of_birth": 0-100
  }
}
NEVER use null values. Use empty strings for missing data.

This prompt went through 47 iterations before reaching production quality. Most of the iteration was edge cases – rotated documents, damaged cards, faded text, non-standard layouts. Each failure mode got its own section in the prompt.

The prompt is a product specification. The more precise the spec, the more consistent the output.

Extraction

func (g *GeminiExtractor) extractFromFile(ctx context.Context, imageFile *os.File) (*APIResponse, error) {
    imgData, mimeType, err := utils.LoadImageFromFile(imageFile)
    if err != nil {
        return nil, fmt.Errorf("failed to load image: %w", err)
    }

    model := g.client.GenerativeModel("gemini-1.5-flash")
    temperature := float32(0.1)
    model.Temperature = &temperature

    promptParts := []genai.Part{
        genai.Text(extractionPrompt),
        genai.Blob{MIMEType: mimeType, Data: imgData},
    }

    resp, err := model.GenerateContent(ctx, promptParts...)
    return parseStructuredResponse(resp)
}

Low temperature (0.1) for consistency. The model returns its response as text with an embedded JSON block, which gets parsed out:

func parseJSONResponse(responseText string) (*DocumentData, error) {
    jsonPattern := regexp.MustCompile("(?s)```json\\s*(.*?)\\s*```")
    matches := jsonPattern.FindStringSubmatch(responseText)
    if len(matches) < 2 {
        return nil, errors.New("no JSON block found in response")
    }

    var docData DocumentData
    err := json.Unmarshal([]byte(matches[1]), &docData)
    return &docData, err
}

Confidence scoring

Each extracted field gets a confidence score from four weighted factors:

type ConfidenceFactors struct {
    OCRQuality          float64 // 40% -- text clarity, recognition certainty
    ValidationCompliance float64 // 30% -- format correctness, valid ranges
    ContextConsistency  float64 // 20% -- cross-field logical consistency
    SpatialAccuracy     float64 // 10% -- correct field positioning on card
}

The model computes these as part of its response – the prompt specifies the scoring methodology and the model applies it. This works because the confidence assessment is part of the same pass that does extraction. The model knows how clearly it read a field, whether the value passes format validation, and whether it’s consistent with neighboring fields.

Adaptive processing

Image quality varies. A clean scan needs different handling than a phone photo of a damaged card. The processing config adjusts:

func (g *GeminiExtractor) selectProcessingStrategy(imageQuality float64) ProcessingConfig {
    if imageQuality > 0.9 {
        return ProcessingConfig{Temperature: 0.0, MaxTokens: 1000, DetailLevel: "standard"}
    } else if imageQuality > 0.6 {
        return ProcessingConfig{Temperature: 0.1, MaxTokens: 2000, DetailLevel: "enhanced"}
    }
    return ProcessingConfig{Temperature: 0.2, MaxTokens: 3000, DetailLevel: "maximum"}
}

Higher temperature and token budget for worse images. The “maximum” path also appends additional prompt sections for handling damage, rotation, and faded text.

Post-processing

Even with Gemini handling extraction, some post-processing remains. Birth information sometimes comes back as a combined field. Gender values need standardization. And addresses still run through the sanitizer – Gemini gets addresses right more often than traditional OCR, but the sanitizer catches what it misses and validates the administrative hierarchy.

func (g *GeminiExtractor) validateAndEnhance(response *DocumentData) {
    if birthInfo := response.Data.PlaceOfBirth; birthInfo != "" {
        place, date := splitBirthInformation(birthInfo)
        response.Data.PlaceOfBirth = place
        response.Data.DateOfBirth = standardizeDateFormat(date)
    }

    if gender := response.Data.Gender; gender != "" {
        response.Data.Gender = standardizeGender(gender)
    }

    if !skipAddressSanitization {
        response = g.addressSanitizer.Sanitize(response)
    }
}

Error handling

API failures, rate limits, and safety filters are part of the operating reality. Retry with exponential backoff, with specific handling for safety filter triggers (those won’t resolve with retries – the document image itself is the issue):

func (g *GeminiExtractor) extractWithRetry(ctx context.Context, image []byte) (*Result, error) {
    maxRetries := 3
    backoff := time.Second

    for attempt := 0; attempt <= maxRetries; attempt++ {
        result, err := g.callGeminiAPI(ctx, image)
        if err == nil {
            return result, nil
        }
        if isRateLimitError(err) {
            time.Sleep(backoff * time.Duration(attempt+1))
            continue
        }
        if isSafetyFilterError(err) {
            return g.handleSafetyFilter(image)
        }
        return nil, err
    }
    return nil, fmt.Errorf("max retries exceeded")
}

In production

Character accuracy went from 94% to 98.5%. Field extraction completeness from 78% to 94%. Manual review rate dropped from 35% to 8%.

Processing runs at 2.3 seconds per document, $0.003 per call including retries. Caching similar documents (same template, same scan quality) brings a 60% hit rate and cuts the effective cost further.

Accuracy improved. The bigger change was maintenance. The traditional pipeline had hundreds of validation rules, character substitution tables, and format-specific parsers that needed constant updates. The Gemini approach moved most of that logic into the prompt. When a new edge case appears, the fix is a prompt revision, not a code change. The 47th iteration of the prompt handles cases that would have been months of rule engineering in the old system.

The address sanitizer still runs. Gemini reduced its workload – strategy distribution shifted from 30% fuzzy matches to under 10% – but the hierarchy validation catches things the model occasionally misses. The two systems complement rather than replace each other.