Resume Parsing Developer Document

Resume Parsing Developer document

Resume Parsing — Production System Reference

This document describes the Intelletto.ai production pipeline for converting unstructured candidate artifacts into high-fidelity, scoring-ready data. The pipeline is orchestrated by Netflix Conductor with workers deployed on Google Cloud Run. The focus is on auditability, cost-efficiency, and evidence-backed extraction.

Document Overview

  • Module: Resume Parsing Pipeline
  • Version: 1.0.0
  • Last Updated: 2026-04-04
  • Pipeline Version: v3
  • Audience: Backend, Data, and AI Engineering

Core Objectives

The pipeline transforms PDFs, DOCX files, and raw text into an immutable Intermediate Representation (IIR) characterized by:

  • Explainability: Every extracted fact links to a specific source segment or page.
  • Determinism: Predictable behaviors with idempotent processing and explicit retry logic.
  • Cost Management: Stage-level attribution for token usage and model invocations.
  • Auditability: Versioned records with a full provenance trail.

Technical Stack

Component Specification
Primary API FastAPI (Python 3.11+) on Google Cloud Run
AI Runtime Google Gemini 2.5 Flash via google-genai Python SDK with response_schema enforcement
Extraction Method Direct PDF upload to Gemini (not OCR-to-text prompting). Schema constrains output at API decoding level.
OCR / Layout Google Document AI Enterprise Document OCR (fallback for layout evidence)
Database PostgreSQL 16+ on Google Cloud SQL (asia-southeast1) via psycopg3 async connection pool
Storage Google Cloud Storage (intelletto-ai-resume-parse-404886655151)
Pipeline Orchestration Netflix Conductor OSS 3.15.0 on GCE (asia-southeast1-b). Workflows define stage ordering, decision branches, retries, and rate limits. HUMAN task support for recruiter review gates.
Worker Cloud Run service (intelletto-worker), HTTP-based task polling via httpx. Each stage is an async handler registered in worker_registry.py.
Rate Limiting Conductor task-level: 30 RPM / 4 concurrent for Gemini stages. Python-side token bucket as secondary safety net.
Deployment Cloud Run (asia-southeast1). API: 2 vCPU / 2Gi. Worker: always-on, min 1 instance.
Observability Cloud Logging + structured pipeline_phase_event telemetry + Conductor workflow/task history

v3 Architecture

The v3 codebase is domain-driven, replacing the v2 monolith (intelletto_server.py, retained as reference only).

Module Path Responsibility
Entry Point v3/main.py Uvicorn entry, app bootstrap
App Factory v3/src/api/app.py FastAPI app creation, middleware, route mounting
Pipeline Domain v3/src/domains/pipeline/ Conductor adapter (conductor/): client, poller, worker_registry, worker_adapter. Stage runners.
Pipeline Stages v3/src/domains/pipeline/stages/ 12 stage modules: registration, dedup, ocr_layout, data_cleaning, extraction, normalization, enrichment, validation, scoring_inputs, gcs_archive, scorecard, artifacts
Scoring Domain v3/src/domains/scoring/ ScoreService, ConfigService, interview brief, authenticity, bias audit, comp inference, JD calibration, sector classification, cross-JD fit
Intake Domain v3/src/domains/intake/ Resume registration, GCS import, document management
JD Domain v3/src/domains/jd/ Job description management, AI JD generation
Candidates Domain v3/src/domains/candidates/ Candidate profiles, cover letters
Gemini Integration v3/src/integrations/google/gemini.py Gemini 2.5 Flash client with response_schema
GCS Integration v3/src/integrations/google/gcs.py Cloud Storage operations, RESUME-POOL structure
Document AI v3/src/integrations/document_ai/ocr.py OCR and layout extraction
Batch Gemini v3/src/integrations/google/batch_gemini.py Async batch processing (50% cost reduction, 200K requests/job)
Database v3/src/database/pool.py, repository.py psycopg3 async connection pool, generic repository
Job Queue v3/src/database/job_queue.py Async durable job queue for background processing

Pipeline Stage Map (12 Stages)

The pipeline processes resumes through 12 stages orchestrated by Netflix Conductor. Three pipeline modes (A, B, C) are defined as Conductor workflow definitions that control which stages execute. Each stage is a Conductor task polled and executed by the intelletto-worker Cloud Run service. Conductor handles retry logic, rate limiting, timeouts, and decision branching (e.g., dedup stop, gate stop). A HUMAN task type enables recruiter review gates for borderline candidates.

# Stage Key Human Label Code Location
01 RESUME_REGISTERED Registered stages/registration.py
02 DEDUP_CHECK Duplicate Check stages/dedup.py
03 OCR_LAYOUT Reading Document Layout stages/ocr_layout.py
04 DATA_CLEANING Cleaning Extracted Text stages/data_cleaning.py
05 STRUCTURED_EXTRACTION Extracting Structured Resume stages/extraction.py
06 NORMALIZATION Skill Normalization stages/normalization.py
07 DATA_FUSION_ENRICHMENT Data Enrichment stages/enrichment.py
08 VALIDATION_GATES Quality Gates stages/validation.py
09 SCORING_INPUT_BUILD Score Preparation stages/scoring_inputs.py
10 GCS_ARCHIVE_MOVE Move Source to Processed stages/gcs_archive.py
11 SCORECARD_GENERATION Score Against JD stages/scorecard.py
12 RECRUITER_ARTIFACTS Final Output stages/artifacts.py

Pipeline Modes

Mode Name Stages Executed Use Case
A Pool Build 01-06, 08-09 Parse and normalize resumes into the RESUME-POOL without scoring. Terminal state: POOL_READY.
B Pool Activation 07, 10-12 Activate pooled resumes against a JD: enrich, archive, score, build artifacts.
C Direct JD 01-12 Full end-to-end: resume arrives with a JD assignment, all 12 stages run sequentially.

RESUME-POOL GCS Structure

Resumes are organized in GCS under RESUME-POOL/ with JD code directories for assignment:

gs://intelletto-ai-resume-parse-404886655151/
  RESUME-POOL/
    {JD_CODE}/              # e.g., 59A356/
      CV - Name - Title.pdf
      CV - Name - Title.pdf
    unassigned/
      resume.pdf

JD code is detected from the GCS directory path during import. The resume_document_job_map table links documents to job requisitions for scoring.

Contract Principles (Non-Negotiable)

These principles are enforced as hard contracts across parsing, normalization, and scoring. Each bullet links to the implementation clause(s) and the downstream acceptance gates.

IIR schema section

Contract: The Internal Intermediate Representation (IIR) schema is the single source of truth for extraction output. It is versioned, discoverable, and must validate (schema-first; no best-effort).

Implemented in: Stage 05 (STRUCTURED_EXTRACTION) uses direct PDF upload to Gemini 2.5 Flash with response_schema parameter. The schema constrains output at API decoding level, guaranteeing skills always return as {technical: string[], soft: string[]}.

Enforced by: Validation gates (Gate B schema validation as a hard pass/fail gate). Gate B now uses _partition_validation_issues() to separate blocking errors from advisory warnings. Blocking errors stop the pipeline.

Evidence/page_map section

Contract: Every extracted fact must be traceable to immutable source evidence, and every pipeline run must persist a lossless page_map for round-trip regeneration.

Implemented in: Stage 03 (OCR_LAYOUT) via Document AI + Stage 05 evidence span protocol.

Enforced by: Validation gates (Gate A lossless coverage and Gate C evidence integrity). Gate C now uses _validate_extraction_evidence_contract() with a 20% per-section fact density floor.

Deterministic invocation policy

Contract: LLM invocation is deterministic and enforceable: model/version pinning, explicit tool parameters, bounded retries, and idempotent step execution (no hidden variability).

Implemented in: Gemini 2.5 Flash with temperature=0.1, response_schema enforcement, and extraction cached by checksum_sha256 scoped per tenant_id for cost efficiency. Same inputs produce same cached outputs.

Enforced by: Idempotency records (intelletto.idempotency_record) and replay checks within pipeline orchestrator.

Metering dictionary table

Contract: Metering is deterministic and auditable per step. The metering dictionary is pinned, versioned, and used to compute cost-per-resume without drift.

Implemented in: intelletto.pipeline_phase_event records per-stage telemetry including stage, outcome, error_code, error_message, started_at, completed_at, and duration. intelletto.scoring_run_metering records per-scorecard token counts, latency breakdowns, and DB write counts.

Normalization/taxonomy ownership statement

Contract: Ownership is unambiguous: the parsing pipeline produces taxonomy-resolved, scoring-ready identifiers (and the taxonomy_snapshot_id it used). The scoring engine must not "re-normalize" raw text; it consumes resolved IDs + confidence.

Implemented in: Stage 06 (NORMALIZATION) resolves skills against a taxonomy of 55,000+ hard skills, 63,000+ aliases, and 146 soft skills. Match types: EXACT, FUZZY, SEMANTIC. Certification-to-skill expansion via intelletto.certification_skill_map table.

Scoring handoff payload and acceptance gates

Contract: The parser hands off a single, explicit scoring payload that includes identifiers, evidence, and the snapshots used (IIR + taxonomy). Scoring begins only after all required gates pass.

Implemented in: Stage 09 (SCORING_INPUT_BUILD) creates an immutable intelletto.scoring_input_snapshot row. Stage 11 (SCORECARD_GENERATION) resolves published scoring config, loads JD + parsed resume + scoring input, and runs the 8-bucket weighted scoring model.

Minimum payload (scoring_input_snapshot columns):

{
  "scoring_input_id": "uuid",
  "pipeline_run_id": "uuid",
  "parsed_resume_id": "uuid",
  "taxonomy_snapshot_id": "uuid",
  "iir_schema_version": "semver",
  "evidence_pack_sha256": "sha256:...",
  "normalized_skills_json": {...},
  "work_history_json": {...},
  "education_summary_json": {...},
  "certifications_json": {...},
  "languages_json": {...},
  "all_gates_passed": true
}

Section 0: Pipeline Stages (12-Stage Production Pipeline)

1) Intent

Define the exact end-to-end resume parsing pipeline stages (and their boundaries) so developers can understand each step, with deterministic cost metering, schema-first extraction, and evidence traceability as non-negotiable invariants.

2) Why it matters (risks mitigated / dependencies unlocked)

This section prevents the two most common failure modes in AI extraction projects:

  • "Partial extraction" (teams ship a "good enough" model that silently drops content). This is unacceptable because Intelletto requires 100% coverage and resume regeneration.
  • "Cost drift" (prompt changes, retries, or OCR changes quietly double cost-per-resume). Intelletto requires deterministic metering and enforceable gates.

It also unlocks downstream work: skills normalization, evidence-based scoring inputs, audit trails, and regression testing.

3) Definitions for AI-new developers (minimal, practical)

  • Structured extraction: sending the PDF directly to Gemini 2.5 Flash with a response_schema parameter that forces the model to output JSON conforming to the declared schema at the API decoding level.
  • Schema validation gate: a hard pass/fail check (no "best effort"). Gate B uses _partition_validation_issues() to separate blocking errors from advisory warnings.
  • Evidence: a pointer back to the source document. For Intelletto, evidence is page_index + bbox and/or text_span, always linked to a block_id in the lossless layer.
  • Hallucination: model output not supported by evidence in the source. Intelletto mitigates via: (1) response_schema enforcement, (2) evidence required for extracted facts, (3) coverage ledger, (4) temperature 0.1 for factual extraction.

4) Requirements

MUST
  • Process resumes through the 12 stages in order (stage map above), respecting pipeline mode (A, B, or C).
  • Preserve a lossless page/block capture for every page, even if content is not mapped to structured fields.
  • Produce outputs that support round-trip regeneration (at minimum HTML with faithful ordering).
  • Emit per-stage pipeline_phase_event records (stage, outcome, error_code, duration).
  • Be idempotent and retry-safe: re-running a stage must not duplicate artifacts or costs.
SHOULD
  • Use direct PDF upload to Gemini rather than OCR-to-text prompting for extraction.
  • Cache extraction results by checksum_sha256 (tenant-scoped) for cost efficiency.
MAY
  • Add enrichment sources (e.g., public profile links) only when explicitly configured and always as a distinct stage with its own metering.

5) Implementation steps (the 12 stages)

Stage 01 -- RESUME_REGISTERED
  • Create resume_document record with status RECEIVED.
  • Compute checksum_sha256 from PDF bytes.
  • Record GCS object_uri and file metadata (mime_type, byte_size, page_count).
  • Create or link candidate_profile (from filename inference or prior identity).
  • Create parsing_pipeline_run record with status PENDING.
Stage 02 -- DEDUP_CHECK
  • Compute deterministic fingerprints: file hash (bytes) + normalized-text hash + page-map hash.
  • Search for duplicates: exact-match (SHA256) and near-match (candidate signal matching: email, phone, LinkedIn URL).
  • Tenant-scoped queries (dedup is never cross-tenant).
  • If duplicate: pipeline terminates with status DEDUP_SKIPPED and Gate D records the result.
  • If new: pipeline continues to Stage 03.
Stage 03 -- OCR_LAYOUT
  • Call Google Document AI Enterprise Document OCR to produce page-level text + layout coordinates.
  • Build normalized page_map with stable reading order, bbox geometry, and per-block hashes.
  • Persist coverage_ledger recording page_count_total, page_count_processed.
  • Gate A (GATE_A_LOSSLESS) validates: observed_pages == expected_pages, every page has blocks.
  • Graceful degradation when Document AI is unavailable: extraction proceeds with direct PDF to Gemini (OCR evidence is advisory, not blocking).
Stage 04 -- DATA_CLEANING
  • Normalize whitespace, bullet symbols, and hyphenation (without changing meaning).
  • Preserve original text in lossless layer; cleaning produces a "clean view," not replacements.
  • Identify repeated headers/footers and tag them.
  • Pass-through when OCR was skipped (direct PDF mode).
Stage 05 -- STRUCTURED_EXTRACTION
  • Direct PDF upload to Gemini 2.5 Flash with response_schema parameter.
  • The schema constrains output at the API decoding level -- skills always return as {technical: string[], soft: string[]}.
  • Temperature: 0.1 for factual extraction.
  • Extraction cached by checksum_sha256 scoped per tenant_id for cost efficiency.
  • Output: parsed_resume.extraction_json (JSONB) with person, work history, education, skills, certifications, languages, URLs.
  • Gate B (GATE_B_SCHEMA) validates extraction schema. Uses _partition_validation_issues() to separate blocking from advisory. Blocking errors stop pipeline. Status: FIXED (was P0-1: always passed true).
  • Gate C (GATE_C_EVIDENCE) validates evidence spans. Uses _validate_extraction_evidence_contract() with per-section 20% density floor. Status: FIXED (was P0-2: only checked list presence).
Stage 06 -- NORMALIZATION
  • Normalize raw skill terms against canonical taxonomy: 55,000+ hard skills, 63,000+ aliases, 146 soft skills.
  • Match types: EXACT, FUZZY, SEMANTIC.
  • Certification-to-skill expansion via intelletto.certification_skill_map table.
  • Track all normalization decisions with confidence + match_type in intelletto.normalization_result.
  • Do not overwrite raw terms; store canonical mappings separately.
  • Quality alerts written to intelletto.normalization_quality_alert.
Stage 07 -- DATA_FUSION_ENRICHMENT
  • Fetch and parse URLs found in the resume: LinkedIn, GitHub, portfolio, personal websites.
  • SSRF protection: DNS resolve + private-IP block, scheme/port allowlist, redirect validation (32 tests).
  • Enrichment evidence logged with source, URL, HTTP status, character count to intelletto.enrichment_audit.
  • When candidate has websites, they MUST be fetched and parsed. Not parsing is a failure.
  • Enrichment results stored in intelletto.enrichment_payload.
  • If enrichment fails: skip and proceed. Do not block scoring inputs.
Stage 08 -- VALIDATION_GATES

Five gates control pipeline progression. Each gate has a passed boolean written to intelletto.pipeline_gate_result.

Gate Code What it validates Status
Gate A GATE_A_LOSSLESS OCR coverage -- page map completeness vs source page count Working correctly
Gate B GATE_B_SCHEMA Extraction schema validity against IIR schema. Now uses _partition_validation_issues() -- blocking errors stop pipeline. FIXED (was P0-1)
Gate C GATE_C_EVIDENCE Evidence span resolution rate per structured field. Now uses _validate_extraction_evidence_contract() with 20% per-section density floor. FIXED (was P0-2)
Gate D GATE_D_DEDUP Deduplication result -- stops run as DEDUP_SKIPPED if duplicate Working correctly
Gate E GATE_E_ARTIFACTS Meta-gate: URL quality, bbox quality, normalization, work history Working correctly
Stage 09 -- SCORING_INPUT_BUILD
  • Create immutable intelletto.scoring_input_snapshot row.
  • Loads parsed resume, pipeline run, page map, canonical extraction, normalization results, gate status.
  • Hard-blocks if required gates are missing or failed.
  • Snapshot includes: normalized_skills_json, work_history_json, education_summary_json, certifications_json, languages_json.
  • In Mode A (Pool Build): pipeline terminates here with status POOL_READY.
Stage 10 -- GCS_ARCHIVE_MOVE
  • Move source PDF from intake prefix to RESUME-POOL/processed/ in GCS.
  • Update resume_document.object_uri with new location.
Stage 11 -- SCORECARD_GENERATION
  • Resolve published scoring_config_version for the JD.
  • Load JD skill requirements via 3-level fallback: jd_skill_requirement table, extracted_features_json, job_description_version.model_json.
  • Execute 8-bucket weighted scoring model:
    • Hard Skills (skill matching with exact/alias/fuzzy tiers)
    • Domain/Process (title relevance + skill overlap + industry continuity)
    • Scope/Complexity (seniority patterns)
    • Tenure/Recency (years + recency + stability)
    • Soft/Behavioral (12 soft signals, leadership weighted 1.2x)
    • Languages/Communications (English + additional)
    • Compliance/Scheduling (completeness + confidence + gate health)
    • Education/Certifications (tenure decay model: e^(-0.18 * max(0, years - 3)))
  • Gate evaluation: N_OF_M_SKILLS, MIN_VALUE, MAX_VALUE, BOOLEAN_TRUE, TIMEZONE_OVERLAP.
  • Modifiers: data fusion confidence + evidence coverage JD duties.
  • Persist: scorecard_version, bucket_score, modifier_result, scoring_gate_result, scoring_run_metering.
Stage 12 -- RECRUITER_ARTIFACTS
  • Build Intelletto Resume snapshot (HTML/PDF/JSON) from extraction data.
  • Persist to intelletto.intelletto_resume_snapshot.
  • DB fallback for extraction data when orchestrator context is unavailable.
  • In Mode A: terminal state is POOL_READY (not COMPLETED).
  • Pipeline run status set to COMPLETED (Mode B/C) or POOL_READY (Mode A).

6) Artifacts produced/consumed (what the pipeline produces)

Core canonical artifacts (must exist for every resume)
  • intelletto.resume_document -- file metadata, GCS URI, checksum
  • intelletto.parsing_pipeline_run -- run lifecycle, status, stages
  • intelletto.pipeline_phase_event -- per-stage telemetry
  • intelletto.parsed_resume -- extraction_json (JSONB)
  • intelletto.normalization_result -- per-skill canonical mappings
  • intelletto.pipeline_gate_result -- gate pass/fail per run
  • intelletto.scoring_input_snapshot -- scoring-ready normalized data
  • intelletto.coverage_ledger -- page/block coverage proof

7) Validation gates (5-gate framework)

Defined in Stage 08 above. Gates A through E with their current implementation status.

8) Failure modes + recovery paths

  • Schema invalid JSON (Gate B blocks)
    • Recovery: run Repair Prompt once; if still invalid, re-extract with stricter constraints and reduced temperature.
  • Coverage ledger fails (missing blocks/pages)
    • Recovery: re-run Stage 03 (OCR_LAYOUT) with fallback extractor; do not proceed to Stage 05.
  • Evidence density below threshold (Gate C blocks)
    • Recovery: re-extract with "evidence required" enforcement or move content to unmapped_content[].
  • Cost spike (unexpected retries / OCR invoked)
    • Recovery: fail the run with COST_GATE_EXCEEDED status unless cost_gate_override flag is set.
  • Dedup match (Gate D)
    • Recovery: pipeline terminates with DEDUP_SKIPPED. Not a failure -- expected behavior for duplicate resumes.

9) Acceptance tests (how to prove it works)

For any resume, the pipeline MUST prove:

  • All 12 stages complete (or subset per pipeline mode) with telemetry in pipeline_phase_event.
  • All pages present in page_map (N/N).
  • Structured extraction includes major constructs present in the source with evidence links.
  • Skills normalized with canonical IDs and confidence scores.
  • Scoring input snapshot created with all gates passed.
  • Scorecard generated (Mode B/C) with per-bucket scores, evidence, and rubric metadata.

Production verification: 218 documents processed end-to-end, 218/218 scored, 0 failures (April 2026).

Section 1: Architectural Principles

1) Intent

Define the non-negotiable architectural principles that govern every component of the Intelletto resume parsing pipeline -- so developers can implement and extend the system without breaking: (a) schema-first correctness, (b) 100% extraction + regeneration, (c) auditability, and (d) cost-per-resume control.

2) Why it matters (risks mitigated / dependencies unlocked)

These principles prevent the systemic failures that derail AI pipelines:

  • Silent loss of content (partial extraction): breaks the "regenerate without source" requirement.
  • Unexplainable outputs (no evidence chain): breaks trust and auditability, and blocks downstream scoring explainability.
  • Non-deterministic cost (unbounded retries, uncontrolled OCR/enrichment): kills unit economics at scale.
  • Data drift (schema and prompt changes degrade quality): breaks regression confidence.

3) Definitions for AI-new developers (if AI concepts appear)

  • Schema-first: the response_schema parameter is the contract; Gemini's API decoding layer enforces it. Prompts and code must obey it.
  • Deterministic pipeline: same input + same versions = same outputs, barring upstream extractor changes (tracked via checksums).
  • Repair vs re-extract:
    • Repair fixes JSON validity/formatting with minimal deviation (no new facts).
    • Re-extract re-runs the model on source content (higher cost; tightly controlled).
  • Provenance: where a datum came from (source page/block + extraction run id + model id + prompt version).

4) Requirements

MUST (architectural invariants)

MUST-AP-01: Schema governs everything

  • The pipeline MUST use response_schema with Gemini 2.5 Flash to constrain extraction output at the API decoding level.
  • A run MUST fail if Gate B detects blocking schema violations -- no partial "acceptance."

MUST-AP-02: Lossless first, structured second

  • The pipeline MUST persist a lossless page/block layer for every page.
  • Structured fields MUST be derived from lossless content; they MUST NOT replace it.

MUST-AP-03: 100% extraction coverage

  • Every page exists in lossless layer.
  • Every page has blocks with evidence geometry (bbox) and stable reading order.
  • Anything not mapped to structured fields is captured in unmapped_content[] with evidence.

MUST-AP-04: Evidence required for every structured fact

  • Gate C enforces per-section 20% evidence density floor via _validate_extraction_evidence_contract().

MUST-AP-05: Version everything

  • Every output MUST include: schema_version, model_id, pipeline_version, and a deterministic run_id.
  • Hashes for inputs: source_sha256, page_map_sha256, cleaned_text_sha256.

MUST-AP-06: Idempotent, retry-safe by design

  • Each stage MUST check intelletto.idempotency_record before writing.
  • Retries MUST not double-count metering units.
  • Extraction cached by checksum_sha256 + tenant_id.

MUST-AP-07: Cost gates are first-class

  • Each stage MUST emit pipeline_phase_event records.
  • The orchestrator MUST enforce cost_ceiling_usd per resume/run unless overridden.

MUST-AP-08: Separation of concerns (domain-driven)

  • Extraction (Stage 05) MUST NOT do canonicalization/normalization (Stage 06).
  • Enrichment (Stage 07) MUST be isolated and optional.
  • Scoring (Stage 11) consumes the scoring_input_snapshot produced by Stage 09.

MUST-AP-09: Tenant isolation

  • All queries MUST be scoped by tenant_id. No cross-tenant data leakage.
  • Dedup is tenant-scoped. Extraction cache is tenant-scoped.
SHOULD (strong guidance)
  • Prefer direct PDF upload to Gemini over OCR-to-text prompting.
  • Keep temperature low (0.1) and maximize schema constraints.
  • Normalize with deterministic rules and database lookups (not LLM normalization).
MAY (optional but allowed)
  • Add multiple extraction passes only if metering and gates remain strict and predictable.

5) Implementation steps

  • Domain-driven architecture: each pipeline stage is a separate module in v3/src/domains/pipeline/stages/.
  • Orchestrator pattern: v3/src/domains/pipeline/orchestrator.py drives stage execution. event_orchestrator.py provides event-driven alternative with Pub/Sub support.
  • Always persist lossless page map before calling Gemini (Stage 03 before Stage 05).
  • Run structured extraction with response_schema: direct PDF upload, no OCR-to-text prompting.
  • Normalize after extraction: Stage 06 uses database lookups against 55K+ hard skills taxonomy.
  • Enforce gates in Stage 08: all 5 gates evaluated, results persisted to pipeline_gate_result.

Section 2: Technology Stack (Gemini 2.5 Flash, Python, FastAPI, Cloud Run, Cloud SQL)

1) Intent

Document the Google-first toolchain used by Intelletto's v3 production pipeline:

  • Direct PDF extraction via Gemini 2.5 Flash with response_schema enforcement
  • Lossless evidence capture via Document AI OCR + layout page maps
  • Deterministic cost metering via pipeline_phase_event and scoring_run_metering
  • PostgreSQL-first persistence via Cloud SQL (psycopg3 async)
  • Cloud Run deployment for stateless, auto-scaling execution

2) Why it matters

  • Prompt drift without regressions: fixed by response_schema enforcement + golden test corpus (31/36 within 5 pts of v2).
  • Untraceable outputs: fixed by coupling Gemini extraction with Document AI layout page maps.
  • Unbounded cost: fixed by extraction caching (checksum-based) and cost_ceiling_usd enforcement.
  • Schema inconsistency: CRITICAL -- without response_schema, Gemini returns skills in inconsistent formats leading to zero-skill failures.

3) Production Stack

Component Technology Details
LLM Gemini 2.5 Flash google-genai Python SDK. Direct PDF upload with response_schema. Temperature 0.1.
OCR Document AI Enterprise Document OCR Layout extraction for evidence geometry. Processor ID in env var.
API Framework FastAPI (Python 3.11+) Async/await throughout. Uvicorn server.
Database PostgreSQL 16 on Cloud SQL Instance: intelletto-ai, region: asia-southeast1, schema: intelletto, ~169 tables.
DB Driver psycopg3 (async) Connection pool in v3/src/database/pool.py. prepare_threshold=None.
Object Storage Google Cloud Storage Bucket: intelletto-ai-resume-parse-404886655151. RESUME-POOL prefix structure.
Compute Google Cloud Run 2 vCPU / 2Gi, 2 workers, min 1 / max 10. Region: asia-southeast1.
Batch Processing Gemini Batch API v3/src/integrations/google/batch_gemini.py. 50% cost reduction, 200K requests/job, bypasses QPM.
Secrets Google Secret Manager INTELLETTO_DB_URL, GEMINI_API_KEY, API_KEY, GITHUB_TOKEN, GOOGLE_OAUTH_CLIENT_SECRET

4) Gemini Extraction Architecture

Key design decision: Direct PDF upload to Gemini 2.5 Flash with response_schema parameter. This is NOT OCR-to-text-to-LLM prompting.

  • The PDF binary is uploaded directly as a Gemini content part.
  • The response_schema parameter forces Gemini to output JSON conforming to the declared schema at the API decoding level.
  • This eliminates the zero-skill failure mode where skills returned in inconsistent formats.
  • Skills ALWAYS return as {technical: string[], soft: string[]} -- no variation.
  • Extraction is cached by checksum_sha256 scoped per tenant_id.

Gemini call pattern (conceptual):

# v3/src/integrations/google/gemini.py
from google import genai

client = genai.Client(api_key=api_key)
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        {"text": system_instruction},
        {"inline_data": {"mime_type": "application/pdf", "data": pdf_bytes}}
    ],
    config={
        "temperature": 0.1,
        "response_schema": extraction_schema,   # enforced at decode
        "response_mime_type": "application/json"
    }
)

5) Database Connection Pattern

# v3/src/database/pool.py
# psycopg3 async pool with prepare_threshold=None (never prepare)
# DEALLOCATE ALL in reset callback
# All handlers use: async with pool.connection() as conn:

Section 3: Data Stores and Canonical Persistence Model

1) Intent

Define the canonical persistence model for Intelletto's resume parsing pipeline, including:

  • Which artifacts are stored where (GCS vs PostgreSQL)
  • The lossless-first storage pattern required for 100% extraction + regeneration
  • The response_schema extraction contract (direct PDF to Gemini)
  • The Evidence Span Protocol and Page Map Protocol
  • The Per-Step Metering via pipeline_phase_event

2) PostgreSQL Schema (intelletto.*)

All tables live in the intelletto schema on Cloud SQL. Key tables:

Pipeline Tables

Table Purpose Key Columns
resume_document Raw file metadata + status resume_document_id, tenant_id, candidate_id, object_uri, checksum_sha256, status (RECEIVED/PROCESSED/POOL_READY/DUPLICATE_REVIEW/PROCESS_FAILED)
parsing_pipeline_run Per-run orchestration state pipeline_run_id, tenant_id, resume_document_id, status (PENDING/RUNNING/COMPLETED/FAILED/COST_GATE_EXCEEDED/DEDUP_SKIPPED/PARTIAL), current_stage, cost_ceiling_usd
pipeline_phase_event Per-stage telemetry pipeline_run_id, stage, outcome, error_code, error_message, started_at, completed_at
parsed_resume Structured extraction output parsed_resume_id, tenant_id, candidate_id, extraction_json (JSONB), evidence_spans_json, confidence_summary_json
candidate_profile Extracted identity candidate_id, tenant_id, full_name, email, phone, linkedin_url, current_title, seniority
normalization_result Skill canonical mappings raw_term, canonical_skill_id, canonical_skill_name, match_type (EXACT/FUZZY/SEMANTIC), confidence
pipeline_gate_result Gate pass/fail outcomes pipeline_run_id, gate_code (GATE_A through GATE_E), passed (boolean), failure_reason, failure_detail (JSONB)
coverage_ledger Per-page OCR coverage pipeline_run_id, page_count_total, page_count_processed, gate_a_lossless_passed
scoring_input_snapshot Immutable scoring handoff scoring_input_id, pipeline_run_id, parsed_resume_id, normalized_skills_json, work_history_json, all_gates_passed
enrichment_payload Data fusion results pipeline_run_id, enrichment data JSONB
enrichment_audit Fetch attempt log source, url, http_status, char_count, ssrf_blocked
idempotency_record Replay and dedup protection key, created_at

Scoring Tables

Table Purpose Key Columns
scorecard_version Scored output versions scorecard_version_id, scoring_config_version_id, candidate_id, score_total, base_fit, modifier_points, status (SCORED/DISQUALIFIED/PENDING/FAILED)
bucket_score Per-bucket scoring breakdown scorecard_version_id, bucket_id, score, weight, contribution, evidence_json, rubric_applied_json
modifier_result Scoring modifier outputs scorecard_version_id, modifier_id, points_awarded, basis_json
scoring_gate_result Per-gate evaluation scorecard_version_id, gate_code, status (PASS/FAIL/SKIPPED), severity (DISQUALIFY/WARN/INFO)
scoring_config_version Scoring config lifecycle scoring_config_version_id, config_payload (JSONB), status (DRAFT/PUBLISHED/DEPRECATED)
scoring_run_metering Per-scorecard metering scorecard_version_id, gemini_calls, tokens_in/out, latency breakdowns, db_writes

Taxonomy Tables

Table Purpose Scale
skill_hard Canonical hard skills 55,000+ skills with category, is_hot_tech, is_emerging
skill_soft Canonical soft skills 146 skills with source_framework
skill_alias Alias-to-skill mappings 63,000+ aliases with confidence
vendor_certification Certification taxonomy cert_name, cert_type (CERTIFICATION/SPECIALIZATION), vendor
certification_skill_map Cert-to-skill expansion cert_name to skill_name with skill_dimension (HARD/SOFT), confidence_boost

3) Google Cloud Storage Layout

gs://intelletto-ai-resume-parse-404886655151/
  RESUME-POOL/
    {JD_CODE}/                    # Resumes assigned to a JD
      CV - Name - Title.pdf
    unassigned/                   # Resumes without JD assignment
      resume.pdf
    processed/                    # Archived after pipeline completion
      {document_id}.pdf

Immutability rule: Objects are write-once. Archive moves use GCS copy + delete. Source URIs updated in resume_document.object_uri.

Section 4: End-to-End Workflow

1) Intent

Define the end-to-end execution workflow for the v3 pipeline, from ingest through scored output. This section specifies the orchestration, stage dependencies, and mode-specific behavior.

2) Orchestration

Two orchestrator implementations exist:

  • Synchronous orchestrator (v3/src/domains/pipeline/orchestrator.py): drives stages sequentially within a single request. Used for Mode C (Direct JD) and Mode A (Pool Build).
  • Event-driven orchestrator (v3/src/domains/pipeline/event_orchestrator.py): supports Pub/Sub-driven stage execution with MODE_STAGES filtering and retry semantics. Currently PUBSUB_ENABLED=false (event orchestrator has parity but hasn't been live-tested with Pub/Sub).

3) Mode A -- Pool Build (Stages 01-06, 08-09)

  1. Stage 01 RESUME_REGISTERED: Create resume_document (RECEIVED), candidate_profile, pipeline_run (PENDING).
  2. Stage 02 DEDUP_CHECK: Tenant-scoped SHA256 + signal matching. If duplicate, terminate with DEDUP_SKIPPED.
  3. Stage 03 OCR_LAYOUT: Document AI OCR for layout evidence. Gate A validates page coverage.
  4. Stage 04 DATA_CLEANING: Normalize whitespace, tag headers/footers. Pass-through if OCR skipped.
  5. Stage 05 STRUCTURED_EXTRACTION: Direct PDF to Gemini 2.5 Flash with response_schema. Cache by checksum + tenant. Gates B + C validated.
  6. Stage 06 NORMALIZATION: 55K+ hard skills, 63K+ aliases, 146 soft skills. EXACT/FUZZY/SEMANTIC matching.
  7. Stage 08 VALIDATION_GATES: All 5 gates evaluated, results persisted.
  8. Stage 09 SCORING_INPUT_BUILD: Create immutable scoring_input_snapshot. Pipeline terminates with POOL_READY.

4) Mode B -- Pool Activation (Stages 07, 10-12)

  1. Stage 07 DATA_FUSION_ENRICHMENT: Fetch LinkedIn, GitHub, portfolio URLs. SSRF-protected.
  2. Stage 10 GCS_ARCHIVE_MOVE: Move PDF to processed/ prefix.
  3. Stage 11 SCORECARD_GENERATION: 8-bucket weighted scoring against assigned JD.
  4. Stage 12 RECRUITER_ARTIFACTS: Build Intelletto Resume (HTML/PDF/JSON). Status becomes COMPLETED.

5) Mode C -- Direct JD (Stages 01-12)

All 12 stages execute sequentially. Resume arrives with a JD assignment (via GCS directory path or explicit assignment). No intermediate POOL_READY state.

6) Stage Dependencies and Data Flow

01 RESUME_REGISTERED
  |-> resume_document, candidate_profile, pipeline_run
02 DEDUP_CHECK
  |-> dedup_cluster (if match -> DEDUP_SKIPPED, stop)
03 OCR_LAYOUT
  |-> coverage_ledger, page_map (Gate A)
04 DATA_CLEANING
  |-> cleaned page_map
05 STRUCTURED_EXTRACTION
  |-> parsed_resume.extraction_json (Gates B, C)
06 NORMALIZATION
  |-> normalization_result rows
08 VALIDATION_GATES
  |-> pipeline_gate_result rows (all 5 gates)
09 SCORING_INPUT_BUILD
  |-> scoring_input_snapshot (Mode A stops here: POOL_READY)
07 DATA_FUSION_ENRICHMENT
  |-> enrichment_payload, enrichment_audit
10 GCS_ARCHIVE_MOVE
  |-> updated object_uri
11 SCORECARD_GENERATION
  |-> scorecard_version, bucket_score, modifier_result, scoring_gate_result
12 RECRUITER_ARTIFACTS
  |-> intelletto_resume_snapshot (COMPLETED)

Section 5: API Contracts

1) Intent

Define the API contracts for the v3 resume parsing pipeline. The v3 system uses domain-driven route organization.

2) v3 API Routes

Pipeline Endpoints

Method Path Purpose
POST /api/v3/pipeline/process Trigger pipeline processing for a document
POST /api/v3/pipeline/rescore/{document_id} Re-score a document (append-only, new scorecard_version)
POST /api/v3/pipeline/rescore/bulk Bulk re-score multiple documents
GET /api/v3/pipeline/latency-report Pipeline latency SLA dashboard data
GET /api/v3/health Health check (DB connectivity, service status)

Intake Endpoints

Method Path Purpose
POST /api/v1/intake/documents/upload Browser upload
POST /api/v1/intake/documents/import_gcs_prefix Bulk GCS import from RESUME-POOL
GET /api/v1/intake/aggregator/documents List received resumes
GET /api/v1/intake/intelletto_resumes List parsed resumes
POST /api/v1/intake/aggregator/ai_jd_assignment AI-powered JD assignment (detects JD code from filename/GCS path)

Scoring Endpoints

Method Path Purpose
POST /api/v1/scoring/scorecards Generate scorecard for a resume against a JD
GET /api/v3/scoring/interview-brief/{id} Evidence-derived interview probes
GET /api/v3/scoring/comp-inference Compensation inference (P25/P50/P75)
GET /api/v3/scoring/jd-calibration/{id} JD difficulty calibration
POST /api/v3/scoring/decision Recruiter feedback signal loop

Candidate Endpoints

Method Path Purpose
GET /api/v3/candidates/{id}/authenticity Resume authenticity scoring
GET /api/v3/candidates/{id}/sector-profile Industry/sector classification
GET /api/v3/candidates/{id}/bias-audit Bias audit (advisory only)
GET /api/v3/candidates/{id}/cross-jd-fit Multi-JD cross-fit analysis
POST/GET /api/v3/candidates/{id}/cover-letter Cover letter + sentiment analysis

3) Internal Stage Execution

Pipeline stages are NOT exposed as individual HTTP endpoints in v3. Instead, the EventDrivenOrchestrator (or synchronous orchestrator) calls stage functions directly within the same process. Stage results flow through an orchestrator context dictionary.

The v2 internal endpoints (/internal/v1/...) are retained for backward compatibility but all new pipeline execution uses the v3 orchestrator pattern.

Section 6: Cost Model and Metering

1) Intent

Define the metering and cost model for the pipeline so that cost-per-resume is deterministic and auditable.

2) Metering Implementation

Metering is implemented via two tables:

  • intelletto.pipeline_phase_event: per-stage telemetry for every pipeline run. Records stage, outcome, error_code, error_message, started_at, completed_at, and computed duration.
  • intelletto.scoring_run_metering: per-scorecard metering. Records gemini_calls, gemini_tokens_in/out, gate_evaluations, bucket_computations, modifier_computations, db_writes, and latency breakdowns (load_inputs, gates, buckets, modifiers, persist, total).

3) Cost Control

  • Extraction caching: Results cached by checksum_sha256 + tenant_id. Duplicate documents skip Gemini entirely.
  • Cost ceiling: parsing_pipeline_run.cost_ceiling_usd enforced per run. Override via cost_gate_override flag.
  • Batch processing: v3/src/integrations/google/batch_gemini.py provides 50% cost reduction for bulk workloads via Gemini Batch API.
  • Budget stop: If cost_ceiling_usd exceeded, run terminates with status COST_GATE_EXCEEDED. Partial artifacts preserved.

Section 7: Quality Metrics

1) Intent

Define the required quality metrics that prove the pipeline meets the prime directive: strict schema compliance, 100% extraction coverage, evidence traceability, and operational robustness.

2) Core Metrics

  • M1 -- schema_valid: Gate B passes (extraction schema valid, blocking errors absent).
  • M2 -- coverage_complete: Gate A passes (all pages captured in lossless layer).
  • M3 -- evidence_density: Gate C passes (20% per-section fact density floor).
  • M4 -- normalization_coverage: Percentage of raw terms with canonical matches.
  • M5 -- pipeline_success_rate: Ratio of COMPLETED runs to total runs.
  • M6 -- scoring_completion_rate: Ratio of scored documents to parseable documents.

3) Production Results

  • Pipeline success rate: 218/218 documents completed (100%) in April 2026 trial.
  • Scoring completion rate: 218/218 scored (100%), 0 failures.
  • Golden corpus: 31/36 resumes within 5 points of v2 scoring baseline.
  • Test suite: 747 tests passing, 0 failures.

4) Regression Testing

Golden test corpus at v3/goldens/ provides regression baseline. The v3 test suite covers:

  • Contract gate tests (schema validation, evidence contract)
  • Contact URL recall tests
  • Intelletto resume lossless tests
  • Process QA tests
  • Pipeline validation tests
  • Scoring runtime tests
  • Batch Gemini tests

Section 8: Error Handling, Retries, and Safe Degradation

1) Intent

Specify the error-handling contract for the resume pipeline so that runs are idempotent, retry-safe, repair-capable, and degradable (never lose content).

2) Pipeline Run Status Machine

Status Meaning
PENDING Run created, not yet started
RUNNING Active stage execution
COMPLETED All stages passed, scorecard generated
POOL_READY Mode A: parsed and normalized, awaiting JD assignment
FAILED Unrecoverable error
COST_GATE_EXCEEDED Budget exceeded, partial artifacts preserved
DEDUP_SKIPPED Duplicate detected, pipeline stopped (not a failure)
PARTIAL Some stages completed, run interrupted

3) Error Taxonomy

  • Input errors (permanent): PDF unreadable, unsupported format. Recovery: fail fast.
  • Auth errors (permanent until fixed): Gemini API key invalid, IAM denied. Recovery: fail fast.
  • Rate limits (transient): 429 RESOURCE_EXHAUSTED. Recovery: exponential backoff.
  • Service availability (transient): 503/504, timeouts. Recovery: retry.
  • OCR failures: Document AI partial pages. Recovery: retry once; if still partial, extraction proceeds with direct PDF to Gemini (graceful degradation).
  • LLM output failures: JSON invalid or schema mismatch. Recovery: repair once, re-extract once, then fail.
  • Persistence failures: DB connection (transient: retry) or constraint violation (permanent: fail).

4) Idempotency Pattern

# Every pipeline write checks idempotency_record first
existing = await db.fetchrow(
    "SELECT id FROM intelletto.idempotency_record WHERE key = $1",
    idempotency_key
)
if existing:
    return {"status": "already_processed", "id": str(existing["id"])}

5) Concurrency Guard

Pipeline runs use FOR UPDATE SKIP LOCKED with a 30-minute stale threshold to prevent concurrent execution of the same document.

Section 9: Security, Privacy, and Compliance

1) Tenant Isolation

  • All database queries are scoped by tenant_id. No cross-tenant data leakage.
  • Dedup queries are tenant-scoped.
  • Extraction cache is scoped by tenant_id + checksum_sha256.
  • All sensitive v1 routes require tenant_id parameter (422 without).

2) SSRF Protection (Data Fusion)

  • DNS resolve + private-IP block before fetching any candidate URL.
  • Scheme allowlist (http/https only), port allowlist.
  • Redirect validation (no redirects to private IPs).
  • 32 tests covering SSRF protection scenarios.

3) Authentication

  • Pipeline worker requires OIDC Bearer token or API key (not User-Agent).
  • Google OAuth configured for api.intelletto.ai.
  • API key authentication for programmatic access.

4) Data Classification

  • All resume documents classified as PII_HIGH by default.
  • resume_document.pii_classification field tracks classification.
  • retention_policy_id and purge_at fields support data lifecycle management.

Section 10: Scoring Engine (8-Bucket Weighted Model)

1) Intent

Document the scoring engine that evaluates candidates against job descriptions using an 8-bucket weighted model with gates and modifiers.

2) Scoring Flow

  1. Config resolution: Load published scoring_config_version for the JD.
  2. Input loading: Load scoring_input_snapshot + JD skill requirements (3-level fallback).
  3. Gate evaluation: Run gates first. If a DISQUALIFY gate fails, scoring stops.
  4. Bucket scoring: 8 buckets scored independently via ScorerRegistry.
  5. Base fit: Weighted sum of bucket scores (enabled bucket weights sum to 1.0).
  6. Modifier computation: Data fusion confidence + evidence coverage adjust +/- points.
  7. Final score: Base fit + modifier points, clamped 0..100.
  8. Classification: STRONG / BORDERLINE / RISK with ADVANCE / REVIEW / HOLD recommendation.

3) 8 Scoring Buckets

# Bucket Formula
1 Hard Skills required_match_rate * 75% + nice_to_have_rate * 25% + quality_bonus (up to +8). Match tiers: exact=1.0, alias>=0.85, fuzzy>=0.65.
2 Domain/Process title_relevance*50% + skill_domain_overlap*30% + industry_continuity*20%
3 Scope/Complexity seniority_score*60% + scope_signals*40%. Chief/VP=100, Director=90, Senior=80, Manager=75, Junior=35, Intern=20.
4 Tenure/Recency years_score*40% + recency_score*35% + stability_score*25%
5 Soft/Behavioral 12 soft signals with leadership/problem_solving weighted 1.2x
6 Languages/Communications english_score*60% + additional_languages*40%
7 Compliance/Scheduling completeness*50% + norm_confidence*30% + gate_health*20%
8 Education/Certifications Education tenure decay: e^(-0.18 * max(0, years - 3)). Seniority-aware: Junior (degree 60%, skills 40%) to Executive (degree 5%, skills 95%). Cert bonus capped at 0.12. Combined: education 55% + cert matching 35% + cert-expanded skill bonus 10%.

4) Scoring Config Lifecycle

  • DRAFT: editable, not used for scoring.
  • PUBLISHED: immutable, active for scoring. Enabled bucket weights must sum to 1.0.
  • DEPRECATED: archived, not used for new scoring.

5) JD Skill Requirements Loading

Three-level fallback ensures skills are available for scoring:

  1. intelletto.jd_skill_requirement table (structured, preferred)
  2. job_description_version.extracted_features_json (legacy)
  3. job_description_version.model_json (JD Orchestrator)

Section 11: Acceptance Criteria

1) Pipeline Acceptance

  • All 12 stages complete (or mode-appropriate subset) for every input document.
  • All 5 validation gates evaluated and persisted.
  • Pipeline phase events recorded for every executed stage.
  • Scoring input snapshot created with all gates passed.

2) Extraction Acceptance

  • Skills always returned as {technical: string[], soft: string[]} (response_schema enforced).
  • No zero-skill candidates (schema enforcement prevents this).
  • Evidence density >= 20% per structured section (Gate C).

3) Scoring Acceptance

  • 8 bucket scores computed with evidence and rubric metadata.
  • Per-bucket contribution = score * weight.
  • Base fit = sum of contributions.
  • Modifiers applied within budget.
  • Final score clamped 0..100.

4) Lossless Resume Output Constraints

The lossless spec defines what every generated Intelletto resume must contain. Non-negotiable:

  • NEVER produce: "No detailed achievements mapped in this parse." -- this is always a defect.
  • NEVER truncate experience sections for any reason.
  • NEVER omit advisory roles, even those outside the main employment timeline.
  • NEVER omit education or certifications -- if absent from source, write "Not captured in source document".
  • The Intelletto Resume word count must always exceed the source resume word count.

5) Test Suites

Suite Location Count
v3 tests v3/tests/ 101+ tests
v2 contract gates tests/test_contract_gates_unittest.py Gate B, Gate C, validation tests
Contact URL recall tests/test_contact_url_recall_unittest.py URL extraction accuracy
Lossless resume tests/test_intelletto_resume_lossless_unittest.py Output completeness
Process QA tests/test_process_qa_unittest.py End-to-end pipeline QA
Pipeline validation tests/test_pipeline_validation_unittest.py Stage validation
Scoring runtime tests/test_scoring_runtime_unittest.py Scorer registry, bucket dispatch
Batch Gemini tests/test_batch_gemini_unittest.py Batch API integration
Total 747 tests passing

Section 12: Extraction Schema

1) response_schema Enforcement

The v3 pipeline uses Gemini 2.5 Flash with response_schema parameter. This constrains model output at the API decoding level, meaning the JSON structure is guaranteed by the API, not by post-hoc validation.

2) Key Schema Shape

{
  "basics": {
    "name": "string",
    "email": "string",
    "phone": "string",
    "location": "string",
    "urls": ["string"]
  },
  "work_history": [
    {
      "title": "string",
      "company": "string",
      "start_date": "string",
      "end_date": "string",
      "achievements": ["string"]
    }
  ],
  "education": [
    {
      "school": "string",
      "degree": "string",
      "field": "string",
      "graduation_date": "string"
    }
  ],
  "skills": {
    "technical": ["string"],    // ALWAYS this shape
    "soft": ["string"]          // ALWAYS this shape
  },
  "certifications": [
    {
      "name": "string",
      "issuer": "string",
      "date": "string"
    }
  ],
  "languages": [
    {
      "language": "string",
      "proficiency": "string"
    }
  ]
}

Critical: The skills field ALWAYS returns as {technical: string[], soft: string[]}. Without response_schema enforcement, Gemini would return skills in inconsistent formats (flat array, nested objects, comma-separated strings) leading to zero-skill failures in normalization and scoring.

3) parsed_resume.extraction_json

The full extraction output is stored in intelletto.parsed_resume.extraction_json as a JSONB column. This is the canonical extraction artifact for the pipeline run.

Section 13: Development Environment Setup

1) Prerequisites

  • Python 3.11+
  • Google Cloud SDK (gcloud)
  • Access to Intelletto Google Cloud project

2) Local Development

# Clone and setup
cd Intelletto
python -m venv .venv
source .venv/bin/activate
pip install -r v3/requirements.txt

# Start v3 server (port 8080)
cd v3
uvicorn main:app --port 8080 --reload

# Or use the startup script
./start_services.sh

3) Database Access

The database is remote Cloud SQL (not local). Never attempt local postgres commands.

# Cloud SQL instance
Instance: intelletto-ai
Region: asia-southeast1
Schema: intelletto

# Connection via Cloud SQL Proxy (for local dev)
cloud_sql_proxy -instances=<project>:asia-southeast1:intelletto-ai=tcp:5432

# Or set INTELLETTO_DB_URL env var directly

4) GCS Bucket

GCS_BUCKET=intelletto-ai-resume-parse-404886655151

# Resume intake path
gs://intelletto-ai-resume-parse-404886655151/RESUME-POOL/

# Processed archive path
gs://intelletto-ai-resume-parse-404886655151/RESUME-POOL/processed/

5) Deployment

# Build and deploy to Cloud Run
gcloud builds submit --config v3/cloudbuild.yaml
gcloud run deploy intelletto-api \
  --region asia-southeast1 \
  --image <image> \
  --min-instances 1 \
  --max-instances 10 \
  --cpu 2 \
  --memory 2Gi

6) Running Tests

# v2 contract tests (always use .venv/bin/python)
./.venv/bin/python -m pytest tests/test_contract_gates_unittest.py \
  tests/test_contact_url_recall_unittest.py \
  tests/test_intelletto_resume_lossless_unittest.py \
  tests/test_process_qa_unittest.py -q

# v3 tests
./.venv/bin/python -m pytest v3/tests/ -q

# Expected: all tests pass (747 total)

7) Environment Variables

Variable Purpose
INTELLETTO_DB_URL Cloud SQL connection string
GEMINI_API_KEY Google Gemini API key
API_KEY Intelletto API authentication key
GCS_BUCKET GCS bucket name (intelletto-ai-resume-parse-404886655151)
GOOGLE_CLOUD_PROJECT Google Cloud project ID
GOOGLE_DOCUMENT_AI_PROCESSOR_ID Document AI processor ID
GOOGLE_DOCUMENT_AI_LOCATION Document AI processor location
AUTH_ENABLED Enable/disable authentication (true for production)
PUBSUB_ENABLED Enable event-driven orchestrator (currently false)

Section 14: Data Model -- Entities and Relationships

1) Core Entity Relationships

resume_document (1) -----> (N) parsing_pipeline_run
  |                              |
  +-> candidate_profile (1:1)    +-> pipeline_phase_event (1:N)
                                 +-> pipeline_gate_result (1:N)
                                 +-> coverage_ledger (1:1)
                                 +-> parsed_resume (1:1)
                                 |     +-> normalization_result (1:N)
                                 +-> scoring_input_snapshot (1:1)
                                       +-> scorecard_version (1:N)
                                             +-> bucket_score (1:N)
                                             +-> modifier_result (1:N)
                                             +-> scoring_gate_result (1:N)
                                             +-> scoring_run_metering (1:1)

2) Key Foreign Key Chains

  • resume_document -> candidate_profile (via candidate_id)
  • parsing_pipeline_run -> resume_document (via resume_document_id)
  • parsed_resume -> parsing_pipeline_run (via parsing_job_id)
  • scoring_input_snapshot -> parsed_resume (via parsed_resume_id)
  • scorecard_version -> scoring_input_snapshot (via scoring_input_id)
  • resume_document_job_map -> resume_document + job_requisition

3) Key JSONB Shapes

extraction_json (in parsed_resume)

{
  "basics": {"name": "...", "email": "...", "phone": "...", "urls": [...]},
  "work_history": [{"title": "...", "company": "...", "dates": {...}, "achievements": [...]}],
  "education": [{"school": "...", "degree": "...", "field": "..."}],
  "skills": {"technical": ["..."], "soft": ["..."]},
  "certifications": [{"name": "...", "issuer": "...", "date": "..."}],
  "languages": [{"language": "...", "proficiency": "..."}]
}

normalized_skills_json (in scoring_input_snapshot)

[
  {"raw_term": "ReactJS", "canonical_id": "uuid", "canonical_name": "React", "match_type": "EXACT", "confidence": 0.95},
  {"raw_term": "K8s", "canonical_id": "uuid", "canonical_name": "Kubernetes", "match_type": "FUZZY", "confidence": 0.88}
]

config_payload (in scoring_config_version)

{
  "gates": [{"gateId": "...", "name": "...", "severity": "DISQUALIFY|WARN|INFO", "rule": {...}}],
  "buckets": [{"bucketId": "...", "name": "...", "weight": 0.25, "enabled": true}],
  "modifiers": [{"modifierId": "...", "name": "...", "minPoints": -5, "maxPoints": 5}],
  "modifiersBudgetPoints": 10
}

Appendix A: Known Gaps and Remaining Work

A.1 Resolved P0 Gaps

ID Gap Resolution
P0-1 Gate B always passed true FIXED. _partition_validation_issues() separates blocking from advisory. Gate B fails on blocking errors. Intake stops pipeline on failure. 6 new tests.
P0-2 Gate C only checked list presence FIXED. _validate_extraction_evidence_contract() checks per-section fact coverage with 20% density floor. 9 new tests.

A.2 Partial Resolution

ID Gap Status
P0-3 Bucket dispatch + rubric persistence PARTIAL. Dispatch now via ScorerRegistry (exact match, not substring). rubric_applied_json has real scorer_id/input_hash/evidence_hash. Remaining: scorers are inline heuristics, not configurable rubric rules. 16 tests.

A.3 External Dependencies Not Yet Integrated

  • GitHub API integration: Requires GitHub API token for structured repository analysis.
  • Proxycurl for LinkedIn: Requires Proxycurl API key for LinkedIn profile parsing.
  • PII Redaction Pre-LLM: Blocked -- direct PDF to Gemini makes pre-LLM redaction architecturally difficult.

A.4 Enhancement Features Delivered

All delivered in v3, live on Cloud Run:

  • Re-scoring API (single + bulk, append-only)
  • Interview Brief Generator (evidence-derived, no Gemini)
  • Resume Authenticity Scoring
  • Industry/Sector Classification (15 NAICS-style sectors)
  • Bias Audit (advisory only)
  • Compensation Inference (P25/P50/P75)
  • JD Difficulty Calibration
  • Recruiter Feedback Signal Loop
  • Multi-JD Cross-JD Auto-Fit
  • Pipeline Latency SLA Dashboard
  • Cover Letter + Sentiment Analysis
  • AI JD Generator (Gemini-powered)