Resume Parsing Developer Document

Resume Parsing Developer document

Resume Parsing — Production System Reference

This document describes the Intelletto.ai production pipeline for converting unstructured candidate artifacts into high-fidelity, scoring-ready data. The pipeline is orchestrated by Netflix Conductor with workers deployed on Google Cloud Run. The focus is on auditability, cost-efficiency, and evidence-backed extraction.

Document Overview

Module: Resume Parsing Pipeline
Version: 1.0.0
Last Updated: 2026-04-04
Pipeline Version: v3
Audience: Backend, Data, and AI Engineering

Core Objectives

The pipeline transforms PDFs, DOCX files, and raw text into an immutable Intermediate Representation (IIR) characterized by:

Explainability: Every extracted fact links to a specific source segment or page.
Determinism: Predictable behaviors with idempotent processing and explicit retry logic.
Cost Management: Stage-level attribution for token usage and model invocations.
Auditability: Versioned records with a full provenance trail.

Technical Stack

Component	Specification
Primary API	FastAPI (Python 3.11+) on Google Cloud Run
AI Runtime	Google Gemini 2.5 Flash via `google-genai` Python SDK with `response_schema` enforcement
Extraction Method	Direct PDF upload to Gemini (not OCR-to-text prompting). Schema constrains output at API decoding level.
OCR / Layout	Google Document AI Enterprise Document OCR (fallback for layout evidence)
Database	PostgreSQL 16+ on Google Cloud SQL (`asia-southeast1`) via psycopg3 async connection pool
Storage	Google Cloud Storage (`intelletto-ai-resume-parse-404886655151`)
Pipeline Orchestration	Netflix Conductor OSS 3.15.0 on GCE (asia-southeast1-b). Workflows define stage ordering, decision branches, retries, and rate limits. HUMAN task support for recruiter review gates.
Worker	Cloud Run service (`intelletto-worker`), HTTP-based task polling via `httpx`. Each stage is an async handler registered in `worker_registry.py`.
Rate Limiting	Conductor task-level: 30 RPM / 4 concurrent for Gemini stages. Python-side token bucket as secondary safety net.
Deployment	Cloud Run (asia-southeast1). API: 2 vCPU / 2Gi. Worker: always-on, min 1 instance.
Observability	Cloud Logging + structured pipeline_phase_event telemetry + Conductor workflow/task history

v3 Architecture

The v3 codebase is domain-driven, replacing the v2 monolith (intelletto_server.py, retained as reference only).

Module	Path	Responsibility
Entry Point	`v3/main.py`	Uvicorn entry, app bootstrap
App Factory	`v3/src/api/app.py`	FastAPI app creation, middleware, route mounting
Pipeline Domain	`v3/src/domains/pipeline/`	Conductor adapter (`conductor/`): client, poller, worker_registry, worker_adapter. Stage runners.
Pipeline Stages	`v3/src/domains/pipeline/stages/`	12 stage modules: registration, dedup, ocr_layout, data_cleaning, extraction, normalization, enrichment, validation, scoring_inputs, gcs_archive, scorecard, artifacts
Scoring Domain	`v3/src/domains/scoring/`	ScoreService, ConfigService, interview brief, authenticity, bias audit, comp inference, JD calibration, sector classification, cross-JD fit
Intake Domain	`v3/src/domains/intake/`	Resume registration, GCS import, document management
JD Domain	`v3/src/domains/jd/`	Job description management, AI JD generation
Candidates Domain	`v3/src/domains/candidates/`	Candidate profiles, cover letters
Gemini Integration	`v3/src/integrations/google/gemini.py`	Gemini 2.5 Flash client with response_schema
GCS Integration	`v3/src/integrations/google/gcs.py`	Cloud Storage operations, RESUME-POOL structure
Document AI	`v3/src/integrations/document_ai/ocr.py`	OCR and layout extraction
Batch Gemini	`v3/src/integrations/google/batch_gemini.py`	Async batch processing (50% cost reduction, 200K requests/job)
Database	`v3/src/database/pool.py`, `repository.py`	psycopg3 async connection pool, generic repository
Job Queue	`v3/src/database/job_queue.py`	Async durable job queue for background processing

Pipeline Stage Map (12 Stages)

The pipeline processes resumes through 12 stages orchestrated by Netflix Conductor. Three pipeline modes (A, B, C) are defined as Conductor workflow definitions that control which stages execute. Each stage is a Conductor task polled and executed by the intelletto-worker Cloud Run service. Conductor handles retry logic, rate limiting, timeouts, and decision branching (e.g., dedup stop, gate stop). A HUMAN task type enables recruiter review gates for borderline candidates.

#	Stage Key	Human Label	Code Location
01	`RESUME_REGISTERED`	Registered	`stages/registration.py`
02	`DEDUP_CHECK`	Duplicate Check	`stages/dedup.py`
03	`OCR_LAYOUT`	Reading Document Layout	`stages/ocr_layout.py`
04	`DATA_CLEANING`	Cleaning Extracted Text	`stages/data_cleaning.py`
05	`STRUCTURED_EXTRACTION`	Extracting Structured Resume	`stages/extraction.py`
06	`NORMALIZATION`	Skill Normalization	`stages/normalization.py`
07	`DATA_FUSION_ENRICHMENT`	Data Enrichment	`stages/enrichment.py`
08	`VALIDATION_GATES`	Quality Gates	`stages/validation.py`
09	`SCORING_INPUT_BUILD`	Score Preparation	`stages/scoring_inputs.py`
10	`GCS_ARCHIVE_MOVE`	Move Source to Processed	`stages/gcs_archive.py`
11	`SCORECARD_GENERATION`	Score Against JD	`stages/scorecard.py`
12	`RECRUITER_ARTIFACTS`	Final Output	`stages/artifacts.py`

Pipeline Modes

Mode	Name	Stages Executed	Use Case
A	Pool Build	01-06, 08-09	Parse and normalize resumes into the RESUME-POOL without scoring. Terminal state: `POOL_READY`.
B	Pool Activation	07, 10-12	Activate pooled resumes against a JD: enrich, archive, score, build artifacts.
C	Direct JD	01-12	Full end-to-end: resume arrives with a JD assignment, all 12 stages run sequentially.

RESUME-POOL GCS Structure

Resumes are organized in GCS under RESUME-POOL/ with JD code directories for assignment:

gs://intelletto-ai-resume-parse-404886655151/
  RESUME-POOL/
    {JD_CODE}/              # e.g., 59A356/
      CV - Name - Title.pdf
      CV - Name - Title.pdf
    unassigned/
      resume.pdf

JD code is detected from the GCS directory path during import. The resume_document_job_map table links documents to job requisitions for scoring.

Contract Principles (Non-Negotiable)

These principles are enforced as hard contracts across parsing, normalization, and scoring. Each bullet links to the implementation clause(s) and the downstream acceptance gates.

IIR schema is definitive and discoverable
Evidence and page_map protocols are formalized
Deterministic LLM invocation is enforceable
Metering dictionary is pinned down
Normalization/taxonomy ownership is unambiguous
Scoring handoff contract is explicit

IIR schema section

Contract: The Internal Intermediate Representation (IIR) schema is the single source of truth for extraction output. It is versioned, discoverable, and must validate (schema-first; no best-effort).

Implemented in: Stage 05 (STRUCTURED_EXTRACTION) uses direct PDF upload to Gemini 2.5 Flash with response_schema parameter. The schema constrains output at API decoding level, guaranteeing skills always return as {technical: string[], soft: string[]}.

Enforced by: Validation gates (Gate B schema validation as a hard pass/fail gate). Gate B now uses _partition_validation_issues() to separate blocking errors from advisory warnings. Blocking errors stop the pipeline.

Evidence/page_map section

Contract: Every extracted fact must be traceable to immutable source evidence, and every pipeline run must persist a lossless page_map for round-trip regeneration.

Implemented in: Stage 03 (OCR_LAYOUT) via Document AI + Stage 05 evidence span protocol.

Enforced by: Validation gates (Gate A lossless coverage and Gate C evidence integrity). Gate C now uses _validate_extraction_evidence_contract() with a 20% per-section fact density floor.

Deterministic invocation policy

Contract: LLM invocation is deterministic and enforceable: model/version pinning, explicit tool parameters, bounded retries, and idempotent step execution (no hidden variability).

Implemented in: Gemini 2.5 Flash with temperature=0.1, response_schema enforcement, and extraction cached by checksum_sha256 scoped per tenant_id for cost efficiency. Same inputs produce same cached outputs.

Enforced by: Idempotency records (intelletto.idempotency_record) and replay checks within pipeline orchestrator.

Metering dictionary table

Contract: Metering is deterministic and auditable per step. The metering dictionary is pinned, versioned, and used to compute cost-per-resume without drift.

Implemented in: intelletto.pipeline_phase_event records per-stage telemetry including stage, outcome, error_code, error_message, started_at, completed_at, and duration. intelletto.scoring_run_metering records per-scorecard token counts, latency breakdowns, and DB write counts.

Normalization/taxonomy ownership statement

Contract: Ownership is unambiguous: the parsing pipeline produces taxonomy-resolved, scoring-ready identifiers (and the taxonomy_snapshot_id it used). The scoring engine must not "re-normalize" raw text; it consumes resolved IDs + confidence.

Implemented in: Stage 06 (NORMALIZATION) resolves skills against a taxonomy of 55,000+ hard skills, 63,000+ aliases, and 146 soft skills. Match types: EXACT, FUZZY, SEMANTIC. Certification-to-skill expansion via intelletto.certification_skill_map table.

Scoring handoff payload and acceptance gates

Contract: The parser hands off a single, explicit scoring payload that includes identifiers, evidence, and the snapshots used (IIR + taxonomy). Scoring begins only after all required gates pass.

Implemented in: Stage 09 (SCORING_INPUT_BUILD) creates an immutable intelletto.scoring_input_snapshot row. Stage 11 (SCORECARD_GENERATION) resolves published scoring config, loads JD + parsed resume + scoring input, and runs the 8-bucket weighted scoring model.

Minimum payload (scoring_input_snapshot columns):

{
  "scoring_input_id": "uuid",
  "pipeline_run_id": "uuid",
  "parsed_resume_id": "uuid",
  "taxonomy_snapshot_id": "uuid",
  "iir_schema_version": "semver",
  "evidence_pack_sha256": "sha256:...",
  "normalized_skills_json": {...},
  "work_history_json": {...},
  "education_summary_json": {...},
  "certifications_json": {...},
  "languages_json": {...},
  "all_gates_passed": true
}

Section 0: Pipeline Stages (12-Stage Production Pipeline)

1) Intent

Define the exact end-to-end resume parsing pipeline stages (and their boundaries) so developers can understand each step, with deterministic cost metering, schema-first extraction, and evidence traceability as non-negotiable invariants.

2) Why it matters (risks mitigated / dependencies unlocked)

This section prevents the two most common failure modes in AI extraction projects:

"Partial extraction" (teams ship a "good enough" model that silently drops content). This is unacceptable because Intelletto requires 100% coverage and resume regeneration.
"Cost drift" (prompt changes, retries, or OCR changes quietly double cost-per-resume). Intelletto requires deterministic metering and enforceable gates.

It also unlocks downstream work: skills normalization, evidence-based scoring inputs, audit trails, and regression testing.

3) Definitions for AI-new developers (minimal, practical)

Structured extraction: sending the PDF directly to Gemini 2.5 Flash with a response_schema parameter that forces the model to output JSON conforming to the declared schema at the API decoding level.
Schema validation gate: a hard pass/fail check (no "best effort"). Gate B uses _partition_validation_issues() to separate blocking errors from advisory warnings.
Evidence: a pointer back to the source document. For Intelletto, evidence is page_index + bbox and/or text_span, always linked to a block_id in the lossless layer.
Hallucination: model output not supported by evidence in the source. Intelletto mitigates via: (1) response_schema enforcement, (2) evidence required for extracted facts, (3) coverage ledger, (4) temperature 0.1 for factual extraction.

4) Requirements

MUST

Process resumes through the 12 stages in order (stage map above), respecting pipeline mode (A, B, or C).
Preserve a lossless page/block capture for every page, even if content is not mapped to structured fields.
Produce outputs that support round-trip regeneration (at minimum HTML with faithful ordering).
Emit per-stage pipeline_phase_event records (stage, outcome, error_code, duration).
Be idempotent and retry-safe: re-running a stage must not duplicate artifacts or costs.

SHOULD

Use direct PDF upload to Gemini rather than OCR-to-text prompting for extraction.
Cache extraction results by checksum_sha256 (tenant-scoped) for cost efficiency.

MAY

Add enrichment sources (e.g., public profile links) only when explicitly configured and always as a distinct stage with its own metering.

5) Implementation steps (the 12 stages)

Stage 01 -- RESUME_REGISTERED

Create resume_document record with status RECEIVED.
Compute checksum_sha256 from PDF bytes.
Record GCS object_uri and file metadata (mime_type, byte_size, page_count).
Create or link candidate_profile (from filename inference or prior identity).
Create parsing_pipeline_run record with status PENDING.

Stage 02 -- DEDUP_CHECK

Compute deterministic fingerprints: file hash (bytes) + normalized-text hash + page-map hash.
Search for duplicates: exact-match (SHA256) and near-match (candidate signal matching: email, phone, LinkedIn URL).
Tenant-scoped queries (dedup is never cross-tenant).
If duplicate: pipeline terminates with status DEDUP_SKIPPED and Gate D records the result.
If new: pipeline continues to Stage 03.

Stage 03 -- OCR_LAYOUT

Call Google Document AI Enterprise Document OCR to produce page-level text + layout coordinates.
Build normalized page_map with stable reading order, bbox geometry, and per-block hashes.
Persist coverage_ledger recording page_count_total, page_count_processed.
Gate A (GATE_A_LOSSLESS) validates: observed_pages == expected_pages, every page has blocks.
Graceful degradation when Document AI is unavailable: extraction proceeds with direct PDF to Gemini (OCR evidence is advisory, not blocking).

Stage 04 -- DATA_CLEANING

Normalize whitespace, bullet symbols, and hyphenation (without changing meaning).
Preserve original text in lossless layer; cleaning produces a "clean view," not replacements.
Identify repeated headers/footers and tag them.
Pass-through when OCR was skipped (direct PDF mode).

Stage 05 -- STRUCTURED_EXTRACTION

Direct PDF upload to Gemini 2.5 Flash with response_schema parameter.
The schema constrains output at the API decoding level -- skills always return as {technical: string[], soft: string[]}.
Temperature: 0.1 for factual extraction.
Extraction cached by checksum_sha256 scoped per tenant_id for cost efficiency.
Output: parsed_resume.extraction_json (JSONB) with person, work history, education, skills, certifications, languages, URLs.
Gate B (GATE_B_SCHEMA) validates extraction schema. Uses _partition_validation_issues() to separate blocking from advisory. Blocking errors stop pipeline. Status: FIXED (was P0-1: always passed true).
Gate C (GATE_C_EVIDENCE) validates evidence spans. Uses _validate_extraction_evidence_contract() with per-section 20% density floor. Status: FIXED (was P0-2: only checked list presence).

Stage 06 -- NORMALIZATION

Normalize raw skill terms against canonical taxonomy: 55,000+ hard skills, 63,000+ aliases, 146 soft skills.
Match types: EXACT, FUZZY, SEMANTIC.
Certification-to-skill expansion via intelletto.certification_skill_map table.
Track all normalization decisions with confidence + match_type in intelletto.normalization_result.
Do not overwrite raw terms; store canonical mappings separately.
Quality alerts written to intelletto.normalization_quality_alert.

Stage 07 -- DATA_FUSION_ENRICHMENT

Fetch and parse URLs found in the resume: LinkedIn, GitHub, portfolio, personal websites.
SSRF protection: DNS resolve + private-IP block, scheme/port allowlist, redirect validation (32 tests).
Enrichment evidence logged with source, URL, HTTP status, character count to intelletto.enrichment_audit.
When candidate has websites, they MUST be fetched and parsed. Not parsing is a failure.
Enrichment results stored in intelletto.enrichment_payload.
If enrichment fails: skip and proceed. Do not block scoring inputs.

Stage 08 -- VALIDATION_GATES

Five gates control pipeline progression. Each gate has a passed boolean written to intelletto.pipeline_gate_result.

Gate	Code	What it validates	Status
Gate A	`GATE_A_LOSSLESS`	OCR coverage -- page map completeness vs source page count	Working correctly
Gate B	`GATE_B_SCHEMA`	Extraction schema validity against IIR schema. Now uses `_partition_validation_issues()` -- blocking errors stop pipeline.	FIXED (was P0-1)
Gate C	`GATE_C_EVIDENCE`	Evidence span resolution rate per structured field. Now uses `_validate_extraction_evidence_contract()` with 20% per-section density floor.	FIXED (was P0-2)
Gate D	`GATE_D_DEDUP`	Deduplication result -- stops run as `DEDUP_SKIPPED` if duplicate	Working correctly
Gate E	`GATE_E_ARTIFACTS`	Meta-gate: URL quality, bbox quality, normalization, work history	Working correctly

Stage 09 -- SCORING_INPUT_BUILD

Create immutable intelletto.scoring_input_snapshot row.
Loads parsed resume, pipeline run, page map, canonical extraction, normalization results, gate status.
Hard-blocks if required gates are missing or failed.
Snapshot includes: normalized_skills_json, work_history_json, education_summary_json, certifications_json, languages_json.
In Mode A (Pool Build): pipeline terminates here with status POOL_READY.

Stage 10 -- GCS_ARCHIVE_MOVE

Move source PDF from intake prefix to RESUME-POOL/processed/ in GCS.
Update resume_document.object_uri with new location.

Stage 11 -- SCORECARD_GENERATION

Resolve published scoring_config_version for the JD.
Load JD skill requirements via 3-level fallback: jd_skill_requirement table, extracted_features_json, job_description_version.model_json.
Execute 8-bucket weighted scoring model:
- Hard Skills (skill matching with exact/alias/fuzzy tiers)
- Domain/Process (title relevance + skill overlap + industry continuity)
- Scope/Complexity (seniority patterns)
- Tenure/Recency (years + recency + stability)
- Soft/Behavioral (12 soft signals, leadership weighted 1.2x)
- Languages/Communications (English + additional)
- Compliance/Scheduling (completeness + confidence + gate health)
- Education/Certifications (tenure decay model: e^(-0.18 * max(0, years - 3)))
Gate evaluation: N_OF_M_SKILLS, MIN_VALUE, MAX_VALUE, BOOLEAN_TRUE, TIMEZONE_OVERLAP.
Modifiers: data fusion confidence + evidence coverage JD duties.
Persist: scorecard_version, bucket_score, modifier_result, scoring_gate_result, scoring_run_metering.

Stage 12 -- RECRUITER_ARTIFACTS

Build Intelletto Resume snapshot (HTML/PDF/JSON) from extraction data.
Persist to intelletto.intelletto_resume_snapshot.
DB fallback for extraction data when orchestrator context is unavailable.
In Mode A: terminal state is POOL_READY (not COMPLETED).
Pipeline run status set to COMPLETED (Mode B/C) or POOL_READY (Mode A).

6) Artifacts produced/consumed (what the pipeline produces)

Core canonical artifacts (must exist for every resume)

intelletto.resume_document -- file metadata, GCS URI, checksum
intelletto.parsing_pipeline_run -- run lifecycle, status, stages
intelletto.pipeline_phase_event -- per-stage telemetry
intelletto.parsed_resume -- extraction_json (JSONB)
intelletto.normalization_result -- per-skill canonical mappings
intelletto.pipeline_gate_result -- gate pass/fail per run
intelletto.scoring_input_snapshot -- scoring-ready normalized data
intelletto.coverage_ledger -- page/block coverage proof

7) Validation gates (5-gate framework)

Defined in Stage 08 above. Gates A through E with their current implementation status.

8) Failure modes + recovery paths

Schema invalid JSON (Gate B blocks)
- Recovery: run Repair Prompt once; if still invalid, re-extract with stricter constraints and reduced temperature.
Coverage ledger fails (missing blocks/pages)
- Recovery: re-run Stage 03 (OCR_LAYOUT) with fallback extractor; do not proceed to Stage 05.
Evidence density below threshold (Gate C blocks)
- Recovery: re-extract with "evidence required" enforcement or move content to unmapped_content[].
Cost spike (unexpected retries / OCR invoked)
- Recovery: fail the run with COST_GATE_EXCEEDED status unless cost_gate_override flag is set.
Dedup match (Gate D)
- Recovery: pipeline terminates with DEDUP_SKIPPED. Not a failure -- expected behavior for duplicate resumes.

9) Acceptance tests (how to prove it works)

For any resume, the pipeline MUST prove:

All 12 stages complete (or subset per pipeline mode) with telemetry in pipeline_phase_event.
All pages present in page_map (N/N).
Structured extraction includes major constructs present in the source with evidence links.
Skills normalized with canonical IDs and confidence scores.
Scoring input snapshot created with all gates passed.
Scorecard generated (Mode B/C) with per-bucket scores, evidence, and rubric metadata.

Production verification: 218 documents processed end-to-end, 218/218 scored, 0 failures (April 2026).

Section 1: Architectural Principles

1) Intent

Define the non-negotiable architectural principles that govern every component of the Intelletto resume parsing pipeline -- so developers can implement and extend the system without breaking: (a) schema-first correctness, (b) 100% extraction + regeneration, (c) auditability, and (d) cost-per-resume control.

2) Why it matters (risks mitigated / dependencies unlocked)

These principles prevent the systemic failures that derail AI pipelines:

Silent loss of content (partial extraction): breaks the "regenerate without source" requirement.
Unexplainable outputs (no evidence chain): breaks trust and auditability, and blocks downstream scoring explainability.
Non-deterministic cost (unbounded retries, uncontrolled OCR/enrichment): kills unit economics at scale.
Data drift (schema and prompt changes degrade quality): breaks regression confidence.

3) Definitions for AI-new developers (if AI concepts appear)

Schema-first: the response_schema parameter is the contract; Gemini's API decoding layer enforces it. Prompts and code must obey it.
Deterministic pipeline: same input + same versions = same outputs, barring upstream extractor changes (tracked via checksums).
Repair vs re-extract:
- Repair fixes JSON validity/formatting with minimal deviation (no new facts).
- Re-extract re-runs the model on source content (higher cost; tightly controlled).
Provenance: where a datum came from (source page/block + extraction run id + model id + prompt version).

4) Requirements

MUST (architectural invariants)

MUST-AP-01: Schema governs everything

The pipeline MUST use response_schema with Gemini 2.5 Flash to constrain extraction output at the API decoding level.
A run MUST fail if Gate B detects blocking schema violations -- no partial "acceptance."

MUST-AP-02: Lossless first, structured second

The pipeline MUST persist a lossless page/block layer for every page.
Structured fields MUST be derived from lossless content; they MUST NOT replace it.

MUST-AP-03: 100% extraction coverage

Every page exists in lossless layer.
Every page has blocks with evidence geometry (bbox) and stable reading order.
Anything not mapped to structured fields is captured in unmapped_content[] with evidence.

MUST-AP-04: Evidence required for every structured fact

Gate C enforces per-section 20% evidence density floor via _validate_extraction_evidence_contract().

MUST-AP-05: Version everything

Every output MUST include: schema_version, model_id, pipeline_version, and a deterministic run_id.
Hashes for inputs: source_sha256, page_map_sha256, cleaned_text_sha256.

MUST-AP-06: Idempotent, retry-safe by design

Each stage MUST check intelletto.idempotency_record before writing.
Retries MUST not double-count metering units.
Extraction cached by checksum_sha256 + tenant_id.

MUST-AP-07: Cost gates are first-class

Each stage MUST emit pipeline_phase_event records.
The orchestrator MUST enforce cost_ceiling_usd per resume/run unless overridden.

MUST-AP-08: Separation of concerns (domain-driven)

Extraction (Stage 05) MUST NOT do canonicalization/normalization (Stage 06).
Enrichment (Stage 07) MUST be isolated and optional.
Scoring (Stage 11) consumes the scoring_input_snapshot produced by Stage 09.

MUST-AP-09: Tenant isolation

All queries MUST be scoped by tenant_id. No cross-tenant data leakage.
Dedup is tenant-scoped. Extraction cache is tenant-scoped.

SHOULD (strong guidance)

Prefer direct PDF upload to Gemini over OCR-to-text prompting.
Keep temperature low (0.1) and maximize schema constraints.
Normalize with deterministic rules and database lookups (not LLM normalization).

MAY (optional but allowed)

Add multiple extraction passes only if metering and gates remain strict and predictable.

5) Implementation steps

Domain-driven architecture: each pipeline stage is a separate module in v3/src/domains/pipeline/stages/.
Orchestrator pattern: v3/src/domains/pipeline/orchestrator.py drives stage execution. event_orchestrator.py provides event-driven alternative with Pub/Sub support.
Always persist lossless page map before calling Gemini (Stage 03 before Stage 05).
Run structured extraction with response_schema: direct PDF upload, no OCR-to-text prompting.
Normalize after extraction: Stage 06 uses database lookups against 55K+ hard skills taxonomy.
Enforce gates in Stage 08: all 5 gates evaluated, results persisted to pipeline_gate_result.

Section 2: Technology Stack (Gemini 2.5 Flash, Python, FastAPI, Cloud Run, Cloud SQL)

1) Intent

Document the Google-first toolchain used by Intelletto's v3 production pipeline:

Direct PDF extraction via Gemini 2.5 Flash with response_schema enforcement
Lossless evidence capture via Document AI OCR + layout page maps
Deterministic cost metering via pipeline_phase_event and scoring_run_metering
PostgreSQL-first persistence via Cloud SQL (psycopg3 async)
Cloud Run deployment for stateless, auto-scaling execution

2) Why it matters

Prompt drift without regressions: fixed by response_schema enforcement + golden test corpus (31/36 within 5 pts of v2).
Untraceable outputs: fixed by coupling Gemini extraction with Document AI layout page maps.
Unbounded cost: fixed by extraction caching (checksum-based) and cost_ceiling_usd enforcement.
Schema inconsistency: CRITICAL -- without response_schema, Gemini returns skills in inconsistent formats leading to zero-skill failures.

3) Production Stack

Component	Technology	Details
LLM	Gemini 2.5 Flash	`google-genai` Python SDK. Direct PDF upload with `response_schema`. Temperature 0.1.
OCR	Document AI Enterprise Document OCR	Layout extraction for evidence geometry. Processor ID in env var.
API Framework	FastAPI (Python 3.11+)	Async/await throughout. Uvicorn server.
Database	PostgreSQL 16 on Cloud SQL	Instance: `intelletto-ai`, region: `asia-southeast1`, schema: `intelletto`, ~169 tables.
DB Driver	psycopg3 (async)	Connection pool in `v3/src/database/pool.py`. `prepare_threshold=None`.
Object Storage	Google Cloud Storage	Bucket: `intelletto-ai-resume-parse-404886655151`. RESUME-POOL prefix structure.
Compute	Google Cloud Run	2 vCPU / 2Gi, 2 workers, min 1 / max 10. Region: asia-southeast1.
Batch Processing	Gemini Batch API	`v3/src/integrations/google/batch_gemini.py`. 50% cost reduction, 200K requests/job, bypasses QPM.
Secrets	Google Secret Manager	INTELLETTO_DB_URL, GEMINI_API_KEY, API_KEY, GITHUB_TOKEN, GOOGLE_OAUTH_CLIENT_SECRET

4) Gemini Extraction Architecture

Key design decision: Direct PDF upload to Gemini 2.5 Flash with response_schema parameter. This is NOT OCR-to-text-to-LLM prompting.

The PDF binary is uploaded directly as a Gemini content part.
The response_schema parameter forces Gemini to output JSON conforming to the declared schema at the API decoding level.
This eliminates the zero-skill failure mode where skills returned in inconsistent formats.
Skills ALWAYS return as {technical: string[], soft: string[]} -- no variation.
Extraction is cached by checksum_sha256 scoped per tenant_id.

Gemini call pattern (conceptual):

# v3/src/integrations/google/gemini.py
from google import genai

client = genai.Client(api_key=api_key)
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        {"text": system_instruction},
        {"inline_data": {"mime_type": "application/pdf", "data": pdf_bytes}}
    ],
    config={
        "temperature": 0.1,
        "response_schema": extraction_schema,   # enforced at decode
        "response_mime_type": "application/json"
    }
)

5) Database Connection Pattern

# v3/src/database/pool.py
# psycopg3 async pool with prepare_threshold=None (never prepare)
# DEALLOCATE ALL in reset callback
# All handlers use: async with pool.connection() as conn:

Section 3: Data Stores and Canonical Persistence Model

1) Intent

Define the canonical persistence model for Intelletto's resume parsing pipeline, including:

Which artifacts are stored where (GCS vs PostgreSQL)
The lossless-first storage pattern required for 100% extraction + regeneration
The response_schema extraction contract (direct PDF to Gemini)
The Evidence Span Protocol and Page Map Protocol
The Per-Step Metering via pipeline_phase_event

2) PostgreSQL Schema (intelletto.*)

All tables live in the intelletto schema on Cloud SQL. Key tables:

Pipeline Tables

Table	Purpose	Key Columns
`resume_document`	Raw file metadata + status	resume_document_id, tenant_id, candidate_id, object_uri, checksum_sha256, status (RECEIVED/PROCESSED/POOL_READY/DUPLICATE_REVIEW/PROCESS_FAILED)
`parsing_pipeline_run`	Per-run orchestration state	pipeline_run_id, tenant_id, resume_document_id, status (PENDING/RUNNING/COMPLETED/FAILED/COST_GATE_EXCEEDED/DEDUP_SKIPPED/PARTIAL), current_stage, cost_ceiling_usd
`pipeline_phase_event`	Per-stage telemetry	pipeline_run_id, stage, outcome, error_code, error_message, started_at, completed_at
`parsed_resume`	Structured extraction output	parsed_resume_id, tenant_id, candidate_id, extraction_json (JSONB), evidence_spans_json, confidence_summary_json
`candidate_profile`	Extracted identity	candidate_id, tenant_id, full_name, email, phone, linkedin_url, current_title, seniority
`normalization_result`	Skill canonical mappings	raw_term, canonical_skill_id, canonical_skill_name, match_type (EXACT/FUZZY/SEMANTIC), confidence
`pipeline_gate_result`	Gate pass/fail outcomes	pipeline_run_id, gate_code (GATE_A through GATE_E), passed (boolean), failure_reason, failure_detail (JSONB)
`coverage_ledger`	Per-page OCR coverage	pipeline_run_id, page_count_total, page_count_processed, gate_a_lossless_passed
`scoring_input_snapshot`	Immutable scoring handoff	scoring_input_id, pipeline_run_id, parsed_resume_id, normalized_skills_json, work_history_json, all_gates_passed
`enrichment_payload`	Data fusion results	pipeline_run_id, enrichment data JSONB
`enrichment_audit`	Fetch attempt log	source, url, http_status, char_count, ssrf_blocked
`idempotency_record`	Replay and dedup protection	key, created_at

Scoring Tables

Table	Purpose	Key Columns
`scorecard_version`	Scored output versions	scorecard_version_id, scoring_config_version_id, candidate_id, score_total, base_fit, modifier_points, status (SCORED/DISQUALIFIED/PENDING/FAILED)
`bucket_score`	Per-bucket scoring breakdown	scorecard_version_id, bucket_id, score, weight, contribution, evidence_json, rubric_applied_json
`modifier_result`	Scoring modifier outputs	scorecard_version_id, modifier_id, points_awarded, basis_json
`scoring_gate_result`	Per-gate evaluation	scorecard_version_id, gate_code, status (PASS/FAIL/SKIPPED), severity (DISQUALIFY/WARN/INFO)
`scoring_config_version`	Scoring config lifecycle	scoring_config_version_id, config_payload (JSONB), status (DRAFT/PUBLISHED/DEPRECATED)
`scoring_run_metering`	Per-scorecard metering	scorecard_version_id, gemini_calls, tokens_in/out, latency breakdowns, db_writes

Taxonomy Tables

Table	Purpose	Scale
`skill_hard`	Canonical hard skills	55,000+ skills with category, is_hot_tech, is_emerging
`skill_soft`	Canonical soft skills	146 skills with source_framework
`skill_alias`	Alias-to-skill mappings	63,000+ aliases with confidence
`vendor_certification`	Certification taxonomy	cert_name, cert_type (CERTIFICATION/SPECIALIZATION), vendor
`certification_skill_map`	Cert-to-skill expansion	cert_name to skill_name with skill_dimension (HARD/SOFT), confidence_boost

3) Google Cloud Storage Layout

gs://intelletto-ai-resume-parse-404886655151/
  RESUME-POOL/
    {JD_CODE}/                    # Resumes assigned to a JD
      CV - Name - Title.pdf
    unassigned/                   # Resumes without JD assignment
      resume.pdf
    processed/                    # Archived after pipeline completion
      {document_id}.pdf

Immutability rule: Objects are write-once. Archive moves use GCS copy + delete. Source URIs updated in resume_document.object_uri.

Section 4: End-to-End Workflow

1) Intent

Define the end-to-end execution workflow for the v3 pipeline, from ingest through scored output. This section specifies the orchestration, stage dependencies, and mode-specific behavior.

2) Orchestration

Two orchestrator implementations exist:

Synchronous orchestrator (v3/src/domains/pipeline/orchestrator.py): drives stages sequentially within a single request. Used for Mode C (Direct JD) and Mode A (Pool Build).
Event-driven orchestrator (v3/src/domains/pipeline/event_orchestrator.py): supports Pub/Sub-driven stage execution with MODE_STAGES filtering and retry semantics. Currently PUBSUB_ENABLED=false (event orchestrator has parity but hasn't been live-tested with Pub/Sub).

3) Mode A -- Pool Build (Stages 01-06, 08-09)

Stage 01 RESUME_REGISTERED: Create resume_document (RECEIVED), candidate_profile, pipeline_run (PENDING).
Stage 02 DEDUP_CHECK: Tenant-scoped SHA256 + signal matching. If duplicate, terminate with DEDUP_SKIPPED.
Stage 03 OCR_LAYOUT: Document AI OCR for layout evidence. Gate A validates page coverage.
Stage 04 DATA_CLEANING: Normalize whitespace, tag headers/footers. Pass-through if OCR skipped.
Stage 05 STRUCTURED_EXTRACTION: Direct PDF to Gemini 2.5 Flash with response_schema. Cache by checksum + tenant. Gates B + C validated.
Stage 06 NORMALIZATION: 55K+ hard skills, 63K+ aliases, 146 soft skills. EXACT/FUZZY/SEMANTIC matching.
Stage 08 VALIDATION_GATES: All 5 gates evaluated, results persisted.
Stage 09 SCORING_INPUT_BUILD: Create immutable scoring_input_snapshot. Pipeline terminates with POOL_READY.

4) Mode B -- Pool Activation (Stages 07, 10-12)

Stage 07 DATA_FUSION_ENRICHMENT: Fetch LinkedIn, GitHub, portfolio URLs. SSRF-protected.
Stage 10 GCS_ARCHIVE_MOVE: Move PDF to processed/ prefix.
Stage 11 SCORECARD_GENERATION: 8-bucket weighted scoring against assigned JD.
Stage 12 RECRUITER_ARTIFACTS: Build Intelletto Resume (HTML/PDF/JSON). Status becomes COMPLETED.

5) Mode C -- Direct JD (Stages 01-12)

All 12 stages execute sequentially. Resume arrives with a JD assignment (via GCS directory path or explicit assignment). No intermediate POOL_READY state.

6) Stage Dependencies and Data Flow

01 RESUME_REGISTERED
  |-> resume_document, candidate_profile, pipeline_run
02 DEDUP_CHECK
  |-> dedup_cluster (if match -> DEDUP_SKIPPED, stop)
03 OCR_LAYOUT
  |-> coverage_ledger, page_map (Gate A)
04 DATA_CLEANING
  |-> cleaned page_map
05 STRUCTURED_EXTRACTION
  |-> parsed_resume.extraction_json (Gates B, C)
06 NORMALIZATION
  |-> normalization_result rows
08 VALIDATION_GATES
  |-> pipeline_gate_result rows (all 5 gates)
09 SCORING_INPUT_BUILD
  |-> scoring_input_snapshot (Mode A stops here: POOL_READY)
07 DATA_FUSION_ENRICHMENT
  |-> enrichment_payload, enrichment_audit
10 GCS_ARCHIVE_MOVE
  |-> updated object_uri
11 SCORECARD_GENERATION
  |-> scorecard_version, bucket_score, modifier_result, scoring_gate_result
12 RECRUITER_ARTIFACTS
  |-> intelletto_resume_snapshot (COMPLETED)

Section 5: API Contracts

1) Intent

Define the API contracts for the v3 resume parsing pipeline. The v3 system uses domain-driven route organization.

2) v3 API Routes

Pipeline Endpoints

Method	Path	Purpose
POST	`/api/v3/pipeline/process`	Trigger pipeline processing for a document
POST	`/api/v3/pipeline/rescore/{document_id}`	Re-score a document (append-only, new scorecard_version)
POST	`/api/v3/pipeline/rescore/bulk`	Bulk re-score multiple documents
GET	`/api/v3/pipeline/latency-report`	Pipeline latency SLA dashboard data
GET	`/api/v3/health`	Health check (DB connectivity, service status)

Intake Endpoints

Method	Path	Purpose
POST	`/api/v1/intake/documents/upload`	Browser upload
POST	`/api/v1/intake/documents/import_gcs_prefix`	Bulk GCS import from RESUME-POOL
GET	`/api/v1/intake/aggregator/documents`	List received resumes
GET	`/api/v1/intake/intelletto_resumes`	List parsed resumes
POST	`/api/v1/intake/aggregator/ai_jd_assignment`	AI-powered JD assignment (detects JD code from filename/GCS path)

Scoring Endpoints

Method	Path	Purpose
POST	`/api/v1/scoring/scorecards`	Generate scorecard for a resume against a JD
GET	`/api/v3/scoring/interview-brief/{id}`	Evidence-derived interview probes
GET	`/api/v3/scoring/comp-inference`	Compensation inference (P25/P50/P75)
GET	`/api/v3/scoring/jd-calibration/{id}`	JD difficulty calibration
POST	`/api/v3/scoring/decision`	Recruiter feedback signal loop

Candidate Endpoints

Method	Path	Purpose
GET	`/api/v3/candidates/{id}/authenticity`	Resume authenticity scoring
GET	`/api/v3/candidates/{id}/sector-profile`	Industry/sector classification
GET	`/api/v3/candidates/{id}/bias-audit`	Bias audit (advisory only)
GET	`/api/v3/candidates/{id}/cross-jd-fit`	Multi-JD cross-fit analysis
POST/GET	`/api/v3/candidates/{id}/cover-letter`	Cover letter + sentiment analysis

3) Internal Stage Execution

Pipeline stages are NOT exposed as individual HTTP endpoints in v3. Instead, the EventDrivenOrchestrator (or synchronous orchestrator) calls stage functions directly within the same process. Stage results flow through an orchestrator context dictionary.

The v2 internal endpoints (/internal/v1/...) are retained for backward compatibility but all new pipeline execution uses the v3 orchestrator pattern.

Section 6: Cost Model and Metering

1) Intent

Define the metering and cost model for the pipeline so that cost-per-resume is deterministic and auditable.

2) Metering Implementation

Metering is implemented via two tables:

intelletto.pipeline_phase_event: per-stage telemetry for every pipeline run. Records stage, outcome, error_code, error_message, started_at, completed_at, and computed duration.
intelletto.scoring_run_metering: per-scorecard metering. Records gemini_calls, gemini_tokens_in/out, gate_evaluations, bucket_computations, modifier_computations, db_writes, and latency breakdowns (load_inputs, gates, buckets, modifiers, persist, total).

3) Cost Control

Extraction caching: Results cached by checksum_sha256 + tenant_id. Duplicate documents skip Gemini entirely.
Cost ceiling: parsing_pipeline_run.cost_ceiling_usd enforced per run. Override via cost_gate_override flag.
Batch processing: v3/src/integrations/google/batch_gemini.py provides 50% cost reduction for bulk workloads via Gemini Batch API.
Budget stop: If cost_ceiling_usd exceeded, run terminates with status COST_GATE_EXCEEDED. Partial artifacts preserved.

Section 7: Quality Metrics

1) Intent

Define the required quality metrics that prove the pipeline meets the prime directive: strict schema compliance, 100% extraction coverage, evidence traceability, and operational robustness.

2) Core Metrics

M1 -- schema_valid: Gate B passes (extraction schema valid, blocking errors absent).
M2 -- coverage_complete: Gate A passes (all pages captured in lossless layer).
M3 -- evidence_density: Gate C passes (20% per-section fact density floor).
M4 -- normalization_coverage: Percentage of raw terms with canonical matches.
M5 -- pipeline_success_rate: Ratio of COMPLETED runs to total runs.
M6 -- scoring_completion_rate: Ratio of scored documents to parseable documents.

3) Production Results

Pipeline success rate: 218/218 documents completed (100%) in April 2026 trial.
Scoring completion rate: 218/218 scored (100%), 0 failures.
Golden corpus: 31/36 resumes within 5 points of v2 scoring baseline.
Test suite: 747 tests passing, 0 failures.

4) Regression Testing

Golden test corpus at v3/goldens/ provides regression baseline. The v3 test suite covers:

Contract gate tests (schema validation, evidence contract)
Contact URL recall tests
Intelletto resume lossless tests
Process QA tests
Pipeline validation tests
Scoring runtime tests
Batch Gemini tests

Section 8: Error Handling, Retries, and Safe Degradation

1) Intent

Specify the error-handling contract for the resume pipeline so that runs are idempotent, retry-safe, repair-capable, and degradable (never lose content).

2) Pipeline Run Status Machine

Status	Meaning
`PENDING`	Run created, not yet started
`RUNNING`	Active stage execution
`COMPLETED`	All stages passed, scorecard generated
`POOL_READY`	Mode A: parsed and normalized, awaiting JD assignment
`FAILED`	Unrecoverable error
`COST_GATE_EXCEEDED`	Budget exceeded, partial artifacts preserved
`DEDUP_SKIPPED`	Duplicate detected, pipeline stopped (not a failure)
`PARTIAL`	Some stages completed, run interrupted

3) Error Taxonomy

Input errors (permanent): PDF unreadable, unsupported format. Recovery: fail fast.
Auth errors (permanent until fixed): Gemini API key invalid, IAM denied. Recovery: fail fast.
Rate limits (transient): 429 RESOURCE_EXHAUSTED. Recovery: exponential backoff.
Service availability (transient): 503/504, timeouts. Recovery: retry.
OCR failures: Document AI partial pages. Recovery: retry once; if still partial, extraction proceeds with direct PDF to Gemini (graceful degradation).
LLM output failures: JSON invalid or schema mismatch. Recovery: repair once, re-extract once, then fail.
Persistence failures: DB connection (transient: retry) or constraint violation (permanent: fail).

4) Idempotency Pattern

# Every pipeline write checks idempotency_record first
existing = await db.fetchrow(
    "SELECT id FROM intelletto.idempotency_record WHERE key = $1",
    idempotency_key
)
if existing:
    return {"status": "already_processed", "id": str(existing["id"])}

5) Concurrency Guard

Pipeline runs use FOR UPDATE SKIP LOCKED with a 30-minute stale threshold to prevent concurrent execution of the same document.

Section 9: Security, Privacy, and Compliance

1) Tenant Isolation

All database queries are scoped by tenant_id. No cross-tenant data leakage.
Dedup queries are tenant-scoped.
Extraction cache is scoped by tenant_id + checksum_sha256.
All sensitive v1 routes require tenant_id parameter (422 without).

2) SSRF Protection (Data Fusion)

DNS resolve + private-IP block before fetching any candidate URL.
Scheme allowlist (http/https only), port allowlist.
Redirect validation (no redirects to private IPs).
32 tests covering SSRF protection scenarios.

3) Authentication

Pipeline worker requires OIDC Bearer token or API key (not User-Agent).
Google OAuth configured for api.intelletto.ai.
API key authentication for programmatic access.

4) Data Classification

All resume documents classified as PII_HIGH by default.
resume_document.pii_classification field tracks classification.
retention_policy_id and purge_at fields support data lifecycle management.

Section 10: Scoring Engine (8-Bucket Weighted Model)

1) Intent

Document the scoring engine that evaluates candidates against job descriptions using an 8-bucket weighted model with gates and modifiers.

2) Scoring Flow

Config resolution: Load published scoring_config_version for the JD.
Input loading: Load scoring_input_snapshot + JD skill requirements (3-level fallback).
Gate evaluation: Run gates first. If a DISQUALIFY gate fails, scoring stops.
Bucket scoring: 8 buckets scored independently via ScorerRegistry.
Base fit: Weighted sum of bucket scores (enabled bucket weights sum to 1.0).
Modifier computation: Data fusion confidence + evidence coverage adjust +/- points.
Final score: Base fit + modifier points, clamped 0..100.
Classification: STRONG / BORDERLINE / RISK with ADVANCE / REVIEW / HOLD recommendation.

3) 8 Scoring Buckets

#	Bucket	Formula
1	Hard Skills	`required_match_rate * 75% + nice_to_have_rate * 25% + quality_bonus (up to +8)`. Match tiers: exact=1.0, alias>=0.85, fuzzy>=0.65.
2	Domain/Process	`title_relevance50% + skill_domain_overlap30% + industry_continuity*20%`
3	Scope/Complexity	`seniority_score60% + scope_signals40%`. Chief/VP=100, Director=90, Senior=80, Manager=75, Junior=35, Intern=20.
4	Tenure/Recency	`years_score40% + recency_score35% + stability_score*25%`
5	Soft/Behavioral	12 soft signals with leadership/problem_solving weighted 1.2x
6	Languages/Communications	`english_score60% + additional_languages40%`
7	Compliance/Scheduling	`completeness50% + norm_confidence30% + gate_health*20%`
8	Education/Certifications	Education tenure decay: `e^(-0.18 * max(0, years - 3))`. Seniority-aware: Junior (degree 60%, skills 40%) to Executive (degree 5%, skills 95%). Cert bonus capped at 0.12. Combined: education 55% + cert matching 35% + cert-expanded skill bonus 10%.

4) Scoring Config Lifecycle

DRAFT: editable, not used for scoring.
PUBLISHED: immutable, active for scoring. Enabled bucket weights must sum to 1.0.
DEPRECATED: archived, not used for new scoring.

5) JD Skill Requirements Loading

Three-level fallback ensures skills are available for scoring:

intelletto.jd_skill_requirement table (structured, preferred)
job_description_version.extracted_features_json (legacy)
job_description_version.model_json (JD Orchestrator)

Section 11: Acceptance Criteria

1) Pipeline Acceptance

All 12 stages complete (or mode-appropriate subset) for every input document.
All 5 validation gates evaluated and persisted.
Pipeline phase events recorded for every executed stage.
Scoring input snapshot created with all gates passed.

2) Extraction Acceptance

Skills always returned as {technical: string[], soft: string[]} (response_schema enforced).
No zero-skill candidates (schema enforcement prevents this).
Evidence density >= 20% per structured section (Gate C).

3) Scoring Acceptance

8 bucket scores computed with evidence and rubric metadata.
Per-bucket contribution = score * weight.
Base fit = sum of contributions.
Modifiers applied within budget.
Final score clamped 0..100.

4) Lossless Resume Output Constraints

The lossless spec defines what every generated Intelletto resume must contain. Non-negotiable:

NEVER produce: "No detailed achievements mapped in this parse." -- this is always a defect.
NEVER truncate experience sections for any reason.
NEVER omit advisory roles, even those outside the main employment timeline.
NEVER omit education or certifications -- if absent from source, write "Not captured in source document".
The Intelletto Resume word count must always exceed the source resume word count.

5) Test Suites

Suite	Location	Count
v3 tests	`v3/tests/`	101+ tests
v2 contract gates	`tests/test_contract_gates_unittest.py`	Gate B, Gate C, validation tests
Contact URL recall	`tests/test_contact_url_recall_unittest.py`	URL extraction accuracy
Lossless resume	`tests/test_intelletto_resume_lossless_unittest.py`	Output completeness
Process QA	`tests/test_process_qa_unittest.py`	End-to-end pipeline QA
Pipeline validation	`tests/test_pipeline_validation_unittest.py`	Stage validation
Scoring runtime	`tests/test_scoring_runtime_unittest.py`	Scorer registry, bucket dispatch
Batch Gemini	`tests/test_batch_gemini_unittest.py`	Batch API integration
Total		747 tests passing

Section 12: Extraction Schema

1) response_schema Enforcement

The v3 pipeline uses Gemini 2.5 Flash with response_schema parameter. This constrains model output at the API decoding level, meaning the JSON structure is guaranteed by the API, not by post-hoc validation.

2) Key Schema Shape

{
  "basics": {
    "name": "string",
    "email": "string",
    "phone": "string",
    "location": "string",
    "urls": ["string"]
  },
  "work_history": [
    {
      "title": "string",
      "company": "string",
      "start_date": "string",
      "end_date": "string",
      "achievements": ["string"]
    }
  ],
  "education": [
    {
      "school": "string",
      "degree": "string",
      "field": "string",
      "graduation_date": "string"
    }
  ],
  "skills": {
    "technical": ["string"],    // ALWAYS this shape
    "soft": ["string"]          // ALWAYS this shape
  },
  "certifications": [
    {
      "name": "string",
      "issuer": "string",
      "date": "string"
    }
  ],
  "languages": [
    {
      "language": "string",
      "proficiency": "string"
    }
  ]
}

Critical: The skills field ALWAYS returns as {technical: string[], soft: string[]}. Without response_schema enforcement, Gemini would return skills in inconsistent formats (flat array, nested objects, comma-separated strings) leading to zero-skill failures in normalization and scoring.

3) parsed_resume.extraction_json

The full extraction output is stored in intelletto.parsed_resume.extraction_json as a JSONB column. This is the canonical extraction artifact for the pipeline run.

Section 13: Development Environment Setup

1) Prerequisites

Python 3.11+
Google Cloud SDK (gcloud)
Access to Intelletto Google Cloud project

2) Local Development

# Clone and setup
cd Intelletto
python -m venv .venv
source .venv/bin/activate
pip install -r v3/requirements.txt

# Start v3 server (port 8080)
cd v3
uvicorn main:app --port 8080 --reload

# Or use the startup script
./start_services.sh

3) Database Access

The database is remote Cloud SQL (not local). Never attempt local postgres commands.

# Cloud SQL instance
Instance: intelletto-ai
Region: asia-southeast1
Schema: intelletto

# Connection via Cloud SQL Proxy (for local dev)
cloud_sql_proxy -instances=<project>:asia-southeast1:intelletto-ai=tcp:5432

# Or set INTELLETTO_DB_URL env var directly

4) GCS Bucket

GCS_BUCKET=intelletto-ai-resume-parse-404886655151

# Resume intake path
gs://intelletto-ai-resume-parse-404886655151/RESUME-POOL/

# Processed archive path
gs://intelletto-ai-resume-parse-404886655151/RESUME-POOL/processed/

5) Deployment

# Build and deploy to Cloud Run
gcloud builds submit --config v3/cloudbuild.yaml
gcloud run deploy intelletto-api \
  --region asia-southeast1 \
  --image <image> \
  --min-instances 1 \
  --max-instances 10 \
  --cpu 2 \
  --memory 2Gi

6) Running Tests

# v2 contract tests (always use .venv/bin/python)
./.venv/bin/python -m pytest tests/test_contract_gates_unittest.py \
  tests/test_contact_url_recall_unittest.py \
  tests/test_intelletto_resume_lossless_unittest.py \
  tests/test_process_qa_unittest.py -q

# v3 tests
./.venv/bin/python -m pytest v3/tests/ -q

# Expected: all tests pass (747 total)

7) Environment Variables

Variable	Purpose
`INTELLETTO_DB_URL`	Cloud SQL connection string
`GEMINI_API_KEY`	Google Gemini API key
`API_KEY`	Intelletto API authentication key
`GCS_BUCKET`	GCS bucket name (intelletto-ai-resume-parse-404886655151)
`GOOGLE_CLOUD_PROJECT`	Google Cloud project ID
`GOOGLE_DOCUMENT_AI_PROCESSOR_ID`	Document AI processor ID
`GOOGLE_DOCUMENT_AI_LOCATION`	Document AI processor location
`AUTH_ENABLED`	Enable/disable authentication (true for production)
`PUBSUB_ENABLED`	Enable event-driven orchestrator (currently false)

Section 14: Data Model -- Entities and Relationships

1) Core Entity Relationships

resume_document (1) -----> (N) parsing_pipeline_run
  |                              |
  +-> candidate_profile (1:1)    +-> pipeline_phase_event (1:N)
                                 +-> pipeline_gate_result (1:N)
                                 +-> coverage_ledger (1:1)
                                 +-> parsed_resume (1:1)
                                 |     +-> normalization_result (1:N)
                                 +-> scoring_input_snapshot (1:1)
                                       +-> scorecard_version (1:N)
                                             +-> bucket_score (1:N)
                                             +-> modifier_result (1:N)
                                             +-> scoring_gate_result (1:N)
                                             +-> scoring_run_metering (1:1)

2) Key Foreign Key Chains

resume_document -> candidate_profile (via candidate_id)
parsing_pipeline_run -> resume_document (via resume_document_id)
parsed_resume -> parsing_pipeline_run (via parsing_job_id)
scoring_input_snapshot -> parsed_resume (via parsed_resume_id)
scorecard_version -> scoring_input_snapshot (via scoring_input_id)
resume_document_job_map -> resume_document + job_requisition

3) Key JSONB Shapes

extraction_json (in parsed_resume)

{
  "basics": {"name": "...", "email": "...", "phone": "...", "urls": [...]},
  "work_history": [{"title": "...", "company": "...", "dates": {...}, "achievements": [...]}],
  "education": [{"school": "...", "degree": "...", "field": "..."}],
  "skills": {"technical": ["..."], "soft": ["..."]},
  "certifications": [{"name": "...", "issuer": "...", "date": "..."}],
  "languages": [{"language": "...", "proficiency": "..."}]
}

normalized_skills_json (in scoring_input_snapshot)

[
  {"raw_term": "ReactJS", "canonical_id": "uuid", "canonical_name": "React", "match_type": "EXACT", "confidence": 0.95},
  {"raw_term": "K8s", "canonical_id": "uuid", "canonical_name": "Kubernetes", "match_type": "FUZZY", "confidence": 0.88}
]

config_payload (in scoring_config_version)

{
  "gates": [{"gateId": "...", "name": "...", "severity": "DISQUALIFY|WARN|INFO", "rule": {...}}],
  "buckets": [{"bucketId": "...", "name": "...", "weight": 0.25, "enabled": true}],
  "modifiers": [{"modifierId": "...", "name": "...", "minPoints": -5, "maxPoints": 5}],
  "modifiersBudgetPoints": 10
}

Appendix A: Known Gaps and Remaining Work

A.1 Resolved P0 Gaps

ID	Gap	Resolution
P0-1	Gate B always passed true	FIXED. `_partition_validation_issues()` separates blocking from advisory. Gate B fails on blocking errors. Intake stops pipeline on failure. 6 new tests.
P0-2	Gate C only checked list presence	FIXED. `_validate_extraction_evidence_contract()` checks per-section fact coverage with 20% density floor. 9 new tests.

A.2 Partial Resolution

ID	Gap	Status
P0-3	Bucket dispatch + rubric persistence	PARTIAL. Dispatch now via `ScorerRegistry` (exact match, not substring). `rubric_applied_json` has real scorer_id/input_hash/evidence_hash. Remaining: scorers are inline heuristics, not configurable rubric rules. 16 tests.

A.3 External Dependencies Not Yet Integrated

GitHub API integration: Requires GitHub API token for structured repository analysis.
Proxycurl for LinkedIn: Requires Proxycurl API key for LinkedIn profile parsing.
PII Redaction Pre-LLM: Blocked -- direct PDF to Gemini makes pre-LLM redaction architecturally difficult.

A.4 Enhancement Features Delivered

All delivered in v3, live on Cloud Run:

Re-scoring API (single + bulk, append-only)
Interview Brief Generator (evidence-derived, no Gemini)
Resume Authenticity Scoring
Industry/Sector Classification (15 NAICS-style sectors)
Bias Audit (advisory only)
Compensation Inference (P25/P50/P75)
JD Difficulty Calibration
Recruiter Feedback Signal Loop
Multi-JD Cross-JD Auto-Fit
Pipeline Latency SLA Dashboard
Cover Letter + Sentiment Analysis
AI JD Generator (Gemini-powered)