DoD BAA compliance matrix generation in Python

Broad Agency Announcements (BAAs) from the Department of Defense introduce a distinct compliance burden for research institutions. Unlike standardized solicitations, BAAs frequently embed dynamic technical evaluation criteria, phased milestone deliverables, and layered regulatory references spanning the FAR, DFARS, and agency-specific supplements. Manual requirement tracking introduces unacceptable risk during proposal submission and post-award administration. Automating the generation of a compliance matrix requires a deterministic parsing pipeline that prioritizes structural fidelity over heuristic guessing. This workflow operates directly within the DoD BAA Requirement Extraction framework, ensuring that every conditional obligation, reporting cadence, and security mandate is captured, normalized, and mapped to institutional response templates.

1. Deterministic Document Ingestion & Structural Anchoring

DoD BAAs are distributed as unstructured PDFs, HTML portals, or hybrid XML packages. Coordinate-aware text extraction is mandatory to preserve hierarchical section numbering and table boundaries. The ingestion layer must separate raw text recovery from semantic classification, aligning with established Core Architecture & RFP Taxonomy standards.

python
import pdfplumber
import logging
from typing import List, Dict

logger = logging.getLogger(__name__)

def extract_structured_text(pdf_path: str) -> List[Dict]:
    """
    Extracts coordinate-aware text blocks and maps them to a hierarchical DAG.
    Returns a list of normalized text segments with bounding box metadata.
    
    Note: pdfplumber.extract_words() returns dicts with keys:
    text, x0, y0, x1, y1, top, bottom, doctop, upright.
    Font size is not included in extract_words() by default.
    """
    extracted_blocks = []
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages):
                blocks = page.extract_words(x_tolerance=3, y_tolerance=3)
                for block in blocks:
                    extracted_blocks.append({
                        "page": page_num + 1,
                        "x0": block["x0"],
                        "y0": block["top"],
                        "text": block["text"],
                    })
    except Exception as e:
        logger.error(f"PDF ingestion failed for {pdf_path}: {e}")
        raise RuntimeError("Ingestion pipeline halted due to unreadable document structure.")

    return extracted_blocks

Implementation Notes:

  • Use x_tolerance and y_tolerance to prevent fragmented word extraction in multi-column layouts.
  • Validate bounding box continuity to detect table headers versus body text.
  • Fail fast on corrupted PDFs to prevent silent data loss in downstream compliance mapping.
  • extract_words() does not expose font size per word; use the page.chars list or get_text("dict") via PyMuPDF if font size classification is needed.

2. Obligation Extraction & Conditional Logic

DoD compliance matrices hinge on precise modal verb detection. The extraction engine must isolate mandatory indicators (shall, must, will, required) while filtering permissive language (may, should, encouraged). Conditional requirements need finite-state evaluation to attach activation flags.

python
import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ComplianceObligation:
    requirement_id: str
    source_text: str
    modal_verb: str
    is_conditional: bool
    activation_condition: Optional[str]
    regulatory_ref: Optional[str]
    exception_clause: Optional[str]

MANDATORY_PATTERN = re.compile(r'\b(shall|must|will|are required to|is required to)\b', re.IGNORECASE)
CONDITIONAL_PREFIX = re.compile(r'\b(if|when|unless|provided that|subject to)\b', re.IGNORECASE)
REG_REF_PATTERN = re.compile(r'(?:FAR|DFARS|DoDI|NIST SP)\s*[\d\.\-]+(?:\(\w+\))?', re.IGNORECASE)

def parse_obligations(text_segments: List[str]) -> List[ComplianceObligation]:
    obligations = []
    for segment in text_segments:
        if not MANDATORY_PATTERN.search(segment):
            continue

        modal = MANDATORY_PATTERN.search(segment).group(1)
        is_conditional = bool(CONDITIONAL_PREFIX.search(segment))
        condition = CONDITIONAL_PREFIX.search(segment).group(0) if is_conditional else None
        reg_ref = REG_REF_PATTERN.search(segment).group(0) if REG_REF_PATTERN.search(segment) else None

        exception_match = re.search(r'(?:unless|except|unless otherwise directed by)\s[^.]+', segment, re.IGNORECASE)
        exception = exception_match.group(0) if exception_match else None

        obligations.append(ComplianceObligation(
            requirement_id=f"REQ-{len(obligations)+1:04d}",
            source_text=segment.strip(),
            modal_verb=modal,
            is_conditional=is_conditional,
            activation_condition=condition,
            regulatory_ref=reg_ref,
            exception_clause=exception
        ))
    return obligations

Implementation Notes:

  • The regex engine operates on sentence-level segments to prevent cross-clause contamination.
  • Cross-references to external standards should be resolved via a centralized citation lookup table. Official regulatory texts are maintained at https://www.acquisition.gov/.
  • Inline exceptions are preserved verbatim to support Contracting Officer override tracking during post-award audits.

3. Schema Serialization & Audit Validation

Extracted obligations must be serialized into a structured DataFrame with strict type enforcement.

python
import pandas as pd
import hashlib
from datetime import datetime
from typing import List

def serialize_to_matrix(obligations: List[ComplianceObligation]) -> pd.DataFrame:
    if not obligations:
        raise ValueError("No obligations extracted. Verify source document and modal verb patterns.")

    df = pd.DataFrame([o.__dict__ for o in obligations])

    df = df.astype({
        "requirement_id": "string",
        "source_text": "string",
        "modal_verb": "category",
        "is_conditional": "boolean",
    })

    content_hash = hashlib.sha256(df.to_json().encode()).hexdigest()
    df.attrs["audit_hash"] = content_hash
    df.attrs["generated_utc"] = datetime.utcnow().isoformat()

    return df

Implementation Notes:

  • Columns that may be None (activation_condition, regulatory_ref, exception_clause) cannot be cast to non-nullable "string" dtype without filling None values first; handle this with fillna("") before astype().
  • Attach a SHA-256 hash of the serialized JSON payload to df.attrs for immutable audit trail generation.
  • Validate against institutional response templates before export to ensure column alignment.

4. Production Error Handling & Fallback Routing

Compliance pipelines must degrade gracefully when encountering malformed documents, missing tables, or unsupported encoding.

python
import logging
from datetime import datetime
from pathlib import Path

logger = logging.getLogger(__name__)

class CompliancePipelineError(Exception):
    """Custom exception for pipeline-level failures."""
    pass

def run_pipeline(pdf_path: str, output_dir: Path) -> Path:
    try:
        segments = extract_structured_text(pdf_path)
        obligations = parse_obligations([s["text"] for s in segments])
        matrix_df = serialize_to_matrix(obligations)
        validated = validate_matrix(matrix_df)

        if not all(validated.values()):
            logger.warning("Matrix validation failed. Review flagged fields before submission.")

        output_file = output_dir / f"compliance_matrix_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.csv"
        matrix_df.to_csv(output_file, index=False)
        return output_file

    except FileNotFoundError as e:
        logger.critical(f"Source document missing: {e}")
        raise CompliancePipelineError("Document not found. Verify BAA distribution path.")
    except UnicodeDecodeError:
        logger.warning("Encoding mismatch detected. Attempting fallback parser.")
        return _fallback_extract(pdf_path, output_dir)
    except Exception as e:
        logger.error(f"Pipeline failure: {e}")
        raise CompliancePipelineError("Unrecoverable parsing error. Manual review required.")

Implementation Notes:

  • Catch UnicodeDecodeError early and route to a fallback extraction method before halting.
  • Raise explicit CompliancePipelineError instances to trigger automated alerting in grant management systems.
  • Log all fallback activations to maintain transparency during compliance audits.

The diagram below maps the run_pipeline function flow, including error branches and fallback routing.

flowchart TD
  A["run_pipeline start"] --> B["extract_structured_text"]
  B --> C{"Extraction\nsucceeded?"}
  C -->|"no: FileNotFoundError"| D["Raise CompliancePipelineError"]
  C -->|"no: UnicodeDecodeError"| E["Fallback parser\n_fallback_extract"]
  C -->|"yes"| F["parse_obligations"]
  E --> J["Return output file"]
  F --> G["serialize_to_matrix"]
  G --> H["validate_matrix"]
  H --> I["Write CSV to output dir"]
  I --> J
  D --> K["Pipeline halted"]

5. Compliance Validation & Traceability

Audit-safe compliance validation requires bidirectional traceability between source text, extracted obligations, and institutional response fields.

python
import logging
import pandas as pd
from typing import Dict

logger = logging.getLogger(__name__)

def validate_matrix(df: pd.DataFrame) -> Dict[str, bool]:
    validation_report = {
        "no_null_requirements": df["requirement_id"].notna().all(),
        "modal_verbs_valid": df["modal_verb"].isin(["shall", "must", "will", "are required to", "is required to"]).all(),
        "conditional_logic_present": df["is_conditional"].notna().all() if df["is_conditional"].any() else True,
        "audit_hash_intact": "audit_hash" in df.attrs and len(df.attrs["audit_hash"]) == 64
    }

    if not all(validation_report.values()):
        logger.warning("Matrix validation failed. Review flagged fields before submission.")
    return validation_report

Implementation Notes:

  • Run validation immediately after serialization. Block export if audit_hash_intact or no_null_requirements fails.
  • Integrate with institutional version control (e.g., Git LFS or SharePoint audit logs) to preserve matrix lineage.
  • Reference official Python documentation for regular expression operations when tuning modal verb patterns for agency-specific phrasing.

Automating DoD BAA compliance matrix generation eliminates manual tracking risk while enforcing deterministic, auditable workflows. By anchoring extraction to structural coordinates, enforcing strict schema validation, and implementing circuit-breaker error handling, research administrators and Python automation builders can deliver submission-ready matrices that withstand rigorous pre- and post-award scrutiny.