DoD BAA compliance matrix generation in Python
Broad Agency Announcements (BAAs) from the Department of Defense introduce a distinct compliance burden for research institutions. Unlike standardized solicitations, BAAs frequently embed dynamic technical evaluation criteria, phased milestone deliverables, and layered regulatory references spanning the FAR, DFARS, and agency-specific supplements. Manual requirement tracking introduces unacceptable risk during proposal submission and post-award administration. Automating the generation of a compliance matrix requires a deterministic parsing pipeline that prioritizes structural fidelity over heuristic guessing. This workflow operates directly within the DoD BAA Requirement Extraction framework, ensuring that every conditional obligation, reporting cadence, and security mandate is captured, normalized, and mapped to institutional response templates.
1. Deterministic Document Ingestion & Structural Anchoring
DoD BAAs are distributed as unstructured PDFs, HTML portals, or hybrid XML packages. Coordinate-aware text extraction is mandatory to preserve hierarchical section numbering and table boundaries. The ingestion layer must separate raw text recovery from semantic classification, aligning with established Core Architecture & RFP Taxonomy standards.
import pdfplumber
import logging
from typing import List, Dict
logger = logging.getLogger(__name__)
def extract_structured_text(pdf_path: str) -> List[Dict]:
"""
Extracts coordinate-aware text blocks and maps them to a hierarchical DAG.
Returns a list of normalized text segments with bounding box metadata.
Note: pdfplumber.extract_words() returns dicts with keys:
text, x0, y0, x1, y1, top, bottom, doctop, upright.
Font size is not included in extract_words() by default.
"""
extracted_blocks = []
try:
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
blocks = page.extract_words(x_tolerance=3, y_tolerance=3)
for block in blocks:
extracted_blocks.append({
"page": page_num + 1,
"x0": block["x0"],
"y0": block["top"],
"text": block["text"],
})
except Exception as e:
logger.error(f"PDF ingestion failed for {pdf_path}: {e}")
raise RuntimeError("Ingestion pipeline halted due to unreadable document structure.")
return extracted_blocks
Implementation Notes:
- Use
x_toleranceandy_toleranceto prevent fragmented word extraction in multi-column layouts. - Validate bounding box continuity to detect table headers versus body text.
- Fail fast on corrupted PDFs to prevent silent data loss in downstream compliance mapping.
extract_words()does not expose font size per word; use thepage.charslist orget_text("dict")via PyMuPDF if font size classification is needed.
2. Obligation Extraction & Conditional Logic
DoD compliance matrices hinge on precise modal verb detection. The extraction engine must isolate mandatory indicators (shall, must, will, required) while filtering permissive language (may, should, encouraged). Conditional requirements need finite-state evaluation to attach activation flags.
import re
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class ComplianceObligation:
requirement_id: str
source_text: str
modal_verb: str
is_conditional: bool
activation_condition: Optional[str]
regulatory_ref: Optional[str]
exception_clause: Optional[str]
MANDATORY_PATTERN = re.compile(r'\b(shall|must|will|are required to|is required to)\b', re.IGNORECASE)
CONDITIONAL_PREFIX = re.compile(r'\b(if|when|unless|provided that|subject to)\b', re.IGNORECASE)
REG_REF_PATTERN = re.compile(r'(?:FAR|DFARS|DoDI|NIST SP)\s*[\d\.\-]+(?:\(\w+\))?', re.IGNORECASE)
def parse_obligations(text_segments: List[str]) -> List[ComplianceObligation]:
obligations = []
for segment in text_segments:
if not MANDATORY_PATTERN.search(segment):
continue
modal = MANDATORY_PATTERN.search(segment).group(1)
is_conditional = bool(CONDITIONAL_PREFIX.search(segment))
condition = CONDITIONAL_PREFIX.search(segment).group(0) if is_conditional else None
reg_ref = REG_REF_PATTERN.search(segment).group(0) if REG_REF_PATTERN.search(segment) else None
exception_match = re.search(r'(?:unless|except|unless otherwise directed by)\s[^.]+', segment, re.IGNORECASE)
exception = exception_match.group(0) if exception_match else None
obligations.append(ComplianceObligation(
requirement_id=f"REQ-{len(obligations)+1:04d}",
source_text=segment.strip(),
modal_verb=modal,
is_conditional=is_conditional,
activation_condition=condition,
regulatory_ref=reg_ref,
exception_clause=exception
))
return obligations
Implementation Notes:
- The regex engine operates on sentence-level segments to prevent cross-clause contamination.
- Cross-references to external standards should be resolved via a centralized citation lookup table. Official regulatory texts are maintained at https://www.acquisition.gov/.
- Inline exceptions are preserved verbatim to support Contracting Officer override tracking during post-award audits.
3. Schema Serialization & Audit Validation
Extracted obligations must be serialized into a structured DataFrame with strict type enforcement.
import pandas as pd
import hashlib
from datetime import datetime
from typing import List
def serialize_to_matrix(obligations: List[ComplianceObligation]) -> pd.DataFrame:
if not obligations:
raise ValueError("No obligations extracted. Verify source document and modal verb patterns.")
df = pd.DataFrame([o.__dict__ for o in obligations])
df = df.astype({
"requirement_id": "string",
"source_text": "string",
"modal_verb": "category",
"is_conditional": "boolean",
})
content_hash = hashlib.sha256(df.to_json().encode()).hexdigest()
df.attrs["audit_hash"] = content_hash
df.attrs["generated_utc"] = datetime.utcnow().isoformat()
return df
Implementation Notes:
- Columns that may be
None(activation_condition,regulatory_ref,exception_clause) cannot be cast to non-nullable"string"dtype without fillingNonevalues first; handle this withfillna("")beforeastype(). - Attach a SHA-256 hash of the serialized JSON payload to
df.attrsfor immutable audit trail generation. - Validate against institutional response templates before export to ensure column alignment.
4. Production Error Handling & Fallback Routing
Compliance pipelines must degrade gracefully when encountering malformed documents, missing tables, or unsupported encoding.
import logging
from datetime import datetime
from pathlib import Path
logger = logging.getLogger(__name__)
class CompliancePipelineError(Exception):
"""Custom exception for pipeline-level failures."""
pass
def run_pipeline(pdf_path: str, output_dir: Path) -> Path:
try:
segments = extract_structured_text(pdf_path)
obligations = parse_obligations([s["text"] for s in segments])
matrix_df = serialize_to_matrix(obligations)
validated = validate_matrix(matrix_df)
if not all(validated.values()):
logger.warning("Matrix validation failed. Review flagged fields before submission.")
output_file = output_dir / f"compliance_matrix_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.csv"
matrix_df.to_csv(output_file, index=False)
return output_file
except FileNotFoundError as e:
logger.critical(f"Source document missing: {e}")
raise CompliancePipelineError("Document not found. Verify BAA distribution path.")
except UnicodeDecodeError:
logger.warning("Encoding mismatch detected. Attempting fallback parser.")
return _fallback_extract(pdf_path, output_dir)
except Exception as e:
logger.error(f"Pipeline failure: {e}")
raise CompliancePipelineError("Unrecoverable parsing error. Manual review required.")
Implementation Notes:
- Catch
UnicodeDecodeErrorearly and route to a fallback extraction method before halting. - Raise explicit
CompliancePipelineErrorinstances to trigger automated alerting in grant management systems. - Log all fallback activations to maintain transparency during compliance audits.
The diagram below maps the run_pipeline function flow, including error branches and fallback routing.
flowchart TD
A["run_pipeline start"] --> B["extract_structured_text"]
B --> C{"Extraction\nsucceeded?"}
C -->|"no: FileNotFoundError"| D["Raise CompliancePipelineError"]
C -->|"no: UnicodeDecodeError"| E["Fallback parser\n_fallback_extract"]
C -->|"yes"| F["parse_obligations"]
E --> J["Return output file"]
F --> G["serialize_to_matrix"]
G --> H["validate_matrix"]
H --> I["Write CSV to output dir"]
I --> J
D --> K["Pipeline halted"]
5. Compliance Validation & Traceability
Audit-safe compliance validation requires bidirectional traceability between source text, extracted obligations, and institutional response fields.
import logging
import pandas as pd
from typing import Dict
logger = logging.getLogger(__name__)
def validate_matrix(df: pd.DataFrame) -> Dict[str, bool]:
validation_report = {
"no_null_requirements": df["requirement_id"].notna().all(),
"modal_verbs_valid": df["modal_verb"].isin(["shall", "must", "will", "are required to", "is required to"]).all(),
"conditional_logic_present": df["is_conditional"].notna().all() if df["is_conditional"].any() else True,
"audit_hash_intact": "audit_hash" in df.attrs and len(df.attrs["audit_hash"]) == 64
}
if not all(validation_report.values()):
logger.warning("Matrix validation failed. Review flagged fields before submission.")
return validation_report
Implementation Notes:
- Run validation immediately after serialization. Block export if
audit_hash_intactorno_null_requirementsfails. - Integrate with institutional version control (e.g., Git LFS or SharePoint audit logs) to preserve matrix lineage.
- Reference official Python documentation for regular expression operations when tuning modal verb patterns for agency-specific phrasing.
Automating DoD BAA compliance matrix generation eliminates manual tracking risk while enforcing deterministic, auditable workflows. By anchoring extraction to structural coordinates, enforcing strict schema validation, and implementing circuit-breaker error handling, research administrators and Python automation builders can deliver submission-ready matrices that withstand rigorous pre- and post-award scrutiny.