DIH Internal
Architecture walkthrough

From an inbox to the right SharePoint folder — automatically, with an audit trail.

DIH Hub watches eight mailboxes and eight OneDrive roots, reads every new attachment and file, and decides where it belongs in the firm's SharePoint taxonomy under /DIH/…. A hybrid engine — deterministic rules first, then a two-stage LLM with a confidence gate — handles the routing. Low-confidence items drop into a human triage queue. Every mutation is audit-logged in the same database transaction.

Module
2 of DIH Email Ops
Branch
sandbox
As of
2026-06-06
Audience
Stakeholders, Ops, Audit
8 + 8
Sources
Mailboxes & OneDrives
9
Pipeline steps
Read → file → audit
5 min
Live sweep cycle
Near real-time ingestion
100%
Routing decisions audited
Same DB transaction

How a document moves through DIH Hub

A document enters through one of two triggers — the live Delta sweep every five minutes, or an admin-fired Backfill (which optionally re-evaluates already-filed documents via a rerouteFiled flag). From there the pipeline is fixed and auditable; the flow always converges on one of three outcomes: auto-filed, triaged, or merged.

Entry triggers Delta sweep (every 5 min) 8 mailboxes · 8 OneDrives Backfill (admin) Historical walk · optional reroute flag Stage to blob SHA-256 hash Dedup SHA-256 probe Same hash Duplicate source Silent extra row Merge queue Human one-click New Extract text PDF · Office · Vision Scope screen (LLM) Drop promo · no-reply · scam Deterministic routing — 5 tiers Deal · Entity · Counterparty Investor · Function Rule hit Rule miss LLM mini · gpt-4o-mini JSON sub_path + confidence Confidence gate ≥ 0.9? · spread > 0.15? High Mid Low / always-triage Escalation LLM gpt-4.1 or claude-sonnet-4-6 Agree → file Disagree → triage AUTO-FILED SharePoint + AuditLog TRIAGE QUEUE Human review
Stage / activity
Decision point
Filed outcome
Human queue
Duplicate branch

The SharePoint folder taxonomy

Every document filed by DIH Hub lives under /DIH/… in one of five primary objects — Deals, Entities, Counterparties, Investors, Functions. The five mirror how DIH operates day-to-day. The second level is the lifecycle or category bucket. The third level is the specific deal, entity, firm, investor, or function. The fourth is the document type (Origination, DD, IC, Constitutional, Engagement, etc.). One canonical home per document, with metadata carrying the cross-references.

Primary object Buckets (level 2) Per-item sub-folders Routing trigger (first match wins)
ADeals Investment initiatives — origination to exit Active Archive Evaluation Origination DD IC Execution Closing Deal codename in subject — e.g. [HEIRLOOM], [SUNFLOWER]. Killed / Exited deals move one-way to /Archive.
BEntities Legal entities in the DIH perimeter Group Opcos SPVs Wind-down Constitutional Board Accounts Regulator Intercompany No codename; entity match-keyword in subject (e.g. TASC-Infra, DIH-Holdings).
CCounterparties External service-providers under contract Banks Law Audit Consult Vendors Engagement Invoices KYC No codename; no entity match; sender domain matches a registered counterparty (e.g. @kpmg.com, @morganlewis.com).
DInvestors External capital — group or deal-level LPs Co-Investors Commitment KYC Reports Capital Sender or recipient domain matches a registered investor (e.g. @blackrock.com, LP list).
EFunctions Cross-cutting internal disciplines Finance Legal IR Ops IT HR Tax Treasury Policies Templates Comms Insurance Employment Payroll Incentives + more No codename, no entity, no counterparty/investor match; function keyword in subject (e.g. VATFinance/Tax).

The full tree

/DIH
├── /Deals                        ← PRIMARY OBJECT 1
│   ├── /Active/<codename>/{Origination, DD, IC, Execution, Closing}
│   ├── /Archive/<codename>/...     same shape — auto-moved on Killed/Exited
│   └── /Evaluation                  register only, not email-routed
│
├── /Entities                     ← PRIMARY OBJECT 2
│   ├── /Group           DIH-Holdings, intermediate holds
│   ├── /Opcos           TASC-Infra, TASC-Towers
│   ├── /SPVs            deal-specific SPVs
│   └── /Wind-down       dormant or dissolving
│       └── <entity>/{Constitutional, Board, Accounts, Regulator, Intercompany}
│
├── /Counterparties               ← PRIMARY OBJECT 3
│   ├── /Banks           Citi, UniCredit, PWP
│   ├── /Law             Morgan-Lewis, Dentons
│   ├── /Audit           KPMG, PKF-Littlejohn, SG-LLP
│   ├── /Consult         Detecon
│   └── /Vendors         IT, payroll, brokers, other
│       └── <firm>/{Engagement, Invoices, KYC}
│
├── /Investors                    ← PRIMARY OBJECT 4
│   ├── /LPs/<name>/{Commitment, KYC, Reports, Capital}
│   └── /Co-Investors/<name>/{Commitment, KYC, Reports}
│
└── /Functions                    ← PRIMARY OBJECT 5
    ├── /Finance/{Tax, Treasury, Audit-Coord}
    ├── /Legal/{Templates, Policies, KYC-Framework}
    ├── /IR/{Comms, Fundraising}
    ├── /Ops/{Insurance, Vendors-Master, Asset-Reports}
    ├── /IT/{Policies, Access, DataRoom-Admin}
    └── /HR/{Employment, Payroll, Training, Incentives}

Worked example. An email from @morganlewis.com with subject "[HEIRLOOM] SPA mark-up v3" lands at /DIH/Deals/Active/Heirloom/Execution. Tier 1 (deal codename) fires first, so the Morgan Lewis sender domain is never tested at Tier 3. The counterparty relationship is preserved as metadata on the document, making it findable from the Morgan Lewis angle without duplication. One canonical home per document.

The nine steps in detail

The diagram above is the picture. The numbered steps below are the narrative — what each box actually does, why it exists, and what could divert a document off the happy path.

01Step
Ingestion

Read mail & OneDrive

Every internal-team member has a mailbox and a OneDrive root that DIH Hub is allowed to watch — eight of each, listed in seed/mailboxes.csv. Each watched location is a Source, and each Source carries an IngestionCursor that describes what it has and has not read.

Delta sweep (live)

A timer fires every 5 minutes. It only walks Sources in Phase = Delta, using Graph's @odata.deltaLink token to fetch only what is new.

cron · 0 */5 * * * *

Backfill (historical)

An admin presses the Backfill button. The orchestrator walks pages of receivedDateTime ge {horizon} back to the configured horizon (e.g. 2 years).

POST /api/backfill?force&rerouteFiled

Two doors that never collide

Each Source belongs to exactly one phase at a time: Historical (backfill owns it) or Delta (the live sweep owns it). When a backfill exhausts the horizon for a Source, the cursor flips one-way to Delta and the live sweep takes over.

02Step
Ingestion

Stage to blob & hash

For each new item, DIH Hub downloads the bytes via Microsoft Graph and writes them to its own Azure Blob storage. The blob URL is short-lived — the SWA preview pane fetches the file through a 15-minute SAS URL only.

While staging, the pipeline computes a SHA-256 hash of the bytes. This single hash is the load-bearing identifier for the next two steps (dedup and merge).

Why blob, not SharePoint, at this stage?

SharePoint is the destination, not the working area. Staging in blob means we can extract, reason, route, and gate before committing anything to SharePoint — and we can cheaply throw the blob away if the document turns out to be out of scope.

03Step
Ingestion

Dedup probe (SHA-256)

Before spending compute on extraction or LLM calls, the pipeline asks: have I seen this hash before?

New

No existing document with this hash. Continue to extraction.

Duplicate source

Same hash, same primary object. Silently record an extra source row (e.g. the same email landed in two shared mailboxes).

Merge candidate

Same hash, different primary object. Queue to Merge for a one-click human decision.

04Step
Understand

Extract text

The ExtractorRouter picks one of four extractors based on MIME type. Anything unmapped raises UnsupportedMimeTypeException and the document lands in the Skipped admin view.

OpenXml

.docx · .xlsx · .pptx

OpenXml + ClosedXML

Legacy Office

.xls only (NPOI HSSF)

No free .doc/.ppt parser

PDF (digital)

Python pdfplumber compiled to a binary by CI; PdfPig fallback.

Vision

.png · .jpeg via Azure OpenAI Vision.

Why ship a Python binary inside a .NET app?

The .NET PDF libraries are either expensive (iText) or noticeably worse than pdfplumber on real-world decks. We keep one Function App and zero Python at runtime by having CI run PyInstaller on the script and bundling the resulting Linux binary into the .NET deployment artifact.

Output is ExtractedDocument { Text, Method, PageCount, Warning }. Text is truncated to 32 KB at the orchestrator boundary and to 16,000 characters in the LLM prompt (~4,000 tokens), after a sanitiser strips jailbreak attempts.

05Step
Understand

Scope screen

A small LLM call asks: is this even something DIH Hub should file? Promotional newsletters, no-reply notices, scam attempts, and calendar invitations are dropped before they consume routing cost. Dropped items appear in the Skipped admin view with a reason — they are visible and recoverable, never silently discarded.

06Step
Decide

Deterministic routing — the precedence chain

The DeterministicRoutingEngine walks five tiers in strict order, short-circuiting on the first hit. A hit yields AutoFile at confidence 1.0 — no LLM cost. Most volume should land here once registers are warm.

01
Deal codename Explicit [CODENAME] in subject, or whole-word match in subject + filename + first 1 KB of text. Minimum 4 chars.
/DIH/Deals/…
02
Entity keyword Match against Entity.MatchKeywords[].
/DIH/Entities/…
03
Counterparty domain Sender email domain matches a registered counterparty (e.g. @kpmg.com).
/DIH/Counterparties/…
04
Investor domain Parallel check: sender or recipient domain matches an investor.
/DIH/Investors/…
05
Function keyword Subject keyword matches a FunctionRoute (e.g. "VAT" → Finance/Tax).
/DIH/Functions/…

What about ambiguity?

Within any tier, one match → route. Multiple matches → the engine refuses to guess and sends the document straight to Triage with reason AmbiguousDeterministic. Zero matches → fall through to the next tier. Zero matches across all five tiers → on to the LLM.

Plus a learning layer

Once a path has accumulated enough learned signals from past human triage decisions, the engine will route to it directly — without paying for an LLM call. This is a reinforcement layer on top of the precedence chain, not a sixth tier; it grows as operators resolve triage items and bakes their judgments back into the routing.

07Step
Decide

LLM fallback — two models in series

When the rules can't decide, the document is handed to the LLM tier. Two models, both invoked through the Azure AI Foundry inference SDK:

Mini · gpt-4o-mini

Cheap and fast. Always called first. Returns structured JSON: { primary_object, sub_path, confidence_score, alternative_candidates[], reasoning, is_new_proposal }.

Escalation · gpt-4.1 or claude-sonnet-4-6

Configurable per environment. Called only when mini's confidence is in the uncertain band. Foundry routes non-OpenAI families (claude-, llama-, mistral-, phi-) through the inference endpoint by deployment-name prefix.

What the LLM is actually asked

The prompt embeds DIH context (active deals, archive deals, counterparties, entities, investors, functions), the document's metadata (source mailbox, sender, recipients, subject, filename, date), and up to 16,000 characters of sanitised extracted text. The model returns a single sub_path under /DIH/<bucket>/… with a minimum of three segments, plus alternative candidates.

08Step
Decide

Confidence gate — auto, escalate, or triage

Mini's response is graded against a confidence threshold AND a spread between the top two candidates. Three outcomes:

Confidence ≥ 0.9 AND top-2 spread > 0.15
Auto-fileFile immediately, audit-log the decision.
Confidence in [0.5, 0.9] OR top-2 spread ≤ 0.15
EscalateSend to the escalation model.
Confidence < 0.5, disagreement, "always-triage" path, or new register proposal
TriagePark for a human; LLM Reclassifier may drain it offline.

After escalation

If both models agree on the path and the primary object, the document is auto-filed at the escalation model's confidence. Disagreement demotes the item to triage with reason Disagreement. Always-triage paths (IC papers, HR incentives, investor commitments) bypass auto-filing entirely no matter the confidence.

09Step
File

File to SharePoint & audit

For an auto-fileable decision, the writer creates any missing folders, uploads the staged blob to the canonical path under /DIH/…, and applies folder defaults (confidentiality tier, retention rule, regulator flags). The DB row is committed and a docorg.AuditLog entry is appended in the same transaction — this is an architectural invariant, enforced by integration tests.

Outcome A

Auto-filed

Document is in SharePoint under its canonical path. Operators see it in Dashboard → Recent.

Outcome B

Triage queue

Operators pick a path in Triage.tsx. Resolution writes to SharePoint and appends a learned signal so the rule tier catches it next time.

Outcome C

Merge queue

Duplicate-but-different rows go to Merge.tsx for collapse / keep-as-version / keep-both, with a fingerprint recorded for future auto-merge.

Stack

One Function App. One Static Web App. Shared SQL server (new database). Shared AI Foundry account. Everything else — Storage, Key Vault, App Insights — is DIH Hub's own.

Backend

.NET 9, isolated worker, seven backend projects. Durable orchestrations coordinate per-document and per-source work.

Frontend

Vite + React 19 SWA. Types generated from openapi.yaml via pnpm gen:api. Auth via SWA's x-ms-client-principal.

Data

Own DocOrgDbContext in the shared SQL server. No cross-database joins with the sibling Contact-Graph module — only API.

Infra

Bicep at subscription scope. Own Storage, Key Vault, App Insights. Naming <type>-dih-docorg-<env>.

LLM

Foundry inference SDK. Mini deployment + configurable escalation. Polly pipeline for rate-limit, retry, and circuit-breaker.

OSS-only

Free packages only (MIT / Apache 2.0). iText is excluded (AGPL). Aspose is the flagged paid escape hatch — explicit approval required.