Architecture Walkthrough

From mailbox to Atlas — the full pipeline, end to end

How DIH Atlas reads emails from each mailbox, turns external senders and recipients into durable Contact records, classifies them with a two-tier hybrid Lookup→LLM cascade, and associates the right Projects and Tags. Message bodies are never stored — only metadata plus the first 256 characters of bodyPreview for signature heuristics.

3 folders

Inbox · Sent · Archive

100 msgs / page

slim $select — no body

2 tiers

Dictionary → Azure OpenAI

0 message bodies

persisted anywhere

◇

The whole process at a glance

One diagram, the entire per-message pipeline as implemented in MailboxBackfillActivities.ProcessMessage — shared by the 5-minute Delta timer, manual /api/sync, and the force-backfill orchestrator.

End-to-end pipeline

Triggered by manual sync or the 5-minute DeltaSyncTimer (0 */5 * * * *). Single-mailbox at a time, transactional, idempotent.

Stage / activity

Decision point

Auto-categorized outcome

Human-review queue

Dropped / silent exit

How emails are read & contacts are decided

For each active mailbox, the pipeline pulls messages from the Inbox, Sent Items, and Archive folders in parallel, extracts the people involved, drops bots and promotional senders, and upserts a single Contact per unique email address. Message bodies are never stored.

1.1

Trigger & pre-flight checks

Entry is either a manual POST /api/sync?mailbox=<upn> or the DeltaSyncTimer running on cron 0 */5 * * * * (every 5 minutes). Before any external call we do three local DB reads:

Allowlist: Mailboxes table must contain the UPN with IsActive=1 — security boundary, refuses sync against unknown mailboxes.
Resume cursor: MailboxCursors gives the saved DeltaLink + Phase (Historical or Delta) per (mailbox, folder).
Internal mailbox list: The full DIH mailbox set feeds the extractor so DIH staff never become contacts.

1.2

Graph pull from Inbox, Sent, and Archive

For every mailbox, three Graph delta readers run in parallel — one against Inbox, one against Sent Items, and one against Archive. Each request uses a slim projection: no body — only id, internetMessageId, from / to / cc / bcc, subject, sentDateTime, receivedDateTime, and a 256-character body preview for signature heuristics.

Each (mailbox, folder) pair has its own resume cursor — a failure or cursor invalidation in one folder doesn't reset the others.
Paged at 100 messages per request; throttling and transient errors retry with exponential backoff.
If Microsoft Graph invalidates a folder's cursor (rare), that folder restarts from the configured horizon while siblings continue uninterrupted.
Sent Items adds outbound visibility — contacts our team reached out to but who never replied still land in the graph.
Archive captures filed history — long-running deal threads moved out of Inbox aren't lost to the contact graph.
First-ever sync of a mailbox runs a historical reader (date-filtered) per folder until exhausted, then transitions to incremental delta mode.

1.3

Extracting contact candidates

For every message, ContactExtractor.Extract walks the recipient slots in From → To → Cc → Bcc order (first slot wins on duplicates) and emits a ContactCandidate per unique address that survives a multi-layer noise filter:

Subject filter (whole message): Outlook / Teams meeting auto-replies ("Accepted:", "Declined:", "Tentative:") are dropped — no candidates produced.
Internal-mailbox filter: any address in the DIH allowlist is skipped.
Exact local-part blocklist: info, powerautomatenoreply, 365copilotupdates — concatenated automation senders the regex below misses.
Bounded noise regex on local-part: matches no-?reply, do-?not-?reply, noreply, donotreply, notifications?, newsletter, marketing, reports? — bounded by start / separator / digit on the left so it catches office365reports@, mssecurity-noreply@, etc.
Sub-domain prefix blocklist: domain starts with noreply., no-reply., notifications., notify., updates., alerts. → dropped even if the local-part is innocuous (e.g. [email protected]).
Marketing domain blocklist: substring match against a curated list (telecompaper, tmtfinance, intralinks, gsma, docusign, etc.) drops every local-part on that domain.
Display-name cleanup: names containing “via X” or “on behalf of X” are nulled out — the platform is sending, not the human.
Domain normalisation: mail. / email. sub-domain prefixes are stripped → CompanyDomain.
Free-mail flag: IsFreemail = true when CompanyDomain is in the well-known free-mail set (gmail.com, outlook.com, hotmail.com, proton.me, …).
Signature snippet: if bodyPreview contains an em-dash on its own line, or “Best regards,” / “Kind regards,” / “Regards,” / etc., the next 200 characters are extracted as SignatureSnippet.
Hard cap: at most MaxCandidatesPerMessage = 50 per message.

1.4

LLM noise classifier — catches anything the regex missed

Each candidate that survives the regex layer is then checked against AzureOpenAIEmailNoiseClassifier — a small LLM call returning a strict-schema { isNoise, reason } response. The classifier targets bot / promo / transactional / newsletter senders that look like real humans (e.g. [email protected], jira-noreply@…).

Per-email memory cache with a 24-hour sliding window — once an address is classified, repeat sightings short-circuit the LLM call.
LLM unavailability is a soft failure: defaults to isNoise=false (losing a real contact is worse than admitting a few bots — the next sync re-evaluates).
Prompt is conservative: "when uncertain, prefer isNoise=false".

1.5

The upsert — one transaction per candidate

ContactUpsertService runs a ReadCommitted transaction per candidate. Contacts are keyed on the lowercased PrimaryEmail (filtered unique index):

Hit → bump LastSeenAt, increment InteractionCount, fill Name / IsFreemail if previously empty.
Miss → INSERT new Contact with placeholder sources (CategorySource=Rule, OwnerSource=Unassigned); race fallback catches unique-constraint violations and retries as an update.
Insert one EmailInteraction row with direction (Inbound / Outbound / Cc / Bcc). Composite unique index (MailboxUpn, MessageId, ContactId) guarantees idempotency.
Recompute IsShared: true when ≥ 2 distinct mailboxes have seen the contact.
For brand-new contacts only: a fuzzy-merge check runs against existing same-domain contacts — Levenshtein similarity on the lowercased Name at threshold ≥ 0.85 emits a MergeCandidate for human review.

1.6

Cursor persistence — resumable per page

The Durable orchestrator persists the cursor via the PersistCursor activity after every page. A mid-stream throttle or timeout resumes from the last persisted nextLink, not from the start. On the final page, the saved cursor flips to @odata.deltaLink for the next incremental sync.

Why this design? "Skip internal mailboxes" guarantees DIH staff never become contacts — the graph is strictly external counterparties seen by our mailboxes. Idempotency at every layer (contact unique index, interaction unique index, per-page cursor save) makes re-syncing safe under any failure mode.

How classification (category assignment) works

Every contact gets exactly one category. The cascade is a two-tier hybrid: a deterministic dictionary lookup on the company domain, falling through to Azure OpenAI when the domain is unknown. High-confidence LLM answers extend the dictionary so the next call on the same domain is free.

Tier 1 DictionaryCategorizer

Deterministic · ~0ms

The whole DomainDictionary table is loaded once per 5 minutes into IMemoryCache under a single key. Lookup is an exact, case-insensitive CompanyDomain match.

Hit → return CategoryResult(Category, Confidence, Source = Rule).
Miss → throw CategoryUnknownException; HybridCategorizer catches it and calls Tier 2.

B2B domain counts are bounded (typically < 1k unique domains), so one cache entry covers the whole table with no per-row eviction logic.

Tier 2 AzureOpenAICategorizer

gpt-5-mini · strict JSON

Wrapped in CategorizerResiliencePipeline — a Polly v8 stack with four layers, shared across all Azure OpenAI consumers so the rate budget is global:

RateLimiter

SlidingWindow (6 segments, 1-min window) sized to the AOAI deployment RPM. Shared across categorizer / project-tag / noise / company extractors — overflow rejects with RateLimiterRejectedException, which is mapped to CategorizerUnavailableException so the contact is parked for AiRetryTimer.

CircuitBreaker

5 failures within 2 min opens for 5 min. BrokenCircuitException fails fast on subsequent calls.

Retry

3 attempts. DelayGenerator honors the upstream Retry-After header when present (capped at MaxRetryAfterSeconds); falls back to exponential 1s → 2s → 4s otherwise. Triggers on 429 + 5xx + timeout. 400 is never retried — that's a prompt or schema bug.

Timeout

10 seconds per attempt, configurable via AzureOpenAI:TimeoutSeconds. TimeoutRejectedException counts as a failure for both retry and the breaker.

JSON schema (strict mode):

category is a free-form PascalCase string — the schema deliberately doesn't pin it to an enum. Server-side validation decides whether the slug is known or net-new.
confidence is constrained to [0, 1]; role and company are nullable strings.
isNewCategory + newCategoryDisplayName + newCategoryDescription let the model explicitly signal a new bucket.

Three response branches:

Known slug + confidence ≥ 0.7 → upsert DomainDictionary with Source = LLM, invalidate the Tier-1 cache, return Source = LLM.
Known slug + confidence < 0.7 → no writes, return Source = LowConfidence. Surfaced in the UI as an orange pill for human triage.
Net-new slug OR isNewCategory=true → queue a Pending CategoryProposal and return PreserveExistingCategory = true so the contact's existing category is not overwritten. A flaky LLM call can never blank a good category.

2.1

Manual is sticky — humans always win

The first decision in TryCategoriseAsync is if (contact.CategorySource == Manual) return;. Once a user PATCHes a category through the API, no automated path overwrites it.

Manual Rule LLM LowConfidence

2.2

Self-extending dictionary — the cascade gets smarter over time

Each successful high-confidence LLM call writes one row to DomainDictionary and removes the Tier-1 cache key. The next contact on the same domain — even within the same sync — gets an instant Tier-1 hit and pays no AI tokens.

2.3

Dynamic taxonomy — categories aren't enum-locked

The category list is read from the Categories table at runtime via CategoryRegistry. The LLM's system prompt is rebuilt from this table on every call, including each category's description. When an admin edits a description in the UI, the registry cache invalidates and the LLM's prompt picks up the new wording on the next call — admins can steer future categorisations without a redeploy.

Confirmed CategoryProposals become new rows in the Categories table; rejected proposals are marked and don't influence future prompts.

2.4

Failure containment — the sync never aborts because of AOAI

Circuit-open, rate-limit rejection, or repeated timeouts are mapped to CategorizerUnavailableException and caught at the call site. The contact is marked AiStatus = Pending with an AiPendingWork bitmask of exactly which AI steps need replay; the AiRetryTimer (and the post-backfill Durable AiDrainOrchestrator) re-runs only the failed steps on an exponential schedule:

Attempt 1 → 60s · 2 → 2m · 3 → 4m · 4 → 8m · 5 → 16m · 6 → 32m · 7 → 1h (capped) · 8 → 1h.
±10s of jitter; total span ~24h before the contact flips to Abandoned.
During Historical-phase backfill, DeferAiClassification=true skips every inline AOAI call and queues all four AI bits (Noise · Categorize · CompanyExtraction · ProjectsAndTags) so the worker stays memory-bounded; AiRetryTimer drains them later at AOAI quota pace.

How projects & tags are associated

Same hybrid Lookup → LLM shape as categorization, applied to the message's subject + bodyPreview (not the domain). Multiple projects/tags per contact are allowed by design, but the pipeline applies strict guards at every layer so the table doesn't fill with noise.

Tier 1 LookupProjectTagExtractor

Regex · ~0ms

Reads Project.Name + Tag.Name columns into MemoryCache (5-min TTL). The haystack is subject + "\n" + bodyPreview. Matching uses a word-boundary regex, not raw substring:

var pattern = $@"\b{Regex.Escape(name)}\b";
if (Regex.IsMatch(haystack, pattern, IgnoreCase | CultureInvariant))
    hits.Add(name);

Word-boundary anchors stop 1-2 char names like "Q3" or stray "D" from matching inside today / Dear / Email3.
MinNameLength = 2 — single-letter names in the table are ignored as defense-in-depth.
Zero hits across both arrays → ProjectTagMissException → fall through to Tier 2.
The Rule path returns Source = Rule, Confidence = null.

Tier 2 AzureOpenAIProjectTagExtractor

Open arrays · strict prompt

Same Polly stack as categorization. JSON schema is two string arrays + one confidence:

{ "projects": ["..."], "tags": ["..."], "confidence": 0.0-1.0 }

The prompt is heavily constrained — the model is explicitly told what NOT to return:

Allowed projects: stable deal codenames (e.g. "MESEC", "HEIRLOOM", "SUNFLOWER"). Proper nouns the team uses for specific deals.
Allowed tags: deal workstreams or transaction documents that would belong on a CRM deal record (e.g. "NDA", "Term Sheet", "Due Diligence", "Lender Counsel").
Never tag: financial metrics (ARR, EBITDA, IRR, capex), generic acronyms outside a recognised doc list, product / vendor names (DocuSign, Slack, Teams), step labels ("Phase 1"), standalone person or firm names, generic business words ("Strategy", "Update", "Review").
Return empty for entire categories: personal admin, newsletters / digests, marketing, calendar logistics, transactional notifications.
Max 3 tags per call. Model is instructed to rank by relevance and drop the weakest if more than 3 candidates appear.
MinNameLength = 2. Single-letter names are never returned.

Confidence gate (0.7) IS applied: if the model's calibrated confidence is below the threshold, both arrays are cleared to empty before persistence. A regression is still surfaced in the structured logs.

3.1

Persistence — with per-contact caps

ContactProjectTagWriter takes the extractor result and writes links, but with two hard caps:

Max 3 tags per contact (aggregate). Once a contact has 3 ContactTag rows, later extractions can refresh existing links but won't add new ones.
Per-extraction tag cap at 3 (defense-in-depth on top of the prompt).
Idempotent re-links of already-attached tags don't count toward the cap — re-running the same extraction can never push a contact past the ceiling.

3.2

HITL: project names are humans-only, tags aren't

Asymmetric on purpose — projects map to revenue-bearing engagements, so a net-new project name does not auto-create a Project row. Instead a Pending ProjectProposal is recorded (deduplicated on name, occurrence count bumped on every repeat sighting). A reviewer confirms (creates the Project + the ContactProject link) or rejects.

Tags, by contrast, are upserted directly into the Tag table — their taxonomy is intentionally looser and they carry no compliance / billing weight.

3.3

Why this step runs last

Projects & tags are processed only after the contact's category and owner are already persisted. That way a hiccup in the LLM during this step can never roll back already-classified contact state — the worst case is that the project/tag links arrive on the next sweep.

Category vs. Project/Tag — side by side

Same hybrid shape, different cardinality and different match key.

Concern	Category	Project / Tag
Lookup key	Exact match on `CompanyDomain`	Word-boundary regex on `subject + bodyPreview`
LLM output	Free-form PascalCase string + isNewCategory hint	Two open string arrays — model picks names from the message
Confidence granularity	Per result (one category)	One per LLM call — covers the whole response
Confidence threshold (0.7)	Below → `Source=LowConfidence`, no dictionary writeback	Below → both arrays emptied before persistence
Cardinality per contact	Exactly 1	Many projects · max 3 tags (hard cap)
Storage	Scalar columns on `Contact`	`ContactProject` / `ContactTag` join rows
HITL queue	`CategoryProposals` (any net-new slug)	`ProjectProposals` only (tags upsert directly)

Design note — the pipeline is intentionally boring where it can be (relational schema, deterministic rules, hard-coded mailbox allowlist, capped cardinality) and expressive where the data demands it (hybrid LLM, dynamic taxonomy, HITL queues). Every layer that could go unbounded — message size, candidates per message, tags per contact, AI retry attempts — has an explicit cap.

Owner resolution — deterministic, zero AI

Every contact is assigned an owning DIH mailbox. This stage uses no AI — it's pure rules, by design.

4.1

Resolution order

Manual override — if OwnerSource=Manual and OwnerUpn is set, return immediately. Permanent until a human changes it.
OwnershipRule match — cached for 10 min, ordered by Priority desc. First match wins. Rule types: Manual + DomainSpecific (case-insensitive match against CompanyDomain) and CategoryDefault (match against Category). Returns Source=Rule.
AutoMailbox — 90-day activity, two-pass with To-precedence:
- First pass: count EmailInteractions in the last 90 days where direction is Inbound or Outbound only (the mailbox was on From or To). Pick the highest-count mailbox.
- If no principal interactions exist, fallback to counting every direction including Cc / Bcc — so a CC-only contact still gets an owner.
- Deterministic tie-break by mailbox UPN alphabetical order.
- Returns Source=AutoMailbox.
Unassigned — nothing matched, OwnerUpn=null, Source=Unassigned. Surfaces in the /contacts/unassigned queue for human assignment.