From mailbox to Atlas — the full pipeline, end to end
How DIH Atlas reads emails from each mailbox, turns external senders and
recipients into durable Contact records, classifies them with a two-tier hybrid
Lookup→LLM cascade, and associates the right Projects and Tags. Message bodies are
never stored — only metadata plus the first 256 characters of bodyPreview
for signature heuristics.
$select — no bodyThe whole process at a glance
One diagram, the entire per-message pipeline as implemented in
MailboxBackfillActivities.ProcessMessage —
shared by the 5-minute Delta timer, manual /api/sync,
and the force-backfill orchestrator.
End-to-end pipeline
Triggered by manual sync or the 5-minute DeltaSyncTimer
(0 */5 * * * *).
Single-mailbox at a time, transactional, idempotent.
How emails are read & contacts are decided
For each active mailbox, the pipeline pulls messages from the Inbox, Sent Items, and Archive folders in parallel, extracts the people involved, drops bots and promotional senders, and upserts a single Contact per unique email address. Message bodies are never stored.
Trigger & pre-flight checks
Entry is either a manual POST /api/sync?mailbox=<upn> or the
DeltaSyncTimer running on cron
0 */5 * * * * (every 5 minutes). Before any external call we do
three local DB reads:
- Allowlist: Mailboxes table must contain the UPN with
IsActive=1— security boundary, refuses sync against unknown mailboxes. - Resume cursor: MailboxCursors gives the saved
DeltaLink+Phase(Historical or Delta) per (mailbox, folder). - Internal mailbox list: The full DIH mailbox set feeds the extractor so DIH staff never become contacts.
Graph pull from Inbox, Sent, and Archive
For every mailbox, three Graph delta readers run in parallel — one against Inbox, one against Sent Items, and one against Archive. Each request uses a slim projection: no body — only id, internetMessageId, from / to / cc / bcc, subject, sentDateTime, receivedDateTime, and a 256-character body preview for signature heuristics.
- Each (mailbox, folder) pair has its own resume cursor — a failure or cursor invalidation in one folder doesn't reset the others.
- Paged at 100 messages per request; throttling and transient errors retry with exponential backoff.
- If Microsoft Graph invalidates a folder's cursor (rare), that folder restarts from the configured horizon while siblings continue uninterrupted.
- Sent Items adds outbound visibility — contacts our team reached out to but who never replied still land in the graph.
- Archive captures filed history — long-running deal threads moved out of Inbox aren't lost to the contact graph.
- First-ever sync of a mailbox runs a historical reader (date-filtered) per folder until exhausted, then transitions to incremental delta mode.
Extracting contact candidates
For every message, ContactExtractor.Extract walks the recipient slots
in From → To → Cc → Bcc order (first slot wins on duplicates)
and emits a ContactCandidate per unique address that survives a
multi-layer noise filter:
- Subject filter (whole message): Outlook / Teams meeting auto-replies (
"Accepted:","Declined:","Tentative:") are dropped — no candidates produced. - Internal-mailbox filter: any address in the DIH allowlist is skipped.
- Exact local-part blocklist:
info,powerautomatenoreply,365copilotupdates— concatenated automation senders the regex below misses. - Bounded noise regex on local-part: matches
no-?reply,do-?not-?reply,noreply,donotreply,notifications?,newsletter,marketing,reports?— bounded by start / separator / digit on the left so it catchesoffice365reports@,mssecurity-noreply@, etc. - Sub-domain prefix blocklist: domain starts with
noreply.,no-reply.,notifications.,notify.,updates.,alerts.→ dropped even if the local-part is innocuous (e.g.[email protected]). - Marketing domain blocklist: substring match against a curated list (
telecompaper,tmtfinance,intralinks,gsma,docusign, etc.) drops every local-part on that domain. - Display-name cleanup: names containing “via X” or “on behalf of X” are nulled out — the platform is sending, not the human.
- Domain normalisation:
mail./email.sub-domain prefixes are stripped →CompanyDomain. - Free-mail flag:
IsFreemail = truewhenCompanyDomainis in the well-known free-mail set (gmail.com, outlook.com, hotmail.com, proton.me, …). - Signature snippet: if
bodyPreviewcontains an em-dash on its own line, or “Best regards,” / “Kind regards,” / “Regards,” / etc., the next 200 characters are extracted asSignatureSnippet. - Hard cap: at most
MaxCandidatesPerMessage = 50per message.
LLM noise classifier — catches anything the regex missed
Each candidate that survives the regex layer is then checked against
AzureOpenAIEmailNoiseClassifier — a small LLM call returning a
strict-schema { isNoise, reason } response. The classifier targets
bot / promo / transactional / newsletter senders that look like real humans
(e.g. [email protected], jira-noreply@…).
- Per-email memory cache with a 24-hour sliding window — once an address is classified, repeat sightings short-circuit the LLM call.
- LLM unavailability is a soft failure: defaults to
isNoise=false(losing a real contact is worse than admitting a few bots — the next sync re-evaluates). - Prompt is conservative: "when uncertain, prefer isNoise=false".
The upsert — one transaction per candidate
ContactUpsertService runs a ReadCommitted transaction
per candidate. Contacts are keyed on the lowercased PrimaryEmail
(filtered unique index):
- Hit → bump
LastSeenAt, incrementInteractionCount, fillName/IsFreemailif previously empty. - Miss → INSERT new Contact with placeholder sources (
CategorySource=Rule,OwnerSource=Unassigned); race fallback catches unique-constraint violations and retries as an update. - Insert one
EmailInteractionrow with direction (Inbound / Outbound / Cc / Bcc). Composite unique index(MailboxUpn, MessageId, ContactId)guarantees idempotency. - Recompute
IsShared:truewhen ≥ 2 distinct mailboxes have seen the contact. - For brand-new contacts only: a fuzzy-merge check runs against existing same-domain contacts — Levenshtein similarity on the lowercased
Nameat threshold≥ 0.85emits aMergeCandidatefor human review.
Cursor persistence — resumable per page
The Durable orchestrator persists the cursor via the PersistCursor
activity after every page. A mid-stream throttle or timeout
resumes from the last persisted nextLink, not from the start. On
the final page, the saved cursor flips to @odata.deltaLink for the
next incremental sync.
How classification (category assignment) works
Every contact gets exactly one category. The cascade is a two-tier hybrid: a deterministic dictionary lookup on the company domain, falling through to Azure OpenAI when the domain is unknown. High-confidence LLM answers extend the dictionary so the next call on the same domain is free.
The whole DomainDictionary table is loaded once per 5 minutes into
IMemoryCache under a single key. Lookup is an exact, case-insensitive
CompanyDomain match.
- Hit → return
CategoryResult(Category, Confidence, Source = Rule). - Miss → throw
CategoryUnknownException;HybridCategorizercatches it and calls Tier 2.
B2B domain counts are bounded (typically < 1k unique domains), so one cache entry covers the whole table with no per-row eviction logic.
Wrapped in CategorizerResiliencePipeline — a Polly v8 stack with
four layers, shared across all Azure OpenAI consumers so the rate budget is
global:
RateLimiterRejectedException, which is mapped to CategorizerUnavailableException so the contact is parked for AiRetryTimer.BrokenCircuitException fails fast on subsequent calls.DelayGenerator honors the upstream Retry-After header when present (capped at MaxRetryAfterSeconds); falls back to exponential 1s → 2s → 4s otherwise. Triggers on 429 + 5xx + timeout. 400 is never retried — that's a prompt or schema bug.AzureOpenAI:TimeoutSeconds. TimeoutRejectedException counts as a failure for both retry and the breaker.JSON schema (strict mode):
categoryis a free-form PascalCase string — the schema deliberately doesn't pin it to an enum. Server-side validation decides whether the slug is known or net-new.confidenceis constrained to[0, 1];roleandcompanyare nullable strings.isNewCategory+newCategoryDisplayName+newCategoryDescriptionlet the model explicitly signal a new bucket.
Three response branches:
-
Known slug + confidence ≥ 0.7 → upsert
DomainDictionarywithSource = LLM, invalidate the Tier-1 cache, returnSource = LLM. -
Known slug + confidence < 0.7 → no writes, return
Source = LowConfidence. Surfaced in the UI as an orange pill for human triage. -
Net-new slug OR
isNewCategory=true→ queue aPending CategoryProposaland returnPreserveExistingCategory = trueso the contact's existing category is not overwritten. A flaky LLM call can never blank a good category.
Manual is sticky — humans always win
The first decision in TryCategoriseAsync is
if (contact.CategorySource == Manual) return;. Once a user PATCHes a
category through the API, no automated path overwrites it.
Self-extending dictionary — the cascade gets smarter over time
Each successful high-confidence LLM call writes one row to
DomainDictionary and removes the Tier-1 cache key. The next contact
on the same domain — even within the same sync — gets an instant Tier-1 hit and
pays no AI tokens.
Dynamic taxonomy — categories aren't enum-locked
The category list is read from the Categories table at runtime via
CategoryRegistry. The LLM's system prompt is rebuilt from this
table on every call, including each category's description. When an admin edits
a description in the UI, the registry cache invalidates and the LLM's prompt
picks up the new wording on the next call — admins can steer future
categorisations without a redeploy.
Confirmed CategoryProposals become new rows in the
Categories table; rejected proposals are marked and don't influence
future prompts.
Failure containment — the sync never aborts because of AOAI
Circuit-open, rate-limit rejection, or repeated timeouts are mapped to
CategorizerUnavailableException and caught at the call site. The
contact is marked AiStatus = Pending with an AiPendingWork
bitmask of exactly which AI steps need replay; the AiRetryTimer
(and the post-backfill Durable AiDrainOrchestrator) re-runs only
the failed steps on an exponential schedule:
- Attempt 1 → 60s · 2 → 2m · 3 → 4m · 4 → 8m · 5 → 16m · 6 → 32m · 7 → 1h (capped) · 8 → 1h.
- ±10s of jitter; total span ~24h before the contact flips to
Abandoned. - During Historical-phase backfill,
DeferAiClassification=trueskips every inline AOAI call and queues all four AI bits (Noise · Categorize · CompanyExtraction · ProjectsAndTags) so the worker stays memory-bounded;AiRetryTimerdrains them later at AOAI quota pace.
How projects & tags are associated
Same hybrid Lookup → LLM shape as categorization, applied to the message's subject + bodyPreview (not the domain). Multiple projects/tags per contact are allowed by design, but the pipeline applies strict guards at every layer so the table doesn't fill with noise.
Reads Project.Name + Tag.Name columns into
MemoryCache (5-min TTL). The haystack is
subject + "\n" + bodyPreview. Matching uses a
word-boundary regex, not raw substring:
var pattern = $@"\b{Regex.Escape(name)}\b";
if (Regex.IsMatch(haystack, pattern, IgnoreCase | CultureInvariant))
hits.Add(name);
- Word-boundary anchors stop 1-2 char names like
"Q3"or stray"D"from matching insidetoday/Dear/Email3. - MinNameLength = 2 — single-letter names in the table are ignored as defense-in-depth.
- Zero hits across both arrays →
ProjectTagMissException→ fall through to Tier 2. - The Rule path returns
Source = Rule,Confidence = null.
Same Polly stack as categorization. JSON schema is two string arrays + one confidence:
{ "projects": ["..."], "tags": ["..."], "confidence": 0.0-1.0 }
The prompt is heavily constrained — the model is explicitly told what NOT to return:
- Allowed projects: stable deal codenames (e.g. "MESEC", "HEIRLOOM", "SUNFLOWER"). Proper nouns the team uses for specific deals.
- Allowed tags: deal workstreams or transaction documents that would belong on a CRM deal record (e.g. "NDA", "Term Sheet", "Due Diligence", "Lender Counsel").
- Never tag: financial metrics (ARR, EBITDA, IRR, capex), generic acronyms outside a recognised doc list, product / vendor names (DocuSign, Slack, Teams), step labels ("Phase 1"), standalone person or firm names, generic business words ("Strategy", "Update", "Review").
- Return empty for entire categories: personal admin, newsletters / digests, marketing, calendar logistics, transactional notifications.
- Max 3 tags per call. Model is instructed to rank by relevance and drop the weakest if more than 3 candidates appear.
- MinNameLength = 2. Single-letter names are never returned.
Confidence gate (0.7) IS applied: if the model's calibrated confidence is below the threshold, both arrays are cleared to empty before persistence. A regression is still surfaced in the structured logs.
Persistence — with per-contact caps
ContactProjectTagWriter takes the extractor result and writes
links, but with two hard caps:
- Max 3 tags per contact (aggregate). Once a contact has 3
ContactTagrows, later extractions can refresh existing links but won't add new ones. - Per-extraction tag cap at 3 (defense-in-depth on top of the prompt).
- Idempotent re-links of already-attached tags don't count toward the cap — re-running the same extraction can never push a contact past the ceiling.
HITL: project names are humans-only, tags aren't
Asymmetric on purpose — projects map to revenue-bearing
engagements, so a net-new project name does not auto-create a Project
row. Instead a Pending ProjectProposal is recorded
(deduplicated on name, occurrence count bumped on every repeat sighting). A
reviewer confirms (creates the Project + the
ContactProject link) or rejects.
Tags, by contrast, are upserted directly into the
Tag table — their taxonomy is intentionally looser and they carry
no compliance / billing weight.
Why this step runs last
Projects & tags are processed only after the contact's category and owner are already persisted. That way a hiccup in the LLM during this step can never roll back already-classified contact state — the worst case is that the project/tag links arrive on the next sweep.
Category vs. Project/Tag — side by side
Same hybrid shape, different cardinality and different match key.
| Concern | Category | Project / Tag |
|---|---|---|
| Lookup key | Exact match on CompanyDomain |
Word-boundary regex on subject + bodyPreview |
| LLM output | Free-form PascalCase string + isNewCategory hint | Two open string arrays — model picks names from the message |
| Confidence granularity | Per result (one category) | One per LLM call — covers the whole response |
| Confidence threshold (0.7) | Below → Source=LowConfidence, no dictionary writeback |
Below → both arrays emptied before persistence |
| Cardinality per contact | Exactly 1 | Many projects · max 3 tags (hard cap) |
| Storage | Scalar columns on Contact |
ContactProject / ContactTag join rows |
| HITL queue | CategoryProposals (any net-new slug) |
ProjectProposals only (tags upsert directly) |
Owner resolution — deterministic, zero AI
Every contact is assigned an owning DIH mailbox. This stage uses no AI — it's pure rules, by design.
Resolution order
- Manual override — if
OwnerSource=ManualandOwnerUpnis set, return immediately. Permanent until a human changes it. - OwnershipRule match — cached for 10 min, ordered by
Priority desc. First match wins. Rule types:Manual+DomainSpecific(case-insensitive match againstCompanyDomain) andCategoryDefault(match againstCategory). ReturnsSource=Rule. -
AutoMailbox — 90-day activity, two-pass with To-precedence:
- First pass: count
EmailInteractionsin the last 90 days where direction is Inbound or Outbound only (the mailbox was on From or To). Pick the highest-count mailbox. - If no principal interactions exist, fallback to counting every direction including Cc / Bcc — so a CC-only contact still gets an owner.
- Deterministic tie-break by mailbox UPN alphabetical order.
- Returns
Source=AutoMailbox.
- First pass: count
- Unassigned — nothing matched,
OwnerUpn=null,Source=Unassigned. Surfaces in the/contacts/unassignedqueue for human assignment.