← Items Hub/ Docs

Data Dictionary

docs/DATA_DICTIONARY.md

Items Hub — Data Dictionary

Purpose. One reference page listing every source, what it contains, where it lands in Neon, what gets extracted from it, and what its scope / retention / access envelope is. If you're writing a consumer that queries items.*, start here.

Canonical truth is items.source_config in Neon — this document is the human-readable mirror of that table plus the per-source schema contracts the extractor produces. When in doubt, run:

SELECT source, display_name, tenant, trigger_env, cadence, status,
       alert_threshold_hours, sort_order, enabled
FROM items.source_config
ORDER BY sort_order;

Legend


The sources (17 rows, active first)

Live, Mac-login-item

SourceDisplayTenantCadenceScope
cloud-storageCloud storage (Inbounds/iCloud/Box/GDrive/Dropbox)mixed30mAll files appearing under 5 watched roots. Metadata only (filename, size, provider, mtime, path hash). No content extraction by default.
imessageiMessagepersonal30mAll messages in the last 30 min, decoded from chat.db including modern attributedBody NSString format. Sender, body, chat id, date. Excludes attachments.
pklPKL (Personal Knowledge Library)personal30mNotes/PDFs added or changed under the PKL root, excluding 05_Archives. Metadata + Finder tags (portfolios[], markets[], type). Content extraction deferred.
pmlPML (Personal Media Library, counter)personaldailyJust a count: how many photos / videos added. No binary content.

Live, Windows Claude Desktop (via Outlook MCP → Neon public.outlook_*, then Vercel cron picks up)

SourceDisplayTenantCadenceScope
m365-lfiM365 — LFI (email)lfi2hInbox emails for justin@leftfieldinv.com in last 3h window. Subject, from, to, body (HTML stripped), received_at.
m365-mosserM365 — Mosser (email)mosser2hInbox emails for jsato@mosserco.com. Same shape.
m365-lfi-calendarM365 — LFI (calendar)lfi2hLFI calendar events ±7 days. Title, attendees, start/end, location, body.
m365-mosser-calendarM365 — Mosser (calendar)mosser2hMosser calendar events ±7 days. Same shape.
outlook-contactsOutlook Contacts (LFI)lfi6h (shared w/ M365)LFI contact directory passthrough — name, email, company, tags. Mosser contacts intentionally skipped (duplicate set).
teams-chatTeams chat (daily takeaways via email)mosserdailyPKM-AI-Workflow daily email with subject pattern Teams Chat - Daily Takeaways YYYY-MM-DD. Subject-routed in /api/ingest/m365 via SUBJECT_ROUTES.

Live, Anthropic cloud

SourceDisplayTenantCadenceScope
granolaGranola — meetingsmixed15mMeetings updated in last 2h. Title, participants, started_at, transcript (truncated to 16K chars), notes.
pbiPower BI — Golden Data Modelmosser04:03 dailyDaily operational signal: total AR balance, 30+ day delinquent portion, work orders opened today, leads today, upstream freshness. Computed from Brickston Cloud SQL pbi_* staging tables.
sf-open-dataSF Open Datamarket06:30 daily6 civic datasets (permits, violations, evictions, housing complaints, planning, rent-board petitions). Inbox row is a one-line summary of today's activity.
craigslistCraigslist — SF rentalsmarket05:10 dailyToday's new / active listings + median rent / avg bedrooms / neighborhoods covered. Computed from Brickston Cloud SQL market_listings.

Down (explicitly not pursuing — registry kept for future activation)

SourceDisplayTenantWhy off
yardiYardi — Mosser PM systemmosserNo viable ingress — API not in scope; SSO-gated export is operator-dependent. Revisit when Mosser opens an API or a SOC2 data share.
smartsheetSmartsheetmixedLow ROI for the PKM layer — operational data already surfaces via M365 attachments and Teams.
docusignDocuSignlfiLow daily volume + signed-artifact-only workflow; every execution event already appears in email. Revisit if LFI deal volume grows.

Payload contracts (what items.inbox_items.payload looks like)

Every inbox row has source, tenant, received_at, payload jsonb, idempotency_key, processed_at, error. The payload shape varies by source; the extractor dispatches on payload->>'kind'.

kind = "email" (m365-lfi, m365-mosser)

{
  "kind": "email",
  "id": "<outlook message id>",
  "subject": "…",
  "from": { "name": "Ji Won Choi", "email": "jwchoi@mosserco.com" },
  "to":   [ { "name": "…", "email": "…" }, … ],
  "body": "…plain text…",
  "received_at": "2026-04-22T10:14:00Z",
  "thread_id": "<conversation id>",
  "tenant": "lfi" | "mosser"
}

kind = "calendar" (m365-lfi-calendar, m365-mosser-calendar)

{
  "kind": "calendar",
  "id": "<event id>",
  "subject": "…", "body": "…",
  "start_at": "…", "end_at": "…",
  "attendees": [ { "name": "…", "email": "…", "response": "…" } ],
  "location": "…",
  "tenant": "lfi" | "mosser"
}

kind = "meeting" (granola, teams-chat)

{
  "kind": "meeting",
  "id": "<granola meeting id | teams email id>",
  "title": "…",
  "started_at": "…",
  "participants": ["…","…"],
  "transcript": "…(<=16000 chars)",
  "notes": "…",
  "tenant": "mixed" | "mosser"
}

kind = "chat" (imessage)

{
  "kind": "chat",
  "id": "<chat.db rowid>",
  "chat_id": "iMessage;-;+14155551234",
  "handle": "+14155551234",
  "is_from_me": false,
  "text": "…(attributedBody decoded)",
  "sent_at": "…"
}

kind = "contact" (outlook-contacts)

{
  "kind": "contact",
  "id": "<outlook contact id>",
  "name": "…",
  "emails": ["…"],
  "phones": ["…"],
  "company": "…",
  "title": "…",
  "categories": ["…"],
  "tenant": "lfi"
}

kind = "doc" (cloud-storage, pkl, gdrive-fetch)

{
  "kind": "doc",
  "provider": "icloud" | "box" | "dropbox" | "gdrive-lfi" | "gdrive-lfiq" | "pkl" | "inbounds",
  "path": "/absolute/path/on/origin",
  "filename": "2025-LFI-LOI-870-Oak.pdf",
  "size_bytes": 148122,
  "mtime": "…",
  "finder_tags": ["portfolios:Mosser", "markets:SF", "type:LOI"],
  "content_hash": "sha256:…",
  "content": null // filled only when content extraction is enabled
}

kind = "market_summary" (pbi, craigslist, sf-open-data)

{
  "kind": "market_summary",
  "source": "pbi" | "craigslist" | "sf-open-data",
  "tenant": "mosser" | "market",
  "as_of": "2026-04-22",
  "date":  "2026-04-22",
  "title": "pbi daily summary — 2026-04-22",
  "body":  "Mosser ops — 2026-04-22: Total AR $…; delinquent 30+ $…. Work orders opened today: … .",
  "content": "…same as body (legacy alias)…",
  "facts": { "ar_total": …, "work_orders_today": …, … },
  "details": null
}

kind = "counter" (pml)

{
  "kind": "counter",
  "date": "2026-04-22",
  "added_today": 47,
  "total": 12341
}

What the extractor produces per payload kind

The extractor (app/lib/extractor.ts) dispatches on kind to a family prompt in app/lib/prompts/ and writes normalized rows.

Inbox kindNormalized tables written
email, meeting, chatitems.tasks, items.commitments, items.decisions, items.entities
calendaritems.tasks (prep tasks), items.entities (attendees)
contactitems.entities only (no tasks / decisions)
docitems.documents, items.entities (when content is available)
market_summaryitems.decisions (as a dated market observation), items.entities (geos / properties mentioned)
counterNo normalized output — statistical only, renders on /sources/pml

Every extraction run also appends to items.embeddings (see §Embeddings pipeline below).


Embeddings pipeline

Embeddings live in Neon — one table, items.embeddings, pgvector HNSW index on cosine distance. They are the substrate for every "search across everything" feature.

What

Where the text comes from

app/lib/embeddings.ts:canonicalText() builds the string that gets embedded. Current coverage:

source_dbTables embeddedTrigger
neon-itemsitems.tasks, items.commitments, items.decisions, items.entities, items.documentsAfter each extractor run that writes a normalized row.
neon-publicpublic.tasks, public.commitments, public.decisions, public.entitiesBackfill only, via scripts/embed_backfill.py. Keeps legacy PKM rows searchable until they age out.
brickston(designed — see prompt.md P2.9) Row-level Brickston Cloud SQL rows (AR ledger, work orders, leads, market listings)Nightly job writes vectors directly into items.embeddings without replicating the source rows.

When

Who consumes them

PII and redaction

Raw text leaves Neon at embed time and is sent to OpenAI over TLS. Only the 1536-dim vector + the canonical text are retained in items.embeddings. To skip embedding for a sensitive source (e.g. a new banking feed), gate it in canonicalText() — return null and upsertEmbeddings will no-op.


Scope / retention / PII


See also