Items Hub — Data Dictionary

Purpose. One reference page listing every source, what it contains, where it lands in Neon, what gets extracted from it, and what its scope / retention / access envelope is. If you're writing a consumer that queries items.*, start here.

Canonical truth is items.source_config in Neon — this document is the human-readable mirror of that table plus the per-source schema contracts the extractor produces. When in doubt, run:

SELECT source, display_name, tenant, trigger_env, cadence, status,
       alert_threshold_hours, sort_order, enabled
FROM items.source_config
ORDER BY sort_order;

Legend

Status: live (actively ingesting), degraded (live but past staleness threshold), broken (recent errors), designed (planned, not built), down (explicitly shut off — see Off — not currently ingesting section on the dashboard).
Tenant: lfi · mosser · personal · market · mixed.
Trigger env: anthropic-cloud (scheduled task in the Anthropic sandbox), windows-claude-desktop (always-on Windows PC at dockerpc.local), mac-login-item (LaunchAgent on Justin's MacBook), none (off / manual only).
Inbox payload kind: the discriminator inside inbox_items.payload->>'kind'. The extractor dispatches per-kind prompts.

The sources (17 rows, active first)

Live, Mac-login-item

Source	Display	Tenant	Cadence	Scope
`cloud-storage`	Cloud storage (Inbounds/iCloud/Box/GDrive/Dropbox)	mixed	30m	All files appearing under 5 watched roots. Metadata only (filename, size, provider, mtime, path hash). No content extraction by default.
`imessage`	iMessage	personal	30m	All messages in the last 30 min, decoded from `chat.db` including modern `attributedBody` NSString format. Sender, body, chat id, date. Excludes attachments.
`pkl`	PKL (Personal Knowledge Library)	personal	30m	Notes/PDFs added or changed under the PKL root, excluding `05_Archives`. Metadata + Finder tags (`portfolios[]`, `markets[]`, `type`). Content extraction deferred.
`pml`	PML (Personal Media Library, counter)	personal	daily	Just a count: how many photos / videos added. No binary content.

Live, Windows Claude Desktop (via Outlook MCP → Neon `public.outlook_*`, then Vercel cron picks up)

Source	Display	Tenant	Cadence	Scope
`m365-lfi`	M365 — LFI (email)	lfi	2h	Inbox emails for `justin@leftfieldinv.com` in last 3h window. Subject, from, to, body (HTML stripped), received_at.
`m365-mosser`	M365 — Mosser (email)	mosser	2h	Inbox emails for `jsato@mosserco.com`. Same shape.
`m365-lfi-calendar`	M365 — LFI (calendar)	lfi	2h	LFI calendar events ±7 days. Title, attendees, start/end, location, body.
`m365-mosser-calendar`	M365 — Mosser (calendar)	mosser	2h	Mosser calendar events ±7 days. Same shape.
`outlook-contacts`	Outlook Contacts (LFI)	lfi	6h (shared w/ M365)	LFI contact directory passthrough — name, email, company, tags. Mosser contacts intentionally skipped (duplicate set).
`teams-chat`	Teams chat (daily takeaways via email)	mosser	daily	PKM-AI-Workflow daily email with subject pattern `Teams Chat - Daily Takeaways YYYY-MM-DD`. Subject-routed in `/api/ingest/m365` via `SUBJECT_ROUTES`.

Live, Anthropic cloud

Source	Display	Tenant	Cadence	Scope
`granola`	Granola — meetings	mixed	15m	Meetings updated in last 2h. Title, participants, started_at, transcript (truncated to 16K chars), notes.
`pbi`	Power BI — Golden Data Model	mosser	04:03 daily	Daily operational signal: total AR balance, 30+ day delinquent portion, work orders opened today, leads today, upstream freshness. Computed from Brickston Cloud SQL `pbi_*` staging tables.
`sf-open-data`	SF Open Data	market	06:30 daily	6 civic datasets (permits, violations, evictions, housing complaints, planning, rent-board petitions). Inbox row is a one-line summary of today's activity.
`craigslist`	Craigslist — SF rentals	market	05:10 daily	Today's new / active listings + median rent / avg bedrooms / neighborhoods covered. Computed from Brickston Cloud SQL `market_listings`.

Down (explicitly not pursuing — registry kept for future activation)

Source	Display	Tenant	Why off
`yardi`	Yardi — Mosser PM system	mosser	No viable ingress — API not in scope; SSO-gated export is operator-dependent. Revisit when Mosser opens an API or a SOC2 data share.
`smartsheet`	Smartsheet	mixed	Low ROI for the PKM layer — operational data already surfaces via M365 attachments and Teams.
`docusign`	DocuSign	lfi	Low daily volume + signed-artifact-only workflow; every execution event already appears in email. Revisit if LFI deal volume grows.

Payload contracts (what `items.inbox_items.payload` looks like)

Every inbox row has source, tenant, received_at, payload jsonb, idempotency_key, processed_at, error. The payload shape varies by source; the extractor dispatches on payload->>'kind'.

`kind = "email"` (m365-lfi, m365-mosser)

{
  "kind": "email",
  "id": "<outlook message id>",
  "subject": "…",
  "from": { "name": "Ji Won Choi", "email": "jwchoi@mosserco.com" },
  "to":   [ { "name": "…", "email": "…" }, … ],
  "body": "…plain text…",
  "received_at": "2026-04-22T10:14:00Z",
  "thread_id": "<conversation id>",
  "tenant": "lfi" | "mosser"
}

`kind = "calendar"` (m365-lfi-calendar, m365-mosser-calendar)

{
  "kind": "calendar",
  "id": "<event id>",
  "subject": "…", "body": "…",
  "start_at": "…", "end_at": "…",
  "attendees": [ { "name": "…", "email": "…", "response": "…" } ],
  "location": "…",
  "tenant": "lfi" | "mosser"
}

`kind = "meeting"` (granola, teams-chat)

{
  "kind": "meeting",
  "id": "<granola meeting id | teams email id>",
  "title": "…",
  "started_at": "…",
  "participants": ["…","…"],
  "transcript": "…(<=16000 chars)",
  "notes": "…",
  "tenant": "mixed" | "mosser"
}

`kind = "chat"` (imessage)

{
  "kind": "chat",
  "id": "<chat.db rowid>",
  "chat_id": "iMessage;-;+14155551234",
  "handle": "+14155551234",
  "is_from_me": false,
  "text": "…(attributedBody decoded)",
  "sent_at": "…"
}

`kind = "contact"` (outlook-contacts)

{
  "kind": "contact",
  "id": "<outlook contact id>",
  "name": "…",
  "emails": ["…"],
  "phones": ["…"],
  "company": "…",
  "title": "…",
  "categories": ["…"],
  "tenant": "lfi"
}

`kind = "doc"` (cloud-storage, pkl, gdrive-fetch)

{
  "kind": "doc",
  "provider": "icloud" | "box" | "dropbox" | "gdrive-lfi" | "gdrive-lfiq" | "pkl" | "inbounds",
  "path": "/absolute/path/on/origin",
  "filename": "2025-LFI-LOI-870-Oak.pdf",
  "size_bytes": 148122,
  "mtime": "…",
  "finder_tags": ["portfolios:Mosser", "markets:SF", "type:LOI"],
  "content_hash": "sha256:…",
  "content": null // filled only when content extraction is enabled
}

`kind = "market_summary"` (pbi, craigslist, sf-open-data)

{
  "kind": "market_summary",
  "source": "pbi" | "craigslist" | "sf-open-data",
  "tenant": "mosser" | "market",
  "as_of": "2026-04-22",
  "date":  "2026-04-22",
  "title": "pbi daily summary — 2026-04-22",
  "body":  "Mosser ops — 2026-04-22: Total AR $…; delinquent 30+ $…. Work orders opened today: … .",
  "content": "…same as body (legacy alias)…",
  "facts": { "ar_total": …, "work_orders_today": …, … },
  "details": null
}

`kind = "counter"` (pml)

{
  "kind": "counter",
  "date": "2026-04-22",
  "added_today": 47,
  "total": 12341
}

What the extractor produces per payload kind

The extractor (app/lib/extractor.ts) dispatches on kind to a family prompt in app/lib/prompts/ and writes normalized rows.

Inbox kind	Normalized tables written
`email`, `meeting`, `chat`	`items.tasks`, `items.commitments`, `items.decisions`, `items.entities`
`calendar`	`items.tasks` (prep tasks), `items.entities` (attendees)
`contact`	`items.entities` only (no tasks / decisions)
`doc`	`items.documents`, `items.entities` (when content is available)
`market_summary`	`items.decisions` (as a dated market observation), `items.entities` (geos / properties mentioned)
`counter`	No normalized output — statistical only, renders on `/sources/pml`

Every extraction run also appends to items.embeddings (see §Embeddings pipeline below).

Embeddings pipeline

Embeddings live in Neon — one table, items.embeddings, pgvector HNSW index on cosine distance. They are the substrate for every "search across everything" feature.

What

Model: OpenAI text-embedding-3-small, 1536 dims.
Index: pgvector 0.8, HNSW vector_cosine_ops.
Rows (2026-04-23): ~14k, growing roughly at the rate of new normalized rows.
Row shape: { id, source_db, source_schema, source_table, source_id, canonical_text, text_sha1, embedding vector(1536), metadata jsonb, created_at }. text_sha1 is the dedup key — identical text is stored once.

Where the text comes from

app/lib/embeddings.ts:canonicalText() builds the string that gets embedded. Current coverage:

`source_db`	Tables embedded	Trigger
`neon-items`	`items.tasks`, `items.commitments`, `items.decisions`, `items.entities`, `items.documents`	After each extractor run that writes a normalized row.
`neon-public`	`public.tasks`, `public.commitments`, `public.decisions`, `public.entities`	Backfill only, via `scripts/embed_backfill.py`. Keeps legacy PKM rows searchable until they age out.
`brickston`	(designed — see prompt.md P2.9) Row-level Brickston Cloud SQL rows (AR ledger, work orders, leads, market listings)	Nightly job writes vectors directly into `items.embeddings` without replicating the source rows.

When

Write-time: the extractor calls upsertEmbeddings() for each normalized row it produces. First write embeds; re-writes with identical text are deduped by text_sha1.
Backfill: scripts/embed_backfill.py sweeps any row (items.* or public.*) that isn't already in items.embeddings.

Who consumes them

Dashboard /search + the global SearchBar in the top-right of every page — /api/search?q=… runs a single cosine-nearest query across all source_dbs, returns deep-links into the relevant entity / task / decision.
Brickston AI hits /api/items-hub/search — proxied by BRICKSTON_ITEMS_HUB_BASE_URL. Brickston passes X-Ingest-Secret and receives the same ranked results, joined back onto its own entity graph.
Future agent Q&A (designed) — cross-source retrieval that lets Claude answer "what did Ji Won say last week about 870 Oak?" without knowing which source the answer lives in.

PII and redaction

Raw text leaves Neon at embed time and is sent to OpenAI over TLS. Only the 1536-dim vector + the canonical text are retained in items.embeddings. To skip embedding for a sensitive source (e.g. a new banking feed), gate it in canonicalText() — return null and upsertEmbeddings will no-op.

Scope / retention / PII

Retention: no policy yet. All rows kept forever. items.inbox_items.payload stores raw source content, which means email bodies, meeting transcripts, and chat messages are in Neon indefinitely. Address before sharing DB access externally.
PII: emails, phone numbers, home addresses, financial balances. Treat Neon DATABASE_URL as sensitive.
Tenant isolation: soft — enforced by tenant column, not by row-level security. A consumer with the pooled DATABASE_URL can read everything. If we ever need hard isolation (external viewer, investor share), use a read-only Neon role with CREATE POLICY … USING (tenant = current_setting('app.tenant')) and set the GUC before querying.
Embeddings: see §Embeddings pipeline above for model, retention, and consumers.
Cross-tenant entity merging (scripts/entity_reconcile.py) is bounded by tenant — LFI and Mosser entities never collapse into one row even when names match.

Items Hub — Data Dictionary

Legend

The sources (17 rows, active first)

Live, Mac-login-item

Live, Windows Claude Desktop (via Outlook MCP → Neon public.outlook_*, then Vercel cron picks up)

Live, Anthropic cloud

Down (explicitly not pursuing — registry kept for future activation)

Payload contracts (what items.inbox_items.payload looks like)

kind = "email" (m365-lfi, m365-mosser)

kind = "calendar" (m365-lfi-calendar, m365-mosser-calendar)

kind = "meeting" (granola, teams-chat)

kind = "chat" (imessage)

kind = "contact" (outlook-contacts)

kind = "doc" (cloud-storage, pkl, gdrive-fetch)

kind = "market_summary" (pbi, craigslist, sf-open-data)

kind = "counter" (pml)