Items Hub — Data Dictionary
Purpose. One reference page listing every source, what it contains, where it lands in Neon, what gets extracted from it, and what its scope / retention / access envelope is. If you're writing a consumer that queries items.*, start here.
Canonical truth is items.source_config in Neon — this document is the human-readable mirror of that table plus the per-source schema contracts the extractor produces. When in doubt, run:
SELECT source, display_name, tenant, trigger_env, cadence, status,
alert_threshold_hours, sort_order, enabled
FROM items.source_config
ORDER BY sort_order;
Legend
- Status:
live(actively ingesting),degraded(live but past staleness threshold),broken(recent errors),designed(planned, not built),down(explicitly shut off — seeOff — not currently ingestingsection on the dashboard). - Tenant:
lfi·mosser·personal·market·mixed. - Trigger env:
anthropic-cloud(scheduled task in the Anthropic sandbox),windows-claude-desktop(always-on Windows PC atdockerpc.local),mac-login-item(LaunchAgent on Justin's MacBook),none(off / manual only). - Inbox payload
kind: the discriminator insideinbox_items.payload->>'kind'. The extractor dispatches per-kind prompts.
The sources (17 rows, active first)
Live, Mac-login-item
| Source | Display | Tenant | Cadence | Scope |
|---|---|---|---|---|
cloud-storage | Cloud storage (Inbounds/iCloud/Box/GDrive/Dropbox) | mixed | 30m | All files appearing under 5 watched roots. Metadata only (filename, size, provider, mtime, path hash). No content extraction by default. |
imessage | iMessage | personal | 30m | All messages in the last 30 min, decoded from chat.db including modern attributedBody NSString format. Sender, body, chat id, date. Excludes attachments. |
pkl | PKL (Personal Knowledge Library) | personal | 30m | Notes/PDFs added or changed under the PKL root, excluding 05_Archives. Metadata + Finder tags (portfolios[], markets[], type). Content extraction deferred. |
pml | PML (Personal Media Library, counter) | personal | daily | Just a count: how many photos / videos added. No binary content. |
Live, Windows Claude Desktop (via Outlook MCP → Neon public.outlook_*, then Vercel cron picks up)
| Source | Display | Tenant | Cadence | Scope |
|---|---|---|---|---|
m365-lfi | M365 — LFI (email) | lfi | 2h | Inbox emails for justin@leftfieldinv.com in last 3h window. Subject, from, to, body (HTML stripped), received_at. |
m365-mosser | M365 — Mosser (email) | mosser | 2h | Inbox emails for jsato@mosserco.com. Same shape. |
m365-lfi-calendar | M365 — LFI (calendar) | lfi | 2h | LFI calendar events ±7 days. Title, attendees, start/end, location, body. |
m365-mosser-calendar | M365 — Mosser (calendar) | mosser | 2h | Mosser calendar events ±7 days. Same shape. |
outlook-contacts | Outlook Contacts (LFI) | lfi | 6h (shared w/ M365) | LFI contact directory passthrough — name, email, company, tags. Mosser contacts intentionally skipped (duplicate set). |
teams-chat | Teams chat (daily takeaways via email) | mosser | daily | PKM-AI-Workflow daily email with subject pattern Teams Chat - Daily Takeaways YYYY-MM-DD. Subject-routed in /api/ingest/m365 via SUBJECT_ROUTES. |
Live, Anthropic cloud
| Source | Display | Tenant | Cadence | Scope |
|---|---|---|---|---|
granola | Granola — meetings | mixed | 15m | Meetings updated in last 2h. Title, participants, started_at, transcript (truncated to 16K chars), notes. |
pbi | Power BI — Golden Data Model | mosser | 04:03 daily | Daily operational signal: total AR balance, 30+ day delinquent portion, work orders opened today, leads today, upstream freshness. Computed from Brickston Cloud SQL pbi_* staging tables. |
sf-open-data | SF Open Data | market | 06:30 daily | 6 civic datasets (permits, violations, evictions, housing complaints, planning, rent-board petitions). Inbox row is a one-line summary of today's activity. |
craigslist | Craigslist — SF rentals | market | 05:10 daily | Today's new / active listings + median rent / avg bedrooms / neighborhoods covered. Computed from Brickston Cloud SQL market_listings. |
Down (explicitly not pursuing — registry kept for future activation)
| Source | Display | Tenant | Why off |
|---|---|---|---|
yardi | Yardi — Mosser PM system | mosser | No viable ingress — API not in scope; SSO-gated export is operator-dependent. Revisit when Mosser opens an API or a SOC2 data share. |
smartsheet | Smartsheet | mixed | Low ROI for the PKM layer — operational data already surfaces via M365 attachments and Teams. |
docusign | DocuSign | lfi | Low daily volume + signed-artifact-only workflow; every execution event already appears in email. Revisit if LFI deal volume grows. |
Payload contracts (what items.inbox_items.payload looks like)
Every inbox row has source, tenant, received_at, payload jsonb, idempotency_key, processed_at, error. The payload shape varies by source; the extractor dispatches on payload->>'kind'.
kind = "email" (m365-lfi, m365-mosser)
{
"kind": "email",
"id": "<outlook message id>",
"subject": "…",
"from": { "name": "Ji Won Choi", "email": "jwchoi@mosserco.com" },
"to": [ { "name": "…", "email": "…" }, … ],
"body": "…plain text…",
"received_at": "2026-04-22T10:14:00Z",
"thread_id": "<conversation id>",
"tenant": "lfi" | "mosser"
}
kind = "calendar" (m365-lfi-calendar, m365-mosser-calendar)
{
"kind": "calendar",
"id": "<event id>",
"subject": "…", "body": "…",
"start_at": "…", "end_at": "…",
"attendees": [ { "name": "…", "email": "…", "response": "…" } ],
"location": "…",
"tenant": "lfi" | "mosser"
}
kind = "meeting" (granola, teams-chat)
{
"kind": "meeting",
"id": "<granola meeting id | teams email id>",
"title": "…",
"started_at": "…",
"participants": ["…","…"],
"transcript": "…(<=16000 chars)",
"notes": "…",
"tenant": "mixed" | "mosser"
}
kind = "chat" (imessage)
{
"kind": "chat",
"id": "<chat.db rowid>",
"chat_id": "iMessage;-;+14155551234",
"handle": "+14155551234",
"is_from_me": false,
"text": "…(attributedBody decoded)",
"sent_at": "…"
}
kind = "contact" (outlook-contacts)
{
"kind": "contact",
"id": "<outlook contact id>",
"name": "…",
"emails": ["…"],
"phones": ["…"],
"company": "…",
"title": "…",
"categories": ["…"],
"tenant": "lfi"
}
kind = "doc" (cloud-storage, pkl, gdrive-fetch)
{
"kind": "doc",
"provider": "icloud" | "box" | "dropbox" | "gdrive-lfi" | "gdrive-lfiq" | "pkl" | "inbounds",
"path": "/absolute/path/on/origin",
"filename": "2025-LFI-LOI-870-Oak.pdf",
"size_bytes": 148122,
"mtime": "…",
"finder_tags": ["portfolios:Mosser", "markets:SF", "type:LOI"],
"content_hash": "sha256:…",
"content": null // filled only when content extraction is enabled
}
kind = "market_summary" (pbi, craigslist, sf-open-data)
{
"kind": "market_summary",
"source": "pbi" | "craigslist" | "sf-open-data",
"tenant": "mosser" | "market",
"as_of": "2026-04-22",
"date": "2026-04-22",
"title": "pbi daily summary — 2026-04-22",
"body": "Mosser ops — 2026-04-22: Total AR $…; delinquent 30+ $…. Work orders opened today: … .",
"content": "…same as body (legacy alias)…",
"facts": { "ar_total": …, "work_orders_today": …, … },
"details": null
}
kind = "counter" (pml)
{
"kind": "counter",
"date": "2026-04-22",
"added_today": 47,
"total": 12341
}
What the extractor produces per payload kind
The extractor (app/lib/extractor.ts) dispatches on kind to a family prompt in app/lib/prompts/ and writes normalized rows.
| Inbox kind | Normalized tables written |
|---|---|
email, meeting, chat | items.tasks, items.commitments, items.decisions, items.entities |
calendar | items.tasks (prep tasks), items.entities (attendees) |
contact | items.entities only (no tasks / decisions) |
doc | items.documents, items.entities (when content is available) |
market_summary | items.decisions (as a dated market observation), items.entities (geos / properties mentioned) |
counter | No normalized output — statistical only, renders on /sources/pml |
Every extraction run also appends to items.embeddings (see §Embeddings pipeline below).
Embeddings pipeline
Embeddings live in Neon — one table, items.embeddings, pgvector HNSW index on cosine distance. They are the substrate for every "search across everything" feature.
What
- Model: OpenAI
text-embedding-3-small, 1536 dims. - Index: pgvector 0.8, HNSW
vector_cosine_ops. - Rows (2026-04-23): ~14k, growing roughly at the rate of new normalized rows.
- Row shape:
{ id, source_db, source_schema, source_table, source_id, canonical_text, text_sha1, embedding vector(1536), metadata jsonb, created_at }.text_sha1is the dedup key — identical text is stored once.
Where the text comes from
app/lib/embeddings.ts:canonicalText() builds the string that gets embedded. Current coverage:
source_db | Tables embedded | Trigger |
|---|---|---|
neon-items | items.tasks, items.commitments, items.decisions, items.entities, items.documents | After each extractor run that writes a normalized row. |
neon-public | public.tasks, public.commitments, public.decisions, public.entities | Backfill only, via scripts/embed_backfill.py. Keeps legacy PKM rows searchable until they age out. |
brickston | (designed — see prompt.md P2.9) Row-level Brickston Cloud SQL rows (AR ledger, work orders, leads, market listings) | Nightly job writes vectors directly into items.embeddings without replicating the source rows. |
When
- Write-time: the extractor calls
upsertEmbeddings()for each normalized row it produces. First write embeds; re-writes with identical text are deduped bytext_sha1. - Backfill:
scripts/embed_backfill.pysweeps any row (items.* or public.*) that isn't already initems.embeddings.
Who consumes them
- Dashboard
/search+ the global SearchBar in the top-right of every page —/api/search?q=…runs a single cosine-nearest query across all source_dbs, returns deep-links into the relevant entity / task / decision. - Brickston AI hits
/api/items-hub/search— proxied byBRICKSTON_ITEMS_HUB_BASE_URL. Brickston passesX-Ingest-Secretand receives the same ranked results, joined back onto its own entity graph. - Future agent Q&A (designed) — cross-source retrieval that lets Claude answer "what did Ji Won say last week about 870 Oak?" without knowing which source the answer lives in.
PII and redaction
Raw text leaves Neon at embed time and is sent to OpenAI over TLS. Only the 1536-dim vector + the canonical text are retained in items.embeddings. To skip embedding for a sensitive source (e.g. a new banking feed), gate it in canonicalText() — return null and upsertEmbeddings will no-op.
Scope / retention / PII
- Retention: no policy yet. All rows kept forever.
items.inbox_items.payloadstores raw source content, which means email bodies, meeting transcripts, and chat messages are in Neon indefinitely. Address before sharing DB access externally. - PII: emails, phone numbers, home addresses, financial balances. Treat Neon
DATABASE_URLas sensitive. - Tenant isolation: soft — enforced by
tenantcolumn, not by row-level security. A consumer with the pooledDATABASE_URLcan read everything. If we ever need hard isolation (external viewer, investor share), use a read-only Neon role withCREATE POLICY … USING (tenant = current_setting('app.tenant'))and set the GUC before querying. - Embeddings: see §Embeddings pipeline above for model, retention, and consumers.
- Cross-tenant entity merging (
scripts/entity_reconcile.py) is bounded bytenant— LFI and Mosser entities never collapse into one row even when names match.
See also
docs/CONSUMING_ITEMS_HUB.md— three access patterns (direct SQL, REST, programmatic) with examples.docs/WINDOWS_CLAUDE_DESKTOP.md— M365 trigger operator guide.app/lib/sources.ts— canonical list used to seeditems.source_config.app/lib/db/schema.ts— Drizzle schema (table shapes truth).