KOS — FRD: Data Ingestion¶
KOS — FRD: Data Ingestion¶
Parent: KOS PRD
Owner: Digital Office
Status: Phase 1b
Overview¶
The ingestion pipeline connects to external source systems, extracts documents, scans for PII, and writes temporal facts to the FalkorDB knowledge graph via Graphiti.
Functional Requirements¶
FR-ING-01: Connector Interface¶
All connectors MUST implement BaseConnector:
-
fetch_documents(since: datetime | None) -> AsyncIterator[Document] -
health_check() -> bool -
Each document carries: id, source_type, document_type, title, content, url, created_at, updated_at, metadata
FR-ING-02: Supported Sources¶
| Connector | Auth Method | Incremental Sync |
|---|---|---|
| Confluence | Basic (user + API token) | updated_on filter |
| Slack | Bot token | oldest timestamp |
| Bitbucket | Basic (user + app password) | updated_on filter |
| Git repos | Filesystem (local) | File mtime |
| Filesystem | Filesystem (local) | File mtime |
| Jira | Basic (email + API token) | JQL updated >= |
| SharePoint/OneDrive | OAuth2 client credentials | lastModifiedDateTime |
FR-ING-03: PII Scanning¶
Before any document is written to the graph:
-
Scan content for: phone numbers (6 country formats), credit/debit cards (Luhn), API keys (6 provider patterns), email addresses, private keys
-
Redact matches with
[REDACTED-{type}] -
Log PII detection event to audit trail
-
Never block ingestion on PII — redact and continue
FR-ING-04: Rate Limiting¶
All HTTP connectors MUST handle 429 responses with exponential backoff (max 3 retries, base delay 1s, jitter).
FR-ING-05: Scheduled Re-ingestion¶
Background scheduler re-ingests all sources every N hours (default: 6). Configurable via KOS_INGESTION_INTERVAL_HOURS. Can be disabled via KOS_INGESTION_ENABLED=false.
FR-ING-06: Error Isolation¶
A single document failure MUST NOT halt the pipeline. Log the error, increment error counter, continue to next document.
FR-ING-07: Document Classification¶
Documents are classified by type using source-specific rules:
-
Confluence: page labels and title patterns
-
Git/Filesystem: filename patterns (ADR-, STD-, -schema.surql, -api.yaml, etc.)
-
Jira: issue type (Epic → PLAN, Story/Task/Bug → WIKI_PAGE)
FR-ING-08: Entity Resolution¶
During ingestion, entity names are matched against the canonical catalogue (jurisdictions, systems, APIs) to improve graph entity linking.
Configuration¶
| Env Var | Description | Default |
|---|---|---|
| KOS_CONFLUENCE_URL | Confluence base URL | — |
| KOS_CONFLUENCE_USER | Confluence username | — |
| KOS_CONFLUENCE_API_TOKEN | API token | — |
| KOS_CONFLUENCE_SPACES | Space keys to ingest (CSV) | all |
| KOS_SLACK_BOT_TOKEN | Slack bot token | — |
| KOS_BITBUCKET_WORKSPACE | Workspace slug | — |
| KOS_BITBUCKET_APP_PASSWORD | App password | — |
| KOS_GIT_REPO_PATHS | Comma-separated local paths | — |
| KOS_FILESYSTEM_DIR | Local directory to walk | — |
| KOS_JIRA_URL | Jira Cloud base URL | — |
| KOS_JIRA_API_TOKEN | Jira API token | — |
| KOS_JIRA_PROJECT_KEYS | Project keys to ingest | all |
| KOS_MS_TENANT_ID | Microsoft tenant ID | — |
| KOS_MS_CLIENT_ID | App registration client ID | — |
| KOS_MS_CLIENT_SECRET | App registration secret | — |
| KOS_SHAREPOINT_SITE_IDS | SharePoint site IDs | — |
| KOS_ONEDRIVE_USER_IDS | OneDrive user IDs | — |