Skip to content

KOS — FRD: Data Ingestion

KOS — FRD: Data Ingestion

Parent: KOS PRD
Owner: Digital Office
Status: Phase 1b


Overview

The ingestion pipeline connects to external source systems, extracts documents, scans for PII, and writes temporal facts to the FalkorDB knowledge graph via Graphiti.

Functional Requirements

FR-ING-01: Connector Interface

All connectors MUST implement BaseConnector:

  • fetch_documents(since: datetime | None) -> AsyncIterator[Document]

  • health_check() -> bool

  • Each document carries: id, source_type, document_type, title, content, url, created_at, updated_at, metadata

FR-ING-02: Supported Sources

Connector Auth Method Incremental Sync
Confluence Basic (user + API token) updated_on filter
Slack Bot token oldest timestamp
Bitbucket Basic (user + app password) updated_on filter
Git repos Filesystem (local) File mtime
Filesystem Filesystem (local) File mtime
Jira Basic (email + API token) JQL updated >=
SharePoint/OneDrive OAuth2 client credentials lastModifiedDateTime

FR-ING-03: PII Scanning

Before any document is written to the graph:

  • Scan content for: phone numbers (6 country formats), credit/debit cards (Luhn), API keys (6 provider patterns), email addresses, private keys

  • Redact matches with [REDACTED-{type}]

  • Log PII detection event to audit trail

  • Never block ingestion on PII — redact and continue

FR-ING-04: Rate Limiting

All HTTP connectors MUST handle 429 responses with exponential backoff (max 3 retries, base delay 1s, jitter).

FR-ING-05: Scheduled Re-ingestion

Background scheduler re-ingests all sources every N hours (default: 6). Configurable via KOS_INGESTION_INTERVAL_HOURS. Can be disabled via KOS_INGESTION_ENABLED=false.

FR-ING-06: Error Isolation

A single document failure MUST NOT halt the pipeline. Log the error, increment error counter, continue to next document.

FR-ING-07: Document Classification

Documents are classified by type using source-specific rules:

  • Confluence: page labels and title patterns

  • Git/Filesystem: filename patterns (ADR-, STD-, -schema.surql, -api.yaml, etc.)

  • Jira: issue type (Epic → PLAN, Story/Task/Bug → WIKI_PAGE)

FR-ING-08: Entity Resolution

During ingestion, entity names are matched against the canonical catalogue (jurisdictions, systems, APIs) to improve graph entity linking.

Configuration

Env Var Description Default
KOS_CONFLUENCE_URL Confluence base URL
KOS_CONFLUENCE_USER Confluence username
KOS_CONFLUENCE_API_TOKEN API token
KOS_CONFLUENCE_SPACES Space keys to ingest (CSV) all
KOS_SLACK_BOT_TOKEN Slack bot token
KOS_BITBUCKET_WORKSPACE Workspace slug
KOS_BITBUCKET_APP_PASSWORD App password
KOS_GIT_REPO_PATHS Comma-separated local paths
KOS_FILESYSTEM_DIR Local directory to walk
KOS_JIRA_URL Jira Cloud base URL
KOS_JIRA_API_TOKEN Jira API token
KOS_JIRA_PROJECT_KEYS Project keys to ingest all
KOS_MS_TENANT_ID Microsoft tenant ID
KOS_MS_CLIENT_ID App registration client ID
KOS_MS_CLIENT_SECRET App registration secret
KOS_SHAREPOINT_SITE_IDS SharePoint site IDs
KOS_ONEDRIVE_USER_IDS OneDrive user IDs