Ingestion

Ingestion

Ingestion is the first step in Outrun's data synchronization process. After adding a source, Outrun begins collecting and storing raw data from your systems using the most appropriate method for each source type.

📥 Data Collection Strategy

Outrun automatically selects the optimal ingestion method for each source - real-time streams when available, or intelligent batch jobs for comprehensive data collection.

Ingestion Methods

Outrun uses two primary methods for data ingestion, automatically selecting the best approach for each source:

Real-Time Streams

When a source supports real-time data streams, Outrun leverages these for immediate data collection:

  • Instant Collection: Data is ingested as soon as it's available
  • Event-Driven: Triggered by actual data changes in the source system
  • Continuous Flow: Maintains persistent connection for ongoing data flow
  • Examples: Webhook-enabled systems, real-time event feeds

Batch Jobs

For sources without real-time capabilities, Outrun runs periodic batch jobs:

  • Scheduled Collection: Jobs run at configured intervals
  • Comprehensive Sweep: Aims to import all available data
  • Configurable Timing: Frequency depends on your settings and source capabilities
  • Incremental Updates: Only collects changed data after initial sync

Real-Time Streams

  • Webhook Systems - Event-driven notifications
  • Change Streams - Database change logs
  • Event APIs - Real-time event feeds
  • Pipedrive Webhooks - Real-time activity updates

Batch Jobs

  • HubSpot - 60-minute polling intervals
  • Zoho CRM - Conservative rate-limited batches
  • Confluence - Content change detection
  • Google Search Console - Daily analytics collection

Stream Storage Architecture

All ingested data is stored in the stream_data table within each workspace's tenant database, preserving the original format while adding essential metadata.

Stream Data Table

stream_data (per tenant database)
├── source_id    → FK to the source that produced this data
├── external_id  → Original record ID from the source system
├── record       → JSONB column containing the raw API response
├── metadata     → JSONB column with processing metadata
└── created_at   → Ingestion timestamp

All sources write to the same table, differentiated by source_id. Composite indexes on (source_id, created_at) and (source_id, external_id) ensure fast lookups.

Data Storage Principles

First-In, First-Out (FIFO)

  • Chronological Order: Data stored in the order it was received
  • Temporal Integrity: Maintains timeline of data changes
  • Audit Trail: Complete history of all data ingestion
  • Processing Order: Ensures consistent processing sequence

Original Format Preservation

  • Minimal Transformation: Data stored as close to API response format as possible
  • Native Structure: Preserves source system's data structure and relationships
  • Field Names: Original field names and data types maintained
  • Nested Objects: Complex object structures preserved intact

Metadata Enrichment

Outrun stores source data and system metadata in separate columns, keeping the original record untouched:

// record column (JSONB) - original source data preserved
{
  "id": "12345",
  "email": "[email protected]",
  "firstName": "John",
  "lastName": "Doe"
}

// metadata column (JSONB) - system metadata stored separately
{
  "sourceType": "hubspot",
  "apiEndpoint": "/contacts/v1/contact/12345",
  "processed": false,
  "processingAttempts": 0,
  "lastModified": "2024-01-15T09:45:00Z"
}

Metadata Fields

Outrun adds comprehensive metadata to track data lineage and processing:

Source Information

  • sourceId: Unique identifier for the source instance
  • sourceType: Type of source system (hubspot, pipedrive, zoho, etc.)
  • apiEndpoint: Specific API endpoint used for data collection
  • objectType: Native object type from source system

Timing Information

  • ingestedAt: When Outrun received the data
  • lastModified: When the data was last modified in source system
  • syncJobId: Identifier for the sync job that collected this data
  • batchId: Batch identifier for grouped operations

Processing Status

  • processed: Whether data has been processed into standardized objects
  • processingAttempts: Number of processing attempts
  • processingErrors: Any errors encountered during processing
  • consolidatedAt: When data was moved to consolidation stage

Data Quality

  • dataQuality: Quality score and validation results
  • duplicateOf: Reference to original record if duplicate detected
  • validationErrors: Field-level validation issues
  • enrichmentStatus: Status of any data enrichment processes

Value-Added Services

The metadata-enriched stream data enables powerful value-added services:

Data Lineage Tracking

  • Complete History: Track data from source to destination
  • Change Attribution: Identify what caused data changes
  • Impact Analysis: Understand downstream effects of changes
  • Compliance Auditing: Meet regulatory audit requirements

Data Quality Monitoring

  • Quality Metrics: Track data quality scores over time
  • Validation Reporting: Identify common data quality issues
  • Trend Analysis: Monitor data quality improvements
  • Alert Systems: Notify of significant quality degradation

Performance Analytics

  • Ingestion Rates: Monitor data collection performance
  • Processing Times: Track how long data takes to process
  • Error Rates: Identify and resolve ingestion issues
  • Capacity Planning: Plan for scaling data operations

Advanced Processing

  • Machine Learning: Use historical data for predictive analytics
  • Data Enrichment: Enhance data with external sources
  • Anomaly Detection: Identify unusual data patterns
  • Custom Transformations: Apply business-specific data rules

Ingestion Configuration

Batch Job Settings

Configure how often batch jobs run based on your needs:

  • Polling Interval: How frequently to check for new data
  • Backfill Period: How far back to collect historical data
  • Rate Limiting: Respect source system API limits
  • Error Handling: Retry logic and failure recovery

Real-Time Stream Settings

Configure real-time data collection parameters:

  • Event Filtering: Which events to capture
  • Buffer Settings: How to handle high-volume streams
  • Failover Logic: Fallback to batch jobs if stream fails
  • Monitoring: Health checks and performance metrics

Best Practices

Initial Setup

  1. Start Small: Begin with recent data to test ingestion
  2. Monitor Performance: Watch for rate limit issues
  3. Validate Data: Ensure data quality meets expectations
  4. Gradual Expansion: Increase scope after successful testing

Ongoing Management

  1. Regular Monitoring: Check ingestion health and performance
  2. Error Review: Address ingestion errors promptly
  3. Capacity Planning: Monitor storage and processing needs
  4. Performance Optimization: Adjust settings based on usage patterns

Data Quality

  1. Source Validation: Ensure source data quality before ingestion
  2. Metadata Review: Use metadata for data quality insights
  3. Error Analysis: Investigate and resolve recurring issues
  4. Continuous Improvement: Refine ingestion based on learnings

Next Steps

Learn About Standardization

Discover how raw ingested data transforms into standardized objects.

Standardization Process →

Explore Sources

See specific ingestion methods for each supported source.

View Sources →

Ingestion is the foundation of reliable data synchronization. Understanding this process helps you optimize your data collection strategy.