Ingestion
Ingestion
Ingestion is the first step in Outrun's data synchronization process. After adding a source, Outrun begins collecting and storing raw data from your systems using the most appropriate method for each source type.
📥 Data Collection Strategy
Outrun automatically selects the optimal ingestion method for each source - real-time streams when available, or intelligent batch jobs for comprehensive data collection.
Ingestion Methods
Outrun uses two primary methods for data ingestion, automatically selecting the best approach for each source:
Real-Time Streams
When a source supports real-time data streams, Outrun leverages these for immediate data collection:
- Instant Collection: Data is ingested as soon as it's available
- Event-Driven: Triggered by actual data changes in the source system
- Continuous Flow: Maintains persistent connection for ongoing data flow
- Examples: Webhook-enabled systems, real-time event feeds
Batch Jobs
For sources without real-time capabilities, Outrun runs periodic batch jobs:
- Scheduled Collection: Jobs run at configured intervals
- Comprehensive Sweep: Aims to import all available data
- Configurable Timing: Frequency depends on your settings and source capabilities
- Incremental Updates: Only collects changed data after initial sync
Real-Time Streams
- • Webhook Systems - Event-driven notifications
- • Change Streams - Database change logs
- • Event APIs - Real-time event feeds
- • Pipedrive Webhooks - Real-time activity updates
Batch Jobs
- • HubSpot - 60-minute polling intervals
- • Zoho CRM - Conservative rate-limited batches
- • Confluence - Content change detection
- • Google Search Console - Daily analytics collection
Stream Storage Architecture
All ingested data is stored in the stream_data table within each workspace's tenant database, preserving the original format while adding essential metadata.
Stream Data Table
stream_data (per tenant database)
├── source_id → FK to the source that produced this data
├── external_id → Original record ID from the source system
├── record → JSONB column containing the raw API response
├── metadata → JSONB column with processing metadata
└── created_at → Ingestion timestamp
All sources write to the same table, differentiated by source_id. Composite indexes on (source_id, created_at) and (source_id, external_id) ensure fast lookups.
Data Storage Principles
First-In, First-Out (FIFO)
- Chronological Order: Data stored in the order it was received
- Temporal Integrity: Maintains timeline of data changes
- Audit Trail: Complete history of all data ingestion
- Processing Order: Ensures consistent processing sequence
Original Format Preservation
- Minimal Transformation: Data stored as close to API response format as possible
- Native Structure: Preserves source system's data structure and relationships
- Field Names: Original field names and data types maintained
- Nested Objects: Complex object structures preserved intact
Metadata Enrichment
Outrun stores source data and system metadata in separate columns, keeping the original record untouched:
// record column (JSONB) - original source data preserved
{
"id": "12345",
"email": "[email protected]",
"firstName": "John",
"lastName": "Doe"
}
// metadata column (JSONB) - system metadata stored separately
{
"sourceType": "hubspot",
"apiEndpoint": "/contacts/v1/contact/12345",
"processed": false,
"processingAttempts": 0,
"lastModified": "2024-01-15T09:45:00Z"
}
Metadata Fields
Outrun adds comprehensive metadata to track data lineage and processing:
Source Information
sourceId: Unique identifier for the source instancesourceType: Type of source system (hubspot, pipedrive, zoho, etc.)apiEndpoint: Specific API endpoint used for data collectionobjectType: Native object type from source system
Timing Information
ingestedAt: When Outrun received the datalastModified: When the data was last modified in source systemsyncJobId: Identifier for the sync job that collected this databatchId: Batch identifier for grouped operations
Processing Status
processed: Whether data has been processed into standardized objectsprocessingAttempts: Number of processing attemptsprocessingErrors: Any errors encountered during processingconsolidatedAt: When data was moved to consolidation stage
Data Quality
dataQuality: Quality score and validation resultsduplicateOf: Reference to original record if duplicate detectedvalidationErrors: Field-level validation issuesenrichmentStatus: Status of any data enrichment processes
Value-Added Services
The metadata-enriched stream data enables powerful value-added services:
Data Lineage Tracking
- Complete History: Track data from source to destination
- Change Attribution: Identify what caused data changes
- Impact Analysis: Understand downstream effects of changes
- Compliance Auditing: Meet regulatory audit requirements
Data Quality Monitoring
- Quality Metrics: Track data quality scores over time
- Validation Reporting: Identify common data quality issues
- Trend Analysis: Monitor data quality improvements
- Alert Systems: Notify of significant quality degradation
Performance Analytics
- Ingestion Rates: Monitor data collection performance
- Processing Times: Track how long data takes to process
- Error Rates: Identify and resolve ingestion issues
- Capacity Planning: Plan for scaling data operations
Advanced Processing
- Machine Learning: Use historical data for predictive analytics
- Data Enrichment: Enhance data with external sources
- Anomaly Detection: Identify unusual data patterns
- Custom Transformations: Apply business-specific data rules
Ingestion Configuration
Batch Job Settings
Configure how often batch jobs run based on your needs:
- Polling Interval: How frequently to check for new data
- Backfill Period: How far back to collect historical data
- Rate Limiting: Respect source system API limits
- Error Handling: Retry logic and failure recovery
Real-Time Stream Settings
Configure real-time data collection parameters:
- Event Filtering: Which events to capture
- Buffer Settings: How to handle high-volume streams
- Failover Logic: Fallback to batch jobs if stream fails
- Monitoring: Health checks and performance metrics
Best Practices
Initial Setup
- Start Small: Begin with recent data to test ingestion
- Monitor Performance: Watch for rate limit issues
- Validate Data: Ensure data quality meets expectations
- Gradual Expansion: Increase scope after successful testing
Ongoing Management
- Regular Monitoring: Check ingestion health and performance
- Error Review: Address ingestion errors promptly
- Capacity Planning: Monitor storage and processing needs
- Performance Optimization: Adjust settings based on usage patterns
Data Quality
- Source Validation: Ensure source data quality before ingestion
- Metadata Review: Use metadata for data quality insights
- Error Analysis: Investigate and resolve recurring issues
- Continuous Improvement: Refine ingestion based on learnings
Next Steps
Learn About Standardization
Discover how raw ingested data transforms into standardized objects.
Standardization Process →Ingestion is the foundation of reliable data synchronization. Understanding this process helps you optimize your data collection strategy.