Ingestion
Ingestion
Ingestion is the first step in Outrun's data synchronization process. After adding a source, Outrun begins collecting and storing raw data from your systems using the most appropriate method for each source type.
π₯ Data Collection Strategy
Outrun automatically selects the optimal ingestion method for each source - real-time streams when available, or intelligent batch jobs for comprehensive data collection.
Ingestion Methods
Outrun uses two primary methods for data ingestion, automatically selecting the best approach for each source:
Real-Time Streams
When a source supports real-time data streams, Outrun leverages these for immediate data collection:
- Instant Collection: Data is ingested as soon as it's available
- Event-Driven: Triggered by actual data changes in the source system
- Continuous Flow: Maintains persistent connection for ongoing data flow
- Examples: Salesforce PubSub, webhook-enabled systems
Batch Jobs
For sources without real-time capabilities, Outrun runs periodic batch jobs:
- Scheduled Collection: Jobs run at configured intervals
- Comprehensive Sweep: Aims to import all available data
- Configurable Timing: Frequency depends on your settings and source capabilities
- Incremental Updates: Only collects changed data after initial sync
β‘ Real-Time Streams
- β’ Salesforce PubSub - Enterprise/Unlimited editions
- β’ Webhook Systems - Event-driven notifications
- β’ Change Streams - Database change logs
- β’ Event APIs - Real-time event feeds
π Batch Jobs
- β’ HubSpot - 60-minute polling intervals
- β’ Zoho CRM - Conservative rate-limited batches
- β’ Confluence - Content change detection
- β’ Google Search Console - Daily analytics collection
Stream Storage Architecture
All ingested data is stored in dedicated stream collections that preserve the original format while adding essential metadata.
Stream Collection Naming
[sourceId]_stream
Each source gets its own dedicated stream collection:
hubspot_abc123_stream
- HubSpot source datasalesforce_def456_stream
- Salesforce source datazoho_ghi789_stream
- Zoho CRM source data
Data Storage Principles
First-In, First-Out (FIFO)
- Chronological Order: Data stored in the order it was received
- Temporal Integrity: Maintains timeline of data changes
- Audit Trail: Complete history of all data ingestion
- Processing Order: Ensures consistent processing sequence
Original Format Preservation
- Minimal Transformation: Data stored as close to API response format as possible
- Native Structure: Preserves source system's data structure and relationships
- Field Names: Original field names and data types maintained
- Nested Objects: Complex object structures preserved intact
Metadata Enrichment
Outrun appends system metadata without altering source data:
{
// Original source data (unchanged)
"id": "12345",
"email": "john@example.com",
"firstName": "John",
"lastName": "Doe",
// Outrun metadata (appended)
"_outrun": {
"sourceId": "hubspot_abc123",
"sourceType": "hubspot",
"ingestedAt": "2024-01-15T10:30:00Z",
"apiEndpoint": "/contacts/v1/contact/12345",
"processed": false,
"processingAttempts": 0,
"lastModified": "2024-01-15T09:45:00Z"
}
}
Metadata Fields
Outrun adds comprehensive metadata to track data lineage and processing:
Source Information
sourceId
: Unique identifier for the source instancesourceType
: Type of source system (hubspot, salesforce, etc.)apiEndpoint
: Specific API endpoint used for data collectionobjectType
: Native object type from source system
Timing Information
ingestedAt
: When Outrun received the datalastModified
: When the data was last modified in source systemsyncJobId
: Identifier for the sync job that collected this databatchId
: Batch identifier for grouped operations
Processing Status
processed
: Whether data has been processed into standardized objectsprocessingAttempts
: Number of processing attemptsprocessingErrors
: Any errors encountered during processingconsolidatedAt
: When data was moved to consolidation stage
Data Quality
dataQuality
: Quality score and validation resultsduplicateOf
: Reference to original record if duplicate detectedvalidationErrors
: Field-level validation issuesenrichmentStatus
: Status of any data enrichment processes
Value-Added Services
The metadata-enriched stream data enables powerful value-added services:
Data Lineage Tracking
- Complete History: Track data from source to destination
- Change Attribution: Identify what caused data changes
- Impact Analysis: Understand downstream effects of changes
- Compliance Auditing: Meet regulatory audit requirements
Data Quality Monitoring
- Quality Metrics: Track data quality scores over time
- Validation Reporting: Identify common data quality issues
- Trend Analysis: Monitor data quality improvements
- Alert Systems: Notify of significant quality degradation
Performance Analytics
- Ingestion Rates: Monitor data collection performance
- Processing Times: Track how long data takes to process
- Error Rates: Identify and resolve ingestion issues
- Capacity Planning: Plan for scaling data operations
Advanced Processing
- Machine Learning: Use historical data for predictive analytics
- Data Enrichment: Enhance data with external sources
- Anomaly Detection: Identify unusual data patterns
- Custom Transformations: Apply business-specific data rules
Ingestion Configuration
Batch Job Settings
Configure how often batch jobs run based on your needs:
- Polling Interval: How frequently to check for new data
- Backfill Period: How far back to collect historical data
- Rate Limiting: Respect source system API limits
- Error Handling: Retry logic and failure recovery
Real-Time Stream Settings
Configure real-time data collection parameters:
- Event Filtering: Which events to capture
- Buffer Settings: How to handle high-volume streams
- Failover Logic: Fallback to batch jobs if stream fails
- Monitoring: Health checks and performance metrics
Best Practices
Initial Setup
- Start Small: Begin with recent data to test ingestion
- Monitor Performance: Watch for rate limit issues
- Validate Data: Ensure data quality meets expectations
- Gradual Expansion: Increase scope after successful testing
Ongoing Management
- Regular Monitoring: Check ingestion health and performance
- Error Review: Address ingestion errors promptly
- Capacity Planning: Monitor storage and processing needs
- Performance Optimization: Adjust settings based on usage patterns
Data Quality
- Source Validation: Ensure source data quality before ingestion
- Metadata Review: Use metadata for data quality insights
- Error Analysis: Investigate and resolve recurring issues
- Continuous Improvement: Refine ingestion based on learnings
Next Steps
π Learn About Standardization
Discover how raw ingested data transforms into standardized objects.
Standardization Process βIngestion is the foundation of reliable data synchronization. Understanding this process helps you optimize your data collection strategy.