Data Integration for AI
- Why data integration is the foundation of reliable AI automation
- Standardisation patterns for multi-CRM and multi-tool environments
- How object mapping and field normalisation work in practice
- Strategies for keeping AI workflows fed with fresh, consistent data
AI workflows are only as good as the data they receive. Feed an LLM inconsistent field names, mixed formats, and stale records, and the output is unreliable no matter how good the prompt is. Data integration - getting clean, normalised data from your tools into your AI pipelines - is the foundation that everything else depends on.
The Data Problem in AI Workflows
Most businesses use multiple tools that describe the same concepts differently:
| Concept | Salesforce | HubSpot | Pipedrive | Zoho |
|---|---|---|---|---|
| A person | Contact | Contact | Person | Contact |
| A company | Account | Company | Organization | Account |
| A deal | Opportunity | Deal | Deal | Potential |
| A note | Activity | Note | Note | Note |
| Deal stage | StageName | dealstage | stage_id | Stage |
When your AI workflow processes a deal update, the prompt needs to understand the data structure. If the workflow connects to Salesforce, the deal stage field is StageName. For HubSpot, it's dealstage. For Pipedrive, it's stage_id - and it's a numeric ID that maps to a label through a separate API call.
Without a normalisation layer, every AI prompt must be provider-specific. You end up maintaining four versions of every workflow, or worse, a tangle of if/else conditions in every node.
Standardised Object Models
The solution is a standardisation layer that maps provider-specific schemas to a common object model:
{
"standardObject": "Deal",
"standardFields": {
"id": { "type": "string", "required": true },
"title": { "type": "string", "required": true },
"value": { "type": "number", "required": false },
"currency": { "type": "string", "default": "USD" },
"stage": { "type": "string", "required": true },
"owner": { "type": "string", "required": false },
"company": { "type": "string", "required": false },
"contacts": { "type": "array", "items": "string" },
"createdAt": { "type": "datetime", "required": true },
"updatedAt": { "type": "datetime", "required": true }
}
}
Each provider gets a mapping configuration that transforms its native format to the standard model:
{
"provider": "salesforce",
"object": "Opportunity",
"fieldMappings": {
"id": "Id",
"title": "Name",
"value": "Amount",
"currency": "CurrencyIsoCode",
"stage": "StageName",
"owner": "OwnerId",
"company": "AccountId",
"contacts": "ContactRoles[].ContactId",
"createdAt": "CreatedDate",
"updatedAt": "LastModifiedDate"
}
}
With this mapping in place, your AI workflow works with the standard Deal object regardless of which CRM is connected. The prompt references deal.stage and deal.value - never StageName or Amount.
A standardised object model decouples your AI workflows from specific tool implementations. Write your prompts and conditions against the standard model once, and the mapping layer handles provider-specific translation. This means you can swap CRMs, add new tools, or support multiple tools simultaneously without rewriting any workflow logic.
The Ingestion Pipeline
Data flows from source tools into your standardised model through an ingestion pipeline:
[Source API] → [Extract] → [Transform] → [Validate] → [Store] → [Available for AI]
Extract
Pull data from the source tool via API. Two strategies:
Webhook-driven (real-time). The source tool pushes change events as they happen. You receive a webhook payload, identify the changed object, and process it immediately.
{
"event": "deal.updated",
"provider": "hubspot",
"payload": {
"dealId": "12345",
"properties": {
"dealstage": "closedwon",
"amount": "50000"
}
}
}
Poll-based (batch). Periodically query the source API for changes since the last sync. This is simpler but introduces latency.
const lastSync = await getLastSyncTimestamp(provider, object);
const changes = await provider.getModifiedSince(object, lastSync);
Webhook-driven is preferred for AI workflows because the AI needs current data to make good decisions. A lead classification based on data that is 15 minutes stale might miss a critical update.
Transform
Apply the field mappings to convert the source-specific payload to the standard model:
function transformToStandard(payload, mappings) {
const standardObject = {};
for (const [standardField, sourceField] of Object.entries(mappings)) {
if (sourceField.includes('[].')) {
// Handle array fields (e.g., "ContactRoles[].ContactId")
const [arrayField, subField] = sourceField.split('[].');
standardObject[standardField] = (payload[arrayField] || [])
.map(item => item[subField]);
} else {
standardObject[standardField] = payload[sourceField];
}
}
return standardObject;
}
Validate
Check the transformed object against the standard schema. Missing required fields, type mismatches, and out-of-range values are caught here - not downstream in the AI node.
function validate(standardObject, schema) {
const errors = [];
for (const [field, spec] of Object.entries(schema.standardFields)) {
if (spec.required && !standardObject[field]) {
errors.push(`Missing required field: ${field}`);
}
if (standardObject[field] && typeof standardObject[field] !== spec.type) {
errors.push(`Type mismatch on ${field}: expected ${spec.type}`);
}
}
return { valid: errors.length === 0, errors };
}
Relationship consolidation is one of the trickiest parts of data integration. A "Contact" in Salesforce may be linked to an "Account" via an AccountId field, while the same relationship in HubSpot uses an association API. The standardisation layer needs to map both into a consistent contact.company reference. This often requires additional API calls during the transform stage to resolve foreign keys into meaningful references.
Feeding Data to AI Nodes
Once data is standardised, AI nodes consume it through template references:
- id: analyse-deal
type: ai
config:
prompt: |
Analyse this deal and assess the likelihood of closing:
Deal: {{deal.title}}
Value: {{deal.value}} {{deal.currency}}
Stage: {{deal.stage}}
Owner: {{deal.owner.name}}
Company: {{deal.company.name}}
Days in current stage: {{deal.daysInStage}}
Recent activity:
{{#each deal.recentActivities}}
- {{this.date}}: {{this.type}} - {{this.summary}}
{{/each}}
The AI node doesn't know or care whether this deal came from Salesforce, HubSpot, or Pipedrive. The standardised model provides a consistent interface.
Context Enrichment
AI makes better decisions with more context. Beyond the primary object, enrich the prompt with related data:
context:
- source: deals
filter: "company == {{deal.company.id}} AND status == 'open'"
label: "Other open deals with this company"
- source: activities
filter: "contact in {{deal.contacts}} AND date > now-30d"
label: "Recent touchpoints with deal contacts"
- source: emails
filter: "related_deal == {{deal.id}}"
label: "Email thread history"
This context assembly step happens before the AI node executes. The workflow engine queries the standardised data store, assembles the context, and passes it as part of the prompt.
Bidirectional Sync
AI workflows don't just read data - they write it back. When an AI node classifies a lead, that classification needs to flow back to the CRM. The same standardisation layer handles reverse mapping:
// AI output: update deal stage to "negotiation"
const standardUpdate = {
object: "Deal",
id: "deal_123",
fields: { stage: "negotiation" }
};
// Reverse mapping for Salesforce
const salesforceUpdate = reverseTransform(standardUpdate, salesforceMappings);
// Result: { object: "Opportunity", id: "006xxx", fields: { StageName: "Negotiation" } }
Bidirectional sync requires conflict resolution. If the AI updates a deal stage at the same moment a rep updates it in the CRM, which update wins? Common strategies include last-write-wins, source-of-truth priority, or flagging conflicts for human resolution.
Outrun handles data integration natively through its standardised data models. Connect any supported CRM or tool, and Outrun maps your data to a common schema automatically. AI workflow nodes reference standard fields - so your workflows work across any connected tool without modification. The MCP server provides tool-agnostic data access for custom integrations.
Data Freshness and Caching
AI workflows need current data, but not every data point needs to be fetched in real-time. Implement a tiered freshness strategy:
| Data Type | Freshness Requirement | Strategy |
|---|---|---|
| Trigger data | Real-time | Webhook-driven |
| Primary object fields | < 1 minute | Cache with short TTL |
| Related objects | < 5 minutes | Cache with medium TTL |
| Historical context | < 1 hour | Cache with long TTL |
| Reference data (stages, owners) | < 24 hours | Daily sync |
Cache standardised objects in your fast data store. When an AI workflow needs deal data, check the cache first. If the cache is fresh, use it. If stale, fetch from source, transform, validate, cache, and then pass to the AI node.
What's Next
When you serve multiple customers or teams from the same platform, data integration becomes more complex. The next guide covers multi-tenant AI patterns - how to keep data isolated while sharing workflow logic across tenants.