Dallas Crilley
Throughline case study
Architecture

Why I Reduced Integration Code by 40× — and Then Stopped There

The four-method connector contract that turned 2,000-line integrations into 50-line entry points — and the restraint required to keep it at four.

The problem

We ran a PR and podcast production company on four disconnected SaaS tools: QuickBooks (billing), Copper (CRM), Basecamp (projects), and PandaDoc (contracts). Reporting meant manual exports. Adding a new data source historically cost roughly 2,000 lines of bespoke Python — auth, pagination, rate-limit handling, field mapping, upsert logic, and the inevitable retry loop that only fails at 2 AM.

After the third integration, I noticed the pattern: every connector solved the same five problems, but each solved them differently. Different auth patterns. Different pagination state machines. Different error-handling philosophies. Different retry strategies that were either too aggressive (rate-limit death spirals) or too passive (data gaps that surface in the QBR).

The real cost wasn't the first 2,000 lines. It was the 400 lines of subtle divergence that made every subsequent connector harder to reason about.

The contract

I reduced the surface area to four methods:

class Connector(ABC):
    @abstractmethod
    def authenticate(self, credentials: Credentials) -> AuthContext: ...

    @abstractmethod
    def fetch_entities(self, entity_type: str, since: datetime | None) -> Iterator[Entity]: ...

    @abstractmethod
    def transform(self, raw: Entity) -> NormalizedRecord: ...

    @abstractmethod
    def supports_destination(self, dest: DestinationType) -> bool: ...

That's it. No sync(), no upsert(), no health_check(). Those live in the engine, not the connector.

Why four methods, not three or five?

Three methods (auth, fetch, transform) would have tempted connector authors to embed destination-specific write logic inside transform. I've seen this mistake before: a "generic" ETL where the Salesforce connector returns dict and the PostgreSQL connector returns SQLAlchemy model, and the sync engine has to branch on type. That is not abstraction; it is indirection.

Five methods (adding health_check or validate) would have encouraged connector authors to build mini-services. A health check that only pings the API is useless; a health check that validates schema drift against the live API is valuable but expensive to maintain across every connector. I moved health validation into the engine's pre-flight runner, which runs a lightweight fetch-and-discard against each connector on startup.

Four methods forces a clean separation: the connector knows the source schema and auth. The engine knows sync semantics, retry policy, checkpointing, and dual-destination routing.

The engine

The sync engine (SyncEngine, 545 LOC) handles the generic loop:

  1. Authenticate — refresh OAuth tokens, validate scopes.
  2. Resolve dependency order — if invoices depends on customers, fetch customers first.
  3. Fetch — incremental mode passes the last checkpoint timestamp; full mode passes None.
  4. Transform — connector-specific normalization to a shared NormalizedRecord schema.
  5. Upsert — engine handles INSERT ... ON CONFLICT for PostgreSQL or batch Airtable API calls.
  6. Write checkpoint — per-connector, per-entity-type metadata in _sync_metadata.

Retry policy

The engine implements exponential backoff with jitter, but with two connector-aware overrides:

  • Rate-limit headers: If the connector exposes Retry-After (or the vendor equivalent), the engine respects it exactly. Generic jitter is a guess; a header is a promise.
  • Permanent vs. transient classification: The engine maintains a registry of known permanent errors (auth revoked, entity not found) and known transient errors (timeout, 503, rate limit). Unknown errors are treated as transient for the first two retries, then escalated to permanent. Connectors can override classification via a classify_error hook, but the default behavior prevents connector authors from having to think about retry taxonomy.

Dual-destination routing

One requirement I refused to compromise: the ops team wanted Airtable (familiar UI, fast filtering) and the analytics pipeline wanted PostgreSQL (real SQL, BI tooling). Most ETL tools force you to pick one primary destination and treat the other as a slow replica.

I split the write path:

  • PostgreSQL: Full schema, foreign keys, JSONB for extensibility. Used by the Next.js dashboard and Looker Studio.
  • Airtable: Flattened schema, human-readable field names, link fields for relational views. Used by ops for daily triage.

The engine writes both destinations in parallel. If Airtable rate-limits, PostgreSQL still commits and the checkpoint advances. Airtable catches up on the next sync interval. This means ops might see a 30-second delay, but analytics never stalls because of a UI tool's rate limit.

What I cut

1. Schema migration in connectors

Connectors do not manage their own schema. The engine owns the NormalizedRecord schema and the migration path. When a connector author adds a new field, they update the transform method and submit a PR. The engine's migration runner (Alembic) handles the rest. This prevents the "each connector has its own migration tool" nightmare I've seen in larger ETL platforms.

2. Real-time streaming

The engine is batch-oriented with configurable intervals (15 min, 1 hr, daily). Real-time streaming would require persistent connections, webhook validation, and operational complexity that this team's size could not support. The 15-minute sync interval is a business-acceptable tradeoff; real-time would be an engineering vanity project.

3. Bidirectional sync

Write-back to source systems is intentionally unsupported. The contract has no push_entities method. Bidirectional sync introduces conflict resolution, field-level locking, and merge semantics that explode complexity. If ops needs to update a Copper contact, they use Copper's UI. The sync engine reads the change on the next interval.

Failure modes I designed for

API schema drift

Copper changed their company_id field from integer to string in 2024. The connector's transform method caught the type mismatch during local testing (the engine runs a dry-run mode that validates transform output against the NormalizedRecord schema). The fix was 3 lines in the Copper connector; the engine required no changes.

Auth token expiration mid-sync

The engine checkpoints after every entity batch, not just at the end of a connector. If a 10,000-record sync fails at record 8,247 because the OAuth token expired, the next run resumes from record 8,248 with a refreshed token. The checkpoint granularity prevents the "re-sync everything" death spiral.

Duplicate detection across partial runs

The upsert logic uses a composite key of (source_id, source_system, entity_type). This means a record from Copper's contacts table and a record from QuickBooks' customers table can both exist without collision, even if they represent the same person. Identity resolution ("this Copper contact is that QuickBooks customer") is a separate, explicit pipeline step — not an implicit assumption baked into the sync engine.

Honest limits

  • The propagation engine hardcodes platforms. core/propagation_engine.py knows about QuickBooks, Copper, and PandaDoc explicitly. This violates the plugin architecture and is the most important technical-debt item.
  • Observability is weak. No Prometheus, no structured logging pipeline, no alert on sync failure. Failures are visible in the dashboard but not paged.
  • Deployment is manual. The deploy process takes ~20 minutes and rollback coverage is limited.

The numbers

Metric Before After
New connector lines of code ~2,000 ~50 (entry point)
API calls per day (QuickBooks) 11,548 3,395 (69% reduction via incremental sync)
Time to add a new data source ~2 weeks ~2 days
Connectors in production 0 5

The 69% API reduction is doc-stated, not instrumented. The connector LOC reduction is directly verifiable: the connector template and all five production connectors conform to the 50-line target for the entry-point class.

What this proves

This is not a "look how clever my abstraction is" post. The four-method contract is obvious in hindsight. The point is that I shipped it, maintained it across five live integrations, and resisted the temptation to add a sixth method every time a new edge case appeared.

The staff-level signal is not the design. It is the restraint.

Want to see the connector interface, the retry queue, or the dual-destination routing in production code? I will walk you through it on a call.