CSV Ingestion for Coverage Analytics

Rebuilt the CSV ingestion pipeline for Coverage Impact, a reporting platform that turns third-party media exports into daily coverage analytics. The new importer reads dozens of vendor formats, infers date structures on the fly, and generates contiguous day-level datapoints so comms teams can spot spikes without hand-cleaning spreadsheets.

The Challenge

Incoming CSVs from sources like Cision, Talkwalker, Critical Mention, and internal exports varied wildly: inconsistent headers, BOM/encoding issues, mixed delimiters, multi-language labels, and duplicated rows. Objectives included:

Automatically detecting the correct header row and date column even when labels changed language/case.
Normalizing encodings/BOMs to prevent background jobs from crashing.
Handling US, EU, ISO, and hybrid date formats without manual configuration.
Guaranteeing contiguous daily datapoints for dashboards while rejecting malformed uploads gracefully.
Backing every edge case with regression tests tied to real customer fixtures.

Technical Architecture

Inference Layer

Graphs::Import::DateFormatInferrer scores candidate formats across ISO, year-first, US, and RoW “families,” intersecting possibilities cell-by-cell until a single pattern emerges.
Named-header prioritization (“published on”, “news date”, etc.) plus fallback search across the first row containing any parsable date ensures resilience when vendors rename columns.

Import Layer

Graphs::Import::Daily now relies solely on inferred ranges, tallies occurrences per day, clamps them between the earliest allowed import date and today, and raises UnsupportedCsvFormatError if no clean range exists—preventing silent data drift.
File preprocessing strips BOMs, coerces everything to UTF-8, and normalizes line endings so downstream parsing always sees consistent input.

Validation Layer

Expanded fixture library covering oddities such as Cision’s two-digit years, non-standard US strings, multi-word/snake_case headers, foreign-language headings, and files with extra metadata blocks.
ActiveJob test suite replays each fixture end-to-end, asserting pivot counts, day ranges, and per-day tallies to catch regressions before deployment.

Key Features

Heuristic date detection that converges within the first few rows, enabling schema-less ingestion.
Automatic header recovery for files with preambles, merged cells, or localized labels.
Encoding/BOM sanitation that eliminated a class of job crashes in production.
Fixture-driven regression tests tied to real customer exports, ensuring confidence when adding new vendors.
Graceful failure mode with explicit errors when a file can’t be parsed, giving support teams actionable feedback.

Results

Reduced CSV ingestion failures tied to date/header issues to effectively zero across the monitored sample set.
Enabled analysts to onboard new vendor exports without developer intervention—simply drop in the CSV and the importer adapts.
Cut manual cleanup time for large Cision drops (50k+ rows) from hours to minutes, since daily tallies now align automatically.
Provided product/CS teams with reproducible fixtures so they can verify fixes against customer data before rolling out.

Technologies Used

Backend: Ruby on Rails 7.1, ActiveJob, ActiveStorage, CSV standard lib.
Data Processing: Custom heuristics via Time.strptime, regex-based fuzzy parser for fallback scenarios.
Testing: Minitest with fixture-driven integration coverage.

Takeaway

This project shows how far deterministic heuristics plus thorough fixture coverage can take a data ingestion pipeline. By treating every messy CSV as a first-class citizen—detecting formats, cleaning encodings, and validating behavior end-to-end—we turned an unreliable import job into a self-healing subsystem that scales with whatever data partners throw at it.

Navigation