Building Offline-First Telemetry for Markets Where 4G Is the Exception

Rural African road with EV motorcycle, mobile signal coverage map overlay

The Assumption Embedded in Most IoT Architectures

The standard IoT telemetry architecture assumes that connectivity exists. Data is collected at the device, transmitted over MQTT or HTTP, and ingested by a cloud broker in near-real-time. The pipeline is designed to handle brief outages — a few seconds, maybe a few minutes — with reconnection logic and an in-memory buffer. When connectivity is gone for hours, the standard architecture drops data.

This assumption holds in Western Europe, urban North America, and coastal China, where 4G coverage is effectively ubiquitous. It doesn't hold in the markets Stima is built for. In Kenya, Uganda, and Nigeria combined, roughly 35% of the land area — where a meaningful fraction of EV motorcycle routes operate — has 2G coverage only, or no mobile data coverage at all. Rural peri-urban routes in Nairobi's outer zones, the Kampala suburbs, and routes between Lagos and Ibadan pass through coverage gaps that can last 20–40 minutes during a standard working shift.

For a battery health monitoring system, this creates a specific problem: the telemetry data most valuable for degradation modeling comes from the full discharge curve — the complete voltage and temperature trajectory from full to near-empty. If you drop 30% of the discharge data because the vehicle spent part of the cycle in a coverage gap, your degradation estimates become significantly less accurate. The missing segments are not random — they're correlated with specific route segments, which means the bias compounds over time.

Attempt 1: Store-and-Forward MQTT

Our first solution was a standard store-and-forward MQTT setup. The Raspberry Pi prototype stored outbound messages in a local queue (using a SQLite-backed FIFO). When connectivity was detected, the queue drained to the broker in order. When connectivity dropped, new readings appended to the queue.

The problem was MQTT's session state management. MQTT QoS 1 (at-least-once delivery) requires the broker to acknowledge each message before the client removes it from its local queue. In conditions where connectivity was present but degraded — 2G with occasional dropped packets — the acknowledgment round-trips were unreliable. The client would send a message, not receive an ack in time, and retry. The broker would receive the same message twice. Duplicate handling at the broker level was possible but added complexity.

More critically, MQTT's persistent session feature — which is supposed to allow a client to resume a session after disconnection — behaved inconsistently across mobile network transitions. When a vehicle moved from 4G to 2G coverage and back, the session handoff caused the broker to drop in-flight QoS 1 messages about 12% of the time in our testing. That 12% loss rate was unacceptable for data that needed to be complete for accurate degradation modeling.

Attempt 2: HTTP Batch Upload

We moved to HTTP batch uploads: the device accumulates readings in a local SQLite file and uploads batches of 500 readings at a time via HTTP POST when connectivity is available. HTTP is stateless — each upload is independent, and the server can acknowledge or reject it cleanly. Duplicate handling is easy: each reading has a UUID, and the server discards known UUIDs.

The HTTP approach worked much better for data integrity. But we discovered a different problem: 500-reading batches on a 2G connection with 200 kbps peak throughput took 8–12 seconds to upload. When a vehicle was moving through intermittent coverage, a 10-second upload window often ended mid-transfer, resulting in a partial upload that the server rejected. The device then re-sent the entire batch. In areas with spotty coverage, this created a retry storm where the device spent 40% of its connectivity window re-sending data it had already partially delivered.

The fix for the retry storm was chunked uploads with range-request style resumption — but implementing that correctly in a resource-constrained embedded environment was more complexity than the problem warranted. We needed an approach that was both reliable and simple enough to run on an MCU with limited RAM and no OS-level TCP stack sophistication.

The 72-Hour Buffer: What We Actually Shipped

The SEM-1's production data pipeline is architecturally simple. The MCU writes telemetry readings to a circular buffer in the 4MB external flash chip, timestamped to UTC second. The buffer holds approximately 72 hours of data at 30-second sampling intervals for 22 monitored channels. When connectivity is available, the modem uploads data in small 50-record JSON payloads. Each payload is idempotent — safe to receive twice, with UUID-based deduplication on the server side.

The key design decision: uploads happen in background priority. The MCU's connectivity management loop gives reading collection and local storage highest priority, and upload is a best-effort background task. This means that even on a vehicle with very poor connectivity, the local record is always complete. The cloud record may lag by hours, but it catches up when connectivity improves, and it doesn't miss any data.

The 50-record payload size was chosen to complete transmission within 2 seconds on a 2G connection. A 2-second window is achievable even in marginal coverage conditions. If the upload doesn't complete, the device tries again at the next 30-second polling interval. No partial upload problem — each payload is small enough to complete atomically in almost any realistic connectivity condition.

In deployment, the average lag between data collection and cloud ingestion is 4.2 minutes across all vehicles in our monitoring fleet. In the worst connectivity scenario we've measured — a route that passes through a known dead zone between Nairobi's Kikuyu area and Tigoni — the maximum observed lag is 6 hours and 14 minutes. All data arrived eventually. None was lost.

Implications for Real-Time Alerting

The offline-first architecture creates a tradeoff for real-time alerting. If a thermal event occurs while the vehicle is in a coverage gap, the cloud alert system won't receive the data until connectivity is restored — which could be hours after the event. For a thermal runaway in progress, hours-delayed cloud alerts are not useful.

This drove us to implement local alerting via Bluetooth to the driver's phone. The SEM-1 broadcasts BLE advertisements when it detects a thermal anomaly above the configured threshold. The driver app — which maintains a persistent BLE connection when the phone is within 10 meters — receives the alert immediately and displays an in-cab notification. This path is independent of mobile data connectivity entirely.

The dispatch-side alert — which tells the fleet manager a vehicle needs attention — is cloud-mediated and thus subject to the data lag. We considered SMS-based alerting as a lower-latency path, but SMS delivery in sub-Saharan Africa is itself unreliable in rural areas and adds per-message cost. The current design uses BLE for driver-immediate alerts and cloud for manager alerts with an average 4-minute lag. That combination handles the safety-critical case (driver immediate notification) while keeping the management case (fleet manager dashboard) within an acceptable operational latency.

Data Completeness Numbers from Production

After three months of production deployment across 200 vehicles in Nairobi and Lagos, data completeness is 99.2% — meaning 99.2% of expected telemetry readings are present in the cloud database with no gaps. The 0.8% gap rate is primarily attributable to hardware issues (module USB power connector coming loose on two vehicles) rather than connectivity loss. No data was lost to connectivity failure during the three-month period.

For context: a standard streaming MQTT deployment tested against the same fleet in simulation, with real connectivity data from the same vehicles, projected a data completeness of approximately 91%. The offline-first design recovered 8 percentage points of completeness — and in degradation modeling, complete discharge curves are not interchangeable with 91% complete ones. The 9% of dropped readings under streaming architecture would have been the readings during the most challenging operating conditions, which are disproportionately informative for predicting failure.

Filed under: Engineering, IoT Architecture · Back to Blog