Time-series data from IoT sensors has 10-30% missing values due to connectivity gaps — and there is no standard way to handle it
devtoolsdevtools0 views
You are building a predictive maintenance model for factory equipment using sensor data: temperature, vibration, pressure, current, every 5 seconds. Your dataset has 50 million rows from 200 sensors over 6 months. But 15% of rows have missing values: the sensor lost Wi-Fi for 3 minutes (36 missing readings), a sensor rebooted (missing 1 hour of data), a sensor was replaced with a new one (different calibration baseline). You can interpolate short gaps (linear fill for 30 seconds) but cannot interpolate a 3-hour gap without introducing fake data. You can drop rows with missing values but then your time series has gaps, which breaks any model that uses temporal features (moving averages, lag values, recurrence). So what? Every IoT/sensor dataset has significant missing data. The standard approaches — drop, interpolate, or impute — each introduce different biases. Dropping removes potentially important events (a sensor going offline might correlate with the failure you are trying to predict). Interpolation fabricates data that looks real but is not. Imputation using models introduces the model's assumptions into the training data. Most ML practitioners use pandas fillna(method='ffill') (forward fill) and move on — which means they are using the last known value as if nothing changed, even during a 3-hour gap. Why does this persist? Sensor data quality is a hardware/infrastructure problem (connectivity, reliability, calibration) that ML practitioners inherit but cannot fix. The sensors are in the field, maintained by operations teams, and the data pipeline has no quality SLA. There is no standard for 'minimum data quality for ML' — what percentage of missing values is too many? No framework answers this question for time-series data.
Evidence
McKinsey: IoT sensor data quality issues affect 30-40% of industrial ML projects. Most popular imputation method: forward fill (pandas ffill) — no domain awareness. scikit-learn IterativeImputer exists but is designed for tabular data, not time series. No mainstream framework handles time-series-specific missingness patterns (sensor reboot gaps, calibration shifts, connectivity losses).