Missing Data Imputation for LSTM-Based Flood Early Warning System in Jakarta

Abstract

The technological advancements of data storage capacity and computational capabilities has implications for the recording of time series data with increasingly narrow intervals, called high-frequency time series data. Sensor data, as a prominent example of high-frequency time series generated through the utilization of the Internet of Things (IoT), is susceptible to issues related to missing data due to the likelihood of device failures. Furthermore, both the quantity and quality of data significantly impact the performance of forecasting models. This study examines the effects of imputing missing data within a forecasting workflow for sensor data that records water levels at four observation sites. The analysis will be conducted by evaluating the forecasting outcomes of the IMV-LSTM (Interpretable Multi Variable Long Short-Term Memory) model, trained using data reconstructed through imputation methods. The results indicate that the imputed data using the Kalman-Structural method enhances forecast accuracy, evidenced by a 32% reduction in RMSE compared to the model trained on data without imputation treatment as the benchmark. Additionally, imputed data employing Kalman ARIMA improves the performance of the IMV-LSTM model, yielding a 29% lower RMSE compared to the benchmark. The best-performing model demonstrates that the forecasts of water levels deviate by only approximately 0.1% from the actual data.

Country : Indonesia

1 Akmarina Khairunnisa2 Bagus Sartono3 Muhammad Nur Aidi

  1. Department of Statistics and Data Science, IPB University, Bogor, Indonesia
  2. Department of Statistics and Data Science, IPB University, Bogor, Indonesia
  3. Department of Statistics and Data Science, IPB University, Bogor, Indonesia

IRJIET, Volume 9, Issue 6, June 2025 pp. 142-148

doi.org/10.47001/IRJIET/2025.906018

References

  1. H. Kurdi and Novitasari, ‘Evaluasi Terhadap Aspek Hidrologi pada Kawasan Rencana Pengembangan Kota di Kota Balangan’, Jurnal Teknologi Berkelanjutan (Sustainable Technology Journal), vol. 9, no. 2, pp. 96–109, 2020, [Online]. Available: http://jtb.ulm.ac.id/index.php/JTB.
  2. M. N. Aidi, ‘The Influence of Precipitation, Stream Discharge, and Physiographic Factors on Flood Vulnerability at Cimanuk River West Java, Indonesia’, J Sustain Sci Manag, vol. 14, pp. 125–136, Feb. 2019.
  3. N. Koyama, M. Sakai, and T. Yamada, ‘Study on a Water-Level-Forecast Method Based on a Time Series Analysis of Urban River Basins—A Case Study of Shibuya River Basin in Tokyo’, Water (Switzerland), vol. 15, no. 1, Jan. 2023, doi: 10.3390/w15010161.
  4. B. Harsoyo, ‘Mengulas Penyebab Banjir di Wilayah DKI Jakarta dari Sudut Pandang Geologi, Geomorfologi dan Morfometri Sungai’, Jurnal Sains & Teknolohi Modifikasi Cuaca, vol. 14, no. 1, pp. 37–43, 2013.
  5. S. Ginting and W. M. Putuhena, ‘Sistem Peringatan Dini Banjir Jakarta: Jakarta-Flood Early Warning Sytem (J-FEWS)’, Jurnal Sumber Data Air, vol. 10, no. 1, pp. 71–84, May 2014.
  6. M. Halatchev and L. Gruenwald, ‘Estimating Missing Values in Related Sensor Data Streams’, 2005.
  7. R. N. Faizin, M. Riasetiawan, and A. Ashari, ‘A Review of Missing Sensor Data Imputation Methods’, in 5th International Conference on Science and Technology (ICST), Yogyakarta, Indonesia, Jul. 2019.
  8. Y. Tian, K. Zhang, J. Li, X. Lin, and B. Yang, ‘LSTM-based traffic flow prediction with missing data’, Neurocomputing, vol. 318, pp. 297–305, Nov. 2018, doi: 10.1016/j.neucom.2018.08.067.
  9. G. Chang and T. Ge, ‘Comparison of Missing Data Imputation Methods for Traffic Flow’, in International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE), IEEE, 2011, pp. 639–642.
  10. M. S. Osman, A. M. Abu-Mahfouz, and P. R. Page, ‘A Survey on Data Imputation Techniques: Water Distribution System as a Use Case’, IEEE Access, vol. 6, pp. 63279–63291, 2018, doi: 10.1109/ACCESS.2018.2877269.
  11. O. Troyanskaya et al., ‘Missing value estimation methods for DNA microarrays’, 2001. [Online]. Available: http://smi-web.
  12. T. Schneider, ‘853 Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values’, 2001.
  13. M. K. Hasan, M. A. Alam, S. Roy, A. Dutta, M. T. Jawad, and S. Das, ‘Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021)’, Jan. 01, 2021, Elsevier Ltd. doi: 10.1016/j.imu.2021.100799.
  14. M. S. Grewal, ‘Kalman Filtering’, in International Encyclopedia of Statistical Science, Lovric, M., Berlin, Heidelberg: Springer, 2011, pp. 705–708.
  15. J. T. Jalles, ‘Structural Time Series Models and the Kalman Filter: A Concise Review’, 2009, [Online]. Available: http://ssrn.com/abstract=1496864at:https://ssrn.com/abstract=1496864Electroniccopyavailableat:http://ssrn.com/abstract=1496864
  16. P. De Jong and J. Penzer, ‘The ARIMA model in state space form’, 2000.
  17. E. Afrifa-Yamoah, U. A. Mueller, S. M. Taylor, and A. J. Fisher, ‘Missing data imputation of high-resolution temporal climate time series data’, Meteorological Applications, vol. 27, no. 1, Jan. 2020, doi: 10.1002/met.1873.
  18. M. K. Gill, T. Asefa, Y. Kaheil, and M. McKee, ‘Effect of missing data on performance of learning algorithms for hydrologic predictions: Implications to an imputation technique’, Water Resour Res, vol. 43, no. 7, Jul. 2007, doi: 10.1029/2006WR005298.
  19. J. H. Yang, C. H. Cheng, and C. P. Chan, ‘A time-series water level forecasting model based on imputation and variable selection method’, Comput Intell Neurosci, vol. 2017, 2017, doi: 10.1155/2017/8734214.
  20. D. Kumar, A. Singh, P. Samui, and R. K. Jha, ‘Forecasting monthly precipitation using sequential modelling’, Hydrological Sciences Journal, vol. 64, no. 6, pp. 690–700, Apr. 2019, doi: 10.1080/02626667.2019.1595624.
  21. Z. Li, J. Han, and Y. Song, ‘On the forecasting of high-frequency financial time series based on ARIMA model improved by deep learning’, J Forecast, vol. 39, no. 7, pp. 1081–1097, Nov. 2020, doi: 10.1002/for.2677.
  22. T. Guo, T. Lin, and N. Antulov-Fantulin, ‘Exploring Interpretable LSTM Neural Networks over Multi-Variable Data’, in 36th International Conference on Machine Learning, May 2019. [Online]. Available: http://arxiv.org/abs/1905.12034.
  23. H. M. Ahmed, B. Abdulrazak, F. Guillaume Blanchet, H. Aloulou, and M. Mokhtari, ‘Long Gaps Missing IoT Sensors Time Series Data Imputation: A Bayesian Gaussian Approach’, IEEE Access, vol. 10, pp. 116107–116119, 2022, doi: 10.1109/ACCESS.2022.3218785.
  24. J. Park et al., ‘Long-term missing value imputation for time series data using deep neural networks’, Neural Comput Appl, Apr. 2022, doi: 10.1007/s00521-022-08165-6.
  25. D. B. Rubin, ‘Inference and Missing Data’, Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
  26. Y. Zhang and P. J. Thorburn, ‘Handling missing data in near real-time environmental monitoring: A system and a review of selected methods’, Future Generation Computer Systems, vol. 128, pp. 63–72, Mar. 2022, doi: 10.1016/j.future.2021.09.033.
  27. R. E. Kalman, ‘A New Approach to Linear Filtering and Prediction Problems’, Journal of Basic Engineering, vol. 82, no. 1, 1960.
  28. J. C. Abril, ‘Structural Time Series Models’, in International Encyclopedia of Statistical Science, Lovric, M., Berlin, Heidelberg: Springer, 2011, pp. 1555–1558.
  29. J. L. Schafer, Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, 1997.
  30. R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed. 2002.