FOSS4G-Asia 2024

Building an Analysis-Ready Cloud Optimized Global Lidar Data (GEDI and ICESat-2) for Earth System Science applications
12-17, 11:15–11:30 (Asia/Bangkok), Auditorium Hall 2

Global Ecosystem Dynamics Investigation (GEDI) and Ice, Cloud, and Land Elevation Satellite 2 (ICESat-2) are earth observation missions from NASA to construct a three-dimensional model of earth surface in space and time empowered by Light Detection and Ranging (LIDAR). GEDI and ICESat-2 data are organized by orbit ID, sub-orbit granule and track, and distributed in HDF5 format, which is optimized for big data storage. However, this approach is inconvenient for extracting spatio-temporal areas of interest, because each file stores a track crossing a huge range of latitude and longitude, while lacking a spatial index.

To facilitate random access to small areas of interest, we propose a data reconstruction process through Apache Parquet. Parquet is an open source column-oriented data format designed for efficient data storage and retrieval. We sequentially stream raw data into spatio-temporal partitioning blocks (5 degree x 5 degree x year). This layout optimizes the number of partitions (n = 3337) and individual file size (~300 MB). Independence of raw data files and a predefined partititoning scheme enables parallel processing, and periodic update while new data is available.

During the reconstruction, we selected essential attributes and applied quality filtering based on scientific literature. We excluded GEDI shots with Quality Flag equal to 0, Degradation Flag larger than 0, or Sensitivity smaller than 0.95; For ICESat-2 ATL08, we first excluded segments where terrain and/or canopy height are in NaN. We then reconstructed individual photons from ATL03 by ph_segement_id, and excluded the ones classified as noise, as well as segments containing more than 28 photons, according to the result from previous research [1].

Data is finally converted to GeoParquet and published on a cloud server under CC-BY 4.0 license. GEDI Level2 has 1.4 TB, and ICESat-2 ATL08 has 3.8TB in total size respectively. GeoParquet supports two levels of predicate push down: first, at the partition level, and second, at the file level. The partitioning of the global LiDAR datasets enables coarse spatial (5 x 5 degree) and temporal (year) filtering.. The footer of each GeoParquet file enables spatial filtering via bounding boxes or geometry features, and temporal filtering using the datetime columns. Further attribute filtering is possible.

The concept of Analysis-Ready Cloud Optimized (ARCO) data has been defined and implemented for raster data, using technologies such as Zarr or Cloud Optimized GeoTiff (COG) [2]. However, corresponding implementations for vector data are scarce. This work delivers two instances of global ARCO vector datasets. It not only adheres to the concept of 4C (complete, consistent, current, and correct), but also tackles the challenge of organizing terabyte-scale geospatial vector data.

Reference

Milenković, M., Reiche, J., Armston, J., Neuenschwander, A., De Keersmaecker, W., Herold, M., & Verbesselt, J. (2022). Assessing Amazon rainforest regrowth with GEDI and ICESat-2 data. Science of Remote Sensing, 5, 100051.
Stern, C., Abernathey, R., Hamman, J., Wegener, R., Lepore, C., Harkins, S., & Merose, A. (2022). Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production. Frontiers in Climate, 3, 782909.


The topic covers the canopy and terrain modeling. The data are widely used to model canopy height and terrain surface. For example, some papers about flooding scenarios of coastal cities are beneficial from these dataset [1][2][3].

The functionality is further opening up the dataset to public sector to improve or validate local models throughout the globe because of their global coverage.

Reference
1. Pronk, M., Hooijer, A., Eilander, D., Haag, A., de Jong, T., Vousdoukas, M., ... & Eleveld, M. (2024). DeltaDTM: A global coastal digital terrain model. Scientific Data, 11(1), 273.
2. Kulp, S. A., & Strauss, B. H. (2018). CoastalDEM: A global coastal digital elevation model improved from SRTM using a neural network. Remote sensing of environment, 206, 231-239.
3. Dusseau, D., Zobel, Z., & Schwalm, C. R. (2023). DiluviumDEM: Enhanced accuracy in global coastal digital elevation models. Remote Sensing of Environment, 298, 113812.