FOSS4G-Asia 2024

Yu-Feng Ho


Sessions

12-17
11:15
15min
Building an Analysis-Ready Cloud Optimized Global Lidar Data (GEDI and ICESat-2) for Earth System Science applications
Yu-Feng Ho

Global Ecosystem Dynamics Investigation (GEDI) and Ice, Cloud, and Land Elevation Satellite 2 (ICESat-2) are earth observation missions from NASA to construct a three-dimensional model of earth surface in space and time empowered by Light Detection and Ranging (LIDAR). GEDI and ICESat-2 data are organized by orbit ID, sub-orbit granule and track, and distributed in HDF5 format, which is optimized for big data storage. However, this approach is inconvenient for extracting spatio-temporal areas of interest, because each file stores a track crossing a huge range of latitude and longitude, while lacking a spatial index.

To facilitate random access to small areas of interest, we propose a data reconstruction process through Apache Parquet. Parquet is an open source column-oriented data format designed for efficient data storage and retrieval. We sequentially stream raw data into spatio-temporal partitioning blocks (5 degree x 5 degree x year). This layout optimizes the number of partitions (n = 3337) and individual file size (~300 MB). Independence of raw data files and a predefined partititoning scheme enables parallel processing, and periodic update while new data is available.

During the reconstruction, we selected essential attributes and applied quality filtering based on scientific literature. We excluded GEDI shots with Quality Flag equal to 0, Degradation Flag larger than 0, or Sensitivity smaller than 0.95; For ICESat-2 ATL08, we first excluded segments where terrain and/or canopy height are in NaN. We then reconstructed individual photons from ATL03 by ph_segement_id, and excluded the ones classified as noise, as well as segments containing more than 28 photons, according to the result from previous research [1].

Data is finally converted to GeoParquet and published on a cloud server under CC-BY 4.0 license. GEDI Level2 has 1.4 TB, and ICESat-2 ATL08 has 3.8TB in total size respectively. GeoParquet supports two levels of predicate push down: first, at the partition level, and second, at the file level. The partitioning of the global LiDAR datasets enables coarse spatial (5 x 5 degree) and temporal (year) filtering.. The footer of each GeoParquet file enables spatial filtering via bounding boxes or geometry features, and temporal filtering using the datetime columns. Further attribute filtering is possible.

The concept of Analysis-Ready Cloud Optimized (ARCO) data has been defined and implemented for raster data, using technologies such as Zarr or Cloud Optimized GeoTiff (COG) [2]. However, corresponding implementations for vector data are scarce. This work delivers two instances of global ARCO vector datasets. It not only adheres to the concept of 4C (complete, consistent, current, and correct), but also tackles the challenge of organizing terabyte-scale geospatial vector data.

Reference

Milenković, M., Reiche, J., Armston, J., Neuenschwander, A., De Keersmaecker, W., Herold, M., & Verbesselt, J. (2022). Assessing Amazon rainforest regrowth with GEDI and ICESat-2 data. Science of Remote Sensing, 5, 100051.
Stern, C., Abernathey, R., Hamman, J., Wegener, R., Lepore, C., Harkins, S., & Merose, A. (2022). Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production. Frontiers in Climate, 3, 782909.

FOSS4G-Asia 2024 - Abstracts - General Track
Auditorium Hall 2