8. Quality control of Geo-harmonizer datasets

The eumap library provides a set of functions to check the quality of all spatial datasets produced throughout the Geo-harmonizer project. These are the same functions used by the developers and are adapted for users to run quality checks not only on one entire raster layer (with no proper infrastructure it may be too computationally intensive) but also on subset of it.

These functions are contained in the module qc of the eumap package and can be used to check accessibility, completeness and consistency of the raster layers. The main component of the qc module is the Test class (full documentation can be found here

Let’s import the module:

[1]:
from eumap import qc

bounds = (
    4751935,
    2420238,
    4772117,
    2444223,
)

test = qc.Test(
    bounds=bounds,
    crs='EPSG:3035', # optional
    verbose=True,    # optional, defaults to False
)

test
[1]:
<eumap.qc.Test at 0x7f5f707f3ac0>

8.1. Accessibility test

First we check if the datasets we are interested in are accessible (a simple check on the url that allows users to access or download the files). We import the Catalogue object and search through our GeoNetwork for the potential natural vegetation (“pnv”) dataset.

For more information on the Catalogue object, refer to the previous tutorial 7. Access to Geo-harmonizer datasets

[2]:
from eumap.datasets import Catalogue

cat = Catalogue()

asset = cat.search('pnv')[0]

asset.meta
[2]:
title:    PNV - Probability distribution for Quercus ilex
abstract: Overview:
Potential Natural Vegetation (PNV): potential probability of occurrence for the Holm oak from 2018 to 2020

Traceability (lineage):
    This is an original dataset produced with a machine learning framework which used a combination of point datasets and raster datasets as inputs. Point dataset is a harmonized collection of tree occurrence data, comprising observations from National Forest Inventories (EU-Forest), GBIF and LUCAS. The complete dataset is available on Zenodo. Raster datasets used as input are: monthly time series air and surface temperature and precipitation from a reprocessed version of the Copernicus ERA5 dataset; long term averages of bioclimatic variables from CHELSA; elevation, slope and other elevation-derived metrics and long term monthly averages snow probability. For a more comprehensive list refer to Bonannella et al. (2022) (in review, preprint available at: https://doi.org/10.21203/rs.3.rs-1252972/v1).

Scientific methodology:
    Probability and uncertainty maps were the output of a spatiotemporal ensemble machine learning framework based on stacked regularization. Three base models (random forest, gradient boosted trees and generalized linear models) were first trained on the input dataset and their predictions were used to train an additional model (logistic regression) which provided the final predictions. More details on the whole workflow are available in the listed publication.

Usability:
    Probability maps are particularly useful when compared with existing products of potential distribution of species or when combined with maps of realized distribution: gaps in potential and realized distribution can be identified and used as information for future programs of tree planting or forest restoration.

Uncertainty quantification:
    Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model.

Data validation approaches:
    Distribution maps were validated using a spatial 5-fold cross validation following the workflow detailed in the listed publication.

Completeness:
    The raster files perfectly cover the entire Geo-harmonizer region as defined by the landmask raster dataset available here.

Consistency:
    Areas which are outside of the calibration area of the point dataset (Iceland, Norway) usually have high uncertainty values. This is not only a problem of extrapolation but also of poor representation in the feature space available to the model of the conditions that are present in this countries.

Positional accuracy:
    The rasters have a spatial resolution of 30m.

Temporal accuracy:
    The maps cover the period 2018 - 2020

Thematic accuracy:
    Both probability and uncertainty maps contain values from 0 to 100: in the case of probability maps, they indicate the probability of occurrence of a single individual of the target species, while uncertainty maps indicate the standard deviation of the ensemble model.
authors:  [{'name': 'Carmelo Bonannella', 'email': 'carmelo.bonannella@opengeohub.org'}]
theme:    Vegetation

From the “pnv” catalogue we extract the url of the first raster layer of the dataset, the potential distribution map of silver fir for the period 2018 - 2020:

[3]:
str(asset) # assets are just strings with metadata so we can use them as a url string
[3]:
'https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_quercus.ilex_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif'

The raster url and the bounding box previously defined are the only information needed to run all the quality control checks. We can now run the accessibility check using the method with the same name:

[4]:
accessible = test.accessibility(asset)

accessible
Dataset accessible:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_quercus.ilex_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif
[4]:
True

As we can see, the test results is TRUE, which means the file is available.

8.2. Completeness test

The second test checks for completeness of the raster layer: every pixel of the region of interested selected in the raster layer is compared with the landmask used for all the layers produced in the Geo-harmonizer project. The main landmask (30m spatial resolution) is derived from Pflugmacher et al., (2019). We use the raster_land_coverage method: the output of the method is a number between 0 and 1, representing the fraction of pixels of the raster layer tested being nodata across the landmask

[5]:
coverage = test.raster_land_coverage(asset)

coverage
reader using 3 threads
Completeness 100.0% for dataset:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_quercus.ilex_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif
[5]:
1.0

By default, the landmask excludes all those pixels falling in permanent ice/snow and wetlands. If we are interested in these specific areas, the method allows the user to include them during the quality control check:

[6]:
coverage = test.raster_land_coverage(
    asset,
    include_ice=True, # include snow and ice in coverage check
)

coverage
reader using 2 threads
Completeness 100.0% for dataset:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_quercus.ilex_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif
[6]:
1.0
[7]:
coverage = test.raster_land_coverage(
    asset,
    include_ice=True,
    include_wetlands=True, # include wetlands in coverage check
)

coverage
reader using 3 threads
Completeness 100.0% for dataset:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_quercus.ilex_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif
[7]:
1.0

8.3. Consistency test

The last test checks for consistency of the raster layer. This test will check that the information for the object of the analysis actually match with the ones publicly available on GeoNetwork We use the metadata_consistency method: if the output of the test is True, the method will report title, description, theme and corresponding author of the raster layer the output of the method

[8]:
metadata_present = test.metadata_consistency(asset)

metadata_present
All metadata present: True
[8]:
{'title': True, 'abstract': True, 'theme': True, 'authors': True}