2. Point Overlay

In this tutorial, we will use the eumap package to overlay all the points of a vector layer (geopackage file) on several raster layers (geotiff files), using the SpaceOverlay and SpaceTimeOverlay classes to handle with timeless and temporal layers, respectively.

In our dataset the elevation and slope, based on digital terrain model, are timeless and the landsat composites (7 spectral bands, 4 seasons and 3 percentiles) and night light (VIIRS Night Band) layers are temporal (from 2000 to 2020).

First, let’s import the necessary modules

[1]:
# To work with local eumap code
# import sys
# sys.path.append('../../')

import os
import shutil
from pathlib import Path
import sys

import pandas as pd
import geopandas as gpd

from eumap.mapper import SpaceOverlay, SpaceTimeOverlay
from eumap.misc import find_files

import warnings
warnings.filterwarnings('ignore')

2.1. Dataset

Our dataset refers to one tile, located in Switzerland, extracted from a tiling system created for Continental Europe (7,042 tiles) by GeoHarmonizer Project.

[2]:
from eumap import datasets

# Clean the files from the previous tutorial
#shutil.rmtree(datasets.DATA_ROOT_NAME, ignore_errors=False, onerror=None)

tile = datasets.pilot.TILES[0]
#datasets.pilot.get_data(tile+'_rasters_gapfilled.tar.gz')

data_root = datasets.DATA_ROOT_NAME
tile_dir = Path(os.getcwd()).joinpath(data_root, tile)

For this tile we have a geopackage file containing the points

[3]:
#datasets.pilot.get_data(tile+'_landcover_samples.gpkg')

fn_points = Path(os.getcwd()).joinpath(data_root, tile, tile+'_landcover_samples.gpkg')

points = gpd.read_file(fn_points)
points
[3]:
lucas survey_date confidence tile_id lc_class geometry
0 False 2006-06-30 85 10636 321 POINT (4145221.759 2594636.440)
1 False 2006-06-30 85 10636 321 POINT (4142366.664 2598169.380)
2 False 2006-06-30 85 10636 321 POINT (4140249.007 2596954.755)
3 False 2006-06-30 85 10636 322 POINT (4148638.412 2595538.585)
4 False 2006-06-30 85 10636 332 POINT (4156286.754 2595790.720)
... ... ... ... ... ... ...
1119 False 2006-06-30 85 10636 333 POINT (4141257.016 2584469.100)
1120 False 2006-06-30 85 10636 111 POINT (4141241.654 2581611.485)
1121 False 2006-06-30 85 10636 312 POINT (4140414.076 2582953.315)
1122 False 2006-06-30 85 10636 324 POINT (4140716.302 2580458.165)
1123 False 2012-06-30 85 10636 111 POINT (4141241.654 2581611.485)

1124 rows × 6 columns

… some timeless raster layers

[4]:
dir_timeless_layers = os.path.join(tile_dir, 'timeless')
fn_timeless_layers = find_files(dir_timeless_layers, '*.tif')

print(f'Number of timeless layers: {len(fn_timeless_layers)}')
Number of timeless layers: 2

… and several temporal layers.

[5]:
dir_temporal_layers = os.path.join(tile_dir)
fn_temporal_layers = find_files(dir_temporal_layers, 'landsat*.tif')

print(f'{len(fn_temporal_layers)} temporal layers from 2000 to 2020')
1722 temporal layers from 2000 to 2020

The association between the points and the temporal layers will occurs using the survey_date column

[6]:
col_date = 'survey_date'

print('Number of samples per year:')
pd.to_datetime(points[col_date]).dt.year.value_counts()
Number of samples per year:
[6]:
2006    281
2012    281
2018    281
2000    281
Name: survey_date, dtype: int64

… and the name of temporal directories.

[7]:
dirs = find_files(dir_temporal_layers, '????')
dirs.sort()

print('Temporal directories:')
for dir in dirs:
    year_dir = Path(os.path.join(dir_temporal_layers,dir))
    n_layers = len(find_files(year_dir, 'landsat*.tif'))
    print(f' - {dir.name}: {n_layers} layers')
Temporal directories:
 - 2000: 84 layers
 - 2001: 84 layers
 - 2002: 84 layers
 - 2003: 84 layers
 - 2004: 84 layers
 - 2005: 84 layers
 - 2006: 84 layers
 - 2007: 84 layers
 - 2008: 84 layers
 - 2009: 84 layers
 - 2010: 84 layers
 - 2011: 84 layers
 - 2012: 84 layers
 - 2013: 84 layers
 - 2014: 84 layers
 - 2015: 84 layers
 - 2016: 84 layers
 - 2017: 84 layers
 - 2018: 84 layers
 - 2019: 84 layers
 - 2020: 42 layers

2.2. Space Overlay

The points should be overlayed on all timeless layers, regardless the date information stored in survey_date column. In this case, we will use the SpaceOverlay class passing the arguments:

  • fn_points: the geopackage filepath

  • dir_timeless_layers: the directory containing the timeless raster files

[8]:
spc_overlay = SpaceOverlay(fn_points, dir_layers=dir_timeless_layers, verbose=True)
timeless_data = spc_overlay.run()
[09:05:23] 1/2 dtm_elevation
[09:05:23] 2/2 dtm_slope

Now we have the elevation and slope information for each points:

[9]:
timeless_data
[9]:
lucas survey_date confidence tile_id lc_class geometry overlay_id dtm_elevation dtm_slope
0 False 2006-06-30 85 10636 321 POINT (4145221.759 2594636.440) 1 1948.0 36.313705
1 False 2006-06-30 85 10636 321 POINT (4142366.664 2598169.380) 2 2209.0 7.917305
2 False 2006-06-30 85 10636 321 POINT (4140249.007 2596954.755) 3 1990.0 32.722038
3 False 2006-06-30 85 10636 322 POINT (4148638.412 2595538.585) 4 2142.0 49.800537
4 False 2006-06-30 85 10636 332 POINT (4156286.754 2595790.720) 5 2420.0 27.018671
... ... ... ... ... ... ... ... ... ...
1119 False 2006-06-30 85 10636 333 POINT (4141257.016 2584469.100) 1120 2368.0 21.605083
1120 False 2006-06-30 85 10636 111 POINT (4141241.654 2581611.485) 1121 1249.0 20.821150
1121 False 2006-06-30 85 10636 312 POINT (4140414.076 2582953.315) 1122 1729.0 16.108473
1122 False 2006-06-30 85 10636 324 POINT (4140716.302 2580458.165) 1123 900.0 27.319332
1123 False 2012-06-30 85 10636 111 POINT (4141241.654 2581611.485) 1124 1249.0 20.821150

1124 rows × 9 columns

2.3. Space-Time Overlay

For the temporal layers, the points should be filtered by year and overlaid on the right raster files, perfoming the overlay with all the images from a specific year. The SpaceTimeOverlay class implements this approach using the parameter:

  • timeless_data: The result of SpaceOverlay (GeoPandas DataFrame)

  • col_date: The column that contains the date information (2018-09-13)

  • dir_temporal_layers: The directory where the temporal raster files are stored, organized by year

[10]:
fn_layers = [ str(file).replace('2000', '{year}') for file in find_files(dir_temporal_layers, '2000/*.tif') ]
[11]:
spc_time_Overlay = SpaceTimeOverlay(points=timeless_data, col_date=col_date, \
    fn_layers=fn_layers, verbose=False)
overlayed_data = spc_time_Overlay.run()

Now we have the elevation, slope, landsat and the night light data for each points:

[12]:
overlayed_data[overlayed_data.columns[0:10]]
[12]:
lucas survey_date confidence tile_id lc_class geometry overlay_id dtm_elevation dtm_slope landsat_ard_fall_blue_p25
0 False 2006-06-30 85 10636 321 POINT (4145221.759 2594636.440) 1 1948.0 36.313705 5.0
1 False 2006-06-30 85 10636 321 POINT (4142366.664 2598169.380) 2 2209.0 7.917305 5.0
2 False 2006-06-30 85 10636 321 POINT (4140249.007 2596954.755) 3 1990.0 32.722038 4.0
3 False 2006-06-30 85 10636 322 POINT (4148638.412 2595538.585) 4 2142.0 49.800537 7.0
4 False 2006-06-30 85 10636 332 POINT (4156286.754 2595790.720) 5 2420.0 27.018671 16.0
... ... ... ... ... ... ... ... ... ... ...
1098 False 2000-06-30 85 10636 312 POINT (4140414.076 2582953.315) 277 1729.0 16.108473 6.0
1112 False 2000-06-30 85 10636 332 POINT (4157045.539 2609917.600) 278 2562.0 31.661921 7.0
1115 False 2000-06-30 85 10636 321 POINT (4141237.722 2583848.400) 279 2174.0 15.649096 5.0
1116 False 2000-06-30 85 10636 333 POINT (4141257.016 2584469.100) 280 2368.0 21.605083 28.0
1117 False 2000-06-30 85 10636 322 POINT (4140716.302 2580458.165) 281 900.0 27.319332 13.0

1124 rows × 10 columns

2.4. Save to CSV and GeoPackage files

At last, we need to save the overlaid points to access it in other softwares (QGIS) and eumap tutorial:

[13]:
csv_output = os.path.join(tile_dir, tile + '_landcover_samples_overlayed.csv.gz')

print(f"Saving {csv_output}")
overlayed_data.to_csv(csv_output, compression='gzip')
Saving /home/opengeohub/leandro/Code/eumap/docs/notebooks/eumap_data/10636_switzerland/10636_switzerland_landcover_samples_overlayed.csv.gz
[14]:
import fiona

#See https://github.com/Toblerity/Fiona/issues/977
with fiona.Env(OSR_WKT_FORMAT="WKT2_2018"):
    gpkg_output =  os.path.join(tile_dir, tile + '_landcover_samples_overlayed.gpkg')

    print(f"Saving {gpkg_output}")
    overlayed_data.to_file(gpkg_output,  driver="GPKG")
ERROR:fiona._env:PROJ: proj_as_wkt: DatumEnsemble can only be exported to WKT2:2019
ERROR:fiona._env:PROJ: proj_as_wkt: DatumEnsemble can only be exported to WKT2:2019
Saving /home/opengeohub/leandro/Code/eumap/docs/notebooks/eumap_data/10636_switzerland/10636_switzerland_landcover_samples_overlayed.gpkg

2.5. Overlay Benchmarks

Here we will show the performance of eumap’s overlay method against classic raster sampling methods using rasterio. First, let’s time the overlay executions on the same dataset as in the tutorial above.

[15]:
from pathlib import Path
import geopandas as gpd
import rasterio as rio
import numpy as np
import multiprocessing as mp

import warnings
warnings.filterwarnings('ignore')

from eumap.mapper import SpaceOverlay

max_workers = 8

points = gpd.read_file(fn_points)
print('Sample size:', points.index.size)
Sample size: 1124

Serial sampling with rasterio:

[16]:
def serial_sampling(points, layers_dir):
    sources = [
        rio.open(fn)
        for fn in sorted(layers_dir.glob('**/*.tif'))
    ]

    coordinates = np.c_[
        points.geometry.x,
        points.geometry.y,
    ]

    results = points.copy()
    for src in sources:
        layer_name = Path(src.name).stem
        results[layer_name] = np.stack(src.sample(coordinates)).ravel()

%timeit -n 1 -r 1 serial_sampling(points, tile_dir)
5min 29s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Parallel sampling with rasterio:

[17]:
def sample_one_layer(args):
    fn, coordinates = args
    layer_name = fn.stem
    with rio.open(fn) as src:
        data = np.stack(src.sample(coordinates)).ravel()
    return layer_name, data

def parallel_sampling(points, layers_dir):
    files = sorted(layers_dir.glob('**/*.tif'))

    coordinates = np.c_[
        points.geometry.x,
        points.geometry.y,
    ]

    results = points.copy()

    arg_gen = (
        (fn, coordinates)
        for fn in files
    )

    with mp.Pool(max_workers) as pool:
        for layer_name, data in pool.map(
            sample_one_layer,
            arg_gen,
        ):
            results[layer_name] = data

%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)
34.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Parallel sampling with eumap.mapper.SpaceOverlay:

[18]:
def eumap_sampling(points, layers_dir):
    ovr = SpaceOverlay(
        points,
        dir_layers=layers_dir,
        max_workers=max_workers,
        verbose=False,
    )
    data = ovr.run()

%timeit -n 1 -r 1 eumap_sampling(points, tile_dir)
2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Sampling optimizations done in eumap generate some overhead which outweighs the speedup when used on smaller datasets. However, if we quadruple the sample size:

[19]:
for i in range(2):
    points = points.append(points, ignore_index=True)

print('sample size:', points.index.size)
sample size: 4496

Parallel sampling with rasterio:

[20]:
%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)
2min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Parallel sampling with eumap:

[ ]:
%timeit -n 1 -r 1 eumap_sampling(points, tile_dir)

As seen above, the optimized overlay’s execution time has much more favorable scaling with dataset size than is the case with raw parallelization.