2. Point Overlay¶

In this tutorial, we will use the eumap package to overlay all the points of a vector layer (geopackage file) on several raster layers (geotiff files), using the SpaceOverlay and SpaceTimeOverlay classes to handle with timeless and temporal layers, respectively.

In our dataset the elevation and slope, based on digital terrain model, are timeless and the landsat composites (7 spectral bands, 4 seasons and 3 percentiles) and night light (VIIRS Night Band) layers are temporal (from 2000 to 2020).

First, let’s import the necessary modules

[1]:

# To work with local eumap code
# import sys
# sys.path.append('../../')

import os
import shutil
from pathlib import Path
import sys

import pandas as pd
import geopandas as gpd

from eumap.mapper import SpaceOverlay, SpaceTimeOverlay
from eumap.misc import find_files

import warnings
warnings.filterwarnings('ignore')

2.1. Dataset¶

Our dataset refers to one tile, located in Switzerland, extracted from a tiling system created for Continental Europe (7,042 tiles) by GeoHarmonizer Project.

[2]:

from eumap import datasets

# Clean the files from the previous tutorial
#shutil.rmtree(datasets.DATA_ROOT_NAME, ignore_errors=False, onerror=None)

tile = datasets.pilot.TILES[0]
#datasets.pilot.get_data(tile+'_rasters_gapfilled.tar.gz')

data_root = datasets.DATA_ROOT_NAME
tile_dir = Path(os.getcwd()).joinpath(data_root, tile)

For this tile we have a geopackage file containing the points

[3]:

#datasets.pilot.get_data(tile+'_landcover_samples.gpkg')

fn_points = Path(os.getcwd()).joinpath(data_root, tile, tile+'_landcover_samples.gpkg')

points = gpd.read_file(fn_points)
points

[3]:

	lucas	survey_date	confidence	tile_id	lc_class	geometry
0	False	2006-06-30	85	10636	321	POINT (4145221.759 2594636.440)
1	False	2006-06-30	85	10636	321	POINT (4142366.664 2598169.380)
2	False	2006-06-30	85	10636	321	POINT (4140249.007 2596954.755)
3	False	2006-06-30	85	10636	322	POINT (4148638.412 2595538.585)
4	False	2006-06-30	85	10636	332	POINT (4156286.754 2595790.720)
...	...	...	...	...	...	...
1119	False	2006-06-30	85	10636	333	POINT (4141257.016 2584469.100)
1120	False	2006-06-30	85	10636	111	POINT (4141241.654 2581611.485)
1121	False	2006-06-30	85	10636	312	POINT (4140414.076 2582953.315)
1122	False	2006-06-30	85	10636	324	POINT (4140716.302 2580458.165)
1123	False	2012-06-30	85	10636	111	POINT (4141241.654 2581611.485)

1124 rows × 6 columns

… some timeless raster layers

[4]:

dir_timeless_layers = os.path.join(tile_dir, 'timeless')
fn_timeless_layers = find_files(dir_timeless_layers, '*.tif')

print(f'Number of timeless layers: {len(fn_timeless_layers)}')

Number of timeless layers: 2

… and several temporal layers.

[5]:

dir_temporal_layers = os.path.join(tile_dir)
fn_temporal_layers = find_files(dir_temporal_layers, 'landsat*.tif')

print(f'{len(fn_temporal_layers)} temporal layers from 2000 to 2020')

1722 temporal layers from 2000 to 2020

The association between the points and the temporal layers will occurs using the survey_date column

[6]:

col_date = 'survey_date'

print('Number of samples per year:')
pd.to_datetime(points[col_date]).dt.year.value_counts()

Number of samples per year:

[6]:

2006    281
2012    281
2018    281
2000    281
Name: survey_date, dtype: int64

… and the name of temporal directories.

[7]:

dirs = find_files(dir_temporal_layers, '????')
dirs.sort()

print('Temporal directories:')
for dir in dirs:
    year_dir = Path(os.path.join(dir_temporal_layers,dir))
    n_layers = len(find_files(year_dir, 'landsat*.tif'))
    print(f' - {dir.name}: {n_layers} layers')

Temporal directories:
 - 2000: 84 layers
 - 2001: 84 layers
 - 2002: 84 layers
 - 2003: 84 layers
 - 2004: 84 layers
 - 2005: 84 layers
 - 2006: 84 layers
 - 2007: 84 layers
 - 2008: 84 layers
 - 2009: 84 layers
 - 2010: 84 layers
 - 2011: 84 layers
 - 2012: 84 layers
 - 2013: 84 layers
 - 2014: 84 layers
 - 2015: 84 layers
 - 2016: 84 layers
 - 2017: 84 layers
 - 2018: 84 layers
 - 2019: 84 layers
 - 2020: 42 layers

2.2. Space Overlay¶

The points should be overlayed on all timeless layers, regardless the date information stored in survey_date column. In this case, we will use the SpaceOverlay class passing the arguments:

fn_points: the geopackage filepath
dir_timeless_layers: the directory containing the timeless raster files

[8]:

spc_overlay = SpaceOverlay(fn_points, dir_layers=dir_timeless_layers, verbose=True)
timeless_data = spc_overlay.run()

[09:05:23] 1/2 dtm_elevation
[09:05:23] 2/2 dtm_slope

Now we have the elevation and slope information for each points:

[9]:

timeless_data

[9]:

	lucas	survey_date	confidence	tile_id	lc_class	geometry	overlay_id	dtm_elevation	dtm_slope
0	False	2006-06-30	85	10636	321	POINT (4145221.759 2594636.440)	1	1948.0	36.313705
1	False	2006-06-30	85	10636	321	POINT (4142366.664 2598169.380)	2	2209.0	7.917305
2	False	2006-06-30	85	10636	321	POINT (4140249.007 2596954.755)	3	1990.0	32.722038
3	False	2006-06-30	85	10636	322	POINT (4148638.412 2595538.585)	4	2142.0	49.800537
4	False	2006-06-30	85	10636	332	POINT (4156286.754 2595790.720)	5	2420.0	27.018671
...	...	...	...	...	...	...	...	...	...
1119	False	2006-06-30	85	10636	333	POINT (4141257.016 2584469.100)	1120	2368.0	21.605083
1120	False	2006-06-30	85	10636	111	POINT (4141241.654 2581611.485)	1121	1249.0	20.821150
1121	False	2006-06-30	85	10636	312	POINT (4140414.076 2582953.315)	1122	1729.0	16.108473
1122	False	2006-06-30	85	10636	324	POINT (4140716.302 2580458.165)	1123	900.0	27.319332
1123	False	2012-06-30	85	10636	111	POINT (4141241.654 2581611.485)	1124	1249.0	20.821150

1124 rows × 9 columns

2.3. Space-Time Overlay¶

For the temporal layers, the points should be filtered by year and overlaid on the right raster files, perfoming the overlay with all the images from a specific year. The SpaceTimeOverlay class implements this approach using the parameter:

timeless_data: The result of SpaceOverlay (GeoPandas DataFrame)
col_date: The column that contains the date information (2018-09-13)
dir_temporal_layers: The directory where the temporal raster files are stored, organized by year

[10]:

fn_layers = [ str(file).replace('2000', '{year}') for file in find_files(dir_temporal_layers, '2000/*.tif') ]

[11]:

spc_time_Overlay = SpaceTimeOverlay(points=timeless_data, col_date=col_date, \
    fn_layers=fn_layers, verbose=False)
overlayed_data = spc_time_Overlay.run()

Now we have the elevation, slope, landsat and the night light data for each points:

[12]:

overlayed_data[overlayed_data.columns[0:10]]

[12]:

	lucas	survey_date	confidence	tile_id	lc_class	geometry	overlay_id	dtm_elevation	dtm_slope	landsat_ard_fall_blue_p25
0	False	2006-06-30	85	10636	321	POINT (4145221.759 2594636.440)	1	1948.0	36.313705	5.0
1	False	2006-06-30	85	10636	321	POINT (4142366.664 2598169.380)	2	2209.0	7.917305	5.0
2	False	2006-06-30	85	10636	321	POINT (4140249.007 2596954.755)	3	1990.0	32.722038	4.0
3	False	2006-06-30	85	10636	322	POINT (4148638.412 2595538.585)	4	2142.0	49.800537	7.0
4	False	2006-06-30	85	10636	332	POINT (4156286.754 2595790.720)	5	2420.0	27.018671	16.0
...	...	...	...	...	...	...	...	...	...	...
1098	False	2000-06-30	85	10636	312	POINT (4140414.076 2582953.315)	277	1729.0	16.108473	6.0
1112	False	2000-06-30	85	10636	332	POINT (4157045.539 2609917.600)	278	2562.0	31.661921	7.0
1115	False	2000-06-30	85	10636	321	POINT (4141237.722 2583848.400)	279	2174.0	15.649096	5.0
1116	False	2000-06-30	85	10636	333	POINT (4141257.016 2584469.100)	280	2368.0	21.605083	28.0
1117	False	2000-06-30	85	10636	322	POINT (4140716.302 2580458.165)	281	900.0	27.319332	13.0

1124 rows × 10 columns

2.4. Save to CSV and GeoPackage files¶

At last, we need to save the overlaid points to access it in other softwares (QGIS) and eumap tutorial:

[13]:

csv_output = os.path.join(tile_dir, tile + '_landcover_samples_overlayed.csv.gz')

print(f"Saving {csv_output}")
overlayed_data.to_csv(csv_output, compression='gzip')

Saving /home/opengeohub/leandro/Code/eumap/docs/notebooks/eumap_data/10636_switzerland/10636_switzerland_landcover_samples_overlayed.csv.gz

[14]:

import fiona

#See https://github.com/Toblerity/Fiona/issues/977
with fiona.Env(OSR_WKT_FORMAT="WKT2_2018"):
    gpkg_output =  os.path.join(tile_dir, tile + '_landcover_samples_overlayed.gpkg')

    print(f"Saving {gpkg_output}")
    overlayed_data.to_file(gpkg_output,  driver="GPKG")

ERROR:fiona._env:PROJ: proj_as_wkt: DatumEnsemble can only be exported to WKT2:2019
ERROR:fiona._env:PROJ: proj_as_wkt: DatumEnsemble can only be exported to WKT2:2019

Saving /home/opengeohub/leandro/Code/eumap/docs/notebooks/eumap_data/10636_switzerland/10636_switzerland_landcover_samples_overlayed.gpkg

2.5. Overlay Benchmarks¶

Here we will show the performance of eumap’s overlay method against classic raster sampling methods using rasterio. First, let’s time the overlay executions on the same dataset as in the tutorial above.

[15]:

from pathlib import Path
import geopandas as gpd
import rasterio as rio
import numpy as np
import multiprocessing as mp

import warnings
warnings.filterwarnings('ignore')

from eumap.mapper import SpaceOverlay

max_workers = 8

points = gpd.read_file(fn_points)
print('Sample size:', points.index.size)

Sample size: 1124

Serial sampling with rasterio:

[16]:

def serial_sampling(points, layers_dir):
    sources = [
        rio.open(fn)
        for fn in sorted(layers_dir.glob('**/*.tif'))
    ]

    coordinates = np.c_[
        points.geometry.x,
        points.geometry.y,
    ]

    results = points.copy()
    for src in sources:
        layer_name = Path(src.name).stem
        results[layer_name] = np.stack(src.sample(coordinates)).ravel()

%timeit -n 1 -r 1 serial_sampling(points, tile_dir)

5min 29s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Parallel sampling with rasterio:

[17]:

def sample_one_layer(args):
    fn, coordinates = args
    layer_name = fn.stem
    with rio.open(fn) as src:
        data = np.stack(src.sample(coordinates)).ravel()
    return layer_name, data

def parallel_sampling(points, layers_dir):
    files = sorted(layers_dir.glob('**/*.tif'))

    coordinates = np.c_[
        points.geometry.x,
        points.geometry.y,
    ]

    results = points.copy()

    arg_gen = (
        (fn, coordinates)
        for fn in files
    )

    with mp.Pool(max_workers) as pool:
        for layer_name, data in pool.map(
            sample_one_layer,
            arg_gen,
        ):
            results[layer_name] = data

%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)

34.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Parallel sampling with eumap.mapper.SpaceOverlay:

[18]:

def eumap_sampling(points, layers_dir):
    ovr = SpaceOverlay(
        points,
        dir_layers=layers_dir,
        max_workers=max_workers,
        verbose=False,
    )
    data = ovr.run()

%timeit -n 1 -r 1 eumap_sampling(points, tile_dir)

2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Sampling optimizations done in eumap generate some overhead which outweighs the speedup when used on smaller datasets. However, if we quadruple the sample size:

[19]:

for i in range(2):
    points = points.append(points, ignore_index=True)

print('sample size:', points.index.size)

sample size: 4496

Parallel sampling with rasterio:

[20]:

%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)

2min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Parallel sampling with eumap:

[ ]:

%timeit -n 1 -r 1 eumap_sampling(points, tile_dir)

As seen above, the optimized overlay’s execution time has much more favorable scaling with dataset size than is the case with raw parallelization.