2. Point Overlay¶
In this tutorial, we will use the eumap package to overlay all the points of a vector layer (geopackage file) on several raster layers (geotiff files), using the SpaceOverlay and SpaceTimeOverlay classes to handle with timeless and temporal layers, respectively.
In our dataset the elevation and slope, based on digital terrain model, are timeless and the landsat composites (7 spectral bands, 4 seasons and 3 percentiles) and night light (VIIRS Night Band) layers are temporal (from 2000 to 2020).
First, let’s import the necessary modules
[1]:
# To work with local eumap code
# import sys
# sys.path.append('../../')
import os
import shutil
from pathlib import Path
import sys
import pandas as pd
import geopandas as gpd
from eumap.mapper import SpaceOverlay, SpaceTimeOverlay
from eumap.misc import find_files
import warnings
warnings.filterwarnings('ignore')
2.1. Dataset¶
Our dataset refers to one tile, located in Switzerland, extracted from a tiling system created for Continental Europe (7,042 tiles) by GeoHarmonizer Project.
[2]:
from eumap import datasets
# Clean the files from the previous tutorial
#shutil.rmtree(datasets.DATA_ROOT_NAME, ignore_errors=False, onerror=None)
tile = datasets.pilot.TILES[0]
#datasets.pilot.get_data(tile+'_rasters_gapfilled.tar.gz')
data_root = datasets.DATA_ROOT_NAME
tile_dir = Path(os.getcwd()).joinpath(data_root, tile)
For this tile we have a geopackage file containing the points
[3]:
#datasets.pilot.get_data(tile+'_landcover_samples.gpkg')
fn_points = Path(os.getcwd()).joinpath(data_root, tile, tile+'_landcover_samples.gpkg')
points = gpd.read_file(fn_points)
points
[3]:
lucas | survey_date | confidence | tile_id | lc_class | geometry | |
---|---|---|---|---|---|---|
0 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4145221.759 2594636.440) |
1 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4142366.664 2598169.380) |
2 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4140249.007 2596954.755) |
3 | False | 2006-06-30 | 85 | 10636 | 322 | POINT (4148638.412 2595538.585) |
4 | False | 2006-06-30 | 85 | 10636 | 332 | POINT (4156286.754 2595790.720) |
... | ... | ... | ... | ... | ... | ... |
1119 | False | 2006-06-30 | 85 | 10636 | 333 | POINT (4141257.016 2584469.100) |
1120 | False | 2006-06-30 | 85 | 10636 | 111 | POINT (4141241.654 2581611.485) |
1121 | False | 2006-06-30 | 85 | 10636 | 312 | POINT (4140414.076 2582953.315) |
1122 | False | 2006-06-30 | 85 | 10636 | 324 | POINT (4140716.302 2580458.165) |
1123 | False | 2012-06-30 | 85 | 10636 | 111 | POINT (4141241.654 2581611.485) |
1124 rows × 6 columns
… some timeless raster layers
[4]:
dir_timeless_layers = os.path.join(tile_dir, 'timeless')
fn_timeless_layers = find_files(dir_timeless_layers, '*.tif')
print(f'Number of timeless layers: {len(fn_timeless_layers)}')
Number of timeless layers: 2
… and several temporal layers.
[5]:
dir_temporal_layers = os.path.join(tile_dir)
fn_temporal_layers = find_files(dir_temporal_layers, 'landsat*.tif')
print(f'{len(fn_temporal_layers)} temporal layers from 2000 to 2020')
1722 temporal layers from 2000 to 2020
The association between the points and the temporal layers will occurs using the survey_date column
[6]:
col_date = 'survey_date'
print('Number of samples per year:')
pd.to_datetime(points[col_date]).dt.year.value_counts()
Number of samples per year:
[6]:
2006 281
2012 281
2018 281
2000 281
Name: survey_date, dtype: int64
… and the name of temporal directories.
[7]:
dirs = find_files(dir_temporal_layers, '????')
dirs.sort()
print('Temporal directories:')
for dir in dirs:
year_dir = Path(os.path.join(dir_temporal_layers,dir))
n_layers = len(find_files(year_dir, 'landsat*.tif'))
print(f' - {dir.name}: {n_layers} layers')
Temporal directories:
- 2000: 84 layers
- 2001: 84 layers
- 2002: 84 layers
- 2003: 84 layers
- 2004: 84 layers
- 2005: 84 layers
- 2006: 84 layers
- 2007: 84 layers
- 2008: 84 layers
- 2009: 84 layers
- 2010: 84 layers
- 2011: 84 layers
- 2012: 84 layers
- 2013: 84 layers
- 2014: 84 layers
- 2015: 84 layers
- 2016: 84 layers
- 2017: 84 layers
- 2018: 84 layers
- 2019: 84 layers
- 2020: 42 layers
2.2. Space Overlay¶
The points should be overlayed on all timeless layers, regardless the date information stored in survey_date column. In this case, we will use the SpaceOverlay class passing the arguments:
fn_points: the geopackage filepath
dir_timeless_layers: the directory containing the timeless raster files
[8]:
spc_overlay = SpaceOverlay(fn_points, dir_layers=dir_timeless_layers, verbose=True)
timeless_data = spc_overlay.run()
[09:05:23] 1/2 dtm_elevation
[09:05:23] 2/2 dtm_slope
Now we have the elevation and slope information for each points:
[9]:
timeless_data
[9]:
lucas | survey_date | confidence | tile_id | lc_class | geometry | overlay_id | dtm_elevation | dtm_slope | |
---|---|---|---|---|---|---|---|---|---|
0 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4145221.759 2594636.440) | 1 | 1948.0 | 36.313705 |
1 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4142366.664 2598169.380) | 2 | 2209.0 | 7.917305 |
2 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4140249.007 2596954.755) | 3 | 1990.0 | 32.722038 |
3 | False | 2006-06-30 | 85 | 10636 | 322 | POINT (4148638.412 2595538.585) | 4 | 2142.0 | 49.800537 |
4 | False | 2006-06-30 | 85 | 10636 | 332 | POINT (4156286.754 2595790.720) | 5 | 2420.0 | 27.018671 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1119 | False | 2006-06-30 | 85 | 10636 | 333 | POINT (4141257.016 2584469.100) | 1120 | 2368.0 | 21.605083 |
1120 | False | 2006-06-30 | 85 | 10636 | 111 | POINT (4141241.654 2581611.485) | 1121 | 1249.0 | 20.821150 |
1121 | False | 2006-06-30 | 85 | 10636 | 312 | POINT (4140414.076 2582953.315) | 1122 | 1729.0 | 16.108473 |
1122 | False | 2006-06-30 | 85 | 10636 | 324 | POINT (4140716.302 2580458.165) | 1123 | 900.0 | 27.319332 |
1123 | False | 2012-06-30 | 85 | 10636 | 111 | POINT (4141241.654 2581611.485) | 1124 | 1249.0 | 20.821150 |
1124 rows × 9 columns
2.3. Space-Time Overlay¶
For the temporal layers, the points should be filtered by year and overlaid on the right raster files, perfoming the overlay with all the images from a specific year. The SpaceTimeOverlay class implements this approach using the parameter:
timeless_data: The result of SpaceOverlay (GeoPandas DataFrame)
col_date: The column that contains the date information (2018-09-13)
dir_temporal_layers: The directory where the temporal raster files are stored, organized by year
[10]:
fn_layers = [ str(file).replace('2000', '{year}') for file in find_files(dir_temporal_layers, '2000/*.tif') ]
[11]:
spc_time_Overlay = SpaceTimeOverlay(points=timeless_data, col_date=col_date, \
fn_layers=fn_layers, verbose=False)
overlayed_data = spc_time_Overlay.run()
Now we have the elevation, slope, landsat and the night light data for each points:
[12]:
overlayed_data[overlayed_data.columns[0:10]]
[12]:
lucas | survey_date | confidence | tile_id | lc_class | geometry | overlay_id | dtm_elevation | dtm_slope | landsat_ard_fall_blue_p25 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4145221.759 2594636.440) | 1 | 1948.0 | 36.313705 | 5.0 |
1 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4142366.664 2598169.380) | 2 | 2209.0 | 7.917305 | 5.0 |
2 | False | 2006-06-30 | 85 | 10636 | 321 | POINT (4140249.007 2596954.755) | 3 | 1990.0 | 32.722038 | 4.0 |
3 | False | 2006-06-30 | 85 | 10636 | 322 | POINT (4148638.412 2595538.585) | 4 | 2142.0 | 49.800537 | 7.0 |
4 | False | 2006-06-30 | 85 | 10636 | 332 | POINT (4156286.754 2595790.720) | 5 | 2420.0 | 27.018671 | 16.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1098 | False | 2000-06-30 | 85 | 10636 | 312 | POINT (4140414.076 2582953.315) | 277 | 1729.0 | 16.108473 | 6.0 |
1112 | False | 2000-06-30 | 85 | 10636 | 332 | POINT (4157045.539 2609917.600) | 278 | 2562.0 | 31.661921 | 7.0 |
1115 | False | 2000-06-30 | 85 | 10636 | 321 | POINT (4141237.722 2583848.400) | 279 | 2174.0 | 15.649096 | 5.0 |
1116 | False | 2000-06-30 | 85 | 10636 | 333 | POINT (4141257.016 2584469.100) | 280 | 2368.0 | 21.605083 | 28.0 |
1117 | False | 2000-06-30 | 85 | 10636 | 322 | POINT (4140716.302 2580458.165) | 281 | 900.0 | 27.319332 | 13.0 |
1124 rows × 10 columns
2.4. Save to CSV and GeoPackage files¶
At last, we need to save the overlaid points to access it in other softwares (QGIS) and eumap tutorial:
[13]:
csv_output = os.path.join(tile_dir, tile + '_landcover_samples_overlayed.csv.gz')
print(f"Saving {csv_output}")
overlayed_data.to_csv(csv_output, compression='gzip')
Saving /home/opengeohub/leandro/Code/eumap/docs/notebooks/eumap_data/10636_switzerland/10636_switzerland_landcover_samples_overlayed.csv.gz
[14]:
import fiona
#See https://github.com/Toblerity/Fiona/issues/977
with fiona.Env(OSR_WKT_FORMAT="WKT2_2018"):
gpkg_output = os.path.join(tile_dir, tile + '_landcover_samples_overlayed.gpkg')
print(f"Saving {gpkg_output}")
overlayed_data.to_file(gpkg_output, driver="GPKG")
ERROR:fiona._env:PROJ: proj_as_wkt: DatumEnsemble can only be exported to WKT2:2019
ERROR:fiona._env:PROJ: proj_as_wkt: DatumEnsemble can only be exported to WKT2:2019
Saving /home/opengeohub/leandro/Code/eumap/docs/notebooks/eumap_data/10636_switzerland/10636_switzerland_landcover_samples_overlayed.gpkg
2.5. Overlay Benchmarks¶
Here we will show the performance of eumap
’s overlay method against classic raster sampling methods using rasterio
. First, let’s time the overlay executions on the same dataset as in the tutorial above.
[15]:
from pathlib import Path
import geopandas as gpd
import rasterio as rio
import numpy as np
import multiprocessing as mp
import warnings
warnings.filterwarnings('ignore')
from eumap.mapper import SpaceOverlay
max_workers = 8
points = gpd.read_file(fn_points)
print('Sample size:', points.index.size)
Sample size: 1124
Serial sampling with rasterio
:
[16]:
def serial_sampling(points, layers_dir):
sources = [
rio.open(fn)
for fn in sorted(layers_dir.glob('**/*.tif'))
]
coordinates = np.c_[
points.geometry.x,
points.geometry.y,
]
results = points.copy()
for src in sources:
layer_name = Path(src.name).stem
results[layer_name] = np.stack(src.sample(coordinates)).ravel()
%timeit -n 1 -r 1 serial_sampling(points, tile_dir)
5min 29s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Parallel sampling with rasterio
:
[17]:
def sample_one_layer(args):
fn, coordinates = args
layer_name = fn.stem
with rio.open(fn) as src:
data = np.stack(src.sample(coordinates)).ravel()
return layer_name, data
def parallel_sampling(points, layers_dir):
files = sorted(layers_dir.glob('**/*.tif'))
coordinates = np.c_[
points.geometry.x,
points.geometry.y,
]
results = points.copy()
arg_gen = (
(fn, coordinates)
for fn in files
)
with mp.Pool(max_workers) as pool:
for layer_name, data in pool.map(
sample_one_layer,
arg_gen,
):
results[layer_name] = data
%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)
34.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Parallel sampling with eumap.mapper.SpaceOverlay
:
[18]:
def eumap_sampling(points, layers_dir):
ovr = SpaceOverlay(
points,
dir_layers=layers_dir,
max_workers=max_workers,
verbose=False,
)
data = ovr.run()
%timeit -n 1 -r 1 eumap_sampling(points, tile_dir)
2min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Sampling optimizations done in eumap
generate some overhead which outweighs the speedup when used on smaller datasets. However, if we quadruple the sample size:
[19]:
for i in range(2):
points = points.append(points, ignore_index=True)
print('sample size:', points.index.size)
sample size: 4496
Parallel sampling with rasterio
:
[20]:
%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)
2min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Parallel sampling with eumap
:
[ ]:
%timeit -n 1 -r 1 eumap_sampling(points, tile_dir)
As seen above, the optimized overlay’s execution time has much more favorable scaling with dataset size than is the case with raw parallelization.