eumap.mapper.LandMapper

class LandMapper(points, target_col, feat_cols=[], feat_col_prfxs=[], weight_col=None, nodata_imputer=None, estimator=RandomForestClassifier(), estimator_list=None, meta_estimator=LogisticRegression(), hyperpar_selection=None, hyperpar_selection_list=None, hyperpar_selection_meta=None, feature_selection=None, feature_selections_list=None, cv=KFold(n_splits=5, random_state=None, shuffle=False), cv_njobs=1, cv_group_col=None, min_samples_per_class=0, pred_method='predict', verbose=True, apply_corr_factor=False, **autosklearn_kwargs)[source]

Bases: object

Spatial prediction implementation based in supervised machine learning models and point samples.

It’s fully compatible with scikit-learn [1] supporting:

  1. Classification models,

  2. Seamless training using point samples,

  3. Data imputation,

  4. Hyper-parameter optimization,

  5. Feature selection,

  6. Ensemble machine learning (EML) and prediction uncertainty,

  7. AutoML through auto-sklearn [2],

  8. Accuracy assessment through cross-validation,

  9. Seamless raster prediction (read and write).

Parameters
  • points (Union[DataFrame, Path]) – Point samples used to train the ML model. It supports pandas.DataFrame and a path for plain CSV (*.csv) or compressed csv file (*.gz), which are read through pandas.read_csv [3]. All the other extensions are read by geopandas as GIS vector files [4].

  • target_col (str) – Column name used to retrieve the target values for the training.

  • feat_cols (Optional[List]) – List of column names used to retrieve the feature/covariates for the training.

  • feat_col_prfxs (Optional[List]) – List of column prefixes used to derive the feat_cols list, avoiding to provide dozens/hundreds of column names.

  • weight_col (Optional[str]) – Column name used to retrieve the sample_weight for the training.

  • nodata_imputer (Optional[BaseEstimator]) – Transformer used to input missing values filling all np.nan in the point samples. All sklearn.impute classes are supported [1].

  • estimator (Optional[BaseEstimator]) – The ML model used by the class. The default model is a RandomForestClassifier, however all the sklearn model are supported [1]. For estimator=None it tries to use auto-sklearn to find the best model and hyper-parameters [2].

  • estimator_list (Optional[List[BaseEstimator]]) – A list of models used by the EML implementation. The models output are used to feed the meta_estimator model and to derive the prediction uncertainty. This argument has prevalence over estimator.

  • meta_estimator (BaseEstimator) – Model used to derive the prediction output in the EML implementation. The default model here is a LogisticRegression, however all the sklearn model are supported [1].

  • hyperpar_selection (Optional[BaseEstimator]) – Hyper-parameter optimizer used by estimator model.

  • hyperpar_selection_list (Optional[BaseEstimator]) – A list of hyper-parameter optimizers used by estimator_list models, provided in the same order. This argument has prevalence over hyperpar_selection.

  • hyperpar_selection_meta (Optional[List[BaseEstimator]]) – Hyper-parameter optimizer used by meta_estimator model.

  • feature_selection (Optional[BaseEstimator]) – Feature selection algorithm used by estimator model.

  • feature_selections_list (Optional[BaseEstimator]) – A list of feature selection algorithm used by estimator_list models, provided in the same order. This argument has prevalence over feature_selection.

  • cv (BaseCrossValidator) – Cross validation strategy used by all models. The default strategy is a 5-Fold cv, however all the sklearn model are supported [1].

  • cv_njobs (int) – Number of CPU cores to be used in parallel during the cross validation.

  • cv_group_col (Optional[str]) – Column name used to split the train/test set during the cross validation. Use this argument to perform a spatial CV by block/tiles.

  • min_samples_per_class (float) – Minimum percentage of samples according to target_col to keep the class in the training.

  • pred_method (str) – Use predict_prob to predict probabilities and uncertainty, otherwise it predicts only the dominant class.

  • apply_corr_factor (bool) – Apply a correction factor (rmse / averaged_sd) in the prediction uncertainty output.

  • verbose:bool – Use True to print the progress of all steps.

  • **autosklearn_kwargs – Named arguments supported by auto-sklearn [2].

For usage examples access the eumap tutorials [5,6].

References

[1] Sklearn API Reference

[2] Auto-sklearn API

[3] Pandas read_csv function

[4] Geopandas read_file function

[5] Land Cover Mapping

[6] Land Cover Mapping (Advanced)

Methods

load_instance

Load a class instance from disk.

predict

Predict raster data.

predict_multi

Predict multiple raster data.

predict_points

Predict point samples.

save_instance

Persist the class instance in disk using joblib.dump.

train

Train the ML/EML model according to the class arguments.

static load_instance(fn_joblib)[source]

Load a class instance from disk.

Parameters

fn_joblib – Location of the saved instance.

Returns

Class instance

Return type

LandMapper

predict(dirs_layers=[], fn_layers=[], fn_output=None, spatial_win=None, dtype='float32', fill_nodata=False, separate_probs=True, hard_class=True, inmem_calc_func=None, dict_layers_newnames={}, allow_additional_layers=False, n_jobs_io=4, verbose_renaming=True)[source]

Predict raster data. It matches the raster filenames with the input feature/covariates used by training.

Parameters
  • dirs_layers (List) – A list of folders where the raster files are located.

  • fn_layers (List) – A list with the raster paths. Provide it and the dirs_layers is ignored.

  • fn_output (Optional[str]) – File path where the prediction result is saved. For multiple outputs (probabilities, uncertainty) the same location is used, adding specific suffixes in the provided file path.

  • spatial_win (Optional[Window]) – Read the data and predict according to the spatial window. By default is None, which means all the data is read and predict.

  • dtype – Convert the read data to specific dtype. For Float* the nodata values are converted to np.nan.

  • fill_nodata (bool) – Use the nodata_imputer to fill all np.nan values. By default is False because for almost all the cases it’s preferable use the eumap.gapfiller module to perform this task.

  • separate_probs (bool) – Use True to save the predict probabilities in a separate raster, otherwise it’s write as multiple bands of a single raster file. For pred_method='predict' it’s ignored.

  • hard_class (bool) – When pred_method='predict_proba' use True to save the predict dominant class (*_hcl.tif), the probability (*_hcl_prob.tif) and uncertainty (*_hcl_uncertainty.tif) values of each dominant class.

  • inmem_calc_func (Optional[Callable]) – Function to be executed before the prediction. Use it to derive covariates/features on-the-fly, calculating in memory, for example, a NDVI from the red and NIR bands.

  • dict_layers_newnames (set) – A dictionary used to change the raster filenames on-the-fly. Use it to match the column names for the point samples with different raster filenames.

  • allow_additional_layers (bool) – Use False to throw a Exception if a read raster is not present in feature_cols.

  • n_jobs_io (int) – Number of parallel jobs to read the raster files.

  • verbose_renaming (bool) – show which raster layers are renamed

Returns

List with all the raster files produced as output.

Return type

List[Path]

predict_multi(dirs_layers_list=[], fn_layers_list=[], fn_output_list=[], spatial_win=None, dtype='float32', fill_nodata=False, separate_probs=True, hard_class=True, inmem_calc_func=None, dict_layers_newnames_list=[], allow_additional_layers=False, prediction_strategy_type=PredictionStrategyType.Lazy)[source]

Predict multiple raster data. It matches the raster filenames with the input feature/covariates used by training.

Parameters
  • dirs_layers_list (List[List]) – A list of list containing the folders where the raster files are located.

  • fn_layers_list (List[List]) – A list of list containing the raster paths. Provide it and the dirs_layers_list is ignored.

  • fn_output_list (List[List]) – A list of file path where the prediction result is saved. For multiple outputs (probabilities, uncertainty) the same location is used, adding specific suffixes in the provided file path.

  • spatial_win (Optional[Window]) – Read the data and predict according to the spatial window. By default is None, which means all the data is read and predict.

  • dtype (str) – Convert the read data to specific dtype. For Float* the nodata values are converted to np.nan.

  • fill_nodata (bool) – Use the nodata_imputer to fill all np.nan values. By default is False because for almost all the cases it’s preferable use the eumap.gapfiller module to perform this task.

  • separate_probs (bool) – Use True to save the predict probabilities in a separate raster, otherwise it’s write as multiple bands of a single raster file. For pred_method='predict' it’s ignored.

  • hard_class (bool) – When pred_method='predict_proba' use True to save the predict dominant class (*_hcl.tif), the probability (*_hcl_prob.tif) and uncertainty (*_hcl_uncertainty.tif) values of each dominant class.

  • inmem_calc_func (Optional[Callable]) – Function to be executed before the prediction. Use it to derive covariates/features on-the-fly, calculating in memory, for example, a NDVI from the red and NIR bands.

  • dict_layers_newnames – A list of dictionaries used to change the raster filenames on-the-fly. Use it to match the column names for the point samples with different raster filenames.

  • allow_additional_layers – Use False to throw a Exception if a read raster is not present in feature_cols.

  • prediction_strategy_type – Which strategy is used to predict the multiple raster data. By default is Lazỳ, loading one year while predict the other.

Returns

List with all the raster files produced as output.

Return type

List[Path]

predict_points(input_points)[source]

Predict point samples. It uses the feature_cols to retrieve the input feature/covariates.

Parameters

input_points (DataFrame) – New set of point samples to be predicted.

Returns

The prediction result and the prediction uncertainty (only for EML)

Return type

Tuple[Numpy.array, Numpy.array]

save_instance(fn_joblib, no_train_data=False, compress='lz4')[source]

Persist the class instance in disk using joblib.dump. Use it to perform prediction over new raster/point data without retrain the models from scratch.

Parameters
  • fn_joblib (Path) – Location of the output file.

  • no_train_data (bool) – Remove all the training data before persist it in disk.

  • compress (str) – Enable compression.

train()[source]

Train the ML/EML model according to the class arguments.