eumap.mapper.LandMapper¶
- class LandMapper(points, target_col, feat_cols=[], feat_col_prfxs=[], weight_col=None, nodata_imputer=None, estimator=RandomForestClassifier(), estimator_list=None, meta_estimator=LogisticRegression(), hyperpar_selection=None, hyperpar_selection_list=None, hyperpar_selection_meta=None, feature_selection=None, feature_selections_list=None, cv=KFold(n_splits=5, random_state=None, shuffle=False), cv_njobs=1, cv_group_col=None, min_samples_per_class=0, pred_method='predict', verbose=True, apply_corr_factor=False, **autosklearn_kwargs)[source]¶
Bases:
objectSpatial prediction implementation based in supervised machine learning models and point samples.
It’s fully compatible with
scikit-learn[1] supporting:Classification models,
Seamless training using point samples,
Data imputation,
Hyper-parameter optimization,
Feature selection,
Ensemble machine learning (EML) and prediction uncertainty,
AutoML through
auto-sklearn[2],Accuracy assessment through cross-validation,
Seamless raster prediction (read and write).
- Parameters
points (
Union[DataFrame,Path]) – Point samples used to train the ML model. It supportspandas.DataFrameand a path for plain CSV(*.csv)or compressed csv file(*.gz), which are read throughpandas.read_csv[3]. All the other extensions are read bygeopandasas GIS vector files [4].target_col (
str) – Column name used to retrieve the target values for the training.feat_cols (
Optional[List]) – List of column names used to retrieve the feature/covariates for the training.feat_col_prfxs (
Optional[List]) – List of column prefixes used to derive thefeat_colslist, avoiding to provide dozens/hundreds of column names.weight_col (
Optional[str]) – Column name used to retrieve thesample_weightfor the training.nodata_imputer (
Optional[BaseEstimator]) – Transformer used to input missing values filling allnp.nanin the point samples. Allsklearn.imputeclasses are supported [1].estimator (
Optional[BaseEstimator]) – The ML model used by the class. The default model is aRandomForestClassifier, however all thesklearnmodel are supported [1]. Forestimator=Noneit tries to useauto-sklearnto find the best model and hyper-parameters [2].estimator_list (
Optional[List[BaseEstimator]]) – A list of models used by the EML implementation. The models output are used to feed themeta_estimatormodel and to derive the prediction uncertainty. This argument has prevalence overestimator.meta_estimator (
BaseEstimator) – Model used to derive the prediction output in the EML implementation. The default model here is aLogisticRegression, however all thesklearnmodel are supported [1].hyperpar_selection (
Optional[BaseEstimator]) – Hyper-parameter optimizer used byestimatormodel.hyperpar_selection_list (
Optional[BaseEstimator]) – A list of hyper-parameter optimizers used byestimator_listmodels, provided in the same order. This argument has prevalence overhyperpar_selection.hyperpar_selection_meta (
Optional[List[BaseEstimator]]) – Hyper-parameter optimizer used bymeta_estimatormodel.feature_selection (
Optional[BaseEstimator]) – Feature selection algorithm used byestimatormodel.feature_selections_list (
Optional[BaseEstimator]) – A list of feature selection algorithm used byestimator_listmodels, provided in the same order. This argument has prevalence overfeature_selection.cv (
BaseCrossValidator) – Cross validation strategy used by all models. The default strategy is a5-Fold cv, however all thesklearnmodel are supported [1].cv_njobs (
int) – Number of CPU cores to be used in parallel during the cross validation.cv_group_col (
Optional[str]) – Column name used to split the train/test set during the cross validation. Use this argument to perform aspatial CVby block/tiles.min_samples_per_class (
float) – Minimum percentage of samples according totarget_colto keep the class in the training.pred_method (
str) – Usepredict_probto predict probabilities and uncertainty, otherwise it predicts only the dominant class.apply_corr_factor (
bool) – Apply a correction factor (rmse / averaged_sd) in the prediction uncertainty output.verbose:bool – Use
Trueto print the progress of all steps.**autosklearn_kwargs – Named arguments supported by
auto-sklearn[2].
For usage examples access the
eumaptutorials [5,6].References
[2] Auto-sklearn API
[4] Geopandas read_file function
[6] Land Cover Mapping (Advanced)
Methods
Load a class instance from disk.
Predict raster data.
Predict multiple raster data.
Predict point samples.
Persist the class instance in disk using
joblib.dump.Train the ML/EML model according to the class arguments.
- static load_instance(fn_joblib)[source]¶
Load a class instance from disk.
- Parameters
fn_joblib – Location of the saved instance.
- Returns
Class instance
- Return type
- predict(dirs_layers=[], fn_layers=[], fn_output=None, spatial_win=None, dtype='float32', fill_nodata=False, separate_probs=True, hard_class=True, inmem_calc_func=None, dict_layers_newnames={}, allow_additional_layers=False, n_jobs_io=4, verbose_renaming=True)[source]¶
Predict raster data. It matches the raster filenames with the input feature/covariates used by training.
- Parameters
dirs_layers (
List) – A list of folders where the raster files are located.fn_layers (
List) – A list with the raster paths. Provide it and thedirs_layersis ignored.fn_output (
Optional[str]) – File path where the prediction result is saved. For multiple outputs (probabilities, uncertainty) the same location is used, adding specific suffixes in the provided file path.spatial_win (
Optional[Window]) – Read the data and predict according to the spatial window. By default isNone, which means all the data is read and predict.dtype – Convert the read data to specific
dtype. ForFloat*thenodatavalues are converted tonp.nan.fill_nodata (
bool) – Use thenodata_imputerto fill allnp.nanvalues. By default isFalsebecause for almost all the cases it’s preferable use theeumap.gapfiller moduleto perform this task.separate_probs (
bool) – UseTrueto save the predict probabilities in a separate raster, otherwise it’s write as multiple bands of a single raster file. Forpred_method='predict'it’s ignored.hard_class (
bool) – Whenpred_method='predict_proba'useTrueto save the predict dominant class(*_hcl.tif), the probability(*_hcl_prob.tif)and uncertainty(*_hcl_uncertainty.tif)values of each dominant class.inmem_calc_func (
Optional[Callable]) – Function to be executed before the prediction. Use it to derive covariates/features on-the-fly, calculating in memory, for example, a NDVI from the red and NIR bands.dict_layers_newnames (
set) – A dictionary used to change the raster filenames on-the-fly. Use it to match the column names for the point samples with different raster filenames.allow_additional_layers (
bool) – UseFalseto throw aExceptionif a read raster is not present infeature_cols.n_jobs_io (
int) – Number of parallel jobs to read the raster files.verbose_renaming (
bool) – show which raster layers are renamed
- Returns
List with all the raster files produced as output.
- Return type
List[Path]
- predict_multi(dirs_layers_list=[], fn_layers_list=[], fn_output_list=[], spatial_win=None, dtype='float32', fill_nodata=False, separate_probs=True, hard_class=True, inmem_calc_func=None, dict_layers_newnames_list=[], allow_additional_layers=False, prediction_strategy_type=PredictionStrategyType.Lazy)[source]¶
Predict multiple raster data. It matches the raster filenames with the input feature/covariates used by training.
- Parameters
dirs_layers_list (
List[List]) – A list of list containing the folders where the raster files are located.fn_layers_list (
List[List]) – A list of list containing the raster paths. Provide it and thedirs_layers_listis ignored.fn_output_list (
List[List]) – A list of file path where the prediction result is saved. For multiple outputs (probabilities, uncertainty) the same location is used, adding specific suffixes in the provided file path.spatial_win (
Optional[Window]) – Read the data and predict according to the spatial window. By default isNone, which means all the data is read and predict.dtype (
str) – Convert the read data to specificdtype. ForFloat*thenodatavalues are converted tonp.nan.fill_nodata (
bool) – Use thenodata_imputerto fill allnp.nanvalues. By default isFalsebecause for almost all the cases it’s preferable use theeumap.gapfiller moduleto perform this task.separate_probs (
bool) – UseTrueto save the predict probabilities in a separate raster, otherwise it’s write as multiple bands of a single raster file. Forpred_method='predict'it’s ignored.hard_class (
bool) – Whenpred_method='predict_proba'useTrueto save the predict dominant class(*_hcl.tif), the probability(*_hcl_prob.tif)and uncertainty(*_hcl_uncertainty.tif)values of each dominant class.inmem_calc_func (
Optional[Callable]) – Function to be executed before the prediction. Use it to derive covariates/features on-the-fly, calculating in memory, for example, a NDVI from the red and NIR bands.dict_layers_newnames – A list of dictionaries used to change the raster filenames on-the-fly. Use it to match the column names for the point samples with different raster filenames.
allow_additional_layers – Use
Falseto throw aExceptionif a read raster is not present infeature_cols.prediction_strategy_type – Which strategy is used to predict the multiple raster data. By default is
Lazỳ, loading one year while predict the other.
- Returns
List with all the raster files produced as output.
- Return type
List[Path]
- predict_points(input_points)[source]¶
Predict point samples. It uses the
feature_colsto retrieve the input feature/covariates.- Parameters
input_points (
DataFrame) – New set of point samples to be predicted.- Returns
The prediction result and the prediction uncertainty (only for EML)
- Return type
Tuple[Numpy.array, Numpy.array]