eumap.mapper.LandMapper¶
- class LandMapper(points, target_col, feat_cols=[], feat_col_prfxs=[], weight_col=None, nodata_imputer=None, estimator=RandomForestClassifier(), estimator_list=None, meta_estimator=LogisticRegression(), hyperpar_selection=None, hyperpar_selection_list=None, hyperpar_selection_meta=None, feature_selection=None, feature_selections_list=None, cv=KFold(n_splits=5, random_state=None, shuffle=False), cv_njobs=1, cv_group_col=None, min_samples_per_class=0, pred_method='predict', verbose=True, apply_corr_factor=False, **autosklearn_kwargs)[source]¶
Bases:
object
Spatial prediction implementation based in supervised machine learning models and point samples.
It’s fully compatible with
scikit-learn
[1] supporting:Classification models,
Seamless training using point samples,
Data imputation,
Hyper-parameter optimization,
Feature selection,
Ensemble machine learning (EML) and prediction uncertainty,
AutoML through
auto-sklearn
[2],Accuracy assessment through cross-validation,
Seamless raster prediction (read and write).
- Parameters
points (
Union
[DataFrame
,Path
]) – Point samples used to train the ML model. It supportspandas.DataFrame
and a path for plain CSV(*.csv)
or compressed csv file(*.gz)
, which are read throughpandas.read_csv
[3]. All the other extensions are read bygeopandas
as GIS vector files [4].target_col (
str
) – Column name used to retrieve the target values for the training.feat_cols (
Optional
[List
]) – List of column names used to retrieve the feature/covariates for the training.feat_col_prfxs (
Optional
[List
]) – List of column prefixes used to derive thefeat_cols
list, avoiding to provide dozens/hundreds of column names.weight_col (
Optional
[str
]) – Column name used to retrieve thesample_weight
for the training.nodata_imputer (
Optional
[BaseEstimator
]) – Transformer used to input missing values filling allnp.nan
in the point samples. Allsklearn.impute
classes are supported [1].estimator (
Optional
[BaseEstimator
]) – The ML model used by the class. The default model is aRandomForestClassifier
, however all thesklearn
model are supported [1]. Forestimator=None
it tries to useauto-sklearn
to find the best model and hyper-parameters [2].estimator_list (
Optional
[List
[BaseEstimator
]]) – A list of models used by the EML implementation. The models output are used to feed themeta_estimator
model and to derive the prediction uncertainty. This argument has prevalence overestimator
.meta_estimator (
BaseEstimator
) – Model used to derive the prediction output in the EML implementation. The default model here is aLogisticRegression
, however all thesklearn
model are supported [1].hyperpar_selection (
Optional
[BaseEstimator
]) – Hyper-parameter optimizer used byestimator
model.hyperpar_selection_list (
Optional
[BaseEstimator
]) – A list of hyper-parameter optimizers used byestimator_list
models, provided in the same order. This argument has prevalence overhyperpar_selection
.hyperpar_selection_meta (
Optional
[List
[BaseEstimator
]]) – Hyper-parameter optimizer used bymeta_estimator
model.feature_selection (
Optional
[BaseEstimator
]) – Feature selection algorithm used byestimator
model.feature_selections_list (
Optional
[BaseEstimator
]) – A list of feature selection algorithm used byestimator_list
models, provided in the same order. This argument has prevalence overfeature_selection
.cv (
BaseCrossValidator
) – Cross validation strategy used by all models. The default strategy is a5-Fold cv
, however all thesklearn
model are supported [1].cv_njobs (
int
) – Number of CPU cores to be used in parallel during the cross validation.cv_group_col (
Optional
[str
]) – Column name used to split the train/test set during the cross validation. Use this argument to perform aspatial CV
by block/tiles.min_samples_per_class (
float
) – Minimum percentage of samples according totarget_col
to keep the class in the training.pred_method (
str
) – Usepredict_prob
to predict probabilities and uncertainty, otherwise it predicts only the dominant class.apply_corr_factor (
bool
) – Apply a correction factor (rmse / averaged_sd
) in the prediction uncertainty output.verbose:bool – Use
True
to print the progress of all steps.**autosklearn_kwargs – Named arguments supported by
auto-sklearn
[2].
For usage examples access the
eumap
tutorials [5,6].References
[2] Auto-sklearn API
[4] Geopandas read_file function
[6] Land Cover Mapping (Advanced)
Methods
Load a class instance from disk.
Predict raster data.
Predict multiple raster data.
Predict point samples.
Persist the class instance in disk using
joblib.dump
.Train the ML/EML model according to the class arguments.
- static load_instance(fn_joblib)[source]¶
Load a class instance from disk.
- Parameters
fn_joblib – Location of the saved instance.
- Returns
Class instance
- Return type
- predict(dirs_layers=[], fn_layers=[], fn_output=None, spatial_win=None, dtype='float32', fill_nodata=False, separate_probs=True, hard_class=True, inmem_calc_func=None, dict_layers_newnames={}, allow_additional_layers=False, n_jobs_io=4, verbose_renaming=True)[source]¶
Predict raster data. It matches the raster filenames with the input feature/covariates used by training.
- Parameters
dirs_layers (
List
) – A list of folders where the raster files are located.fn_layers (
List
) – A list with the raster paths. Provide it and thedirs_layers
is ignored.fn_output (
Optional
[str
]) – File path where the prediction result is saved. For multiple outputs (probabilities, uncertainty) the same location is used, adding specific suffixes in the provided file path.spatial_win (
Optional
[Window
]) – Read the data and predict according to the spatial window. By default isNone
, which means all the data is read and predict.dtype – Convert the read data to specific
dtype
. ForFloat*
thenodata
values are converted tonp.nan
.fill_nodata (
bool
) – Use thenodata_imputer
to fill allnp.nan
values. By default isFalse
because for almost all the cases it’s preferable use theeumap.gapfiller module
to perform this task.separate_probs (
bool
) – UseTrue
to save the predict probabilities in a separate raster, otherwise it’s write as multiple bands of a single raster file. Forpred_method='predict'
it’s ignored.hard_class (
bool
) – Whenpred_method='predict_proba'
useTrue
to save the predict dominant class(*_hcl.tif)
, the probability(*_hcl_prob.tif)
and uncertainty(*_hcl_uncertainty.tif)
values of each dominant class.inmem_calc_func (
Optional
[Callable
]) – Function to be executed before the prediction. Use it to derive covariates/features on-the-fly, calculating in memory, for example, a NDVI from the red and NIR bands.dict_layers_newnames (
set
) – A dictionary used to change the raster filenames on-the-fly. Use it to match the column names for the point samples with different raster filenames.allow_additional_layers (
bool
) – UseFalse
to throw aException
if a read raster is not present infeature_cols
.n_jobs_io (
int
) – Number of parallel jobs to read the raster files.verbose_renaming (
bool
) – show which raster layers are renamed
- Returns
List with all the raster files produced as output.
- Return type
List[Path]
- predict_multi(dirs_layers_list=[], fn_layers_list=[], fn_output_list=[], spatial_win=None, dtype='float32', fill_nodata=False, separate_probs=True, hard_class=True, inmem_calc_func=None, dict_layers_newnames_list=[], allow_additional_layers=False, prediction_strategy_type=PredictionStrategyType.Lazy)[source]¶
Predict multiple raster data. It matches the raster filenames with the input feature/covariates used by training.
- Parameters
dirs_layers_list (
List
[List
]) – A list of list containing the folders where the raster files are located.fn_layers_list (
List
[List
]) – A list of list containing the raster paths. Provide it and thedirs_layers_list
is ignored.fn_output_list (
List
[List
]) – A list of file path where the prediction result is saved. For multiple outputs (probabilities, uncertainty) the same location is used, adding specific suffixes in the provided file path.spatial_win (
Optional
[Window
]) – Read the data and predict according to the spatial window. By default isNone
, which means all the data is read and predict.dtype (
str
) – Convert the read data to specificdtype
. ForFloat*
thenodata
values are converted tonp.nan
.fill_nodata (
bool
) – Use thenodata_imputer
to fill allnp.nan
values. By default isFalse
because for almost all the cases it’s preferable use theeumap.gapfiller module
to perform this task.separate_probs (
bool
) – UseTrue
to save the predict probabilities in a separate raster, otherwise it’s write as multiple bands of a single raster file. Forpred_method='predict'
it’s ignored.hard_class (
bool
) – Whenpred_method='predict_proba'
useTrue
to save the predict dominant class(*_hcl.tif)
, the probability(*_hcl_prob.tif)
and uncertainty(*_hcl_uncertainty.tif)
values of each dominant class.inmem_calc_func (
Optional
[Callable
]) – Function to be executed before the prediction. Use it to derive covariates/features on-the-fly, calculating in memory, for example, a NDVI from the red and NIR bands.dict_layers_newnames – A list of dictionaries used to change the raster filenames on-the-fly. Use it to match the column names for the point samples with different raster filenames.
allow_additional_layers – Use
False
to throw aException
if a read raster is not present infeature_cols
.prediction_strategy_type – Which strategy is used to predict the multiple raster data. By default is
Lazỳ
, loading one year while predict the other.
- Returns
List with all the raster files produced as output.
- Return type
List[Path]
- predict_points(input_points)[source]¶
Predict point samples. It uses the
feature_cols
to retrieve the input feature/covariates.- Parameters
input_points (
DataFrame
) – New set of point samples to be predicted.- Returns
The prediction result and the prediction uncertainty (only for EML)
- Return type
Tuple[Numpy.array, Numpy.array]