Performing ErosionTask on MASK_TIMELESS.LULC with very few valid pixels


I will be the devil’s advocate for a minute and illustrate an example where I have an eopatch containing extremely few valid LULC pixels from which to sample ( n = 5).

Since I also perform an ErosionTask on this LULC layer before running PointSamplingTask, I end up erasing all remaining valid pixels, meaning that the resulting LULC_SAMPLED vector is empty and of shape (0,1,1) even though n_samples = 2000 . Likewise for FEATURES_SAMPLED.

Considering this has implications for the preparation of labels_training, labels_test, features_training, features_test , I think a warning or error should be raised, either at the VectorToRaster step (which would be difficult because you may not decide to perform an ErosionTask on it, in which case it remains valid), or at the ErosionTask step, saying all valid pixels have eroded away, and potentially even skip the sampling on that eopatch altogether, so as to prevent problems later on.



1 Like

Thanks for raising this issue up. Your example is actually very good for illustrating many of possible pitfalls

  • ErosionTask: I would still let the user to have full control on how reference map is eroded (if eroded at all). Not performing erosion in patches with few labeled pixels could become problematic as well – i.e. single spurious labels can be of lower quality and not eroding them could increase the noise in labels in sampled training/test samples. Please note that at the moment the erosion can be label-specific – different type or strength of erosion can be applied for example to roads and forest classes.

  • PointSamplingTask: your example points to another serious issue. Imagine that the 5 pixels you mentioned above wouldn’t get eroded and in the next step you would sample 2000 pixels. The result of the sampling task would be an array of length 2000 containing only 5 different pixels/points. The 5 pixels would be included multiple time which could potentially introduce problems in the model development stage. Solution to this problems requires changes in the sampling task.



Explanation on ErosionTask makes perfect sense to me, and should be kept that way. The task fulfills precisely was is expected of it.

However, for PointSamplingTask, it is inevitable that some eopatches will have fewer available samples than the stated fixed n_samples, especially with sparsely available training data.
I see two solutions here therefore:

  1. Find a way to consolidate the FEATURES_SAMPLED and LULC_SAMPLED with a dynamic n_samples (e.g. n_samples = 2000 if np.count_nonzero(FeatureType.MASK_TIMELESS['LULC']) >= 2000 else np.count_nonzero(FeatureType.MASK_TIMELESS['LULC']) ) . Currently, the way labels_training and features_training are prepared does not allow for a varying length, but I guess finding a solution for this would be relatively trivial.
  2. Gap-fill the rest of the FEATURES_SAMPLED and LULC_SAMPLED values with NaNs instead of repeating the same samples over the entire n_samples length. I don’t know the implications of this solution, but I am assuming it could also easily be dealt with before the model development stage.

Interesting stuff in any case, and keep me (and the community) posted about what your intents are to deal with this problem.