S2CLOUDLESS: get_cloud_masks is very slow on large bbox

Hi !

I’m always working on my script for temporal analysis. But i’m stuck because of slowness of .get_cloud_masks command on large area.

I use BboxSplitter to split my very big area and to have bbox under 5000 pixel height or large:

largeur = int(np.ceil((Xmax_utm - Xmin_utm)/10)) # Calcul de la largeur de l'image
hauteur = int(np.ceil((Ymax_utm - Ymin_utm)/10)) # Calcul de la hauteur de l'image
print('\nArea pixel size: {0} x {1}'.format(largeur,hauteur))

>>>Area pixel size: 9968 x 7245

if largeur > 5000 or hauteur > 5000: # Si la largeur ou la hauteur depasse 5000 pixels
    if largeur > 5000:
        L = int(np.ceil(largeur/5000))
        print('%s cells wide' % (L))
        L = 1
    if hauteur > 5000:
        H = int(np.ceil(hauteur/5000))
        print('%s cells high' % (H))
        H = 1

>>>2 cells wide
>>>2 cells high

Here is an illustration:

I’m testing it on only 3 dates and it’s already long, so I can’t imagine on 3 years…
Do you have an idea to accelerate it ?

Hi Timothee,

we usually run cloud detection on lower resolution. We found out that running cloud detection at 160 m x 160 m resolution gives good results. Of course the post-processing parameters need to be adjusted accordingly. We usually set them to average_over=2 and dilation_size=1. If you do this you should observe speed up for factor 256.

1 Like

Ok maybe it’s a good option. But I’m a little confused because I want cloud percentage on agricole blocks so 160m resolution could be to low… That’s why until now I used 10m resolution.
I’m still going to try this solution.

Cloud detection is anyway a somehow statistical exercise and it is not “up to pixel accurate”. Perhaps worth exploring several options, e.g. 20m, 40m, 80m, 160m, to see, which one will produce best “price/performance” result. E.g. 20m will be 4 times as fast, 40m 16 times as fast…

Thanks for all these informations. I’ll try to find the best option.

s2cloudless uses a pretrained random forest model for cloud classification. In the background all this is handled by lightgbm package, which is highly optimized for performance (speed and memory). By default it uses all processing cores available on your computer and could ever work on GPU.

Therefore one solution to improve speed performance would be to run your code on a machine with more processors.

Yes, it’ll run on jupyter hub on a google server. So we can modulate CPU power and number of core. I have to see that with my dev team. But does s2cloudless manage multi-threads ?

Yes, in s2cloudless multiple processors and multiple threads are always used. That is because lightgbm works that way by default.

When running cloud detection at lower resolution don’t forget to adjust the post-processing parameters (average_over and dilation_size). At 10m resolution the values that work best are 22 and 11, respectively.

The recommended values are roughly

Resolution [m] average_over dilation_size
10 22 11
20 11 6
40 6 3
80 3 2
160 2 1
1 Like

For my use case, I need to run this on resolution 1m, can you recommend any values for that

Why would you run it on 1 meter, if resolution of Sentinel data is 10m? You will not get any better results yet you will use 100 times more compute resources…
(or are you using some other datasource?)

At 10m the images are way too pixelated…
I am making sentinel hub wcs request with resx and resy as 1m and am retrieving all bands data.
Part of this activity is cloud masking.
The bounding box will always be smaller than zoom level 14 tile.

They might be pixelated, but these are original data. When making requests with resx/resy=1m, you get 10m resolution interpolated to 1m. This is useful in many ways, but you should not assume that actual resolution is 1m…

Okay… Thanks… Will keep that in mind…

Is there any way of converting res 10m image to 1m image afterwards because I need to to display the True color imagery