S2CLOUDLESS: get_cloud_masks is very slow on large bbox


#1

Hi !

I’m always working on my script for temporal analysis. But i’m stuck because of slowness of .get_cloud_masks command on large area.

I use BboxSplitter to split my very big area and to have bbox under 5000 pixel height or large:

largeur = int(np.ceil((Xmax_utm - Xmin_utm)/10)) # Calcul de la largeur de l'image
hauteur = int(np.ceil((Ymax_utm - Ymin_utm)/10)) # Calcul de la hauteur de l'image
print('\nArea pixel size: {0} x {1}'.format(largeur,hauteur))

>>>Area pixel size: 9968 x 7245

if largeur > 5000 or hauteur > 5000: # Si la largeur ou la hauteur depasse 5000 pixels
    if largeur > 5000:
        L = int(np.ceil(largeur/5000))
        print('%s cells wide' % (L))
    else:
        L = 1
    if hauteur > 5000:
        H = int(np.ceil(hauteur/5000))
        print('%s cells high' % (H))
    else:
        H = 1

>>>2 cells wide
>>>2 cells high

Here is an illustration:

I’m testing it on only 3 dates and it’s already long, so I can’t imagine on 3 years…
Do you have an idea to accelerate it ?


#2

Hi Timothee,

we usually run cloud detection on lower resolution. We found out that running cloud detection at 160 m x 160 m resolution gives good results. Of course the post-processing parameters need to be adjusted accordingly. We usually set them to average_over=2 and dilation_size=1. If you do this you should observe speed up for factor 256.


#3

Ok maybe it’s a good option. But I’m a little confused because I want cloud percentage on agricole blocks so 160m resolution could be to low… That’s why until now I used 10m resolution.
I’m still going to try this solution.


#4

Cloud detection is anyway a somehow statistical exercise and it is not “up to pixel accurate”. Perhaps worth exploring several options, e.g. 20m, 40m, 80m, 160m, to see, which one will produce best “price/performance” result. E.g. 20m will be 4 times as fast, 40m 16 times as fast…


#5

Thanks for all these informations. I’ll try to find the best option.


#6

s2cloudless uses a pretrained random forest model for cloud classification. In the background all this is handled by lightgbm package, which is highly optimized for performance (speed and memory). By default it uses all processing cores available on your computer and could ever work on GPU.

Therefore one solution to improve speed performance would be to run your code on a machine with more processors.


#7

Yes, it’ll run on jupyter hub on a google server. So we can modulate CPU power and number of core. I have to see that with my dev team. But does s2cloudless manage multi-threads ?


#8

Yes, in s2cloudless multiple processors and multiple threads are always used. That is because lightgbm works that way by default.


#9

When running cloud detection at lower resolution don’t forget to adjust the post-processing parameters (average_over and dilation_size). At 10m resolution the values that work best are 22 and 11, respectively.

The recommended values are roughly

Resolution [m] average_over dilation_size
10 22 11
20 11 6
40 6 3
80 3 2
160 2 1