Sentinelhub-py slow downloads due to new rate limiting problem

barrett · March 3, 2020, 1:48pm

Since the new sentinelhub-py version has been released, I have continued to experience slow download speeds.
After significant investigation, I have found that the config parameter “number_of_download_processes” and the way it is used in the library are to blame.

For a test case downloading several months of NDVI data for an agricultural parcel:
number_of_download_processes = 0, download time = 2-5s
number_of_download_processes = 1, download time = 4-5s
number_of_download_processes = 6, download time = 14s
number_of_download_processes = 30, download time = 66s

This parameter is only used in one place in the library: to calculate the minimum wait time between making download requests. This means, that the higher the number, the longer the wait time is enforced. This of course is completely counter-intuitive as increasing the number of download processes should reduce download times (to a point). The actual number of processes used to simultaneously download requests is actually undefined in the library, and defined automatically by the python ThreadPoolExecutor class.

I don’t know if this feature works as intended for requests with larger areas, but for my use-case, it is counter-intuitive. However, I can use this parameter in the config file to roughly control the rate at which requests are sent to match my available processing unit limits.

Cheers
Sam

gmilcinski · March 3, 2020, 2:46pm

Interesting.
This parameter tries to appropriately balance the rate limits across all the processes. E.g. if you have “60 requests per minute” rate limit and you use one download processesor, this processor can do a request every second. If you have 60 processors running, then each of the processors can only do one request every minute.

That being said, your rate limits are set to 1000 processing units per minute and 2000 requests per minute. So with 30 processors you should still be able to do 66 requests per minute with each of them.

Which instance are you using (just provide first 3 blocks) so that we check what is happening in the backend?

CC @maleksandrov @iovsn

iovsn · March 3, 2020, 3:03pm

As @gmilcinski said, increasing the number_of_download_processes parameter will slow down the download so you don’t hit the rate limit. If you want to parallelize per thread, use the max_workers parameter, which is passed to ThreadPoolExecutor and when parallelizing this way also the information about rate-limit is shared between threads so they don’t all separately hit the 429 response. The number_of_download_processes was introduced because it’s harder to adjust the rate of issuing requests when you are running the download on multiple processes/computers using the same Sentinel Hub instance_id.

There was also another issue with creating a new oauth session for each request, which slowed down the download process. This issue is already resolved in the sentinelhub-py’s develop branch and it will be released these days.

barrett · March 3, 2020, 5:53pm

Ah! That makes sense now!
Although I find the naming confusing - it sounds like it might mean the number of workers for downloads on a single machine.
I see the max_workers parameter in the code, but I don’t see a way to set it through the sentinelhub library, although it defaults to a high enough value that the limiting comes completely from the rate limit control mentioned by @iovsn, the download speed itself, and the speed of the back-end.

So if I understand now, I actually want the number_of_download_processes parameter multiplied by the minimum_wait_time to equal my number of requests per second refill if I run something over long periods of time, but could make it lower if I’m unlikely to run out of the smaller requests quota? In which case, it would make sense if the minimum_wait_time either was also in the config file, or was read from the requests bucket properties.

The previous more complex method seemed fine in theory to me, it just didn’t work because the response headers only have the processing unit quota from one of the buckets, but the code compared both the big and small bucket quotas to that single number…
I.e. it would think it had 80k quota, but then see 1,999 in the response header and think it had used 7801 since the last request and respond by returning a very large wait time. It worked fine when I disabled checking the 80k bucket quotas, and also would have worked if that information was provided in the response header and properly processed…

@gmilcinski This work is related to Perceptive Sentinel, so I’m using a different instance id than usual (one for the project). The first three blocks are bb9ba5b3-1d75-4efc. Everything seems to be working as expected now with number_of_download_processes set to 0.
Re your calculation for 30 processors: for the requests I usually make (time-series agricultural parcels), I usually expect a few seconds per “request” (for e.g. 6 months of data), and it is most helpful for each request to be fast than paralleling my entire request system.

iovsn · March 5, 2020, 5:05pm

@barrett, we have released sentinelhub-py version 3.0.2, which now handles oauth credentials fetching more efficiently. This should increase the speed of download. Could you update and check again?

barrett · April 7, 2020, 6:28am

I updated a few weeks ago and haven’t noticed any download speed problems since.
Thanks for the update.