SI_LULC_pipeline Questions

Hello! I am working on this example and I had a question. I noticed in the blog post is says that you do a spatial sampling of 40,000 pixels. When I look in the notebook however it says:

n_samples = 125000 # half of pixels

So is the sample size 125,000 pixels? Also what is meant by the half of pixels comment? It was my understanding that each eopatch has 1,000,000 pixels per patch. Sorry if I may have overlooked this somewhere.

Hi @ncouch!

Welcome to the SH forum, I’m glad you reached out with your question.

It is indeed confusing, but the fact is that it’s been a while since the blogpost was written and it seems that a few things were updated in the meantime. Additionally, the blogpost talks about the countrywide application, but in the notebook we use only 6 patches, so we can afford to make a larger selection.

The sample size is aimed at 125,000, as you pointed out. (it’s better to trust the notebook). The patch sizes in the notebook are 5km x 5km, which means 5000x5000 = 250000px, and half of that goes to sampling. In the end, the sample size per eopatch is probably smaller than that, because a lot of the pixels are from the no_data regions, and these are filtered out in the end.

I created a ticket here to update and sync the blogs and the notebook.

Thanks for raising this!

Regards,
Matic

1 Like

I did have another question, so I am basically testing different classifiers just to see how different methods either improve or diminish the results. I noticed that even after the sampling, filtering and interpolation that there are quite a few NaN values left in the dataset. I saw another forum post that said if these NaN values are found at either the beginning or the end of the row there cannot be interpolation. For classifiers that are less forgiving of NaN values then Lightgbm, would the best approach for these NaN values just be to crop them out of the dataset?

I guess you can’t really just crop them out, as lightGBM expects a specific shape of the input array. (unless your values are NaN across the full temporal space, then it’s best if you drop them)

What you could do is impute those NaNs after interpolation in some other way, i.e. replace them with the last valid value or something.

Let me know if this helps!

Regards,
Matic

Thank you yes, I am doing some experimentation/testing with the notebook and was confused about a few things. I really appreciate the help. If I have any more questions I will be sure to post back here.