S3 inventory files questions


#1

Hey all, wanted to confirm my method for cataloging the sentinel 2 archive is correct. I’m downloading the latest manifest.json and iterating over the respective CSV’s. While iterating over the CSV’s I’m looking for productInfo.json and parsing the path attribute/key. Using the path value our ingest system can then take that and download the necessary information for cataloging it further.

What I’m finding is that while iterating over the CSV’s and looking for productInfo.json in the row, the number of productInfo.json changes from day to day. For instance I found close to double 11 million, vs 8 million a few days prior in this month.


#2

Hi Brett, this sounds like the correct approach. From the docs, you can see the following about eventual consistency

All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUTs of both new objects and overwrites, and DELETEs. Inventory lists are a rolling snapshot of bucket items, which are eventually consistent (that is, the list might not include recently added or deleted objects).

To validate the state of the object before you take action on the object, we recommend that you perform a HEAD Object REST API request to retrieve metadata for the object, or check the object's properties in the Amazon S3 console. You can also check object metadata with the AWS CLI or the AWS SDKS. For more information, see HEAD Object in the Amazon Simple Storage Service API Reference.

but I don’t think that’d account for the big difference you see, especially not if both the dates you looked at were in the past. Can you provide me with the dates you looked at to see such a big difference?


#3

It looks like 9/17 vs 9/27. The latter containing more productInfo.json in the csv rows


#4

Would you be able to look at your code again? I just wrote up a script to do the same as you, and below are the numbers of matches I’m seeing for productInfo.json in the inventory files. My code could be flawed, but it gives me a predictable day to day increase which I’d expect to see.

manifest_2018-09-16.json has 90 files.
Number of matches: 696626

manifest_2018-09-17.json has 86 files.
Number of matches: 698808

manifest_2018-09-26.json has 85 files.
Number of matches: 715432

manifest_2018-09-27.json has 85 files.
Number of matches: 718268

#5

@jflasher I’m finding that each manifest has 211 files, consistently. Perhaps I’m using the wrong manifest?

For example I’m using the following manifest:
s3://sentinel-inventory/sentinel-s2-l1c/sentinel-s2-l1c-inventory/2018-09-27T08-00Z/manifest.json"


#6

@jflasher I just realized you were using sentinel-s2-l2a manifest vs sentinel-s2-l1c. Can you confirm that your findings are accurate for the sentinel-s2-l1c bucket as well

I confirmed with my code using sentinel-s2-l2a bucket on 9/17 yielded 86 inventory files for a total of 698902 found productInfo.json in a row. So i’m seeing similar results as you. Sometimes I’d see the result differ +/- 5 productInfo.json found re-running the same code on the same manifest day.


#7

After Iterating over s3://sentinel-inventory/sentinel-s2-l1c/sentinel-s2-l1c-inventory/2018-09-17T08-00Z/manifest.json I get a total of 11692584 productInfo.json found in rows from the respective inventory files related to the above manifest.

Can you confirm you get similar results. It seems that the count differs on each run that I do but by a small %.

I also used s3://sentinel-inventory/sentinel-s2-l1c/sentinel-s2-l1c-inventory/2018-09-27T08-00Z/manifest.json
as a test case and got 11904869 found productInfo.json in rows from the respective inventory files. Can you confirm you get similar results for this as well. Thanks

I think I know why I got such varying results on sept 17th vs sept 27th. My AWS queue had a retention time of 4 days vs 14 (that I expected) so I was losing messages effecting the counts.


#8

Yeah, sorry for using the L2A to get the numbers initially. I just reran my script with the manifests you posted and got similar numbers.

As an aside, this is a lot of products!


#9

It is a lot of products, it has been a challenge cataloging it all !