Data Cube and file system

tdrivas · May 7, 2021, 9:17am

Good morning,

as we trying to set a data cube based on the Open Data Cube software, I’d like to ask in what file system do you save your raster data in order to be more efficient during the load process.
Are you

ingesting the files in a geo-db?
parking the files in a simple NFS?
parking the files in a distributed file system such as HDFS?
parking the files in buckets?

Thanks in advance!

gmilcinski · May 7, 2021, 9:08pm

Hi @tdrivas,
note that we are not working with Open Data Cube software.
I suggest you contact them directly:
https://www.opendatacube.org/contacts
Best,
Grega

tdrivas · May 9, 2021, 3:12pm

Dear Grega,
thanks for the reply. This was a question in a higher level not related with the used software on each scenario. Thus, it will be quite interesting to know the architecture of file storing in the EDC.
Thanks,
Thanassis

gmilcinski · May 9, 2021, 3:35pm

As a general rule, we try to not replicate any data, if already stored in the cloud and feasible for cloud-narive processing.
E.g. core mission data (Sentinel-1,2,3,…) is stored in original formats. We make use of COG-ified Sentinel-1 products (internal tiling, index, etc.), the rest is not changed at all.
In general we found Cloud Optimized GeoTIff, JP2, zarr and HDF5 as the most feasible data formats to work with.

Depending on what the follow-up processes require, there might sometimes be useful to pre-process the data to e.g. xcube or eo-patch, but these are typically only stored for the duration of the analysis.

Vector data are stored in geoDB, which is essentially cloud-hosted PostgreSQL/PostGIS database.

In terms of cloud storage, object storage like S3 or Swift work best in terms of scallability, based on our experience.

Valeri.ba · August 24, 2023, 9:35am

Hi @tdrivas,
note that we are not working with Open Data Cube software.
I suggest you contact them directly:
https://www.opendatacube.org/contacts-check
Best,
Grega

Hey,

I can totally relate to your question, as I’ve been diving into the world of data cubes and file systems lately. It’s always a bit of a puzzle to figure out the most efficient setup. For me, when dealing with Open Data Cube software, I’ve found that a hybrid approach works wonders.

Here’s what I’ve been up to: I store my core mission data, like Sentinel-1, in their original formats, leveraging Cloud Optimized GeoTIff, JP2, zarr, and HDF5. If the data is already in the cloud, why duplicate, right?

For post-processing, I sometimes find it handy to temporarily park data in a distributed file system, like HDFS. But remember, that’s just my preference for my workflows. Your mileage may vary!

And a quick shoutout to tdrivas: Great question! Based on my experience, the choice of file system can significantly impact performance. Considering your use case and the type of analysis you’re running can guide your decision. Don’t hesitate to experiment a bit and find that sweet spot.