Skip to content

Thoughts on building FROST-compatible tabular dataset? #11

@lucien-sim

Description

@lucien-sim

I've been following your ideas regarding FROST and, like you, am really excited about the prospect of having stand-alone, version controlled, snapshotted ARCO datasets accessible to anyone via a single URL and a hash. That would be incredible for science, but also for anyone else interested in exploring and understanding reality through data. It's a fantastic idea.

However, I was thinking about it and realized I don't know how one would actually build and maintain most datasets like this in practice. I looked around and didn't find much guidance. There's some info from Icechunk/Earthmover regarding gridded data, but that only applies to a small subset of applications. I couldn't find anything regarding tabular or more traditional time series datasets. So I figured I'd learn by doing and then write up the process as an example for others.

I think a decent test case would be the USGS Water Service Instantaneous Values dataset (https://waterservices.usgs.gov/docs/instantaneous-values/), which provides long histories of water flow-related variables at several thousand points. The dataset is big enough and inconvenient enough to access that an ARCO version would be helpful, but also not too big for an individual to host and manage alone. Some googling also didn't yield evidence that this already exists. Unfortunately, I ran into difficulty almost immediately so I figured I'd reach out to ask for guidance.

My first question is regarding the ultimate data format. The data is natively tabular so Iceberg seems like the right choice. But when I read through the docs I learned that Iceberg needs a catalog hosted on a server to manage concurrent access. Having to maintain a server in addition to a storage bucket sounds contrary to the idea behind FROST, and also seems like a lot to ask from your average small organization or research group. Am I missing something? Is there a way to configure Iceberg so everything needed for concurrency and ACID compliance is saved in the bucket, similar to Icechunk? If so, the path to that configuration wasn't easy to find.

I could also force the data into 2D arrays and manage it with Icechunk. This is appealing because I wouldn’t need a server, the API is great, and it's familiar ground for me (I have used zarr extensively and played around with Icechunk). However, when I thought more concretely about workflow, I realized Icechunk doesn't fit super well either. The USGS dataset expands very gradually in the time dimension, so the chunk size would need to be really small in the time dimension for expansions to always occur through the addition of new chunks. It's also possible that new flow points will be added, which would complicate things if the chunk size > 1 along the point dimension. In addition, I would ideally update the dataset by querying and applying incremental changes to values for specific time/point pairs (the API has an endpoint for this, and it's the most efficient way of keeping the entire dataset up-to-date). But I don't think Icechunk makes that easy because you can't update chunks without rewriting them completely. None of this seems ideal.

Do you have any thoughts? I don't know how to address these issues, but would like to figure it out because it would be helpful to the many people who might try to host tabular ARCO datasets on a FROST-like network in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions