LSOPS10-BP02 Store data in a format that works both for archiving and for active use by retaining related metadata
Select a data format which is queried while the project is ongoing but archives natively. Iceberg's data format is ideal for life sciences projects because it stores metadata alongside the data. It offers advanced data versioning, time travel capabilities, and schema evolution, essential features for maintaining data lineage and regulatory adherence while handling large-scale scientific datasets that frequently change over time.
Desired outcome: Have a portable dataset that contains the data and metadata in a single package
Level of risk exposed if this best practice is not established: Low
Implementation guidance
Verify that the output of data processing results in portable data formats to allow for simple archiving.
Implementation steps
-
Build standard data pipelines using AWS Glue
-
For deep analysis on structured data use Amazon Redshift.
-
Build a data lake of the data in Amazon S3 in Iceberg format.
Resources
Related documentsn:
Related examples:
Related tools: