

For solutions that have significant requirements for read optimization, this solution would need to be extended with another tier of data that focuses on that objective. Write optimized solutions focus on the ability to update, delete, and generally transform data at rest within S3. With all these great benefits, it is important to point out that a table format provides you the ability to create write optimized solutions within AWS S3. Iceberg is a table format that provides some of the following benefits:Īll of these benefits are attainable with just a few AWS services: Athena, S3, and a bit of Glue. In this article, we will focus our discussion on Iceberg specifically. Examples of table formats are Iceberg, Hudi, and Delta. Table formats start by leveraging file formats to manage data, but they also come with metadata in the form of manifest files that help analytical systems interpret how all the data relates in a way that is not exposed to end users. This is where table formats come into play. With these files alone, we lack the greater context of the collection of files and how they relate. These file formats are considerably more efficient for analytical systems to process and understand but fall far short of what is needed to be able to perform true analysis with them. These formats already start to get much more complicated and come in the forms of but are not limited to, Parquet, Avro, and ORC. File formats attempt to provide context about the contents of data within a file so you can make assumptions simply by looking at some metadata within the file about the contents of the file. The next evolution of data formats is called file formats. Formats in this space are things like csv, json, and xml.

To process this data, you must review each data point to be able to make guaranteed statements about the whole dataset. For example, a list of json entries are not guaranteed any relationship to each other.

Text files contain data in raw format as incremental points devoid of any guaranteed format, schema, or metadata. In this post, we’ll walk through this approach and show how simple and straightforward getting started is.īefore we get into discussing how one goes about building a lakehouse entirely in S3, we should take a moment to discuss what is meant by a table format. This allows us to maintain the highest flexibility in our data platform by leveraging S3 as our storage layer. There is another approach: build a table format directly into S3 and leverage tools such as Athena to perform analysis there. In many cases, you see tools like Snowflake (with external tables) and Redshift (with Spectrum) try to get closer to the source data to implement this. The approach to building one differs depending on the tool of choice, but ultimately, a lakehouse is about combining your warehouse with your data lake. Recently, however, a newer approach has been growing in popularity: the lakehouse. Essentially ingest data into S3, manipulate, catalog it, and then load it into Redshift, Snowflake, or some other analytical warehouse of your choice. When working with data lakes in AWS, it has historically been a standard process to move that data into a warehouse.
