Tabular Founders: Fireside Chat
Join Ryan Blue, Daniel Weeks, and Jason Reid for a candid fireside chat. They discuss why they decided to build a data platform, what problems Tabular solves, who they are building it for, and—most importantly—why data engineers will love it.
Video Transcription
Jason Reid: [cold start] …and the nice thing is it doesn’t matter where you’re starting from. It could be you're starting from I have event data I want to stream into something, I don’t know exactly what I want to do with it yet, but I’ve got to collect it somewhere. Or it could be you have data and you want to run simple ad-hoc SQL against it using—you know—Athena or some sort of simple Trino. Or it could be that you’re doing deep ML. No matter where you start Tabular and Iceberg can support that use case and then when you’re ready for the other one’s nothing new has to happen. Right? You just plug in that next engine like we talked about. And so you can just start anywhere and be confident that you can grow in any direction as your needs evolve.
[…]
Ryan, you’re the co-founder and CEO at Tabular, maybe you can start with why did you bring me along on this journey.
Ryan Blue: You know the data engineering side and the technology side and how they should fit together. And I always thought that your perspective at Netflix was invaluable to informing what we needed to build. And that’s exactly what you’re doing now.
Jason: So far so good.
Ryan: Yeah, that worked out amazingly well.
Jason: …just as long as we make data engineering lives easier I’m pretty happy.
Ryan: Exactly! That’s why we exist.
Jason: Dan, Ryan convinced you to come along on this crazy journey with us. What’s the big reason you decided it was time to leave Netflix and do Tabular?
Dan: I went to Tabular to lead engineering and build a platform that people could reuse without out all the complexity and the cost. And get the latest state-of-the-art, building on top of Iceberg, in a world where engines can interconnect and you have incredible capability across lots of different platforms.
Jason: I think that one of the appeals of the modern data stack is that it allowed companies to get up and running—doing relatively sophisticated data things at scale—very easily. Right? And if there's any really big complaint about the modern data stack is it was so easy to spin up to do data things, that it gets out of control quickly, and now your cost vectors are of control but you’re likely locked into the architecture that you have for various reasons. And I think what Tabular provides is another avenue there. Where you still get the simplicity to spin up something quickly. Get started, prove value, all of the things that we love about the modern data stack, but something that is also based on an open format and has the cost and scalability mechanics of cloud.
Dan: …a system like this that everyone always pushes to the end is security.
Jason: [off camera] Always.
Dan: Where you think your use cases are driving the growth but we’ll take care of security later when we really need it but as every security expert knows, security isn’t the thing that you do last. You have to do it from the beginning otherwise it’s going to be really hard to introduce. And so starting with something that has that built-in and it's easy, and you can evolve that along with your actual practices means that you’re starting from a much better place and you don't have to reinvent it later.
Jason: So, Tabualr represents the next generation of cloud data warehousing. If you had nothing today, the easiest thing to do is I can dump files, JSON files, or parquet files onto S3 and I can give some sort of semblance of schema to those things, or schema on read, some sort of tool to query them. Maybe I’m using Athena or something similar. And that’s nice because it has very few moving parts and there’s a lot of rough edges, there’s a lot of problems that you run into over time if you only adopt that as your data warehousing architecture but I get its appeal. Right? From a cost and flexibility and simplicity perspective. And what I’m really excited about for Tabular, is that we can bring all of the solves for all the problems for what that architecture and for what that pattern represents with the simple ease of use that it also has. For example, putting data into Tabular and reading it from any engine is easier than it is to just put parquet files on S3 and read them with Athena. But now that data is also optimized, it's secured …it can be used from other engines simply. So I’ve gotten a bunch of superpowers and haven’t had to trade off complexity on my side. And that’s a huge win for customers.
114
views
PyIceberg 0.2.1: Iceberg ❤️ PyArrow & DuckDB
In this video, we demonstrate the new features of PyIceberg 0.2.1. For the demo, we use the docker-spark-iceberg setup that's available here: https://github.com/tabular-io/docker-spark-iceberg
After spinning up the docker-compose setup, the Jupyter notebook will be available at http://localhost:8888/
The notebook PyIceberg - Getting Started.ipynb will guide you through how to read data into PyArrow, and then Pandas. And in the last part, it will demonstrate how to query the Pandas dataset using DuckDB.
For a complete overview of all the installation options, please refer to the documentation: https://py.iceberg.apache.org/
If there are any questions, please reach out using the Iceberg Slack: https://iceberg.apache.org/community/ or open an issue or pull request on Github https://github.com/apache/iceberg
#iceberg #python #pyarrow #duckdb #tabular #datalake
58
views
Iceberg 2022: Year In Review
00:00 Intro
00:24 Brian Olsen
05:29 Alex Merced
06:40 Sam Redai
10:40 Ryan Blue
Series: Ask the Iceberg Experts
Guests:
- Brian Olsen, Developer Advocate Trino/Starburst, Iceberg contributor
- Alex Merced, Developer Advocate Dremio
- Sam Redai, Software Engineer Netflix, Iceberg contributor
- Ryan Blue, Tabular CEO, co-creator of Iceberg
Subject: We talk with Iceberg experts around the industry for their thoughts on the highlights of Iceberg evolution and adoption in 2022
iceberg.apache.org
www.dremio.com
www.starburst.io
www.trino.io
www.netflix.com
www.tabular.com
#iceberg #datalake #datalakehouse
48
views
Ancestry Implementation Of Iceberg
Series: Ask the Iceberg Experts
Guest: Thomas Cardenas, Senior Software Engineer, Ancestry
Subject: Ancestry implementation of Iceberg
Thomas talks about his recent blog post on implementing and optimizing a 100 billion row table in Apache Iceberg for the Hints database at Ancestry.
https://medium.com/ancestry-product-and-technology/scaling-ancestry-com-how-to-optimize-updates-for-iceberg-tables-with-100-billion-rows-860285922316
www.ancestry.com
iceberg.apache.org
#iceberg #datalake #ancestry #apacheicerg #dataengineering
28
views
How Insider went from Hive to Iceberg
Series: Ask the Iceberg Experts
Guest: Deniz Parmaksiz, Sr. Machine Learning Engineer at Insider
Subject: What was involved in migrating Insider from Hive to Iceberg
Inspired by this blog post:
https://medium.com/insiderengineering/how-we-migrated-our-production-data-lake-to-apache-iceberg-4d6892eca6e6
iceberg.apache.org
www.tabular.io
#iceberg #datalake #insider #hive #tabular
23
views
Snowflake Support Of Iceberg
Series: Ask the Iceberg Experts
Guest: Dennis Huo, Principal Software Engineer, Snowflake
Subject: Snowflake support of Iceberg
Dennis talks about Snowflake support of Iceberg, what it was like developing it, what it was like working with the Iceberg community and the Snowflake Catalog.
iceberg.apache.org
#iceberg #datalake #snowflake #tabular
22
views
Demonstrating PyIceberg
In this video, we demonstrate how to use the PyIceberg CLI. For the demo, we use the docker-spark-iceberg setup that's available here: https://github.com/tabular-io/docker-spark-iceberg
First, we create a table using Spark through the Jupyter notebook.
Next, we browse the catalog using the `pyiceberg` CLI. We install pyiceberg from pip using `pip install "pyiceberg[pyarrow]"`.
For a complete overview of all the installation options, please refer to the documentation:
https://py.iceberg.apache.org/
Next we demonstrate several commands like list, describe, and files to retrieve information about the iceberg tables. In the end, we show how easy it is to accidentally drop a table using the CLI.
If there are any questions, please reach out using the Iceberg Slack: https://iceberg.apache.org/community/
or open an issue or pull request on Github https://github.com/apache/iceberg
15
views
Iceberg 101
Series: Ask the Iceberg Experts
Guest: Ryan Blue, co-creator of Iceberg, and co-founder of Tabular
Subject: Introduction to Iceberg and its origins at Netflix (Iceberg 101)
iceberg.apache.org
www.tabular.io
#iceberg #datalake #tabular #ryanblue
21
views
Iceberg 102
Series: Ask the Iceberg Experts
Guest: Ryan Blue, co-creator of Iceberg, and co-founder of Tabular
Subject: Introduction to Iceberg and table formats (Iceberg 102)
iceberg.apache.org
www.tabular.io
#iceberg #datalake #tabular #ryanblue
23
views
Tabular Office Hours: June 14, 2023
00:00 Intro
05:34 Three main factors affecting cost
11:10 Tabular's optimization techniques
16:34 Illustrate cost savings
22:23 Summary
Series: Tabular Office Hours
Guest: Jason Reid, Tabular co-founder and head of product
Subject: Cost Optimization
Jason reviews how to optimize costs on AWS Object Store and how Tabular does it for you automatically.
AI Generated Summary:
- James Reed, co-founder of Tabular, discussed the benefits of automatic table optimization for cloud data warehousing, which can lower costs and improve query performance. He explained the three components of cost for data warehousing (network, storage, and compute) and provided examples of pricing models, such as Amazon Athena's pricing based on data volume. -
- Jason discussed the three main factors that affect the cost of running a cloud data warehouse environment: network, storage, and compute. He also explained how Tabular's automated optimization techniques can reduce costs by organizing data more effectively and executing compute in the background.
- Jason discussed how Tabular's platform handles optimization through sorting, compression, and compaction. The platform constantly experiments with different compression settings to find the best combination of size, write performance, and read performance for each table, resulting in a 50-80% reduction in overall data size and significant cost savings.
- Jason discussed how Tabular's table optimization can significantly reduce the cost of data warehousing bills by compacting and organizing data into a smaller set of bytes. He demonstrated this through a demo where a table with 434,000 rows and 175 megabytes worth of data was optimized to just over 100 megabytes, resulting in a 40% overall savings on the cost of that workload.
- Jason explained how sorting and organizing the data helped to significantly reduce the amount of data that needed to be loaded, resulting in faster response times and lower costs. The team was excited about the optimization features in Tabular.
18
views
Iceberg: Copy on Write vs Merge on Read
Series: Ask the Iceberg Experts
Guest: Daniel Weeks, co-creator of Iceberg, and co-founder of Tabular
Subject: Copy on Write vs. Merge on Read
iceberg.apache.org
www.tabular.io
#iceberg #datalake #tabular #danielweeks
7
views
Tabular Bits: Connect with Trino
Series: Tabular Bits
Subject: Connecting with and using Trino, with Tabular
www.tabular.io
www.trino.io
#datalake #datalakehouse #trino #tabular #iceberg #apacheiceberg
16
views
How to Migrate or Convert from Hive
Series: Ask the Iceberg Experts
Guest: Daniel Weeks, co-creator of Iceberg, and co-founder of Tabular
Subject: How to Migrate or Convert from Hive
iceberg.apache.org
www.tabular.io
#iceberg #datalake #hive #tabular #danielweeks
6
views
AWS 2022 Iceberg Integrations
Series: Ask the Iceberg Experts
Guest: Jack Ye, Sr. Software Engineer, Amazon Athena
Subject: AWS 2022 Iceberg Integrations
iceberg.apache.org
www.tabular.io
#iceberg #datalake #AWS #tabular #jackye
7
views
PyIceberg: Python Development Setup
This video will walk you through the steps required to set up the Python development environment for PyIceberg. We will set up a local instance of Spark, Rest catalog, and MinIO for querying an actual table. This makes it easy to do interactive development and test everything end to end.
#iceberg #python #pyiceberg #tabular #minio #spark #datalake #datalakehouse
6
views
REST Catalog Explained
Series: Ask the Iceberg Experts
Guest: Daniel Weeks, co-creator of Iceberg, and co-founder of Tabular
Subject: REST Catalog Explained
iceberg.apache.org
www.tabular.io
#iceberg #datalake #restcatalog #tabular #danielweeks
6
views
Underused Iceberg Features In AWS S3
Series: Ask the Iceberg Experts
Guest: Jack Ye, Sr. Software Engineer, Amazon Athena
Subject: Iceberg features available in AWS S3 that are underused
iceberg.apache.org
www.tabular.io
#iceberg #datalake #AWS #tabular #jackye #s3
8
views
Tabular Bits: Create Warehouse
Series: Tabular Bits
Subject: How to create a Warehouse in Tabular in under a minute
www.tabular.io
#datalake #datalakehouse #tabular #iceberg #apacheiceberg
Original UI/UX published as:
https://youtu.be/7ROmcCypj-g
4
views
Tabular Bits: Starburst Galaxy Integration
Series: Tabular Bits
Subject: Starburst Galaxy integration
www.tabular.io
www.starburst.io
iceberg.apache.org
www.trino.io
#datalake #datalakehouse #trino #tabular #iceberg #apacheiceberg #starburst
3
views
Tabular Explainer Video
An overview of the Tabular platform and what it provides for your Apache Iceberg data lake.
#tabular #datalake #datalakehouse #apacheiceberg #iceberg
3
views
What Is Puffin?
Series: Ask the Iceberg Experts
Guest: Ryan Blue, co-creator of Apache Iceberg, and co-founder of Tabular
Subject: What is the Puffin file format, and how does it relate to the Apache Iceberg ecosystem?
A special thanks to the Trino Software Foundation and Piotr Findeisen for their work on this project.
iceberg.apache.org
www.tabular.io
www.trino.io
#iceberg #datalake #datalakehouse #ryanblue #apacheicerg #dataengineering
5
views
Tabular Bits: Drop and restore Iceberg tables
Series: Tabular Bits
Subject: Drop and Restore Tables
Tabular makes it very simple to drop and restore Apache Iceberg tables. This video illustrates the necessary steps.
www.tabular.io
iceberg.apache.org
#datalake #datalakehouse #dataengineering #tabular #iceberg #apacheiceberg
4
views
Tabular Solutions: Google Colab
Series: Tabular Solutions
Guest: Jason Reid, Tabular co-founder
Subject: Using Spark in Google Colab to read/write data from Tabular managed Iceberg tables
Jason shows Shawn how to configure Google Colab to use Apache Spark to read/write data in Tabular-managed Apache Iceberg tables.
www.tabular.io
https://colab.google/
#iceberg #datalake #apacheiceberg #datalakehouse #redshift #tabular, #apachespark, #googlecolab
7
views
Catalogs: How to Choose
Series: Ask the Iceberg Experts
Guest: Daniel Weeks, co-creator of Iceberg, and co-founder of Tabular
Subject: Catalogs - How to Choose
iceberg.apache.org
www.tabular.io
#iceberg #datalake #tabular #danielweeks
2
views
Hidden Partitioning
Series: Ask the Iceberg Experts
Guest: Ryan Blue, co-creator of Iceberg, and co-founder of Tabular
Subject: Hidden Partitioning
iceberg.apache.org
www.tabular.io
#iceberg #datalake #tabular #ryanblue
2
views