Reddit Data Pipeline | AWS End to End Data Engineering

11 months ago
21

🚀 In this video, we walk you through the integration of Reddit, Airflow, Celery, Postgres, S3, AWS Glue, Athena, and Redshift to create a seamless ETL process. 📊🔍

What You Will Learn 📝:
🌐 How to extract data from Reddit using its API.
🔄 Setting up and orchestrating ETL processes with Apache Airflow and Celery.
📦 Storing efficiently with Amazon S3 using Airflow.
🧠 Leveraging AWS Glue for data cataloging and ETL jobs.
📜 Querying and transforming data with Amazon Athena.
🏢 Setting up Redshift Cluster and Best practices for loading data into Amazon Redshift for analytics.

⏰ Timestamps:
0:00 Introduction
1:27 Setting up Apache airflow with Celery Backend and Postgres
9:20 Reddit Data Pipeline with airflow
41:00 Cleaning and Transforming Reddit Data
50:00 Connecting to AWS from Airflow
1:11:17 AWS Glue data transformation
1:22:13 Querying Data with Athena
1:24:47 Setting up Redshift Data Warehouse
1:27:26 Redshift Data Warehouse Query Tool
1:29:00 Loading Data into Data Warehouse
1:32:25 Charting with Redshift Data Warehouse

🔗 Useful Links:
Reddit API Documentation: https://www.reddit.com/wiki/api/
Apache Airflow Official Site: https://airflow.apache.org/docs/
AWS Glue Documentation: https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html

💬 Let us know in the comments if you have any questions or if there's another topic you'd like us to cover next!

🌟 Don't forget to like, share, and subscribe for more data tutorials! 🌟

Loading comments...