Plumbers of Data Science

Plumbers of Data Science

By Andreas Kretz

Data Engineering is the plumbing of data science. Almost invisible, but super important and a big mess when done wrong.
We talk about interesting Data Engineering trends and topics. I also train Data Engineering in my Data Engineering Academy at LearnDataEngineering.com
Available on
Apple Podcasts Logo
Overcast Logo
Pocket Casts Logo
RadioPublic Logo
Spotify Logo
Currently playing episode

#063 Data Engineering At Airbnb Case Study

Plumbers of Data ScienceMay 27, 2019
00:00
01:02:59
#123 Building Fast and Fun Data Projects - with Mehdi Ouazza

#123 Building Fast and Fun Data Projects - with Mehdi Ouazza

In this episode, I sit down with Mehdi Ouazza - data tinkerer, indie hacker, and content creator - who's always up to something interesting in the world of data and AI.

We started with DuckDB but quickly veered off into much more exciting territory: side projects, voice-to-SQL with actual quacks, the power of local models, and why WebGPU might be one of the most underrated browser technologies today.

We also talked about how we teach and learn data engineering in 2025: the importance of fun, interactivity, and why we both dream of creating a data engineering game that’s part "Among Us" and part serious skills training.

Mehdi shares what tools he's using, where he sees GenAI actually helping—not replacing—engineers, and how he's building courses and meetups that inspire creativity in technical work.

Perfect for data folks who like to experiment, educators looking for inspiration, or anyone wondering how far a fun idea can go with the right mix of curiosity and tooling.

Jun 19, 202501:16:32
#122 Why Writing Is Thinking , and What Data Engineers Can Learn from It - with Simon Späti

#122 Why Writing Is Thinking , and What Data Engineers Can Learn from It - with Simon Späti

In this podcast episode, I’m joined by Simon Späti, long-time BI and data engineering expert turned full-time technical writer and author of the living book Data Engineering Design Patterns.

 We talk about:

  • His 20-year journey from SQL-heavy BI to modern Data Engineering
  • Why switching from employee to full-time author wasn’t planned, but necessary
  • How he uses a “Second Brain” system to manage and publish his knowledge
  • Why writing is a tool for learning, not just sharing
  • The concept of convergent evolution in data tooling: when old and new solve the same problem
  • The underrated power of data modeling and pattern recognition in a hype-driven industry

 

Simon also shares practical advice for building your own public knowledge base, and why Markdown and simplicity still win in the long run.

Whether you're into tools, systems, or lifelong learning, this one’s a thoughtful deep dive.

***

About Simon Späti:

Simon is a Data Engineer and Technical Author with 20+ years of experience in the data field. He's the author of the Data Engineering Blog (ssp.sh), curator of the Data Engineering Vault (vault.ssp.sh), and currently writes a book about Data Engineering Design Patterns (dedp.online). Simon maintains an awareness of open-source data engineering technologies and enjoys sharing his knowledge with the community.

Socials: BlueskyLinkedInTwitter/XYouTube

Jun 11, 202501:05:22
#121 From Application Dev to AWS Hero: A Journey in Tech & Impact - with Johannes Koch
May 19, 202501:03:19
#120 Teaching Data Engineering Like It’s Done on the Job - with Deepak Goyal
May 08, 202548:34
#119 Recruiting is harder than I thought

#119 Recruiting is harder than I thought

In this episode of the Plumbers of Data Science podcast, I dive into the challenges of recruiting today, from overwhelming job application volumes to reaching out directly to recruiters.

I’m testing new strategies to make the process smoother for everyone involved, focusing on fresh job listings and fostering connections with hiring managers who need skilled engineers. My goal? To secure five job placements in Germany by year’s end!

Have thoughts on today’s job market, or tried the Easy Apply feature yourself? Drop a comment below—I’d love to hear your experience!

Nov 04, 202410:10
#118 Freelancing as a Data Engineer - Hero Talk with the "Seattle Data Guy" Ben Rogojan

#118 Freelancing as a Data Engineer - Hero Talk with the "Seattle Data Guy" Ben Rogojan

In this Hero Talk episode, I had the pleasure of chatting with Ben Rogojan, better known as the "Seattle Data Guy." Ben is a data engineer, YouTuber, and freelancer with a background at Facebook. He's become a go-to expert on freelancing for engineers, particularly in the data space.

We dive into Ben's journey from being a full-time engineer to making the switch to freelancing, how he built his own business, and the unique challenges freelancers face in this space.

We also explore how to break into freelancing, the value of specializing in a specific skill, and practical tips on landing your first freelance clients.

Oct 25, 202454:33
#117 We Are Starting a Recruiting Service!

#117 We Are Starting a Recruiting Service!

In this episode of the Plumbers of Data Science podcast, I'm sharing some exciting updates about the future of Learn Data Engineering and a big new service we’re launching—recruiting!

I explain how this new offering will help engineers find their next career move while connecting companies with top talent. Tune in to hear more about how the Academy, Coaching, and now recruiting fit together into one ecosystem designed to support your career growth.

Let me know your thoughts in the comments—are you excited about this new direction?

Oct 18, 202411:28
#116 Data Modeling is F***ing Easy!

#116 Data Modeling is F***ing Easy!

In this episode of the Plumbers of Data Science podcast, I’m sharing my thoughts on why data modeling isn’t as complicated as people make it out to be. You hear about courses and tutorials that stretch for hours—but is it really that hard?

I’ll break down the two main things you need to focus on when modeling data and explain why, once you’ve got those down, the rest falls into place.

Sep 23, 202406:14
#115 His Career Started With a Bootcamp & Now He Helps Others Succeed - Hero Talk w/ Mezue Obi-Eyis

#115 His Career Started With a Bootcamp & Now He Helps Others Succeed - Hero Talk w/ Mezue Obi-Eyis

In this Hero Talk episode, I talk with Mezue, a seasoned Data Engineer with expertise in Azure Databricks Data Engineering. We cover his journey from Electrical Engineering to Data Engineering and discuss the key skills, like Python, SQL, and Spark, that are essential in the field.Mezue also shares his experience running an Azure Databricks bootcamp and offers advice on how to break into Data Engineering, especially in Cloud environments. We also touch on the challenges of finding junior roles and how to stand out by working on practical projects.

Sep 20, 202439:17
#114 Dirty Data & Data Cleaning - Hero Talk with "The Classification Guru" Susan Walsh

#114 Dirty Data & Data Cleaning - Hero Talk with "The Classification Guru" Susan Walsh

In this Hero Talk episode, I chat with Susan Walsh, the “Classification Guru,” known for her expertise in cleaning and classifying messy data.

We dive into her unexpected journey into the data world, starting with a spend analytics job, and how that led to her founding her own business focused on dirty data. Susan shares the unique challenges businesses face with poor data quality, explaining why 99.9% of data problems are actually people problems.

We also explore practical ways to deal with these issues, such as finding those "crappy" data cleaning jobs to gain experience, and the importance of consistent data maintenance to prevent future headaches. From addressing dirty CRM systems to battling fraud, Susan’s stories highlight how critical clean data is for business success.

Sep 16, 202448:31
#113 A Deep Dive Into APIs, IoT, and Data Storage - Hero Talk with Paolo Lulli

#113 A Deep Dive Into APIs, IoT, and Data Storage - Hero Talk with Paolo Lulli

In this Hero Talk episode, I sit down with Paolo Lulli, an experienced Data Engineer, to explore some of the core challenges and decisions in API development and data management. We dive deep into the debate between serverless infrastructure versus traditional servers, discussing the pros and cons of both approaches, particularly in the context of scalability, cost, and maintenance.

Paolo also shares his hands-on experience with time series databases, explaining their advantages in handling massive amounts of data from IoT devices. We delve into vendor lock-in issues, highlighting how relying too heavily on cloud providers like AWS or Azure can impact long-term flexibility.

Sep 09, 202434:27
#112 Why testing data pipelines can be so challenging - and how to tackle it

#112 Why testing data pipelines can be so challenging - and how to tackle it

In this episode of the Plumbers of Data Science podcast, I’m diving into why testing can be so challenging for data engineers. The inspiration for this topic actually came from one of my recent Coaching sessions, where the question of test-driven development (TDD) came up during a Q&A. It stuck with me, so I thought it would be a great topic to dive deeper into.

I’ll explain the key benefits of TDD, like improved code quality and easier refactoring, and why, despite its advantages, it’s not always widely adopted—especially in fast-paced environments where time constraints dominate. We’ll also talk about the specific challenges data engineers face with TDD, such as handling large, unpredictable data, integrating with external systems, and adapting to ever-changing data.

Sep 06, 202418:41
#111 Is This the Synthetic Data Revolution?! Hero Talk with Mario Scriminaci from Mostly AI

#111 Is This the Synthetic Data Revolution?! Hero Talk with Mario Scriminaci from Mostly AI

In this Hero Talk episode, we dive deep into the fascinating world of synthetic data, a critical tool for development, testing, and training Machine Learning models. Joining me is Mario Scriminaci, Chief Product Officer at Mostly AI, who shares his expertise on how synthetic data can revolutionize the way we handle sensitive information, particularly in the context of privacy regulations like GDPR and CCPA.

We discuss the real-world applications of synthetic data, how it differs from traditional mock data, and its potential to drive innovation in AI and ML development. Mario also introduces Mostly AI's cutting-edge tools, highlighting how they make it easier than ever to generate realistic, privacy-safe datasets.

Sep 02, 202446:49
#110 Bootcamps vs Coaching

#110 Bootcamps vs Coaching

In this episode of the Plumbers of Data Science podcast, I’m diving into the debate between bootcamps and coaching programs, especially for those looking to advance in Data Engineering.

I’ll break down the pros and cons of each approach - from the structured, intensive nature of bootcamps to the personalized, flexible support of coaching, I’ll share insights to help you choose the right path for your career. I’ll also discuss the experiences of my current coaching students and what I’m focusing on to help them achieve their goals.

Aug 30, 202425:55
#109 Why your data and goals matter more than tools!

#109 Why your data and goals matter more than tools!

In this episode of the Plumbers of Data Science podcast, I’m diving into what truly matters when building data platforms and pipelines.

As engineers, it’s easy to get caught up in the latest tools, but real success starts with understanding your data sources and defining clear goals. I’ll walk you through the key questions to ask, from data retention to processing speeds and user needs.

Aug 23, 202416:32
#108 Why Apache Spark Is Such An Essential Skill - Hero Talk with Philipp Brunenberg

#108 Why Apache Spark Is Such An Essential Skill - Hero Talk with Philipp Brunenberg

In this episode, we explore the essentials of learning and mastering Apache Spark. Joining me is Philip, an experienced Spark developer and educator, who shares his expert roadmap for becoming proficient in Spark. We discuss why Spark is a crucial tool for data engineers, how to set it up effectively, and the best approaches to start your Spark journey.

Philip also highlights the importance of understanding Spark's internals, deploying real-world applications, and optimizing performance. He walks us through his six-part roadmap, focusing on hands-on practice and building confidence through real-world projects. We also touch on key topics like the Scala vs. Python debate, Spark's role in machine learning, and how it stands against emerging tools like Beam.

Aug 19, 202440:25
#107 The Future of Data Observability - Hero Talk with Ryan Yackel

#107 The Future of Data Observability - Hero Talk with Ryan Yackel

In this Hero Talk episode, we explore the crucial topic of data observability, a field that has become essential for Data Engineers dealing with complex data pipelines. I am joined by my special guest Ryan Yackel from DataBand, who shares his insights and expertise on the subject.

Ryan delves into the concept of data observability and its significance for Data Engineers, addressing common challenges faced in monitoring and maintaining data pipelines. He explains how DataBand helps in monitoring and improving data reliability, ensuring that data flows smoothly from source to destination.

Aug 12, 202453:56
#106 Should You Move to Germany for a Data Engineering Career?

#106 Should You Move to Germany for a Data Engineering Career?

In this episode of the Plumbers of Data Science podcast, I’m breaking down the real deal of working as a data engineer in Germany. Does it live up to the hype? Sure, we’ve got free education and solid health insurance, but what’s the actual cost of living here, and how much of your salary do you really take home after taxes?

I’ll walk you through the numbers—from what you can expect to earn, to the surprising deductions that quickly eat away at your paycheck. Plus, I’ll explain why companies in Germany struggle with high labor costs and how that impacts your wallet.

Aug 09, 202411:22
#105 Personal Branding in Data - Hero Talk with Kate Strachnyi

#105 Personal Branding in Data - Hero Talk with Kate Strachnyi

In this Hero Talk episode, we delve into the fascinating world of personal branding in data with our special guest, Kate Strachnyi, founder of DataCated.Join us as Kate shares her vast expertise in personal branding, drawing from her experience as an author and LinkedIn Learning instructor. We discuss the nuances of building a personal brand, both intentionally and organically, and explore practical strategies for leveraging social media to increase your visibility and credibility in the data space.

Aug 05, 202454:03
#104 The Secret Why Time Series Databases Are Awesome - Hero Talk with Jeff Tao

#104 The Secret Why Time Series Databases Are Awesome - Hero Talk with Jeff Tao

In this Hero Talk episode, we explore the dynamic world of time series data and time series databases with a special guest, Jeff Tao, founder and CEO of TD Engine.


Join us as Jeff shares his journey from designing smart devices to founding TD Engine, a leading time series database. We dive deep into the benefits and unique features of time series databases, practical use cases, and how they handle massive amounts of data generated by IoT devices, smart meters, and more.

Aug 02, 202455:13
#103 From India to the U.S.: Becoming a Data Engineer at Toyota - Hero Talk with Ayan Tiwari

#103 From India to the U.S.: Becoming a Data Engineer at Toyota - Hero Talk with Ayan Tiwari

In this Hero Talk episode, we dive into the inspiring journey of Ayan Tiwari, a Data Engineer at Toyota North America.

Join us as Ayan shares his remarkable transition from being an undergrad in civil engineering in India to working for a major company like Toyota in the U.S.

Ayan walks us through how he made this significant career switch, pursued his master’s degree in the U.S., and the fascinating projects he's currently working on.

Jul 29, 202430:37
#102 Data Tools & Platforms: Why you should always be skeptical

#102 Data Tools & Platforms: Why you should always be skeptical

In this episode of the Plumbers of Data Science podcast, we explore why you should be skeptical of data platforms and tools. Using a LEGO Grogu set from Star Wars as an analogy, I reveal the hidden issues behind flashy exteriors: chaotic scaffolding, empty spaces, and missing features.I emphasize the importance of trying tools yourself, running benchmarks, and checking if they fit your use case. We also discuss the role of developer communities and frequent updates in improving these tools.Have you encountered over-promised solutions? Share your experiences and thoughts in the comments ;)

Jul 26, 202404:50
#101 GenAI from a Data Engineer's perspective - Hero Talk with Vinoth Nageshwaran
Jul 22, 202443:28
#100 Why Excel should be a go-to tool for data professionals
Jul 19, 202407:16
#99 Real Talk on GenAI & Large Language Models - Hero Talk with Harpreet Sahota

#99 Real Talk on GenAI & Large Language Models - Hero Talk with Harpreet Sahota

In this Hero Talk episode we dive into the exciting and evolving world of Generative AI and Large Language Models (LLMs) with a special guest, Harpreet Sahota.


Join us as Harpreet shares his extensive knowledge and experience as a seasoned data scientist. We explore the transformative potential of Generative AI, practical applications, and the challenges that come with integrating these advanced models into various industries.

Jul 15, 202448:13
#98 Are Job Guarantees a Scam?
Jul 12, 202409:47
#97 Data Science Career AMA! - Hero Talk with Andrew Jones

#97 Data Science Career AMA! - Hero Talk with Andrew Jones

In this Hero Talk episode we dive into the world of Data Science careers with a special Ask Me Anything (AMA) session.


Join me as I welcome Andrew Jones, founder of the Data Science Infinity program. Andrew shares his journey from working at top tech companies such as PlayStation to founding his own academy. We discuss the current job market, the role of certifications, and how to build an effective resume and portfolio to stand out in a competitive field.

Jul 05, 202458:15
#96 Can GenAI be trusted?
Jun 28, 202410:01
#95 The Perfect CV for Switching Careers
Jun 21, 202412:57
#94 - Making less money to set yourself up for success?!
Jun 14, 202407:29
#93 Is the highest paying job the best?
Jun 07, 202412:19
#92 Is it impossible to get a Data Engineering job as a fresher?
May 31, 202413:38
#91 A New Beginning & All Successful Students Have This:

#91 A New Beginning & All Successful Students Have This:

Starting up the podcast for another session :)

May 31, 202407:53
#90 Taylor McGrath - The Future of the Modern Data Stack

#90 Taylor McGrath - The Future of the Modern Data Stack

Super happy to have Taylor with me on this stream. She is the VP of Data Labs at Rivery and therefore has a lot of experience with data platforms. We'll talk about the modern data stack and where it's going. I'm excited to hear her experience about the changes that are happening in the data space, and what that means for data engineers & data teams.

Jan 25, 202347:01
#89 Piyush Sachdeva - Getting Into Google After Eight Rejections from Amazon!

#89 Piyush Sachdeva - Getting Into Google After Eight Rejections from Amazon!

In this video I talk to Piyush who's an engineer at Google and has his own YouTube channel: "Tech Tutorials with Piyush". He's a really good guy and I love how he's dedicated to teaching engineering. We are talking about some awesome topics like: 

  • Is Linkedin a must for getting a job?
  • Tips for recording yourself 
  • Cloud Engineering vs Data Engineering
  • Which Cloud Platform should you choose right now?
  • The amazing Google work culture explained
  • Everybody should learn how to use Kubernetes
  • How getting rejected over and over at Amazon got him into Google 
  • The hiring process at Google  

Have fun!

You can also check this on out on YouTube: https://youtu.be/FZemVaqQcnM

If you want to get into Data Engineering check out my Academy at https://learndataengineering.com

Jan 16, 202344:27
#88 - Wouter Trappers - How to Realize a Data Strategy Like a Pro!

#88 - Wouter Trappers - How to Realize a Data Strategy Like a Pro!

I have seen people doing that wrong a few times. Luckily Wouter Trappers who is helping companies as a professional can help. We talked about The steps you need to take from value proposition to dashboards. Wouter is really knowledgeable and it was super fun talking with him and hearing his approach.

Apr 12, 202239:48
#87 - Dhruba Borthakur - From Hadoop to real time analytics

#87 - Dhruba Borthakur - From Hadoop to real time analytics

Dhruba Borthakur is CTO at Rockset and a passionate Data Engineer. Before co-founding Rockset he played a big role in development of Hadoop HDFS at Yahoo as well as HBase and RocksDB at Facebook. His current project is the serverless Rockset platform where you can gain real time analytics insight into your data. I tried it out before our talk and really liked it.

Apr 12, 202201:05:37
#86 The Ultimate Data Engineering Introduction

#86 The Ultimate Data Engineering Introduction

The Podcast is back!!!! I promise I am going to keep it up to date this time ;)
In this episode I talk about my newest Data Engineering course. I think it's the ultimate 1 hour 15 minutes introduction to Data Engineering. 
There were also a ton of questions from the chat that I answered. Think you really enjoy this.

Jan 14, 202101:14:35
#085 Big Data and Data Science Landscape plus trying to read Tweets with Nifi

#085 Big Data and Data Science Landscape plus trying to read Tweets with Nifi

We are looking into the network communication protocol map. I first saw this like 10 years ago and its awesome. 

Then we check out the Big Data and Data Science Landscape image. It shows you all the tools available to do data science, machine learning and data engineering. Which is very helpful if you are researching for tools to use. 

Before using the Twitter API you got to create a developer account. So, I show you how I created one. After that I tried to get Nifi to download Tweets but it is not working.


May 28, 201943:07
#084 Behind the scenes: Audio podcast, free transcriptions and GitHub

#084 Behind the scenes: Audio podcast, free transcriptions and GitHub

Today's podcast is a bit of a behind the scenes. 

What it takes to do a audio podcast. How you can get audio to text transcriptions for free. 

.Also Github questions on how to work with branches on the Cookbook

May 27, 201951:21
#083 Data Engineering at OLX Case Study
May 27, 201901:10:53
#082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS

#082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS

In this episode we install the Nifi docker container and look into how we can extract the twitter data.

We are also talking about the differences between infrastructure as a service, platform as a service and application as a service.

May 27, 201901:19:07
#081 How to get tweets from the Twitter API

#081 How to get tweets from the Twitter API

In this episode we look into the Twitter API documentation, which I love by the way.

How can we get old tweets for a certain hashtags and how to get current live tweets for these hashtags.

May 27, 201901:09:47
#080 How To Find A Job In Germany & Answering Mails

#080 How To Find A Job In Germany & Answering Mails

Tips on how you find a job in Germany and two super interesting mails.

May 27, 201954:54
#079 Trying to stay true to myself and making the cookbook public on GitHub
May 27, 201924:34
#078 Cookbook collaboration and updates

#078 Cookbook collaboration and updates

Updates of the cookbook and how to collaborate on it

May 27, 201931:08
#077 Lambda and Kappa Architecture

#077 Lambda and Kappa Architecture

In this episode we talk about the lambda architecture with stream and batch processing as well as a alternative the Kappa Architecture that consists only of streaming. Also Data engineer vs data scientist and we discuss Andrew Ng's AI Transformation Playbook

May 27, 201901:22:02
#076 Cloud vs On Premise How To Decide

#076 Cloud vs On Premise How To Decide

How do you choose between Cloud vs On-Premise, pros and cons and what you have to think about. Because there are good reasons to not go cloud.

Also thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise the comparison will drive you insane.

May 27, 201901:15:56
#075 Creating the Course Structure For My Data Engineering Course

#075 Creating the Course Structure For My Data Engineering Course

In this episode we go over the ideas I have for the data engineering course structure. It was your chance for you to influence what we put in there.

May 27, 201953:19
#074 Starting My Data Engineering Online Course

#074 Starting My Data Engineering Online Course

In this video we go over some of the 100+ comments I received on LinkedIn about a data engineering training. 

May 27, 201901:01:19