Skip to main content
Plumbers of Data Science

Plumbers of Data Science

By Andreas Kretz
Data Engineering is the plumbing of data science. Almost invisible, but super important and a big mess when done wrong.

I Talk about trends, tools and techniques around big data, and data engineering. I want to help you get started and inspire you to create.

Main Podcast is on YouTube. Not all episodes make sense to be an audio Podcast.
Listen on
Where to listen
Apple Podcasts Logo

Apple Podcasts

Breaker Logo


Google Play Music Logo

Google Play Music

Google Podcasts Logo

Google Podcasts

Overcast Logo


Pocket Casts Logo

Pocket Casts

RadioPublic Logo


Spotify Logo


#085 Big Data and Data Science Landscape plus trying to read Tweets with Nifi
We are looking into the network communication protocol map. I first saw this like 10 years ago and its awesome.  Then we check out the Big Data and Data Science Landscape image. It shows you all the tools available to do data science, machine learning and data engineering. Which is very helpful if you are researching for tools to use.  Before using the Twitter API you got to create a developer account. So, I show you how I created one. After that I tried to get Nifi to download Tweets but it is not working.
May 28, 2019
#084 Behind the scenes: Audio podcast, free transcriptions and GitHub
Today's podcast is a bit of a behind the scenes.  What it takes to do a audio podcast. How you can get audio to text transcriptions for free.  .Also Github questions on how to work with branches on the Cookbook
May 27, 2019
#083 Data Engineering at OLX Case Study
Today a case study about OLX with a guest it was super fun! Here are the slides Alexeyand I talked about:
May 27, 2019
#082 Reading Tweets With Apache Nifi & IaaS vs PaaS vs SaaS
In this episode we install the Nifi docker container and look into how we can extract the twitter data. We are also talking about the differences between infrastructure as a service, platform as a service and application as a service.
May 27, 2019
#081 How to get tweets from the Twitter API
In this episode we look into the Twitter API documentation, which I love by the way. How can we get old tweets for a certain hashtags and how to get current live tweets for these hashtags.
May 27, 2019
#080 How To Find A Job In Germany & Answering Mails
Tips on how you find a job in Germany and two super interesting mails.
May 27, 2019
#079 Trying to stay true to myself and making the cookbook public on GitHub
The cookbook my Youtube, it will be for free, forever! Check out the data engineering cookbook on GitHub:
May 27, 2019
#078 Cookbook collaboration and updates
Updates of the cookbook and how to collaborate on it
May 27, 2019
#077 Lambda and Kappa Architecture
In this episode we talk about the lambda architecture with stream and batch processing as well as a alternative the Kappa Architecture that consists only of streaming. Also Data engineer vs data scientist and we discuss Andrew Ng's AI Transformation Playbook
May 27, 2019
#076 Cloud vs On Premise How To Decide
How do you choose between Cloud vs On-Premise, pros and cons and what you have to think about. Because there are good reasons to not go cloud. Also thoughts on how to choose between the cloud providers by just comparing instance prices. Otherwise the comparison will drive you insane.
May 27, 2019
#075 Creating the Course Structure For My Data Engineering Course
In this episode we go over the ideas I have for the data engineering course structure. It was your chance for you to influence what we put in there.
May 27, 2019
#074 Starting My Data Engineering Online Course
In this video we go over some of the 100+ comments I received on LinkedIn about a data engineering training. 
May 27, 2019
#073 Data Engineering At LinkedIn Case Study
Let's check out how LinkedIn is processing data
May 27, 2019
#072 Data Engineering At Twitter Case Study
How is Twitter doing Data Engineering? Oh man, they have a lot of cool things to share these tweets. 
May 27, 2019
#071 Data Engineering At Spotify Case Study
In this episode we are looking at the data engineering at Spotify, my favorite music streaming service. How do they process all that data?
May 27, 2019
#070 The Engineering Culture At Spotify
In this podcast we look at the engineering culture at Spotify, my favorite music streaming service.  The process behind the development of Spotify is really awesome.
May 27, 2019
#069 Data Engineering At Pinterest Case Study
A look into how Pinterest is doing data engineering.
May 27, 2019
#068 A Budget Data Science PC Build
Configuring a sub 1000 dollar PC for data engineering and machine learning Link to the builds: 900$ build: 1500$ build:
May 27, 2019
#067 Data Engineering At NASA Case Study
A look into how NASA is doing data engineering.
May 27, 2019
#066 How To Do Data Science From A Data Engineers Perspective
A simple introduction how to do data science in the context of the internet of things. 
May 27, 2019
#065 Data Engineering At CERN Case Study
A look into how CERN is doing Data Engineering. They get huge amounts of data from the Large Hydron Colider. Let's check it out.
May 27, 2019
#064 Data Engineering At Case Study
A look into how is doing data engineering.
May 27, 2019
#063 Data Engineering At Airbnb Case Study
A look into how Airbnb is doing Data Engineering.
May 27, 2019
#062 Data Engineering At Netflix Case Study
How Netflix is doing Data Engineering using their Keystone platform
May 27, 2019
#061 Reworking My Cookbook For Data Engineering
I decided to rework the cookbook focusing more on case studies and less on explaining tools. People keep asking me for a path to become a data engineer and, let's be honest, you will never achieve that with just knowledge of the tools. Finding out how companies do data engineering on their data science platforms is way more useful. Over the next weeks we will go over each study on my YouTube channel. The stuff we talk about will then go into the cookbook too.
May 27, 2019
#060 What Is Hadoop And Is Hadoop Still Relevant In 2019?
A Introduction into Hadoop HDFS, YARN and MapReduce.  Yes, Hadoop is still relevant in 2019 even if you look into serverless tools. 
May 27, 2019
#059 A Look Into The Siemens Mindsphere IoT Platform? | #059
The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation.  I have to say I was super unimpressed by what I found.  Many limitations, unclear architecture and no pricing available?  Not good!
May 27, 2019
#058 Guitars And Data Live Stream
A stream full of mediocre guitar playing and great Q&A about Hadoop. 
May 27, 2019
#057 Introducing The Plumbers Medium Publication
I have created a Medium Publication especially for us Plumbers of Data Science who work in Data Engineering and Big Data. It's called, you guessed it, Plumbers of Data Science.
May 27, 2019
#056 NoSQL Key Value Stores Explained With HBase
What is the difference between SQL and NoSQL? In this episode I show you on the example of HBase how a key/value store works. 
May 27, 2019
#055 Data Warehouse vs Data Lake
On this podcast I talk about data warehouses and data lakes. When do people use which? What are the pros and cons of both? Architecture examples for both and does it make sense to completely move to a data lake?
May 27, 2019
#054 How to Market Yourself in 2019 Student or Professional
In this episode I talk about how you can gain a competitive edge on the job market. It's super simple, you can and should start with it TODAY by putting yourself out there. 
May 27, 2019
#053 The Data Science Depression Is Coming? What You Can Do
The Data Science Hype is still strong. Where's the industry going, towards a cliff? Here's what can you do?
May 27, 2019
#052 Data Engineering Cookbook Live Stream
In this episode I show you the first version of my data engineering cookbook.
May 27, 2019
#051 Five Books To Buy As A Data Engineer & My Book Buying Strategy
Getting a book and reading it cover to cover is useless. In this episode I show you my strategy of buying books complimentary to your work. And 5 great books I read over the years that helped me get where I am now.
May 27, 2019
#050 Data Engineer Scientist or Analyst Which One Is For You?
In this podcast we talk about the differences between data scientists, analysts and engineers. Which are the three main data science jobs. All three super important.
May 27, 2019
#049 I Found A REAL Use For Blockchain, At Least I thought So
After all the BS solutions using Blockchain I thought I finally found one that makes sense. Of all the possibilities it's the EU data protection law GDPR. Well, one problem I overlooked in this podcast is, that it is impossible to delete data after it is in the chain. That's however a rule for GDPR. So, I was wrong. Again :D
May 27, 2019
#048 From Wannabe Data Scientist To Engineer My Journey
In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science. 
May 27, 2019
#047 The Truth About Data Science Salary For Graduates
In this episode I show you how much data science graduates are actually payed in Germany. All over the internet you can find that Data Science salary is over 100k Dollars. Data Engineer or Data Scientist. It's way lower then that. Then I give you a few really good tips on how to choose the right company to work for. Huge corporation, startup or small company? Here's how to choose.
May 27, 2019
#046 How To Use GitHub for LaTeX Version Control
In this podcast I am showing you how I use GitHub to write my Data Engineering Cookbook with LaTex.
May 27, 2019
#045 Why I Use LaTeX to Write Professionally And You Should Too
What is the best editing tool to write a thesis, a dissertation or a paper? NOT Word or Pages! It's LaTeX. In today's video I show you why I decided to use LaTeX to write my data engineering cookbook. I used it before for my diploma thesis and I am in love again :) Here's the link to the cheatsheet: Check out my Patreon for the Data Engineering Cookbook: Music: "Day One" by Declan DP Attribution 3.0 Unported
December 7, 2018
#044 How to Increase Your Chances for Internships or a Full-time Job
You have certifications or a university degree, but can't find a job? Sharing your ideas and knowledge will increase your chances! Here's how you can do that. Music: "Day One" by Declan DP Attribution 3.0 Unported
November 27, 2018
#041 Agile Development Is Important But Please Don't Do Scrum
I love agile development. People keep telling you to do Scrum, like it's the only and best choice to be agile. It's not. Here's my take on scrum and my four main beefs with it. Watch out for these issues if you are doing scrum.
October 18, 2018
#040 Huge Big Data News! Cloudera and Hortonworks Merge
So, Cloudera and Hortonworks merge... In today's Plumbers of Data Science Podcast I talk about what these, big data vendors do. How they enable companies, admins and developers to do data science and many more things. If you are interested in the whole hadoop ecosystem you need to check out this episode. You won't regret it ;)
October 9, 2018
#039 Is ETL Dead For Data Science and Big Data?
Is ETL dead in Data Science and Big Data? In today's podcast I share with you my views on your questions regarding ETL (extract, transform, load). Data Lakes & Data Warehouse where is the difference? Is ETL still practiced or did pre processing & cleansing replace it What would replace ETL in Data Engineering? How to become a data engineer? (check out my facebook note) How to get experience training at home? Real time analytics with RDBMS or HDFS?
October 3, 2018
#38 Morning advice to beginner Data Scientists and Data Engineers
What's the difference between Data Scientists & Data Analysts? What to do to find internships or a full time job? Data Scientist and Engineer in large and small companies where's the difference? Are Data Engineers generalists or specialists? Just some questions I go over in this podcast. You sent me over 100 Questions so, I finally worked up the guts to start with the Q&A videos. Answering your questions one by one. Turns out it's a lot of fun :)
September 27, 2018
#037 How To Boost Teamwork With Version Control
Without the proper tools and techniques of version control the team's efficiency goes down the drain. In this episode I talk about how tools like Jira enable you to collect bugs, future features or change requests. How they enable you to create and organize versions, add items to a version and assign items to developers. Once this is done, the team can efficiently start coding with the help of source code management systems like GitHub. How does all that work? Check out this episode to find out :)
September 12, 2018
#036 Why Distributed Processing Is Super Important
You need to become comfortable with distributed processing. Data Science or the Internet of Things, the amount of data that is getting produced and processed grows like crazy. In this podcast I talk about how a platform for distributed processing looks like. I talk about the different layers that need parallelization, as well as the tools you can use for on premise installations or clouds like AWS, Azure or Google Cloud. Big Data tools like Kafka, Spark or server less like Kinesis or Lambda functions.
September 10, 2018
#035 Learning By Doing Is The Best Thing Ever!
For me, school and university was hard. The lectures, sitting down and getting told how things work. Reading books and learning dry stuff was a drag. I was never good at writing tests. Some people excel at this. I was often envious. Over the years I found out what my problem is. I learn differently. I am a learning by doing guy. What does that means and how am I dealing with it? Check out this episode. Maybe you have the same problem.
September 6, 2018
#034 Talent Stacks For Data Engineers
Becoming an expert in single skill is not the way to go for a data engineer. In this episode I talk about which talents go good together in terms of technical and personal ones. So, that you build up a stack of knowledge that will make you a great data engineer.
September 4, 2018
#033 How APIs Rule The World
Strong APIs make a good platform. In this episode I talk about why you need APIs and why Twitter is a great example. Especially JSON APIs are my personal favorite. Because JSON is also important in the Big Data world, for instance in log analytics. How? Check out this episode!
September 3, 2018
#032 How to Design Security Zones and Lambda Architecture
Security is everything! That's why today, I took some time to give you some tips about how to make a good design. The Lambda Architecture with stream and batch processing is one of the cornerstones for Big Data and Data Science. How does that fit into a security zone design? Check out this episode :)
August 30, 2018
#031 IT Networking Infrastructure and Linux
The understanding of how information is transported over the network is super important. OS wise you will mostly encounter Linux so here are some important Linux basics you need to know. Firewalls, Ports, IP-Adresses, Routers and Switches, only a few things I talk about in this podcast. Networking infrastructure also matters for Big Data systems like Hadoop, Kafka and Spark.
August 29, 2018
#030 Why the hardware and the GPU is super important
Knowing the hardware is super important for a data engineer. Even if you are using cloud servers. CPU, RAM, GPU, HDD, SSD... Especially the GPU is a great help to Data Scientists who are doing machine learning.
August 28, 2018
#029 A New Mission
I am bringing the Podcast back! Lets call it season 2. New name, new mission: Helping you become a data engineer. Daily podcast, recorded in my car or my office, getting you up to speed ASAP.
August 27, 2018
4 Vs Of Big Data Are Enough!
8 V's, 10 V's, 12 V's . The best way to explain Big Data is to use the four V's: Volume, Velocity, Variety and Veracity. In this podcast episode I talk about why nobody needs 10 or more V's of big data. And how Big Data is almost a must have, to do data science and especially machine learning. The music in this episode is the song The Quiet Earth by Thomas Barrandon. Check out his awesome music on Bandcamp. He is also on Spotify ;) Have fun!
May 23, 2018
Why Companies Badly Need Data Scientists And Engineers
In this episode I give you my take on why companies badly need data scientists and engineers. Because in this data driven world, you can accomplish a lot with just a few people. All you need is a vision, some sense for business and a lot of skill.
May 18, 2018
What You Need To Know About Data Engineering
This podcast is all about what you as a data engineer really do. From building platforms to collaboration with data scientists and customers. Everything you need to know to get insight into a data engineers life.
May 16, 2018
I'm a Big Data Engineer and it's Super Awesome!
There is this other data science job called data engineer and it's super important. Because data science does not equal data scientist. In today's podcast I talk about how I finally realized that data engineering is my real passion. Definitely check it out, chances are high that data engineering is your thing too.
May 15, 2018
BI vs Data Science vs Big Data
I have recently been asked: "What is the difference between BI, Data Science and Big Data". So, it thought I make a quick podcast about this for you guys. I think especially beginners will help this a lot.
April 4, 2018
How Much Big Data Do You Need To Learn As A Data Scientist?
Big Data tools are very important. But how deep should you really go as a data scientist? How do you best learn all this stuff? Some questions I try to answer in this episode.
March 20, 2018
Working With Time Series Data And Missing Values
Time series data is tricky. Especially if you have missing data. In this episode I talk about a few things you can do to handle this problem.
March 16, 2018
Don't Be Arrogant The Cloud is Safer Then Your On-Premise
A 5 minute rant why the cloud is safe.
March 15, 2018
Hadoop For Data Scientists An Introduction
Hey Podcast, in this episode I talk about the core functions of Hadoop.
March 13, 2018
Three Methods of Streaming Data
There are three different methods of streamging: At least once, at most once and exactly once. Listen why it makes a huge difference which one you use. Because not every system or tool supports all three.
March 9, 2018
Dirty Data, Unicorn Scientists
Where to get dirty data to train cleaning it? What are unicorn data scientists and what is THE skill you need if you aren't one.
February 26, 2018
DS Office Hours Nr. 3
How to define a data science problem.
February 9, 2018
NoSQL Vs SQL How To Choose
NoSQL databases like HBase are awesome! But why and when should you use them? How does a key value store like HBase work? Today I am talking about exactly that
February 9, 2018
Creating A Gaming-AI-Bot
Creating a Gaming-AI with Reinforcement Learning • Creating a Gaming-AI with Reinforcement Learning
February 5, 2018
Loosing $$ With Data Science
Loosing money with data science in the short term does not matter. It's about the long run, not quick sales. This is a story about how this happened in the insurance industry. And how to go at it to turn this loss into a win
January 31, 2018
Swish Swedens Awesome Fintech
Swish the Swedish Fintech that Blew My Mind in 2017
January 22, 2018
Data Science Office Hours
Chat and Q&A about data science with data scientists
January 18, 2018
Gartner's Hype Cycle Explained
It is very important for me to keep track of emerging technologies and trends. You want to know where the industry is headed. You don't want to miss the next big thing. This is where Gartner's Hype Cycles help a lot. Especially the hype cycle for emerging tech.
January 15, 2018
How to Show That ML and AI Works
How to convince people that machine learning actually works? It's simpler than you think. You have the data. Data doesn't lie!
January 12, 2018
Analytics on Edge Devices
Unavailable cellphone coverage really pissed me off. You cannot transmit data and do cloud based analytics during that time. That is why edge devices in the field have to get more and more get analytics capabilities built in. For instance in a container ship.
January 11, 2018
BigData and Catastrophic Success
In this episode I talk about the 4 Vs of big data. And how Big Data can save you from catastrophic success.
January 8, 2018
Machine Learning In Production
Doing machine learning in production is very different then for proof of concepts or in education. One of the hardest parts is keeping models updated.
January 6, 2018
Data Science VS Big Data
Choose between Big Data & Data Science • Choose between Big Data & Data Science • Choose between Big Data & Data Science
January 3, 2018
Agriculture snd DS DailyKayy 004
How Data Science Transforms Agriculture | DailyKayy 004 • How Data Science Transforms Agriculture | DailyKayy 004 • How Data Science Transforms Agriculture | DailyKayy 004
January 3, 2018
DS Preventing Insurance Fraud
Insurance companies use data science to detect fraudulent cases. Saving themselves and the customers money
December 15, 2017
How Data Transforms Healthcare
How Smartwatches and Fitbits Transform Healthcare • How Smartwatches and Fitbits Transform Healthcare
December 12, 2017
Learn Data Science Go Docker!
Docker is so awesome for beginners. Preconfigured images let you start coding in minutes. No annoying dev environment setup.
December 7, 2017
Measure Everything!
Social media, Product development. Get that data and analyze it! Start winning!
December 6, 2017
Prime Video, Tesla and
Prime Video X-Ray Feature, Tesla & Comma AI • Prime Video X-Ray Feature, Tesla & Comma AI • Prime Video X-Ray Feature, Tesla & Comma AI
November 29, 2017