Skip to main content
The Data Life Podcast

The Data Life Podcast

By Sanket Gupta

This is a podcast where we talk all-about real life experiences of dealing with data and machine learning tools, techniques and personalities. We cover not just the technical aspects but also the "life" aspects of working in the field.

Note: Opinions expressed are my own and do not express the views or opinions of my employer.
Available on
Apple Podcasts Logo
Google Podcasts Logo
Overcast Logo
Pocket Casts Logo
PodBean Logo
RadioPublic Logo
Spotify Logo
Currently playing episode

22: Transfer Learning for NLP - With Paul Azunre

The Data Life PodcastApr 13, 2020

00:00
46:47
27: Building Open Source Data Startup with Airbyte CEO, Michel Tricot

27: Building Open Source Data Startup with Airbyte CEO, Michel Tricot

We talk with Michel Tricot, who is the Founder and CEO of Airbyte, which is an open source data integration Y Combinator startup. It has raised over $30M in capital and has been growing quite fast. It was a great conversation and I think you will also enjoy it. 🎉

We cover lots of things in the podcast including: 

1. Technical aspects of what Airbyte does, how it sits in the ETL/ ELT landscape, how it differs from other tools such as Fivetran, Stich etc. 

2. Data Warehouses being a canonical source of data and how Airbyte helps with bringing the data into the warehouse. 

3. How Airbyte works as an open source data tool. 

4. Life aspects of running a fast growing start-up including raising capital, hiring etc. 


Links to the tools/ services mentioned: 

1. Airbyte: airbyte.io

2. Airbyte Slack where you can talk with the team: slack.airbyte.io 

3. Dbt for transformation in ELT: getdbt.com 

4. Airflow which is a data orchestration tool: https://airflow.apache.org/

5. Astronomer which can host Airflow: https://astronomer.io/ 


Pay as you use data warehouses: 

6. Snowflake Data Warehouse: https://www.snowflake.com/

7. BigQuery Data Warehouse: https://cloud.google.com/bigquery 

Set up your own infrastructure: 

8. Redshift Data Warehouse: https://aws.amazon.com/redshift/ 

Oct 11, 202144:56
26: Building Data Engineering Pipelines at Scale (with Data Warehouse, Spark and Airflow)

26: Building Data Engineering Pipelines at Scale (with Data Warehouse, Spark and Airflow)

Imagine you are at a beach and you are hanging out and seeing all the waves come and go and all the shells on the beach. And you get an idea. How about you collect these shells and make necklaces to sell? Well how would you go about doing this? Maybe you’d collect a few shells and make a small necklace and try to show to your friend. This is where we begin our journey on learning about data engineering pipelines. 

Using an example of running a necklace business from shells - we learn about the following data engineering concepts: 

1. ETL - Extract Transform Load vs ELT - Extract Load Transform concepts. Why Data Warehouses are great for analytics. 

2. Spark for large data processing and hosting / running

3. Data orchestration using Airflow


My blog on Towards Data Science about moving from Pandas to Spark: https://towardsdatascience.com/moving-from-pandas-to-spark-7b0b7d956adb 

Great book to learn about Spark: https://www.amazon.com/dp/1492050040/?tag=omnilence-20 

Tools covered in the episode: 

dbt: https://www.getdbt.com/ 

Databricks: https://databricks.com/

EMR: https://aws.amazon.com/emr/

AWS Redshift: https://aws.amazon.com/redshift/

Snowflake: https://www.snowflake.com/

Delta Lake: https://databricks.com/product/delta-lake-on-databricks 

Aug 18, 202139:30
25: Talking Data Privacy with Jeff Bermant
Aug 04, 202128:11
24: Promoting Women in Tech - With Rupal Gupta

24: Promoting Women in Tech - With Rupal Gupta

In this episode, we are talking about women in tech with Rupal Gupta. Rupal, a recent graduate from Online MS in CS from Georgia Tech, is a data engineer in the industry and is passionate to help promote women in tech. She also has some great tips and resources for anyone trying to break into data science and tech! 

In this episode we talk about things that can help promote women in tech, women in tech conferences such as Grace Hopper, looking for jobs, resources to prepare for the interviews etc. 

If you want to reach out to Rupal for any help or to collaborate with her project womenmentors.co, here is her LinkedIn: https://www.linkedin.com/in/rupalgupta15/ 

FREE Women in Tech Conference by Manning Publications on Oct 13th at 12pm ET on Twitch: https://freecontent.manning.com/livemanning-conferences-women-in-tech/ 🎉 There will be women in tech speakers from Dropbox, Microsoft, Warby Parker and more.

🌟 Programs and conferences covered in the episode:
OMSCS program at Georgia Tech: https://omscs.gatech.edu/
Grace Hopper conference: https://ghc.anitab.org/
Anita Borg Institute: https://anitab.org/

🌟 Interviewing resources:
1. Pramp: https://www.pramp.com/#/
2. Interviewing.io: https://interviewing.io/
3. Educative "Grokking the System Design Interview": https://www.educative.io/courses/grokking-the-system-design-interview
4. AWS Certifications: https://aws.amazon.com/certification/

Disclaimer: All opinions on this podcast are our own and not the views of our employers or organizations.

~Thanks for listening~

Oct 08, 202015:06
23: Let’s Talk AWS SageMaker for ML Model Deployment
Jun 17, 202019:46
22: Transfer Learning for NLP - With Paul Azunre

22: Transfer Learning for NLP - With Paul Azunre

In this episode, we are talking with Paul Azunre. Paul is one of the world’s experts in the area of Transfer Learning for NLP and is also an author of the upcoming book Transfer Learning for NLP published by Manning Publications. In this episode we talk about things such as: 

1) Paul’s background and how his background in maths and optimization as well as fake news detection got him started in transfer learning in NLP.
2) How Paul got started with the book, book writing process as well as tips to the listeners for writing a technical book.
3) High level summary of transfer learning in both computer vision and NLP and why this is the ImageNet moment of NLP.
4) Why ML and NLP practitioners today should be excited about transfer learning (such as how students in Ghana are able to build their own Google Translate using transfer learning)
5) How BERT, ELMo and ALBERT work at the high level and how they differ from traditional techniques like Word2Vec or FastText.
6) Differences between BERT, ELMo and ALBERT.
7) What makes Paul’s new book a must-read for anyone interested in this field. 

✨Paul's Info👇

Paul’s Website: azunre.com (with all social media handles)
Please reach out to Paul if you have any questions about transfer learning in NLP or the book.

✨Chance for one of 2 free copies of Transfer Learning for NLP 🎉

Get a chance to win the free copy of Paul's book! Please share this episode on Twitter and add my Twitter handle "sanket107" to it, you will get a chance to win one of 2 free books. My Twitter: https://twitter.com/sanket107

✨Discount Code for all Manning Publications books! 🎊🤩

Special Link to get extra discount for Paul’s book:
https://www.manning.com/books/transfer-learning-for-natural-language-processing?a_aid=Omnilence&a_bid=d53fed17
As The Data Life Podcast listeners, you can also go to this link http://www.manning.com/?a_aid=Omnilence to get any Manning book with 40% discount with the code: poddlife20

This will help support this show as well and is much appreciated.

Thank you Manning Publications and Paul as well as sponsors to make this show a reality. 

~Thanks for listening~









Apr 13, 202046:47
21: Why Scikit-Learn and Keras are Awesome for ML

21: Why Scikit-Learn and Keras are Awesome for ML

In this episode, we talk about why the two libraries Scikit-Learn and Keras are great for machine learning. These two libraries combined with Pandas form the 3 core libraries in Python for a data scientist today. 

We cover things like:

1)  Data Exploration and data cleaning - how Pandas and Jupyter notebooks provide a good way to get started here.
2) Data Transformation - how Scikit-Learn provides many useful functions like train_test_split, Scalers, PCA etc.
3) Data Fitting - how Scikit-Learn provides good shallow models and Keras provides great support to quickly get started with neural networks.

We also cover various tidbits on things to take note in building ML pipelines and preparing models to be deployed in production, so tune into the episode to find out!

Fantastic Resources:
1) Book by head of Youtube DS team Aurelien Geron:
https://www.amazon.com/dp/1492032646/?tag=omnilence-20
This is one of the best book I have read on this topic as it covers practical tips incl. Scikit-Learn API etc.
2) Developing Scikit-Learn estimators: https://scikit-learn.org/stable/developers/develop.html
3) Guide to Keras Sequential API: https://keras.io/getting-started/sequential-model-guide/
4) Guide to Keras Functional API: https://keras.io/getting-started/functional-api-guide/
5) My previous episode on Pandas: https://podcasts.apple.com/us/podcast/17-why-pandas-is-the-new-excel/id1453716761?i=1000454831790

Thanks for listening! Please consider supporting this podcast from the link in the end.

Jan 26, 202019:55
20: Yogi's Guide to Analytics - An Interview with Akshay Kanade

20: Yogi's Guide to Analytics - An Interview with Akshay Kanade

In this episode, we talk with Akshay Kanade. He is a business analyst working in New York City who likes taking a big view of data, and has very interesting spiritual views on data analytics and life in general, he is also a handwriting expert- he can read people’s handwriting and can recognize a lot about their personalities.

In this interview we will cover several things such as: 
- How has been an analyst influenced Akshay's life? 
- Introspection about data and analytics
- Taking high level view of data - connecting deep learning with deep thinking
- People who don’t have background in analytics- how they can use their unique backgrounds for decisions 
- Power of consciousness and spirituality at work 
- Hand-writing analysis and whether it is a science or an art

It was a fascinating conversation, and I took a lot away talking with Akshay's view points. This interview is a must-listen if you deal with data and analytics in your work. 
Akshay's hand-writing analysis and mentorship website:
www.pradnyatantra.com (will be live soon)
Reach Akshay on LinkedIn at https://www.linkedin.com/in/akshaykanade06/

Some of Akshay's favorite books:
1. Autobiography of a Yogi https://www.amazon.com/dp/8120725247/?tag=omnilence-20
2. The Monk Who Sold His Ferrari https://www.amazon.com/dp/0062515675/?tag=omnilence-20
3. Mastery https://www.amazon.com/dp/B00A6G9CGG/?tag=omnilence-20

To add to this list, one of my favorite books is:
The Power of Now https://www.amazon.com/dp/B00A6G9CGG/?tag=omnilence-20

If you have any feedback drop me a note at thedatalifepodcast@gmail.com or reach me on LinkedIn at https://www.linkedin.com/in/sanketgupta107/

~ Thanks for listening~ 

Dec 01, 201935:44
19: Statistics and Data Science- An Interview with Patrick McClory
Nov 22, 201956:21
18: 5 Things to Consider for Master of Science (MS) in US

18: 5 Things to Consider for Master of Science (MS) in US

Nov 15, 201918:60
17: Why Pandas is the new Excel

17: Why Pandas is the new Excel

The Data Life Podcast is a podcast where we talk all-about real life experiences with data and data science science tools, techniques, models and personalities. 

In this episode, we will talk about how Pandas is becoming a tool of choice for many data scientists for doing their data analysis work. We will explore how Pandas wins over Excel in several key areas that are important for businesses today:

1) Large dataset sizes
2) Different kinds of input formats such as JSON, CSV, HTML, SQL etc
3) Complex business logic
4) Linking data analysis work to websites and databases
5) Cost

Pandas has lots of helpful functions such as read_csv, read_json, read_sql that allow easy input of data into dataframes. DataFrames have several useful methods like "describe", "value_counts", "groupby", "loc" and more that allow easy understanding of your dataset. It also supports plotting out of the box with "plot" method.

We also cover how Pandas differs from SQL in things like ease of handling time series data, visualizations and more.
Tune in to the episode to learn more about how Pandas might be the tool for your data analysis needs to take your business to next level! 

Fantastic Resources:
1) Book by Pandas creator Wes McKinney:
https://www.amazon.com/dp/1491957662/?tag=omnilence-20
2) Great workshop video by Kevin Markham in PyCon: https://www.youtube.com/watch?v=0hsKLYfyQZc
3) Input output methods for Pandas:  https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
4) Comparison of some operations of Pandas with SQL https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html

Thanks for listening! Please consider supporting this podcast from the link in the end. 

Oct 25, 201916:37
16: Getting Started with Natural Language Processing

16: Getting Started with Natural Language Processing

Oct 05, 201919:31
15: Using Flask, REST API and Vue.js to build a Single Page Web Application

15: Using Flask, REST API and Vue.js to build a Single Page Web Application

As a data scientist, you will work on machine learning models that are deployed on websites - usually wrapped around a REST API, these days they also call this approach a “micro-service”. It is for this reason it is important to know how backends and front ends work and how to build them. In this episode, we talk about building a note app which is a Single Page Application or SPA using Pythons flask library for backend and Vue.js for frontend. We use REST API to communicate between them. 

We cover following topics in Q and A format:

1. Why should data scientists care about building frontend and backend and rest api? 

2. What is a single page application? 

3. Why Vue.js? 

4. Why do we need server side code? 

5. What is REST API?

6. How does Flask help with building rest api? 


Then we go into the exact mechanics of building the SPA:

Step 1: Database setup 

Step 2: Write REST API in flask 

Step 3: Postman setup and testing of the API

Step 4: Build frontend and write forms to get information 

Step 5: Build routing and login pages 

Step 6: Front end design and UI/UX 

Finally you can deploy both the server and client separately on AWS or Heroku so that other users can see it and use it. 


Dependencies: 

1) Flask to build server side REST APIs

2) Sqlalchemy which is ORM to access database 

3) Bcrypt for hashing user passwords to store in your database 

4) Vue for building frontend 

5) Bootstrap-Vue for using bootstrap with Vue.js

6) Axios to communicate via AJAX between client and server 

7) Vue CLI 3 to manage the tooling of the client 


Really awesome resources:

1) Learn Vue.JS from scratch by the awesome teacher Net Ninja - YouTube https://www.youtube.com/watch?v=5LYrN_cAJoA&list=PL4cUxeGkcC9gQcYgjhBoeQH7wiAyZNrYa&index=1

2) Building book recording app using Vue and Flask https://testdriven.io/blog/developing-a-single-page-app-with-flask-and-vuejs/#bootstrap-vue

3) Managing state in Vue.js including Vuex and simple global store: https://medium.com/fullstackio/managing-state-in-vue-js-23a0352b1c87

4) Authenticating a Flask API Using JSON Web Tokens - YouTube https://www.youtube.com/watch?v=J5bIPtEbS0Q

5) Really nice tutorial for using databases with Flask by Corey Schafer - YouTube https://www.youtube.com/watch?v=cYWiDiIUxQc&list=PL-osiE80TeTs4UjLw5MM6OjgkjFeUxCYH&index=4


If this has been of value please consider supporting me by buying me a coffee at the Anchor link at the end. If you support, I will provide extra bonus content for you. Thanks for listening!

Sep 16, 201920:39
14: Building a Character-Based Text Classifier

14: Building a Character-Based Text Classifier

Ever wonder how to automatically detect language from a script? How does Google do it? 

Ever wonder how Amazon knows whether you are searching for a product or a SKU on its search bar? 

We look into character-based text classifiers in this episode. We cover 2 types of models. First is the bag-of-words models such as Naive Bayes, logistic regression and vanilla neural network. Second we cover sequence models such as LSTMs and how to prepare your characters for the LSTMs including things like one-hot encoding, padding, creating character embeddings and then feeding these into LSTMs. We also cover how to set up and compile these sequence models. 

Thanks for listening, and if you find this content useful, please leave a review and consider supporting this podcast from the link below. 

Aug 07, 201923:20
13: Statistics of A/B Testing

13: Statistics of A/B Testing

You and your team might spend a lot of time building a new feature. But how do you know if this feature will be liked by the users? One of the ways to statistically prove this is by using A/B testing. Listen to this episode to get tips, tricks and intuition behind hypothesis testing, alpha, beta, p-values, two-sample t-tests and more. 

These understandings have been learnt from experiences deploying A/B tests in the field, and talking to experts. 

These ideas are typically not covered in traditional A/B testing texts which tend to focus a lot on math without the intuition, and that's why I really wanted to cover it in this podcast episode. Thanks for listening! I'd really appreciate your support for this podcast. Follow the link below. 

Jul 17, 201921:23
12: Your Users Don't Care How Smart You Are

12: Your Users Don't Care How Smart You Are

In this episode, we will talk about the importance of business impact in data science.  "Your users don't care how smart you are" was a quote I read that got me started in thinking about this. 

The right way to do data science is to think of users, revenue impact, business value and go for the simplest solution possible.  The wrong way to do data science is to just find a nail to hit the hammer with rather than the other way around. 

We will cover about all this and more! 

Amazon link of Inspired by Marty Cagan (a great read to get better at product thinking): https://www.amazon.com/dp/1119387507/?tag=omnilence-20

Please consider buying me a coffee if you find this content useful. Refer to link at the bottom.

Jun 25, 201905:05
11: The Ten Essential Machine Learning Questions

11: The Ten Essential Machine Learning Questions

This episode covers the ten essential machine learning questions. Disclaimer: Baseline answers have been provided in the episode for guidance. For complete accuracy, please refer to textbooks or to courses by Andrew Ng on Coursera. 

If this content is useful, please consider buying me a coffee via the link https://anchor.fm/the-data-life-podcast/support 

Resources:
1. Machine Learning Course by Andrew Ng: https://www.coursera.org/learn/machine-learning
2. Deep Learning Course by Andrew Ng: https://www.coursera.org/specializations/deep-learning

Questions:
1. What is underfitting and overfitting? How to avoid it?
2. What is the difference between batch, SGD and mini-batch gradient descents? When will you use each?
3. How to choose a machine learning model?
4. How to improve the latency of a machine learning model in production?
5. If your training and cross validation accuracies are high, but testing accuracy is less - how would you debug this?
6. Name 3 hyper-parameters. Why can’t we train them as hyper-parameters, why should only humans set them?
7. Which metric should be used to evaluate a classifier? How do you connect it to business value?
8. What prevents someone to select deep learning model for everything?
9. Say you have to classify a lot of data, but you don’t have labelled training examples. How would you begin to solve the problem? How many training data points are needed?
10. Say you have a perfectly working machine learning model. How do you deploy this in production? How do you check if users will actually like it?


Please leave a review on Apple Podcasts or wherever you listen to this.
Thanks for listening! 


Jun 21, 201919:09
Mining Twitter Data for Sentiment Analysis of Events
Jun 01, 201918:44
Don't Be Shy To Pursue Your Interest

Don't Be Shy To Pursue Your Interest

In this episode, we will talk about things like Maslow's Hierarchy of Needs, and focussing on higher level needs such as satisfaction and achieving full potential. In the area of tech, data science and software development, admitting your interest could involve "shyness" as the next shiny cool thing is pursued by everyone. But if your interest is in a niche, don't let others stop you from putting in an effort to become great at it. 

Thanks for listening, and please show your support to keep this podcast going! 

May 19, 201905:00
Review of Udacity Nanodegrees - are they worth it?
May 03, 201913:03
6 Steps to Transition to Data Science from non-CS background

6 Steps to Transition to Data Science from non-CS background

In this episode we will talk all about the various steps to transition to data science from non computer science backgrounds.
One of the main difficulties people face from non-CS backgrounds is how overwhelming it can be to transition to data science field, I talk about my own journey, and share the 6 steps which can help you in your own data science career! 

00:00 to 02:10: Introduction

02:11 to 06:00: My Background of moving to data science from electrical engineering

06:01 to 10:56: Steps 1 to 3 covering things like using external APIs, already processed datasets and performing full stack data science work

10:57 to 11:55: Break sponsored by Anchor

11:56: End: Steps 4 to 6 covering things like math and statistics, machine learning pipelines and data structures & algorithms

Some useful links:

1) Andrew Ng Deep Learning Specialization Coursera https://www.coursera.org/specializations/deep-learning

2) Intro to Statistics by Sebastien Thrun https://www.udacity.com/course/intro-to-statistics--st101

3) Aurelion Geron's book on machine learning https://www.amazon.com/dp/1491962291/?tag=omnilence-20

4) Pramp for mock algorithm sessions on video https://www.pramp.com/ 

5) Leetcode for algorithm question datasets https://leetcode.com/ 

Some great datasets to get started in machine learning: 

6) MNIST for hand written digits https://www.kaggle.com/c/digit-recognizer

7) Iris dataset for flower classification http://archive.ics.uci.edu/ml/datasets/iris

8) IMDB movie reviews https://ai.stanford.edu/~amaas/data/sentiment/

Thanks for listening! 


Apr 20, 201915:59
The Top 5 Data Science Podcasts

The Top 5 Data Science Podcasts

Welcome! In this episode, we will cover some of the top data science podcasts, that have helped me a lot in my own journey, and hopefully will be helpful to you as well.
The top 5 podcasts are (linked to my favorite episodes):
1) AI in Industry with Daniel Faggella
2) This week in Machine Learning and AI (TWiML)
3) DataFramed
4) Data Skeptic
5) Talk Python to Me

Listen to the episode for the sixth bonus podcast!
If you think I should mention another podcast here, let me know and I will add it in the show notes!
Thanks for listening!
Apr 10, 201908:21
What I learnt building a data science course
Mar 30, 201918:17
Overview of Netflix and Spotify like recommendation engines
Mar 22, 201913:28
3 Mistakes to Avoid in a Machine Learning Project

3 Mistakes to Avoid in a Machine Learning Project

You and your team might spend weeks or even months building a model. These are the 3 mistakes to avoid in your next machine learning project! This can save you a lot of time and effort in your next project. 


These tips have been learnt from experiences deploying ML models in production as well as hearing from experts in the field. 


These tips and mistakes are typically not covered in traditional machine learning texts and courses, and that's why I really wanted to cover it in this podcast episode. I'd really appreciate your support for this podcast. Please visit the podcast webpage and support, so that I can continue to develop podcast episodes. Thanks for listening!


 

Mar 15, 201910:18
Flask is a Great Tool for Full Stack Data Science
Mar 05, 201910:44
Hello, World!

Hello, World!

To kick things off, I talk about the kind of topics you can expect to hear in this podcast. Welcome to The Data Life!

Feb 19, 201901:59