Networking is the most valuable career advancement skill in data science. And yet, almost paradoxically, most data scientists don’t spend any time on it at all. In some ways, that’s not terribly surprising: data science is a pretty technical field, and technical people often prefer not to go out of their way to seek social interactions. We tend to think of networking with other “primates who code” as a distraction at best, and an anxiety-inducing nightmare at worst.
So how can data scientists overcome that anxiety, and tap into the value of network-building, and develop a brand for themselves in the data science community? That’s the question that brings us to this episode of the podcast. To answer it, I spoke with repeat guest Sanyam Bhutani — a top Kaggler, host of the Chai Time Data Science Show, Machine Learning Engineer and AI Content Creator at H2O.ai, about the unorthodox networking strategies that he’s leveraged to become a fixture in the machine learning community, and to land his current role.
We’ve talked a lot about “full stack” data science on the podcast. To many, going full-stack is one of those long-term goals that we never get to. There are just too many algorithms and data structures and programming languages to know, and not enough time to figure out software engineering best practices around deployment and building app front-ends.
Fortunately, a new wave of data science tooling is now making full-stack data science much more accessible by allowing people with no software engineering background to build data apps quickly and easily. And arguably no company has had such explosive success at building this kind of tooling than Streamlit, which is why I wanted to sit down with Streamlit founder Adrien Treuille and gamification expert Tim Conkling to talk about their journey, and the importance of building flexible, full-stack data science apps.
It’s no secret that data science is an area where brand matters a lot.
In fact, if there’s one thing I’ve learned from A/B testing ways to help job-seekers get hired at SharpestMinds, it’s that blogging, having a good presence on social media, making open-source contributions, podcasting and speaking at meetups is one of the best ways to get noticed by employers.
Brand matters. And if there’s one person who has a deep understanding of the value of brand in data science — and how to build one — it’s data scientist and YouTuber Ken Jee. Ken not only has experience as a data scientist and sports analyst, having worked at DraftKings and GE, but he’s also founded a number of companies — and his YouTube channel, with over 60 000 subscribers, is one of his main projects today.
For today’s episode, I spoke to Ken about brand-building strategies in data science, as well as job search tips for anyone looking to land their first data-related role.
If you’re interested in upping your coding game, or your data science game in general, then it’s worth taking some time to understand the process of learning itself.
And if there’s one company that’s studied the learning process more than almost anyone else, it’s Codecademy. With over 65 million users, Codecademy has developed a deep understanding of what it takes to get people to learn how to code, which is why I wanted to speak to their Head of Data Science, Cat Zhou, for this episode of the podcast.
Data science is about much more than jupyter notebooks, because data science problems are about more than machine learning.
What data should I collect? How good does my model need to be to be “good enough” to solve my problem? What form should my project take for it to be useful? Should it be a dashboard, a live app, or something else entirely? How do I deploy it? How do I make sure something awful and unexpected doesn’t happen when it’s deployed in production?
None of these questions can be answered by importing sklearn and pandas and hacking away in a jupyter notebook. Data science problems take a unique combination of business savvy and software engineering know-how, and that’s why Emmanuel Ameisen wrote a book called Building Machine Learning Powered Applications: Going from Idea to Product. Emmanuel is a machine learning engineer at Stripe, and formerly worked as Head of AI at Insight Data Science, where he oversaw the development of dozens of machine learning products.
Our conversation was focused on the missing links in most online data science education: business instinct, data exploration, model evaluation and deployment.
Project-building is the single most important activity that you can get up to if you’re trying to keep your machine learning skills sharp or break into data science. But a project won’t do you much good unless you can show it off effectively and get feedback to iterate on it — and until recently, there weren’t many places you could turn to to do that.
A recent open-source initiative called MadeWithML is trying to change that, by creating an easily shareable repository of crowdsourced data science and machine learning projects, and its founder, former Apple ML researcher and startup founder Goku Mohandas, sat down with me for this episode of the TDS podcast to discuss data science projects, his experiences doing research in industry, and the MadeWithML project.
It’s cliché to say that data cleaning accounts for 80% of a data scientist’s job, but it’s directionally true.
That’s too bad, because fun things like data exploration, visualization and modelling are the reason most people get into data science. So it’s a good thing that there’s a major push underway in industry to automate data cleaning as much as possible.
One of the leaders of that effort is Ihab Ilyas, a professor at the University of Waterloo and founder of two companies, Tamr and Inductiv, both of which are focused on the early stages of the data science lifecycle: data cleaning and data integration. Ihab knows an awful lot about data cleaning and data engineering, and has some really great insights to share about the future direction of the space — including what work is left for data scientists, once you automate away data cleaning.
There’s been a lot of talk about the future direction of data science, and for good reason. The space is finally coming into its own, and as the Wild West phase of the mid-2010s well and truly comes to an end, there’s keen interest among data professionals to stay ahead of the curve, and understand what their jobs are likely to look like 2, 5 and 10 years down the road.
And amid all the noise, one trend is clearly emerging, and has already materialized to a significant degree: as more and more of the data science lifecycle is automated or abstracted away, data professionals can afford to spend more time adding value to companies in more strategic ways. One way to do this is to invest your time deepening your subject matter expertise, and mastering the business side of the equation. Another is to double down on technical skills, and focus on owning more and more of the data stack —particularly including productionization and deployment stages.
My guest for today’s episode of the Towards Data Science podcast has been down both of these paths, first as a business-focused data scientist at Spotify, where he spent his time defining business metrics and evaluating products, and second as a data engineer at Better.com, where his focus has shifted towards productionization and engineering. During our chat, Kenny shared his insights about the relative merits of each approach, and the future of the field.
Reinforcement learning has gotten a lot of attention recently, thanks in large part to systems like AlphaGo and AlphaZero, which have highlighted its immense potential in dramatic ways. And while the RL systems we’ve developed have accomplished some impressive feats, they’ve done so in a fairly naive way. Specifically, they haven’t tended to confront multi-agent problems, which require collaboration and competition. But even when multi-agent problems have been tackled, they’ve been addressed using agents that just assume other agents are an uncontrollable part of the environment, rather than entities with rich internal structures that can be reasoned and communicated with.
That’s all finally changing, with new research into the field of multi-agent RL, led in part by OpenAI, Oxford and Google alum, and current FAIR research scientist Jakob Foerster. Jakob’s research is aimed specifically at understanding how reinforcement learning agents can learn to collaborate better and navigate complex environments that include other agents, whose behavior they try to model. In essence, Jakob is working on giving RL agents a theory of mind.
Data science can look very different from one company to the next, and it’s generally difficult to get a consistent opinion on the question of what a data scientist really is.
That’s why it’s so important to speak with data scientists who apply their craft at different organizations — from startups to enterprises. Getting exposure to the full spectrum of roles and responsibilities that data scientists are called on to execute is the only way to distill data science down to its essence.
That’s why I wanted to chat with Ian Scott, Chief Science Officer at Deloitte Omnia, Deloitte’s AI practice. Ian was doing data science as far back as the late 1980s, when he was applying statistical modeling to data from experimental high energy physics as par of his PhD work at Harvard. Since then, he’s occupied strategic roles at a number of companies, most recently including Deloitte, where he leads significant machine learning and data science projects.
Machine learning in grad school and machine learning in industry are very different beasts. In industry, deployment and data collection become key, and the only thing that matters is whether you can deliver a product that real customers want, fast enough to meet internal deadlines. In grad school, there’s a different kind of pressure, focused on algorithm development and novelty. It’s often difficult to know which path you might be best suited for, but that’s why it can be so useful to speak with people who’ve done both — and bonus points if their academic research experience comes from one of the top universities in the world.
For today’s episode of the Towards Data Science podcast, I sat down with Will Grathwohl, a PhD student at the University of Toronto, student researcher at Google AI, and alum of MIT and OpenAI. Will has seen cutting edge machine learning research in industry and academic settings, and has some great insights to share about the differences between the two environments. He’s also recently published an article on the fascinating topic of energy models in which he and his co-authors propose a unique way of thinking about generative models that achieves state-of-the-art performance in computer vision tasks.
One of the themes that I’ve seen come up increasingly in the past few months is the critical importance of product thinking in data science. As new and aspiring data scientists deepen their technical skill sets and invest countless hours doing practice problems on leetcode, product thinking has emerged as a pretty serious blind spot for many applicants. That blind spot has become increasingly critical as new tools have emerged that abstract away a lot of what used to be the day-to-day gruntwork of data science, allowing data scientists more time to develop subject matter expertise and focus on the business value side of the product equation.
If there’s one company that’s made a name for itself for leading the way on product-centric thinking in data science, it’s Shopify. And if there’s one person at Shopify who’s spent the most time thinking about product-centered data science, it’s Shopify’s Head of Data Science and Engineering, Solmaz Shahalizadeh. Solmaz has had an impressive career arc, which included joining Shopify in its pre-IPO days, back in 2013, and seeing the Shopify data science team grow from a handful of people to a pivotal organization-wide effort that tens of thousands of merchants rely on to earn a living today.
Machine learning isn’t rocket science, unless you’re doing it at NASA. And if you happen to be doing data science at NASA, you have something in common with David Meza, my guest for today’s episode of the podcast.
David has spent his NASA career focused on optimizing the flow of information through NASA’s many databases, and ensuring that that data is harnessed with machine learning and analytics. His current focus is on people analytics, which involves tracking the skills and competencies of employees across NASA, to detect people who have abilities that could be used in new or unexpected ways to meet needs that the organization has or might develop.
Nick Pogrebnyakov is a Senior Data Scientist at Thomson Reuters, an Associate Professor at Copenhagen Business School, and the founder of Leverness, a marketplace where experienced machine learning developers can find contract work with companies. He’s a busy man, but he agreed to sit down with me for today’s TDS podcast episode, to talk about his day job ar Reuters, as well as the machine learning and data science job landscape.
One Thursday afternoon in 2015, I got a spontaneous notification on my phone telling me how long it would take to drive to my favourite restaurant under current traffic conditions. This was alarming, not only because it implied that my phone had figured out what my favourite restaurant was without ever asking explicitly, but also because it suggested that my phone knew enough about my eating habits to realize that I liked to go out to dinner on Thursdays specifically.
As our phones, our laptops and our Amazon Echos collect increasing amounts of data about us — and impute even more — data privacy is becoming a greater and greater concern for research as well as government and industry applications. That’s why I wanted to speak to Harvard PhD student and frequent Towards Data Science contributor Matthew Stewart about to get an introduction to some of the key principles behind data privacy. Matthew is a prolific blogger, and his research work at Harvard is focused on applications of machine learning to environmental sciences, a topic we also discuss during this episode.
There’s been a lot of talk in data science circles about techniques like AutoML, which are dramatically reducing the time it takes for data scientists to train and tune models, and create reliable experiments. But that trend towards increased automation, greater robustness and reliability doesn’t end with machine learning: increasingly, companies are focusing their attention on automating earlier parts of the data lifecycle, including the critical task of data engineering.
Today, many data engineers are unicorns: they not only have to understand the needs of their customers, but also how to work with data, and what software engineering tools and best practices to use to set up and monitor their pipelines. Pipeline monitoring in particular is time-consuming, and just as important, isn’t a particularly fun thing to do. Luckily, people like Sean Knapp — a former Googler turned founder of data engineering startup Ascend.io — are leading the charge to make automated data pipeline monitoring a reality.
We had Sean on this latest episode of the Towards Data Science podcast to talk about data engineering: where it’s at, where it’s going, and what data scientists should really know about it to be prepared for the future.
For the last decade, advances in machine learning have come from two things: improved compute power and better algorithms. These two areas have become somewhat siloed in most people’s thinking: we tend to imagine that there are people who build hardware, and people who make algorithms, and that there isn’t much overlap between the two.
But this picture is wrong. Hardware constraints can and do inform algorithm design, and algorithms can be used to optimize hardware. Increasingly, compute and modelling are being optimized together, by people with expertise in both areas.
My guest today is one of the world’s leading experts on hardware/software integration for machine learning applications. Max Welling is a former physicist and currently works as VP Technologies at Qualcomm, a world-leading chip manufacturer, in addition to which he’s also a machine learning researcher with affiliations at UC Irvine, CIFAR and the University of Amsterdam.
Coronavirus quarantines fundamentally change the dynamics of learning, and the dynamics of the job search. Just a few months ago, in-person bootcamps and college programs, live networking events where people exchanged handshakes and business cards were the way the world worked, but now, no longer. With that in mind, many aspiring techies are asking themselves how they should be adjusting their gameplan to keep up with learning or land that next job, given the constraints of an ongoing pandemic and impending economic downturn.
That’s why I wanted to talk to Rubén Harris, CEO and co-founder of Career Karma, a startup that helps aspiring developers find the best coding bootcamp for them. He’s got a great perspective to share on the special psychological and practical challenges of navigating self-learning and the job search, and he was kind enough to make the time to chat with me for this latest episode of the Towards Data Science podcast.
One great way to get ahead in your career is to make good bets on what technologies are going to become important in the future, and to invest time in learning them. If that sounds like something you want to do, then you should definitely be paying attention to graph databases.
Graph databases aren’t exactly new, but they’ve become increasingly important as graph data (data that describe interconnected networks of things) has become more widely available than ever. Social media, supply chains, mobile device tracking, economics and many more fields are generating more graph data than ever before, and buried in these datasets are potential solutions for many of our biggest problems.
That’s why I was so excited to speak with Denise Gosnell and Matthias Broecheler, respectively the Chief Data Officer and Chief Technologist at DataStax, a company specialized in solving data engineering problems for enterprises. Apart from their extensive experience working with graph databases at DataStax, and Denise and Matthias have also recently written a book called The Practitioner’s Guide to Graph Data, and were kind enough to make the time for a discussion about the basics of data engineering and graph data for this episode of the Towards Data Science Podcast.
One of the most interesting recent trends in machine learning has been the combination of different types of data in order to be able to unlock new use cases for deep learning. If the 2010s were the decade of computer vision and voice recognition, the 2020s may very well be the decade we finally figure out how to make machines that can see and hear the world around them, making them that much more context-aware and potentially even humanlike.
The push towards integrating diverse data sources has received a lot of attention, from academics as well as companies. And one of those companies is Twenty Billion Neurons, and its founder Roland Memisevic, is our guest for this latest episode of the Towards Data Science podcast. Roland is a former academic who’s been knee-deep in deep learning since well before the hype that was sparked by AlexNet in 2012. His company has been working on deep learning-powered developer tools, as well as an automated fitness coach that combines video and audio data to keep users engaged throughout their workout routines.
If I were to ask you to explain why you’re reading this blog post, you could answer in many different ways.
For example, you could tell me “it’s because I felt like it”, or “because my neurons fired in a specific way that led me to click on the link that was advertised to me”. Or you might go even deeper and relate your answer to the fundamental laws of quantum physics.
The point is, explanations need to be targeted to a certain level of abstraction in order to be effective.
That’s true in life, but it’s also true in machine learning, where explainable AI is getting more and more attention as a way to ensure that models are working properly, in a way that makes sense to us. Understanding explainability and how to leverage it is becoming increasingly important, and that’s why I wanted to speak with Bahador Khaleghi, a data scientist at H20.ai whose technical focus is on explainability and interpretability in machine learning.
Most of us want to change our identities. And we usually have an idealized version of ourselves that we aspire to become — one who’s fitter, smarter, healthier, more famous, wealthier, more centered, or whatever.
But you can’t change your identity in a fundamental way without also changing what you do in your day-to-day life. You don’t get fitter without working out regularly. You don’t get smarter without studying regularly.
To change yourself, you must first change your habits. But how do you do that?
Recently, books like Atomic Habits and Deep Work have focused on answering that question in general terms, and they’re definitely worth reading. But habit formation in the context of data science, analytics, machine learning, and startups comes with a unique set of challenges, and deserves attention in its own right. And that’s why I wanted to sit down with today’s guest, Russell Pollari.
Russell may now be the CTO of the world’s largest marketplace for income share mentorships (and the very same company I work at every day!) but he was once — and not too long ago — a physics PhD student with next to no coding ability and a classic case of the grad school blues. To get to where he is today, he’s had to learn a lot, and in his quest to optimize that process, he’s focused a lot of his attention on habit formation and self-improvement in the context of tech, data science and startups.
Revenues drop unexpectedly, and management pulls aside the data science team into a room. The team is given its marching orders: “your job,” they’re told, “is to find out what the hell is going on with our purchase orders.”
That’s a very open-ended question, of course, because revenues and signups could drop for any number of reasons. Prices may have increased. A new user interface might be confusing potential customers. Seasonality effects might have to be considered. The source of the problem could be, well, anything.
That’s often the position data scientists find themselves in: rather than having a clear A/B test to analyze, they frequently are in the business of combing through user funnels to ensure that each stage is working as expected.
It takes a very detail-oriented and business-savvy team to pull off an investigation with that broad a scope, but that’s exactly what Medium has: a group of product-minded data scientists dedicated to investigating anomalies and identifying growth opportunities hidden in heaps of user data. They were kind enough to chat with me and talk about how Medium does data science for this episode of the Towards Data Science podcast.
If you want to know where data science is heading, it helps to know where it’s been. Very few people have that kind of historical perspective, and even fewer combine it with an understanding of cutting-edge tooling that hints at the direction the field might be taking in the future.
Luckily for us, one of them is Cameron Davidson-Pillon, the former Director of Data Science at Shopify. Cameron has been knee-deep in data science and estimation theory since 2012, when the space was still coming into its own. He’s got a great high-level perspective not only on technical issues but also on hiring and team-building, and he was kind enough to join us for today’s episode of the Towards Data Science podcast.
It’s easy to think of data science as a purely technical discipline: after all, it exists at the intersection of a number of genuinely technical topics, from statistics to programming to machine learning.
But there’s much more to data science and analytics than solving technical problems — and there’s much more to the data science job search than coding challenges and Kaggle competitions as well. Landing a job or a promotion as a data scientist calls on a ton of career skills and soft skills that many people don’t spend nearly enough time honing.
On this episode of the podcast, I spoke with Emily Robinson, an experienced data scientist and blogger with a pedigree that includes Etsy and DataCamp, about career-building strategies. Emily’s got a lot to say about the topic, particularly since she just finished authoring a book entitled “Build a Career in Data Science” with her co-author Jacqueline Nolis. The book explores a lot of great, practical strategies for moving data science careers forward, many of which we discussed during our conversation.
Most of us believe that decisions that affect us should be made rationally: they should be reached by following a reasoning process that combines data we trust with a logic that we find acceptable.
As long as human beings are making these decisions, we can probe at that reasoning to find out whether we agree with it. We can ask why we were denied that bank loan, or why a judge handed down a particular sentence, for example.
Today however, machine learning is automating away more and more of these important decisions, and as a result, our lives are increasingly governed by decision-making processes that we can’t interrogate or understand. Worse, machine learning algorithms can exhibit bias or make serious mistakes, so a black-box-ocracy risks becoming more like a dystopia than even the most imperfect human-designed systems we have today.
That’s why AI ethics and AI safety have drawn so much attention in recent years, and why I was so excited to talk to Alayna Kennedy, a data scientist at IBM whose work is focused on the ethics of machine learning, and the risks associated with ML-based decision-making. Alayna has consulted with key players in the US government’s AI effort, and has expertise applying machine learning in industry as well, through previous work on neural network modelling and fraud detection.
In mid-January, China launched an official investigation into a string of unusual pneumonia cases in Hubei province. Within two months, that cluster of cases would snowball into a full-blown pandemic, with hundreds of thousands — perhaps even millions — of infections worldwide, with the potential to unleash a wave of economic damage not seen since the 1918 Spanish influenza or the Great Depression.
The exponential growth that led us from a few isolated infections to where we are today is profoundly counterintuitive. And it poses many challenges for the epidemiologists who need to pin down the transmission characteristics of the coronavirus, and for the policy makers who must act on their recommendations, and convince a generally complacent public to implement life-saving social distancing measures.
With the coronas in full bloom, I thought now would be a great time to reach out to Jeremy Howard, co-founder of the incredibly popular Fast.ai machine learning education site. Along with his co-founder Rachel Thomas, Jeremy authored a now-viral report outlining a data-driven case for concern regarding the coronavirus.
It’s easy to think of data scientists as “people who explore and model data”. Bur in reality, the job description is much more flexible: your job as a data scientist is to solve problems that people actually have with data.
You’ll notice that I wrote “problems that people actually have” rather than “build models”. It’s relatively rare that the problems people have actually need to be solved using a predictive model. Instead, a good visualization or interactive chart is almost always the first step of the problem-solving process, and can often be the last as well.
And you know who understands visualization strategy really, really well? Plotly, that’s who. Plotly is a company that builds a ton of great open-source visualization, exploration and data infrastructure tools (and some proprietary commercial ones, too). Today, their tooling is being used by over 50 million people worldwide, and they’ve developed a number of tools and libraries that are now industry standard. So you can imagine how excited I was to speak with Plotly co-founder and Chief Product Officer Chris Parmer.
Chris had some great insights to share about data science and analytics tooling, including the future direction he sees the space moving in. But as his job title suggests, he’s also focused on another key characteristic that all great data scientists develop early on: product instinct (AKA: “knowing what to build next”).
Most machine learning models are used in roughly the same way: they take a complex, high-dimensional input (like a data table, an image, or a body of text) and return something very simple (a classification or regression output, or a set of cluster centroids). That makes machine learning ideal for automating repetitive tasks that might historically have been carried out only by humans.
But this strategy may not be the most exciting application of machine learning in the future: increasingly, researchers and even industry players are experimenting with generative models, that produce much more complex outputs like images and text from scratch. These models are effectively carrying out a creative process — and mastering that process hugely widens the scope of what can be accomplished by machines.
My guest today is Xander Steenbrugge, and his focus is on the creative side of machine learning. In addition to consulting with large companies to help them put state-of-the-art machine learning models into production, he’s focused a lot of his work on more philosophical and interdisciplinary questions — including the interaction between art and machine learning. For that reason, our conversation went in an unusually philosophical direction, covering everything from the structure of language, to what makes natural language comprehension more challenging than computer vision, to the emergence of artificial general intelligence, and how all these things connect to the current state of the art in machine learning.
I can’t remember how many times I’ve forgotten something important.
I’m sure it’s a regular occurrence though: I constantly forget valuable life lessons, technical concepts and useful bits of statistical theory. What’s worse, I often forget these things after working bloody hard to learn them, so my forgetfulness is just a giant waste of time and energy.
That’s why I jumped at the chance to chat with Iain Harlow, VP of Science at Cerego — a company that helps businesses build training courses for their employees by optimizing the way information is served to maximize retention and learning outcomes.
Iain knows a lot about learning and has some great insights to share about how you can optimize your own learning, but he’s also got a lot of expertise solving data science problems and hiring data scientists — two things that he focuses on in his work at Cerego. He’s also a veteran of the academic world, and has some interesting observations to share about the difference between research in academia and research in industry.
You train your model. You check its performance with a validation set. You tweak its hyperparameters, engineer some features and repeat. Finally, you try it out on a test set, and it works great!
Problem solved? Well, probably not.
Five years ago, your job as a data scientist might have ended here, but increasingly, the data science life cycle is expanding to include the steps after basic testing. This shouldn’t come as a surprise: now that machine learning models are being used for life-or-death and mission-critical applications, there’s growing pressure on data scientists and machine learning engineers to ensure that effects like feature drift are addressed reliably, that data science experiments are replicable, and that data infrastructure is reliable.
This episode’s guest is Luke Marsden, and he’s made these problems the focus of this work. Luke is the founder and CEO of Dotscience, a data infrastructure startup that’s creating a git-like tool for data science version control. Luke has spent most of his professional life working on infrastructure problems at scale, and has a lot to say about the direction data science and MLOps are heading in.
When I think of the trends I’ve seen in data science over the last few years, perhaps the most significant and hardest to ignore has been the increased focus on deployment and productionization of models. Not all companies need models deployed to production, of course but at those that do, there’s increasing pressure on data science teams to deliver software engineering along with machine learning solutions.
That’s why I wanted to sit down with Adam Waksman, Head of Core Technology at Foursquare. Foursquare is a company built on data and machine learning: they were one of the first fully scaled social media-powered recommendation services that gained real traction, and now help over 50 million people find restaurants and services in countries around the world.
Our conversation covered a lot of ground, from the interaction between software engineering and data science, to what he looks for in new hires, to the future of the field as a whole.
Podcast interview with one of our top data science writers, Will Koehrsen.
Let’s go! Here’s Will’s article about what he learned from writing a data science article every week for a year: https://towardsdatascience.com/what-i-learned-from-writing-a-data-science-article-every-week-for-a-year-201c0357e0ce
This episode was hosted by YK from CS Dojo: https://www.instagram.com/ykdojo/
Getting hired as a data scientist, machine learning engineer or data analyst is hard. And if there’s one person who’s spent a *lot* of time thinking about why that is, and what you can do about it if you’re trying to break into the field, it’s Edouard Harris.
Ed is the co-founder of SharpestMinds, a data science mentorship program that’s free until you get a job. He also happens to be my brother, which makes this our most nepostistic episode yet.
If there’s one trend that not nearly enough data scientists seem to be paying attention to heading into 2020, it’s this: data scientists are becoming product people.
Five years ago, that wasn’t the case at all: data science and machine learning were all the rage, and managers were impressed by fancy analytics and build over-engineered predictive models. Today, a healthy dose of reality has set in, and most companies see data science as a means to an end: it’s way of improving the experience of real users and real, paying customers, and not a magical tool whose coolness is self-justifying.
At the same time, as more and more tools continue to make it easier and easier for people who aren’t data scientists to build and use predictive models, data scientists are going to have to get good at new things. And that means two things: product instinct, and data storytelling.
That’s why we wanted to chat with Nate Nichols, a data scientist turned VP of Product Architecture at Narrative Science — a company that’s focused on addressing data communication. Nate is also the co-author of Let Your People Be People, a (free) book on data storytelling.
In this podcast interview, YK (aka CS Dojo) asks Ian Xiao about why he thinks machine learning is more boring than you may think.
Original article: https://towardsdatascience.com/data-science-is-boring-1d43473e353e
The other day, I interviewed Jeremie Harris, a SharpestMinds cofounder, for the Towards Data Science podcast and YouTube channel. SharpestMinds is a startup that helps people who are looking for data science jobs by finding mentors for them.
In my opinion, their system is interesting in a way that a mentor only gets paid when their mentee lands a data science job. I wanted to interview Jeremie because I had previously spoken to him on a different occasion, and I wanted to personally learn more about his story, as well as his thoughts on today’s data science job market.
One question I’ve been getting a lot lately is whether graduate degrees — especially PhDs — are necessary in order to land a job in data science. Of course, education requirements vary widely from company to company, which is why I think the most informative answers to this question tend to come not from recruiters or hiring managers, but from data scientists with those fancy degrees, who can speak to whether they were actually useful.
That’s far from the only reason I wanted to sit down with Rachael Tatman for this episode of the podcast though. In addition to holding a PhD in computational sociolinguistics, Rachael is a data scientist at Kaggle, and a popular livestreaming coder (check out her Twitch stream here). She’s has a lot of great insights about breaking into data science, how to get the most out of Kaggle, the future of NLP, and yes, the value of graduate degrees for data science roles.
One thing that you might not realize if you haven’t worked as a data scientist in very large companies is that the problems that arise at enterprise scale (and well as the skills that are needed to solve them) are completely different from those you’re likely to run into at a startup.
Scale is a great thing for many reasons: it means access to more data sources, and usually more resources for compute and storage. But big companies can take advantage of these things only by fostering successful collaboration between and among large teams (which is really, really hard), and have to contend with unique data sanitation challenges that can’t be addressed without reinventing practically the entire data science life cycle.
So I’d say it’s a good thing we booked Sanjeev Sharma, Vice President of Data Modernization and Strategy at Delphix, for today’s episode. Sanjeev’s specialty is helping huge companies with significant technical debt modernize and upgrade their data pipelines, and he’s seen the ins and outs of data science at enterprise scale for longer than almost anyone.
A few years ago, there really wasn’t much of a difference between data science in theory and in practice: a jupyter notebook and a couple of imports were all you really needed to do meaningful data science work. Today, as the classroom overlaps less and less with the realities of industry, it’s becoming more and more important for data scientists to develop the ability to learn independently and go off the beaten path.
Few people have done so as effectively as Sanyam Bhutani, who among other things is an incoming ML engineer at H2O.ai, a top-1% Kaggler, popular blogger and host of the Chai Time Data Science Podcast. Sanyam has a unique perspective on the mismatch between what’s taught in the classroom and what’s required in industry: he started doing ML contract work while still in undergrad, and has interviewed some of the world’s top-ranked Kagglers to better understand where the rubber meets the data science road.
The trend towards model deployment, engineering and just generally building “stuff that works” is just the latest step in the evolution of the now-maturing world of data science. It’s almost guaranteed not to be the last one though, and staying ahead of the data science curve means keeping an eye on what trends might be just around the corner. That’s why we asked Ben Lorica, O’Reilly Media’s Chief Data Scientist, to join us on the podcast.
Not only does Ben have a mile-high view of the data science world (he advises about a dozen startups and organizes multiple world-class conferences), but he also has a perspective that spans two decades of data science evolution.
Each week, I have dozens of conversations with people who are trying to break into data science. The main topic of the conversations varies, but it’s rare that I walk away without getting a question like, “Do you think I have a shot in data science given my unusual background in [finance/physics/stats/economics/etc]?”.
From now on, my answer to that question will be to point them to today’s guest, George John Jordan Thomas Aquinas Hayward.
George [names omitted] Hayward’s data science career is a testament to the power of branding and storytelling. After completing a JD/MBA at Stanford and reaching top-ranked status in Hackerrank’s SQL challenges, he went on to work on contract for a startup at Google, and subsequently for a number of other companies. Now, you might be tempted to ask how comedy and law could possibly lead to a data science career.
For today’s podcast, we spoke with someone who is laser-focused on considering this second possibility: the idea that data science is becoming an engineer’s game. Serkan Piantino served as the Director of Engineering for Facebook AI Research, and now runs machine learning infrastructure startup Spell. Their goal is to make dev tools for data scientists that make it as easy to train models on the cloud as it is to train them locally. That experience, combined with his time at Facebook, have given him a unique perspective on the engineering best practices that data scientists should use, and the future of the field as a whole.
I’ve said it before and I’ll say it again: “data science” is an ambiguous job title. People use the term to refer to data science, data engineering, machine learning engineering and analytics roles, and that’s bad enough. But worse still, being a “data scientist” means completely different things depending on the scale and stage of the company you’re working at. A data scientist at a small startup might have almost nothing in common with a data scientist at a massive enterprise company, for example.
So today, we decided to talk to someone who’s seen data science at both scales. Jay Feng started his career working in analytics and data science at Jobr, which was acquired by Monster.com (which was itself acquired by an even bigger company). Among many other things, his story sheds light on a question that you might not have thought about before: what happens to data scientists when their company gets acquired?
Most software development roles are pretty straightforward: someone tells you what to build (usually a product manager), and you build it. What’s interesting about data science is that although it’s a software role, it doesn’t quite follow this rule.
That’s because data scientists are often the only people who can understand the practical business consequences of their work. There’s only one person on the team who can answer questions like, “What does the variance in our cluster analysis tell us about user preferences?” and “ What are the business consequences of our model’s ROC score?”, and that person is the data scientist. In that sense, data scientists have a very important responsibility not to leave any insights on the table, and to bring business instincts to bare even when they’re dealing with deeply technical problems.
For today’s episode, we spoke with Rocio Ng, a data scientist at LinkedIn, about the need for strong partnerships between data scientists and product managers, and the day-to-day dynamic between those roles at LinkedIn. Along the way, we also talked about one of the most common mistakes that early career data scientists make: focusing too much on that first role.
If you’ve been following developments in data science over the last few years, you’ll know that the field has evolved a lot since its Wild West phase in the early/mid 2010s. Back then, a couple of Jupyter notebooks with half-baked modeling projects could land you a job at a respectable company, but things have since changed in a big way.
Today, as companies have finally come to understand the value that data science can bring, more and more emphasis is being placed on the implementation of data science in production systems. And as these implementations have required models that can perform on larger and larger datasets in real-time, an awful lot of data science problems have become engineering problems.
That’s why we sat down with Akshay Singh, who among other things has worked in and managed data science teams at Amazon, League and the Chan-Zuckerberg Initiative (formerly Meta.com).
It’s easy to think of data science as a technical discipline, but in practice, things don’t really work out that way. If you’re going to be a successful data scientist, people will need to believe that you can add value in order to hire you, people will need to believe in your pet project in order to endorse it within your company, and people will need to make decisions based on the insights you pull out of your data.
Although it’s easy to forget about the human element, managing it is one of the most useful skills you can develop if you want to climb the data science ladder, and land that first job, or that promotion you’re after. And that’s exactly why we sat down with Susan Holcomb, the former Head of Data at Pebble, the world’s first smartwatch company.
When Pebble first hired her, Susan was fresh out of grad school in physics, and had never led a team, or interacted with startup executives. As the company grew, she had to figure out how to get Pebble’s leadership to support her effort to push the company in a more data-driven direction, at the same time as she managed a team of data scientists for the first time.
You import your data. You clean your data. You make your baseline model.
Then, you tune your hyperparameters. You go back and forth from random forests to XGBoost, add feature selection, and tune some more. Your model’s performance goes up, and up, and up.
And eventually, the thought occurs to you: when do I stop?
Most data scientists struggle with this question on a regular basis, and from what I’ve seen working with SharpestMinds, the vast majority of aspiring data scientists get the answer wrong. That’s why we sat down with Tan Vachiramon, a member of the Spatial AI team Oculus, and former data scientist at Airbnb.
Tan has seen data science applied in two very different industry settings: once, as part of a team whose job it was to figure out how to understand their customer base in the middle of a the whirlwind of out-of-control user growth (at Airbnb); and again in a context where he’s had the luxury of conducting far more rigorous data science experiments under controlled circumstances (at Oculus).
My biggest take-home from our conversation was this: if you’re interested in working at a company, it’s worth taking some time to think about their business context, because that’s the single most important factor driving the kind of data science you’ll be doing there. Specifically:
Data science at rapidly growing companies comes with a special kind of challenge that’s not immediately obvious: because they’re growing so fast, no matter where you look, everything looks like it’s correlated with growth! New referral campaign? “That definitely made the numbers go up!” New user onboarding strategy? “Wow, that worked so well!”. Because the product is taking off, you need special strategies to ensure that you don’t confuse the effectiveness of a company initiative you’re interested in with the inherent viral growth that the product was already experiencing.
The amount of time you spend tuning or selecting your model, or doing feature selection, entirely depends on the business context. In some companies (like Airbnb in the early days), super-accurate algorithms aren’t as valuable as algorithms that allow you to understand what the heck is going on in your dataset. As long as business decisions don’t depend on getting second-digit-after-the-decimal levels of accuracy, it’s okay (and even critical) to build a quick model and move on. In these cases, even logistic regression often does the trick!
In other contexts, where tens of millions of dollars depend on every decimal point of accuracy you can squeeze out of your model (investment banking, ad optimization), expect to spend more time on tuning/modeling. At the end of the day, it’s a question of opportunity costs: keep asking yourself if you could be creating more value for the business if you wrapped up your model tuning now, to work on something else. If you think the answer could be yes, then consider calling model.save() and walking away.
To most data scientists, the jupyter notebook is a staple tool: it’s where they learned the ropes, it’s where they go to prototype models or explore their data — basically, it’s the default arena for their all their data science work.
But Joel Grus isn’t like most data scientists: he’s a former hedge fund manager and former Googler, and author of Data Science From Scratch. He currently works as a research engineer at the Allen Institute for Artificial Intelligence, and maintains a very active Twitter account.
Oh, and he thinks you should stop using Jupyter noteoboks. Now.
When you ask him why, he’ll provide many reasons, but a handful really stand out:
Hidden state: let’s say you define a variable like a = 1 in the first cell of your notebook. In a later cell, you assign it a new value, say a = 3 . This results is fairly predictable behavior as long as you run your notebook in order, from top to bottom. But if you don’t—or worse still, if you run the a = 3 cell and delete it later — it can be hard, or impossible to know from a simple inspection of the notebook what the true state of your variables is.
Replicability: one of the most important things to do to ensure that you’re running repeatable data science experiments is to write robust, modular code. Jupyter notebooks implicitly discourage this, because they’re not designed to be modularized (awkward hacks do allow you to import one notebook into another, but they’re, well, awkward). What’s more, to reproduce another person’s results, you need to first reproduce the environment in which their code was run. Vanilla notebooks don’t give you a good way to do that.
Bad for teaching: Jupyter notebooks make it very easy to write terrible tutorials — you know, the kind where you mindlessly hit “shift-enter” a whole bunch of times, and make your computer do a bunch of stuff that you don’t actually understand? It leads to a lot of frustrated learners, or even worse, a lot of beginners who think they understand how to code, but actually don’t.
Overall, Joel’s objections to Jupyter notebooks seem to come in large part from his somewhat philosophical view that data scientists should follow the same set of best practices that any good software engineers would. For instance, Joel stresses the importance of writing unit tests (even for data science code), and is a strong proponent of using type annotation (if you aren’t familiar with that, you should definitely learn about it here).
But even Joel thinks Jupyter notebooks have a place in data science: if you’re poking around at a pandas dataframe to do some basic exploratory data analysis, it’s hard to think of a better way to produce helpful plots on the fly than the trusty ol’ Jupyter notebook.
Whatever side of the Jupyter debate you’re on, it’s hard to deny that Joel makes some compelling points. I’m not personally shutting down my Jupyter kernel just yet, but I’m guessing I’ll be firing up my favorite IDE a bit more often in the future.