Skip to main content
Diaries of Social Data Research

Diaries of Social Data Research

By Katherine A. Keith & Lucy Li
Large-scale data has become a major component of research about human behavior and society. But how are interdisciplinary collaborations that use large-scale social data formed and maintained? What obstacles are encountered on the journey from idea conception to publication? In this podcast, we investigate these questions by probing the “research diaries” of scholars in computational social science and adjacent fields. We unmask the research process with the hope of normalizing the challenges of and increasing accessibility in academia.
Music: Jon Gillick.
Where to listen
Apple Podcasts Logo

Apple Podcasts

Google Podcasts Logo

Google Podcasts

Pocket Casts Logo

Pocket Casts

RadioPublic Logo


Spotify Logo


18. Gender Patterns in English-Language Fiction and Interrogating Data with Ted Underwood and David Bamman
This episode features Ted Underwood, a professor in the School of Information Sciences and Department of English at the University of Illinois Urbana-Champaign, and David Bamman, an associate professor at UC Berkeley’s School of Information. We discuss their 2018 Cultural Analytics paper co-authored with literary studies PhD student Sabrina Lee, titled “The Transformation of Gender in English-Language Fiction.” We trace how Twitter brought Ted and David together as collaborators, and the email that sparked the beginnings of this project. They describe how this paper uses predictive modeling for an unconventional purpose, and various “means of interrogating data.” They also provide tips for establishing collaborative relationships, and advocate using substantive research questions to motivate learning technical skills.
May 09, 2022
17. Hashtag Network Analysis and Interwoven Research Ethics with Ryan Gallagher and Brooke Foucault Welles
Our guests in this episode are Ryan Gallagher, a PhD Candidate in Network Science at Northeastern University, and Brooke Foucault Welles, an Associate Professor in Communication Studies and the Network Science Institute at Northeastern University. We discuss their 2019 CSCW paper, "Reclaiming Stigmatized Narratives: The Networked Disclosure Landscape of #MeToo" with co-authors Elizabeth Stowell and Andrea G. Parker. We talk about their substantive motivation for focusing on #metoo, the networked counter public, and hashtags' influence on social change. Ryan and Brooke also walk us through the advantages of pairing qualitative and quantitative work, weaving ethics throughout every stage of the research process, dealing with missing Tweets, and taking seriously both the "computational" and "social science" sides of CSS.
April 24, 2022
16. Measuring Uptake in Classroom Conversations and Using NLP to Support Teachers with Dora Demszky
This episode features Dora Demszky, a PhD student in Linguistics at Stanford University. Dora works at the intersection of natural language processing and education. We discuss her ACL 2021 paper titled "Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions", co-authored with Jing Liu, Zid Mancenido, Julie Cohen, Heather Hill, Dan Jurafsky, and Tatsunori Hashimoto. Dora's work is motivated by creating tools that are useful for educators, so her research is not only descriptive or predictive, but also applicable to classrooms. She talks about managing large interdisciplinary teams, approaching research with care, and working with actual teachers to annotate data.
March 20, 2022
15. Race in Computational Disinformation Analysis and Deep Reading with Deen Freelon
Our guest in this episode is Deen Freelon, Associate Professor at the University of North Carolina in the School of Journalism and Media. We chat about his 2020 Social Science Computer Review Paper "Black Trolls Matter: Racial and Ideological Asymmetries in Social Media Disinformation" with co-authors Michael Bossetta, Chris Wells, Josephine Lukito, Yiping Xia, and Kirsten Adams. Deen also talks about writing a "behind the scenes" book chapter about the process of making this paper, being one of the first movers in the discipline of computational methods for communication studies, and how he learns programming best when it is connected to the goals of his project. He emphasizes that many of his great research ideas come from reading deeply and recommends devoting at least half a day a week solely to reading.
March 06, 2022
14. The Past Decade of Computational Social Science Research with David Lazer
In this episode, we talk with David Lazer, the University Distinguished Professor of Political Science and Computer Sciences at Northeastern University and the Co-Director of the NULab for Texts, Maps, and Networks. We discuss two seminal papers in computational social science he co-authored a decade apart: "Life in the network: the coming age of computational social science" (Science 2009) and  "Computational social science: Obstacles and opportunities" (Science 2020). David shares with us events in his long and distinguished CSS research career. In the early 2000s, he helped gather a small group of people working on new "data streams" and how they intentionally created the term computational social science. He also talks about his own struggles on the academic job market, advice for aspiring CSS researchers, and a wish for better data availability structures.
February 20, 2022
13. Finding (Mis)alignments in Public Opinion and Wisdom in Collaboration Management with Kenneth Joseph and Sarah Shugars
Our guests on this episode are Kenneth Joseph, an assistant professor in Computer Science and Engineering at the University of Buffalo, and Sarah Shugars, a Faculty Fellow at New York University’s Center for Data Science. We discuss the process behind their EMNLP 2021 paper, “(Mis)alignment Between Stance Expressed in Social Media Data and Public Opinion Surveys,” co-authored with Ryan Gallagher, Jon Green, Alexi Quintana Mathé, Zijian An, and David Lazer. Kenneth and Sarah offer tips around communication, collaboration, and project management, especially for papers written during a pandemic. Kenneth talks about “privileging ethics” when making decisions around data privacy and experimental replicability, and Sarah reflects on navigating differences in terminology use in interdisciplinary environments.
February 10, 2022
12. Understanding Conversational Patterns in Police Community Interactions with Vinodkumar Prabhakaran and Camilla Griffiths
Our guests on this episode are Vinodkumar Prabhakaran, who was a computer science postdoc at Stanford and now a senior research scientist at Google, and Camilla Griffiths, who is a postdoc at Stanford SPARQ (Social Psychological Answers to Real-world Questions). With Hang Su, Prateek Verma, Nelson Morgan, Jennifer Eberhardt, and Dan Jurafsky, they are co-authors on a TACL 2018 paper, "Detecting Institutional Dialog Acts in Police Traffic Stops". Vinod and Camilla share with us how this collaboration formed over a common goal and a deep respect for each other’s disciplines. We discuss the considerations that went into forming community partnerships, handling sensitive police body-camera data, and recognizing the implications of their findings.
January 18, 2022
11. The Effects of Friend-to-Friend Texting on Voter Turnout and Overcoming Project Setbacks with Aaron Schein
This episode features Aaron Schein, a computer scientist and postdoctoral fellow at Columbia University. We discuss his WWW 2021 paper "Assessing the Effects of Friend-to-Friend Texting on Turnout in the 2018 US Midterm Elections", co-authored with Keyon Vafa, Dhanya Sridhar, Victor Veitch, Jeffery Quinn, James Moffet, David Blei, and Donald Green. Aaron shares with us how he collaborated with industry partners, overcame the discovery of a confounder that challenged the experiment’s original design, and responded to public feedback. He also mapped his interdisciplinary journey through linguistics, political science, and computer science, and shared his twist on imposter syndrome.
January 01, 2022
10. Political Discourse and Substantive-Methodological Intersections with Justine Zhang and Arthur Spirling
In this episode, we talk with Justine Zhang and Arthur Spirling. Justine is currently a postdoctoral researcher at Stanford University and Arthur is a Professor of Politics and Data Science at New York University. We discuss their 2017 EMNLP paper, with Cristian Danescu-Niculescu-Mizil, "Asking too much? The rhetorical role of questions in political discourse." Justine and Arthur touch on how collaborations can provide real insight into other disciplines as well as their different paces and writing norms. We also discuss substantive validation for unsupervised learning methods, marinating in "fun" data, the responsibility of studying political institutions that touch all aspects of human life, and a call for administrators to incentivize these kinds of collaborations.
December 10, 2021
9. Reddit Debates and Interdisciplinary Multilingualism with Emaad Manzoor
Our guest on this episode is Emaad Manzoor, an Assistant Professor of Operations and Information Management at the University of Wisconsin Madison. Along with George H. Chen, Dokyun Lee, and Michael D. Smith, he wrote "Influence via Ethos: On the Persuasive Power of Reputation in Deliberation Online" which is currently under review at Management Science. Emaad illuminates this project's long journey, from manually-labeling argumentation schemas, to using observational data from Reddit, to designing experiments. He talks with us about how Economics and NLP can learn from one another and the importance of "interdisciplinary multilingualism" in highlighting different aspects of one's work to different audiences. We also chat about the importance of personal drive in research projects and strategies for developing resilience to "emotional punches."
November 28, 2021
8. The Evolution of Computational Social Science from a Sociology Perspective with Chris Bail
This unique episode centers on a "meta" discussion on interdisciplinary work involving large-scale social data. We interview Chris Bail, a Professor of Sociology and Public Policy at Duke University. Last year, Chris and co-authors Achim Edelman, Tom Wolff, and Danielle Montagne published an overview paper titled "Computational Social Science and Sociology" in the Annual Review of Sociology. We discuss the challenges of defining this large research area, the benefits of making "lateral connections" with potential colleagues as a graduate student, and taking risks in pursuing new research directions. We also highlight the process behind the creation and growth of the Summer Institute in Computational Social Science, which Chris co-founded with Matt Salganik.
September 27, 2021
7. The Power of Birth Stories’ Narratives and Intellectual Generosity with Maria Antoniak and Karen Levy
This episode features Maria Antoniak, a PhD student, and Karen Levy, an assistant professor, who are both in the Department of Information Science at Cornell. Maria, who has a background in computational linguistics, and Karen, who has a background in law and sociology, are co-authors, along with David Mimno, on the CSCW 2019 paper "Narrative Paths and Negotiation of Power in Birth Stories". We discuss the formation of identity in online communities, approaches for protecting the privacy of users, the different submission and review processes in computing venues, and balancing new methodology and applications. Within an interdisciplinary department, Karen and Maria advocate for "learning to lift up each other’s work" and being "intellectually generous" across disciplines.
September 17, 2021
6. Extracting Events from Text and Grad School Memories with Brendan O'Connor and Brandon Stewart
Our guests in this episode are Brendan O'Connor, Associate Professor of Computer Science at UMass Amherst, and Brandon Stewart, Assistant Professor of Sociology at Princeton University. We talk with them about their 2013 ACL paper (with co-author Noah Smith) “Learning to Extract International Relations from Political Context” which presents a probabilistic model for extracting events between countries and international organizations from news articles. Brendan and Brandon also discuss how their collaboration grew from "saying nice things" about each other's work to 30-page written research memos sent back and forth. We also discuss the "ballooning and focusing" scope of research, clunky computer labs in the early 2000s, challenges in incentive structures for interdisciplinary collaborations, and data replicability standards.
September 07, 2021
5. Opioid Use Recovery on Social Media and Mentoring Undergrad Collaborators with Stevie Chancellor
In this episode, we talked to Stevie Chancellor, who is the lead author on a 2019 CHI paper titled "Discovering Alternative Treatments for Opioid Use Recovery in Social Media". Along with Stevie, who is a computer scientist, the team of authors included clinical psychologist and addiction researcher George Nitzburg, Stevie’s advisor Munmun De Choudhury, and two undergraduate students, Andrea Hu and Francisco Zampieri. Stevie shared with us her strategies for successful student mentoring, working with page limits, and using milestones and reflection points in this project’s timeline to help it reach completion.
July 12, 2021
4. COVID-19 Mobility Networks and Post-Publication Scientific Communication with Serina Chang
We discuss the paper "Mobility network models of COVID-19 explain inequities and inform reopening" with first author and Stanford computer science PhD student Serina Chang. This paper's team of interdisciplinary authors include other computer scientists (Emma Pierson, Pang Wei Koh, and Jure Leskovec), sociologists (Beth Redbird and David Grusky), and an epidemiologist (Jaline Gerardin). Serina shared with us challenges in navigating post-publication scientific communication and translating scientific research into real-world policy tools, as well as the success of grounding research questions in supporting the needs of real people.
June 27, 2021
3. Digital Health Communication and Punk Rock Academics with Ethan Zuckerman
In this episode, we talk to Ethan Zuckerman, associate professor at the University of Massachusetts Amherst, where he teaches public policy, communication, and information. We discuss his paper "Digital Health Communication and Global Public Influence: A Study of the Ebola Epidemic" which was published in the Journal of Health Communication in 2017. His co-authors on this paper include technical and visualization experts (Hal Roberts and Sands Alden Fish II), a global public health expert (Brittany Seymour), and expert in education policy (Emily Robinson). Ethan talks about creating Media Cloud--an open-source platform for media analysis that tracks millions of stories published online--over the course of two decades and the "fearsome process" of scaling it up. He also discussed with us being an unconventional "punk-rock" academic and advice to "scratch your deep itch" when it comes to choosing which research directions to pursue. Link:
June 14, 2021
2. Analyzing Menstrual Cycle Data and Math Transcending Boundaries with Emma Pierson
We talk with Emma Pierson, PhD in Computer Science from Stanford and incoming assistant professor of Computer Science at Cornell Tech, about her paper "Daily, weekly, seasonal and menstrual cycles in women’s mood, behaviour and vital signs" published in Nature Human Behavior, 2021. This was joint work with fellow computer scientists (Tim Althoff and Jure Leskovec), head of data science at a partner company (Daniel Thomas), and professor of obstetrics and gynecology (Paula Hillard). Emma shared with us strategies for normalizing research on women's health and the menstrual cycle and creating trust with industry partners. She emphasized that math is a universal language that can transcend the boundaries of individuals' personal experiences. Paper link:
June 11, 2021
1. Abolitionist Newspapers and Maintaining 8-Year Project Momentum with Lauren Klein and Sandeep Soni
This episode features two guests: Lauren Klein, an associate professor of English and Quantitative Theory & Methods at Emory University, and Sandeep Soni, a PhD candidate in Computer Science at Georgia Tech. Their Cultural Analytics paper, "Abolitionist Networks: Modeling Lan­guage Change in Nineteenth-Century Activist Newspapers", was published earlier this year in 2021, with an additional co-author, computational linguist Jacob Eisenstein. Sandeep and Lauren discuss the challenges involved in this project---a project that began with a conversation between Lauren and Jacob over dim sum---from gathering and cleaning noisy data to maintaining research momentum over the project’s eight year lifespan. Paper link:
June 11, 2021