Notice: Function add_theme_support( 'html5' ) was called incorrectly. You need to pass an array of types. Please see Debugging in WordPress for more information. (This message was added in version 3.6.1.) in /home/ on line 6031
How to curate quality datasets for machine learning – Algorithm Conference

Algorithm 2022

Feb. 10 - 12


Sculpting Data for Machine Learning

Workshop by two ML engineers from Twitter

General info

Sculpting Data for Machine Learning is the title of a workshop that will be conducted by a couple of machine learning engineers from Twitter, Inc. on Day 1 of Algorithm 2022. 


Target audience: Developers, aspiring developers, and technical (project) managers.

Date: Feb. 10, 2022

Time: 8 a.m. – 10 a.m.

Instructors: Jigyasa Grover and Rishabh Misra

Location: TI Auditorium, University of Texas at Dallas.

Workshop summary

In the contemporary world of machine learning algorithms, data is the new oil. And for state-of-the-art machine learning algorithms to work their magic, it’s important to have access to relevant data. Though volumes of crude data are available on the web, we still need the ability to identify and extract them into meaningful datasets.


This workshop will present the power of one of the most fundamental aspects of machine learning – dataset curation, which often does not get is due but is highly relevant in machine learning.

You’ll learn why dataset curation is important in specific industry use cases, and also learn, via hands-on Pythonic examples, how to construct good quality datasets.


The methods and tips shared in this workshop have come in handy for the instructors when publishing high-grade research papers, at their current employment with Twitter, Inc., and at prior engagements in the industry and academia.

What you'll learn

  • Why

    Why the ability to curate quality datasets is important, and the significance in academia and industry

  • Find

    How to search and identify relevant data sources from the web

  • Extract

    How to extract raw data using Python tools like Beautiful Soup and Selenium via a hands-on example

  • Simplify

    How to make data extraction process systematic, robust and efficient

  • Clean

    How to convert raw data dump into high quality dataset for ML using hands-on examples

  • Grok

    Some personal anecdotes and recommendations for different use cases from working with datasets at Twitter and prior engagements

What you'll need

BYOL (Bring your own laptop)

And perform the following actions on your computer:

1. Install or update to Google Chrome’s latest version (v79).

2. Download Chrome Driver with version matching the Google Chrome’s version from here

3. Install Jupyter Notebook from here

4. Install Beautiful Soup and Selenium packages.

5. Ensure that the starter code in curating quality ML datasets works on your computer.

Workshop instructors

Jigyasa Grover

Jigyasa Grover

Machine Learning Engineer, Twitter

Jigyasa Grover is a Machine Learning Engineer at Twitter, co-author of Sculpting Data for ML and an ML Google Developer Expert. Prior experiences include stints at Facebook, Inc., National Research Council of Canada, and Institute of Research & Development France involving Data Science, mathematical modeling, and software engineering.


A Red Hat Women in Open Source Academic Award Winner and a Google Summer of Code alumna, Jigyasa is an ardent open-source contributor. In her quest to help build a powerful community of girls and boys alike, and believing in “we rise by lifting others”, she mentors aspiring developers and Machine Learning enthusiasts in various global programs. She has served as the Director of Women Who Code and Lead of Women Techmakers.


She has a Master’s degree in Computer Science with an AI specialization from the University of California, San Diego, and is currently applying her past experiences and knowledge to Applied Machine Learning in the online advertisements prediction and ranking domain.

Rishabh Misra

Rishabh Misra

Machine Learning Engineer, Twitter

Rishabh Misra is a Machine Learning Engineer at Twitter, Inc, and co-author of the book Sculpting Data for ML. He developed a passion for identifying and tackling novel and practical problems using Machine Learning during his research internships at the Indian Institute of Technology Madras, which he further explored during his Master’s in Computer Science program at the University of California San Diego.


He combines his past engineering experiences in designing large-scale systems, working at Amazon and Arcesium (a D.E. Shaw company), and research experiences in Applied Machine Learning to develop distributed Machine Learning relevance systems at Twitter.


Kaggle recently ranked him as one of the top 20 dataset contributors, and’s “Natural Language Processing in TensorFlow” course on Coursera used his Sarcasm Detection dataset for teaching purposes.


Registration for workshop and for the conference itself is now open. The workshop has a limited number of tickets, so hurry and register if you want to guarantee yourself a spot. To reserve your ticket(s), click on that big red button.

Register with crypto

Want to register using your favorite cryptocurrency?  We’re on your side. Just click that button to email us to begin the process. We’ll get back with you pronto.

Get their book

Grover and Misra, the workshop instructors, are also co-authors of a new book titled Sculpting Data for ML: The first act of machine learning. The book is available on Amazon, and you may also chat with them on the book’s Twitter page by clicking here.

Sculpting Data for ML