Algorithm 2022

Feb. 10 - 12 2022


How to curate quality datasets for ML

Workshop by two ML engineers from Twitter

General info

How to curate quality datasets for machine learning is the title of a workshop that will be conducted by a couple of machine learning engineers from Twitter, Inc. on Day 1 of Algorithm 2022. 


Target audience: Developers, aspiring developers, and technical (project) managers.

Date: Feb. 10, 2022

Time: 8 a.m. – 10 a.m.

Instructors: Jigyasa Grover and Rishabh Misra

Location: TI Auditorium, University of Texas at Dallas.

Workshop summary

In the contemporary world of machine learning algorithms, data is the new oil. And for state-of-the-art machine learning algorithms to work their magic, it’s important to have access to relevant data. Though volumes of crude data are available on the web, we still need the ability to identify and extract them into meaningful datasets.


This workshop will present the power of one of the most fundamental aspects of machine learning – dataset curation, which often does not get is due but is highly relevant in machine learning.

You’ll learn why dataset curation is important in specific industry use cases, and also learn, via hands-on Pythonic examples, how to construct good quality datasets.


The methods and tips shared in this workshop have come in handy for the instructors when publishing high-grade research papers, at their current employment with Twitter, Inc., and at prior engagements in the industry and academia.

What you'll learn

Workshop schedule

Detailed class schedule will be published shortly. Stay tuned! But don’t let that stop you from getting your tickets before they sell out.

What you'll need

BYOL (Bring your own laptop)

And perform the following actions on your computer:

1. Install or update to Google Chrome’s latest version (v79).

2. Download Chrome Driver with version matching the Google Chrome’s version from here

3. Install Jupyter Notebook from here

4. Install Beautiful Soup and Selenium packages.

5. Ensure that the starter code in curating quality ML datasets works on your computer.

Workshop instructors

Jigyasa Grover

Jigyasa Grover

Machine Learning Engineer, Twitter

The 2017 Red Hat Women in Open Source Academic Award Winner and Google Summer of Code alumna, I am an ardent open source enthusiast and a budding researcher, with work experience at the San Diego Supercomputer Center; National Research Council of Canada; and the Institute of Research & Development, France. I also briefly worked on anomaly detection frameworks in the ads system at Facebook.


I was the Director of Women Who Code and Lead of Google Women Techmakers for a handful of years. Aside from teaching this workshop with my colleague Rishabh Misra, I’ll also be giving a presentation that sheds some light on an informal taxonomy of machine learning algorithms.

Rishabh Misra

Rishabh Misra

Machine Learning Engineer, Twitter

I am passionate about identifying and tackling novel and practical problems using my machine learning expertise. I also like messing with data. The bigger, the bettter. The datasets I’ve collected as part of my research have been very well received by the machine learning community, and I’m currently ranked 23 as a dataset contributor on the Kaggle platform.  My dataset on Sarcasm Detection has been used in’s Natural Language Processing in TensorFlow course on Coursera.


I love explaining convoluted concepts in an accessible manner and have written several articles with the TowardDataScience online publication. I have a masters degree in computer science from the University of California San Diego, and currently work as a machine learning engineer at Twitter Inc.


Registration for workshop and for the conference itself is now open. The workshop has a limited number of tickets, so hurry and register if you want to guarantee yourself a spot. To reserve your ticket(s), click on that big red button.

Register with crypto

Want to register using your favorite cryptocurrency?  We’re on your side. Just click that button to email us to begin the process. We’ll get back with you pronto.

Get their book

Grover and Misra, the workshop instructors, are also co-authors of a new book titled Sculpting Data for ML: The first act of machine learning. The book is available on Amazon, and you may also chat with them on the book’s Twitter page by clicking here.

Sculpting Data for ML