Sculpting Data for Machine Learning is the title of a workshop that will be conducted by a couple of machine learning engineers from Twitter, Inc. on Day 1 of Algorithm 2022.
Target audience: Developers, aspiring developers, and technical (project) managers.
Date: Feb. 10, 2022
Time: 8 a.m. – 10 a.m.
Instructors: Jigyasa Grover and Rishabh Misra
Location: TI Auditorium, University of Texas at Dallas.
In the contemporary world of machine learning algorithms, data is the new oil. And for state-of-the-art machine learning algorithms to work their magic, it’s important to have access to relevant data. Though volumes of crude data are available on the web, we still need the ability to identify and extract them into meaningful datasets.
This workshop will present the power of one of the most fundamental aspects of machine learning – dataset curation, which often does not get is due but is highly relevant in machine learning.
You’ll learn why dataset curation is important in specific industry use cases, and also learn, via hands-on Pythonic examples, how to construct good quality datasets.
The methods and tips shared in this workshop have come in handy for the instructors when publishing high-grade research papers, at their current employment with Twitter, Inc., and at prior engagements in the industry and academia.
Why the ability to curate quality datasets is important, and the significance in academia and industry
How to search and identify relevant data sources from the web
How to extract raw data using Python tools like Beautiful Soup and Selenium via a hands-on example
How to make data extraction process systematic, robust and efficient
How to convert raw data dump into high quality dataset for ML using hands-on examples
Some personal anecdotes and recommendations for different use cases from working with datasets at Twitter and prior engagements
And perform the following actions on your computer:
1. Install or update to Google Chrome’s latest version (v79).
2. Download Chrome Driver with version matching the Google Chrome’s version from here
3. Install Jupyter Notebook from here
4. Install Beautiful Soup and Selenium packages.
5. Ensure that the starter code in curating quality ML datasets works on your computer.
Jigyasa Grover is a Machine Learning Engineer at Twitter, co-author of Sculpting Data for ML and an ML Google Developer Expert. Prior experiences include stints at Facebook, Inc., National Research Council of Canada, and Institute of Research & Development France involving Data Science, mathematical modeling, and software engineering.
A Red Hat Women in Open Source Academic Award Winner and a Google Summer of Code alumna, Jigyasa is an ardent open-source contributor. In her quest to help build a powerful community of girls and boys alike, and believing in “we rise by lifting others”, she mentors aspiring developers and Machine Learning enthusiasts in various global programs. She has served as the Director of Women Who Code and Lead of Women Techmakers.
She has a Master’s degree in Computer Science with an AI specialization from the University of California, San Diego, and is currently applying her past experiences and knowledge to Applied Machine Learning in the online advertisements prediction and ranking domain.
Rishabh Misra is a Machine Learning Engineer at Twitter, Inc, and co-author of the book Sculpting Data for ML. He developed a passion for identifying and tackling novel and practical problems using Machine Learning during his research internships at the Indian Institute of Technology Madras, which he further explored during his Master’s in Computer Science program at the University of California San Diego.
He combines his past engineering experiences in designing large-scale systems, working at Amazon and Arcesium (a D.E. Shaw company), and research experiences in Applied Machine Learning to develop distributed Machine Learning relevance systems at Twitter.
Kaggle recently ranked him as one of the top 20 dataset contributors, and Deeplearning.ai’s “Natural Language Processing in TensorFlow” course on Coursera used his Sarcasm Detection dataset for teaching purposes.
Registration for workshop and for the conference itself is now open. The workshop has a limited number of tickets, so hurry and register if you want to guarantee yourself a spot. To reserve your ticket(s), click on that big red button.
Want to register using your favorite cryptocurrency? We’re on your side. Just click that button to email us to begin the process. We’ll get back with you pronto.
Grover and Misra, the workshop instructors, are also co-authors of a new book titled Sculpting Data for ML: The first act of machine learning. The book is available on Amazon, and you may also chat with them on the book’s Twitter page by clicking here.
Want to sponsor Algorithm 2022 or have an exhibit space during the conference? Click that button to view the sponsorship prospectus.