High-quality data is the fuel that keeps the AI engine running — and the machine learning community can’t get enough of it. In the conclusion to our year-end series, Synced spotlights ten datasets that were open-sourced in 2019.
Waymo Open Dataset For Autonomous Driving
Waymo, the self-driving technology development company and Alphabet subsidiary, has been relatively protective of its technology and data since its establishment in 2009. This August however it released the Waymo Open Dataset, a high-quality multimodal sensor dataset for autonomous driving.
Waymo principal scientist and head of research Drago Anguelov says the data set is “one of the largest, riches and most diverse self-driving data sets ever released for research.”
Company vehicles collected over 10 million autonomous driving miles in 25 cities. The dataset covers a wide variety of environments, from dense urban centers to suburban landscapes, as well as data collected during day and night, at dawn and dusk, in sunshine and rain.
Deepfake Detection Challenge
Serious concerns have emerged in the tech industry and society as a whole regarding AI-powered deepfake techniques which generate realistic-looking videos of people appearing to do or say things they did not. Deepfake tech has already been used to produce celebrity and revenge porn and to generate fake speeches and other news.
To address the potential issues with deepfakes, AWS, Facebook, Microsoft, the Partnership on AI’s Media Integrity Steering Committee and several universities united to build the Deepfake Detection Challenge (DFDC), which aims to help global researchers to find innovative methods to detect deepfakes.
Earlier this month, Facebook AI released a new, unique data set of 100,000-plus videos specially for the DFDC. Google has also released a large dataset of visual deepfakes that has been incorporated into the Technical University of Munich and the University Federico II of Naples’ new FaceForensics benchmark.
Google’s Natural Questions for Question-Answering Systems
In January Google AI released its Natural Questions (NQ) large-scale dataset for training and evaluating open-domain question-answering systems. Google researchers Tom Kwiatkowski and Michael Collins say this is the first dataset to “replicate the end-to-end process” in which people find answers to questions.
Natural Questions consists of over 300,000 naturally occurring queries paired with human-annotated answers from Wikipedia pages. It’s designed both to train question-answering systems and to evaluate them.
MIT ObjectNet: Pushing the Limits of Object Recognition Models
Researchers from MIT Center for Brains, Minds and Machines, MIT Computer Science & Artificial Intelligence Laboratory and MIT-IBM Watson AI collected a large real-world test set — ObjectNet — for object recognition.
Designed to show objects with a variety of random object backgrounds, rotations, and imaging viewpoints, ObjectNet is the same size as the ImageNet test set (50,000 images), and does not come paired with a training set in order to encourage generalization.
Libri-Light: The Largest-Ever Open Data Set for Speech Technology
Earlier this month Facebook AI introduced Libri-light: “the largest-ever open source data set for speech technology.” Built entirely from public domain audio and optimized for developing automatic speech recognition (ASR) systems using limited or no supervision, Libri-light is designed to support training settings that are less reliant on labels.
In addition to training and test sets, Libri-light includes metrics and baseline models to help researchers compare different methods for developing ASR systems that require less or no supervision.
Stanford ML Group Releases Chest X-Ray and Knee MRI Datasets
In January the Stanford Machine Learning Group led by Andrew Ng introduced CheXpert, a large dataset of chest X-rays designed for automated interpretation. CheXpert contains 224,316 chest radiographs from 65,240 patients. The data was collected from Stanford Hospital chest radiographic examinations performed between 2002 and 2017 in both inpatient and outpatient centers, along with associated radiology reports.
The researchers also developed an automatic labeler that can translate observations into structured labels: positive, negative, or uncertain which they believe can reach human expert level.
Three months later, the group released the MRNet dataset, which contains data from 1,370 knee MRI (magnetic resonance imaging) exams performed at the Stanford University Medical Center between 2001 and 2012. Of the MRIs, 1,104 (80.6%) are abnormal cases, with 319 (23.3%) diagnosed as ACL (anterior cruciate ligament) tears and 508 (37.1%) as Meniscal tears.
Google Releases Open Images V5
Google released its open-source image dataset Open Images V5 in May as an update on last year’s Open Image V4. First introduced in 2016, Open Images is a collaborative release comprising about nine million images annotated with labels covering thousands of object categories.
Open Image V5 features newly added annotations on image segmentation masks for 2.8 million objects in 350 categories. Unlike bounding-boxes that only identify the general area in which an object is located, these image segmentation masks trace the outline of the target object, characterizing its spatial extent with a higher level of detail.
PartNet Helps Robots Learn What Things Are
The ability to understand various categories of things and use that general understanding to make sense of new things is particularly challenging for robots. Researchers from Stanford University, University of California San Diego, Simon Fraser University, Intel AI, and Facebook AI address that with a consistent, large-scale dataset of 3D objects annotated with 3D part information.
Unveiled at CVPR in June, the PartNet dataset consists of 573,585 part instances over 26,671 3D models covering 24 object categories. PartNet enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. There are also three benchmark tasks for evaluating 3D part recognition: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation.
IBM Diversity in Faces Dataset
IBM’s new Diversity in Faces (DiF) dataset is the first of its kind, aiming to advance diversity, fairness and accuracy in facial recognition technology in the global research community. The DiF comprises one million annotated human facial images from the publicly available YFCC-100M Creative Commons dataset.
The annotations were generated using ten facial coding schemes that provide human-interpretable quantitative measures of intrinsic facial features.
In May Google announced the release of a new and improved landmark recognition dataset. Google-Landmarks-v2 includes over five million images, doubling the number in the landmark recognition dataset Google released last year. The dataset now covers more than 200,000 landmarks — a seven times increase over last year.
Landmarks-v2 accommodates a couple of research advancements: instance-level recognition and image retrieval. Instance-level recognition finds specific instances of objects to distinguish for example Toronto Union Station from other urban train stations. Image retrieval meanwhile matches a particular object in an input image to all other instances of the object in a catalogue of reference images.