🌋 SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Recognition

Abstract

Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification.

In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using it to inspect a fixed pretrained self-supervised representation.

Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at https://github.com/jimmyxu123/SELECT.

October 8, 2024

Introducing SELECT: A Large-Scale Benchmark for Data Curation in Image Classification

Data curation is a critical yet often overlooked aspect of machine learning. The process of collecting, organizing, and preparing datasets significantly impacts model performance, but until now, there hasn't been a comprehensive way to evaluate different curation strategies. Enter SELECT, a new large-scale benchmark designed to systematically compare various data curation methods for image classification tasks.

What is SELECT?

SELECT (Systematic Evaluation of Large-scale Efficient Curation Techniques) is a benchmark that allows researchers to assess the effectiveness of different data curation strategies. It provides a standardized way to measure how well curated datasets perform across a range of metrics, including:

Base accuracy on ImageNet validation set
Out-of-distribution (OOD) robustness
Performance on downstream tasks
Effectiveness for self-supervised learning

The benchmark also includes several analytical metrics to help understand dataset properties without requiring model training, such as class imbalance measures and image quality scores.

Introducing ImageNet++

To establish baseline performance for different curation strategies, we created ImageNet++, the largest and most diverse set of ImageNet-1K training set variations to date. ImageNet++ consists of 5 new dataset "shifts" in addition to the original ImageNet-1K:

OI1000: A subset of OpenImages dataset using crowdsourced labeling
LA1000 (img2img): Images from LAION dataset selected using embedding-based search
LA1000 (txt2img): Another LAION subset selected using text-based embedding search
SD1000 (img2img): Synthetic images generated from ImageNet using Stable Diffusion
SD1000 (txt2img): Synthetic images generated using class names as prompts

The benchmark also includes several analytical metrics to help understand dataset properties without requiring model training, such as class imbalance measures and image quality scores.

Key Findings

After extensive experimentation, training over 130 models on these datasets, we uncovered several important insights:

Expert curation still reigns supreme: Despite advances in AI and data collection methods, no reduced-cost strategy outperformed the original expert-curated ImageNet dataset across all metrics.
Embedding-based search shows promise: Among the reduced-cost methods, embedding-based search (used in LA1000 shifts) consistently outperformed synthetic data generation approaches.
Human curation isn't always best: Surprisingly, the crowdsourced OI1000 dataset often underperformed compared to automated methods like LA1000, likely due to greater label imbalance.
Bigger isn't always better: The smallest dataset, LA1000 (img2img), often outperformed larger datasets, highlighting the importance of curation quality over quantity.
Image-conditioned methods outperform text-based ones: Across different curation strategies, methods that used images as a starting point (img2img) generally performed better than those using only text descriptions (txt2img).

Comparative performance of data curation methods, visualized in a radar plot

Implications and Future Work

The SELECT benchmark and the insights gained from ImageNet++ open up several important avenues for future research:

Improving reduced-cost curation: While no method matched expert curation, the strong performance of embedding-based search suggests promising directions for developing more efficient curation techniques.
Addressing class imbalance: The poor performance of the crowdsourced OI1000 dataset highlights the critical importance of maintaining class balance during data collection.
Developing better quality metrics: Current image and label quality metrics showed little correlation with actual model performance, indicating a need for more sophisticated evaluation methods.
Refining synthetic data generation: While synthetic data underperformed in this study, there's potential to improve these methods to better complement real-world datasets.

We are very excited about the future of SELECT and actively would like to partner with researchers do develop new methods for data curation. If you’d like to contribute or have any questions, please get in touch.

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Recognition