Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
See http://ai.google.com/research/ConceptualCaptions for details.
Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).
Google's Conceptual Captions dataset has more than 3 million images, paired with natural-language captions. In contrast with the curated style of the MS-COCO images, Conceptual Captions images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. The raw descriptions are harvested from the Alt-text HTML attribute associated with web images. We developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.
More details are available in this paper (please cite the paper if you use or discuss this dataset in your work):
@inproceedings{sharma2018conceptual, title = {Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning}, author = {Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu}, booktitle = {Proceedings of ACL}, year = {2018}, }
@article{ng2020understanding, title={Understanding Guided Image Captioning Performance across Domains}, author={Edwin G. Ng and Bo Pang and Piyush Sharma and Radu Soricut}, journal={arXiv preprint arXiv:2012.02339}, year={2020} }
Conceptual Captions dataset release contains two splits: train (~3.3M examples) and validation (~16K examples). See Table 1 below for more details.
Table 1: Dataset stats.
Tokens per Caption | |||||
Split | Examples | Uniqe Tokens | Mean | StdDev | Median |
Train | 3,318,333 | 51,201 | 10.3 | 4.5 | 9.0 |
Valid | 15,840 | 10,900 | 10.4 | 4.7 | 9.0 |
Test (Hidden) | 12,559 | 9,645 | 10.2 | 4.6 | 9.0 |
Hidden Test set
We are not releasing the official test split (~12.5K examples). Instead, we are hosting a competition (see http://ai.google.com/research/ConceptualCaptions) dedicated to supporting submissions and evaluations of model outputs on this blind test set.
We strongly believe that this setup has several advantages: a) it allows the evaluation to be done using an unbiased, large number of images b) it keeps the test completely blind and eliminate suspicions of fitting to the test, cheating, etc. c) it overall provides a clean setup for advancing the SoTA on this task, including reporting reproducible results for paper publications, etc.
The image labels are obtained using the Google Cloud Vision API (https://cloud.google.com/vision). Each image label has a machine-generated identifier (MID) corresponding to the label's Google Knowledge Graph entry and a confidence score for its presence in the image. These labels have been obtained running the same model and are presented in a similar fashion with the image labels made available for the T2 Guiding dataset available at https://github.com/google-research-datasets/T2-Guiding.
The Conceptual Captions training and validation sets are provided as TSV (tab-separated values) text files with the following columns:
Table 2: Columns in Train/Validation TSV files.
Column | Description |
---|---|
1 | Caption. The text has been tokenized and lowercased. |
2 | Image URL |
The image labels for a 2.0M subset of the training set are provided as TSV (tab-separated values) text files with the following columns:
Table 3: Columns in Image Labels TSV files.
Column | Description |
---|---|
1 | Caption. The text has been tokenized and lowercased. |
2 | Image URL |
3 | Image labels. Comma separated list in descending order of confidence. |
4 | MIDs. Comma separated list corresponding to the image labels list. |
5 | Confidence scores. Comma separated list corresponding to the image labels list. |
If you have a technical question regarding the dataset, code or publication, please create an issue in this repository. This is the fastest way to reach us.
If you would like to share feedback or report concerns, please email us at [email protected]