Wikipedia Picture of the Day Dataset

[Receive updates on further data collection here]

Wikipedia has a Picture of the day section that presents one (usually beautiful) image per day, with a short paragraph describing the image. The following dataset contains the images and text from Nov 1, 2004 to Feb 2, 2011 *.

Sample Images 

Download:

Note: Images contained in the dataset are in the public domain. Licenses of individual images may vary. Please refer to corresponding wikipedia POTD pages for more details.

Relevant Paper:

In our ICCV paper, we collected the pictures and text descriptions to train a cross-modal laten topic model. Such a topic model learns cross-modal semantics so that images and text are mapped onto a shared latent space, in which content-based retrieval could be carried out. Below is a sample text query and images returned by our algorithm (top row) and Corr-LDA:

Retrieval Results 

* Our ICCV paper uses only the segment from Nov 1 2004 to Oct 30, 2010.