Wikipedia Picture of the Day Dataset

Note: the dataset below is a smaller-scale data used in our ICCV 2011 paper. For a larger-scale dataset used in our UAI 2012 paper, you can download the bag-of-words feature files here (509 MB). Note that metadata of the images and pages are not provided yet, as we are looking for a way to host the image thumbnails in a more affordable way.

Wikipedia has a Picture of the day section that presents one (usually beautiful) image per day, with a short paragraph describing the image. The following dataset contains the images and text from Nov 1, 2004 to Feb 2, 2011 *.

Sample Images 

Download:

Note: Images contained in the dataset are in the public domain. Licenses of individual images may vary. Please refer to corresponding wikipedia POTD pages for more details.

Relevant Paper:

In our ICCV paper, we collected the pictures and text descriptions to train a cross-modal laten topic model. Such a topic model learns cross-modal semantics so that images and text are mapped onto a shared latent space, in which content-based retrieval could be carried out. Below is a sample text query and images returned by our algorithm (top row) and Corr-LDA:

Retrieval Results 

* Our ICCV paper uses only the segment from Nov 1 2004 to Oct 30, 2010.