Discovering Word Associations in News Media via Feature Selection and Sparse Classification
We were fortunate to have our work accepted into the proceedings of the 11th ACM SIGMM International Conference on Multimedia Information Retrieval. Here you may find a gallery of material complementary to this work.
The paper
You can download a PDF copy of the paper here.
The presentation
You can download a PDF copy of the slides presented at MIR 2010
here.
The data
- NYTWData.txt, [44.6 MB]: A tab-delimited text file encoding the appearance of words across paragraphs. Each paragraph-word pair extant in the data receives a line of text for which Column 1 provides the paragraph ID, Column 2 provides the word ID, and Column 3 provides the number of times the word appeared in the paragraph.
- NYTWDict.txt, [1.1 MB]: A tab-delimited text file with each line providing a word identification number (as used in the matrix above) and it's associated plaintext word.
- NYTWStops.txt, [4 KB]: A text file listing words and word IDs deemed a priori uninteresting; these were dropped from the above matrix when conducting our imaging experiments.
The experiments
- NYTWQueries.txt, [4 KB]: A text file listing the 47 words and their associated IDs used as labels for our imaging experiments.
- NYTWSplits.zip, [11.3 MB]: A ZIP file containing 47 text files, each corresponding to the . Each line of each text file provides each paragraph ID with a designation indicating training set membership (a value of 0) or test set membership (a value of 1) for the particular query.
The results
A PDF report documenting the words selected by each feature selection method across each query will be available for download shortly.
The authors
This work was conducted as part of the StatNews Project.