Jonathan Eng

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2015-254

December 18, 2015

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.pdf

We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.

Advisors: Daniel Klein


BibTeX citation:

@mastersthesis{Eng:EECS-2015-254,
    Author= {Eng, Jonathan},
    Title= {Supervised Text Region Identification on Historical Documents},
    School= {EECS Department, University of California, Berkeley},
    Year= {2015},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.html},
    Number= {UCB/EECS-2015-254},
    Abstract= {We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.},
}

EndNote citation:

%0 Thesis
%A Eng, Jonathan 
%T Supervised Text Region Identification on Historical Documents
%I EECS Department, University of California, Berkeley
%D 2015
%8 December 18
%@ UCB/EECS-2015-254
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.html
%F Eng:EECS-2015-254