Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

Selectivity Estimation for Extraction Operators over Text Data

Daisy Zhe Wang, Long Wei, Yunyao Li, Frederick Reiss and Shivakumar Vaithyanathan

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-107
July 2, 2010

http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-107.pdf

Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple ex- traction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We develop three classes of document synopses: n-gram synopsis, bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into sub-parts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively


BibTeX citation:

@techreport{Wang:EECS-2010-107,
    Author = {Wang, Daisy Zhe and Wei, Long and Li, Yunyao and Reiss, Frederick and Vaithyanathan, Shivakumar},
    Title = {Selectivity Estimation for Extraction Operators over Text Data},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2010},
    Month = {Jul},
    URL = {http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-107.html},
    Number = {UCB/EECS-2010-107},
    Abstract = {Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple ex-
traction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We develop three classes of document synopses: n-gram synopsis, bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into sub-parts to achieve
more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively}
}

EndNote citation:

%0 Report
%A Wang, Daisy Zhe
%A Wei, Long
%A Li, Yunyao
%A Reiss, Frederick
%A Vaithyanathan, Shivakumar
%T Selectivity Estimation for Extraction Operators over Text Data
%I EECS Department, University of California, Berkeley
%D 2010
%8 July 2
%@ UCB/EECS-2010-107
%U http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-107.html
%F Wang:EECS-2010-107