We therefore need new solutions to bothproblems mentioned in the previoussections.To
proceed further we observe that estimation of the probability
distribution in the large alphabet setting is a futile task--Kieffer's
result mentioned in Section 1.1
can be used to show this formally. Therefore, what we try to estimate
is the probability of observing just those events already in the data
sample.
In the compression framework, this can be seen to be equivalent to
separating [7] the description of the data strings over large
alphabets into two parts: description of the symbols appearing in the
sequence, and of their pattern, the order in which the symbols
appear. For instance, to describe the sequence ``Sam I am that Sam I
am I do not like that Sam I am'' over the symbol space of English
words, this approach separates the pattern, 123412325674123 from the
dictionary,
1
2
3
4
5
6
7
Sam
I
am
that
do
not
like
.
Patterns have an interesting combinatorial structure [3],
and using Hardy and Ramanujan's celebrated results on the number of
integer partitions [8,9], we show that patterns of i.i.d.
sequences can be universally compressed with vanishing
redundancy [7], see also [10]. These results have
since been extended to more complex distributions as well [11].
These results have their analog in estimation as well. We also
analyze a few variants of the Good Turing estimator, showing that
while they outperform other commonly used estimators, none is optimal
in general. We subsequently develop two provably asymptotically
optimal estimators. These results appeared in
Science [12].