
Parallel Colt 0.7.2  
PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object cern.colt.PersistentObject hep.aida.tdouble.bin.AbstractDoubleBin hep.aida.tdouble.bin.AbstractDoubleBin1D hep.aida.tdouble.bin.StaticDoubleBin1D hep.aida.tdouble.bin.MightyStaticDoubleBin1D hep.aida.tdouble.bin.QuantileDoubleBin1D
public class QuantileDoubleBin1D
1dimensional nonrebinnable bin holding double elements with scalable quantile operations defined upon; Using little main memory, quickly computes approximate quantiles over very large data sequences with and even without apriori knowledge of the number of elements to be filled; Conceptually a strongly lossily compressed multiset (or bag); Guarantees to respect the worst case approximation error specified upon instance construction. First see the package summary and javadoc tree view to get the broad picture.
Motivation and Problem: Intended to help scale applications requiring
quantile computation. Quantile computation on very large data sequences is
problematic, for the following reasons: Computing quantiles requires sorting
the data sequence. To sort a data sequence the entire data sequence needs to
be available. Thus, data cannot be thrown away during filling (as done by
static bins like StaticDoubleBin1D
and
MightyStaticDoubleBin1D
). It needs to be kept, either in main memory
or on disk. There is often not enough main memory available. Thus, during
filling data needs to be streamed onto disk. Sorting disk resident data is
prohibitively time consuming. As a consequence, traditional methods either
need very large memories (like DynamicDoubleBin1D
) or time consuming
disk based sorting.
This class proposes to efficiently solve the problem, at the expense of producing approximate rather than exact results. It can deal with infinitely many elements without resorting to disk. The main memory requirements are smaller than for any other known approximate technique by an order of magnitude. They get even smaller if an upper limit on the maximum number of elements ever to be added is known apriori.
Approximation error: The approximation guarantees are parametrizable and explicit but probabilistic, and apply for arbitrary value distributions and arrival distributions of the data sequence. In other words, this class guarantees to respect the worst case approximation error specified upon instance construction to a certain probability. Of course, if it is specified that the approximation error should not exceed some number very close to zero, this class will potentially consume just as much memory as any of the traditional exact techniques would do. However, for errors larger than 10^{5}, its memory requirements are modest, as shown by the table below.
Main memory requirements: Given in megabytes, assuming a single element (double) takes 8 byte. The number of elements required is then MB*1024*1024/8.
Parameters:
Required main memory [MB] 

#quantiles  epsilon 
delta  N unknown  N known  
N_{max}=inf  N_{max}=10^{6}  N_{max}=10^{7}  N_{max}=10^{8}  N_{max}=inf  N_{max}=10^{6}  N_{max}=10^{7}  N_{max}=10^{8}  
any  0 
any  infinity  7.6  76  762  infinity  7.6  76  762  
any  10^{ 1} 
0  infinity  0.003  0.005  0.006  0.03  0.003  0.005  0.006  
10^{ 2} 
0.02  0.03  0.05  0.31  0.02  0.03  0.05  
10^{ 3} 
0.12  0.2  0.3  2.7  0.12  0.2  0.3  
10^{ 4} 
0.6  1.2  2.1  26.9  0.6  1.2  2.1  
10^{ 5} 
2.5  6.4  11.6  205  2.5  6.4  11.6  
10^{ 6} 
7.6  25.4  63.6  1758  7.6  25.4  63.6  
100 
10^{ 2} 
10^{ 1}  0.033  0.021  0.03  0.03  0.020  0.020  0.020  0.020  
10^{ 5}  0.038  0.021  0.03  0.04  0.024  0.020  0.020  0.020  
10^{ 3} 
10^{ 1}  0.48  0.12  0.2  0.3  0.32  0.12  0.2  0.3  
10^{ 5}  0.54  0.12  0.2  0.3  0.37  0.12  0.2  0.3  
10^{ 4} 
10^{ 1}  6.6  0.6  1.2  2.1  4.6  0.6  1.2  2.1  
10^{ 5}  7.2  0.6  1.2  2.1  5.2  0.6  1.2  2.1  
10^{ 5} 
10^{ 1}  86  2.5  6.4  11.6  63  2.5  6.4  11.6  
10^{ 5}  94  2.5  6.4  11.6  70  2.5  6.4  11.6  
10000 
10^{ 2} 
10^{ 1}  0.04  0.02  0.03  0.04  0.02  0.02  0.02  0.02  
10^{ 5}  0.04  0.02  0.03  0.04  0.03  0.02  0.03  0.03  
10^{ 3} 
10^{ 1}  0.52  0.12  0.21  0.3  0.35  0.12  0.21  0.3  
10^{ 5}  0.56  0.12  0.21  0.3  0.38  0.12  0.21  0.3  
10^{ 4} 
10^{ 1}  7.0  0.64  1.2  2.1  5.0  0.64  1.2  2.1  
10^{ 5}  7.5  0.64  1.2  2.1  5.4  0.64  1.2  2.1  
10^{ 5} 
10^{ 1}  90  2.5  6.4  11.6  67  2.5  6.4  11.6  
10^{ 5}  96  2.5  6.4  11.6  71  2.5  6.4  11.6  
#quantiles  epsilon  delta  N_{max}=inf  N_{max}=10^{6}  N_{max}=10^{7}  N_{max}=10^{8}  N_{max}=inf  N_{max}=10^{6}  N_{max}=10^{7}  N_{max}=10^{8}  
N unknown  N known  
Required main memory [MB] 
Implementation:
After: Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. Proc. of the 1999 ACM SIGMOD Int. Conf. on Management of Data, Paper available here.
and
Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate Medians and other Quantiles in One Pass and with Limited Memory, Proc. of the 1998 ACM SIGMOD Int. Conf. on Management of Data, Paper available here.
The broad picture is as follows. Two concepts are used: Shrinking and Sampling. Shrinking takes a data sequence, sorts it and produces a shrinked data sequence by picking every kth element and throwing away all the rest. The shrinked data sequence is an approximation to the original data sequence.
Imagine a large data sequence (residing on disk or being generated in memory on the fly) and a main memory block of n=b*k elements ( b is the number of buffers, k is the number of elements per buffer). Fill elements from the data sequence into the block until it is full or the data sequence is exhausted. When the block (or a subset of buffers) is full and the data sequence is not exhausted, apply shrinking to lossily compress a number of buffers into one single buffer. Repeat these steps until all elements of the data sequence have been consumed. Now the block is a shrinked approximation of the original data sequence. Treating it as if it would be the original data sequence, we can determine quantiles in main memory.
Now, the whole thing boils down to the question of: Can we choose b and k (the number of buffers and the buffer size) such that b*k is minimized, yet quantiles determined upon the block are guaranteed to be away from the true quantiles no more than some epsilon? It turns out, we can. It also turns out that the required main memory block size n=b*k is usually moderate (see the table above).
The theme can be combined with random sampling to further reduce main memory requirements, at the expense of probabilistic guarantees. Sampling filters the data sequence and feeds only selected elements to the algorithm outlined above. Sampling is turned on or off, depending on the parametrization.
This quick overview does not go into important details, such as assigning proper weights to buffers, how to choose subsets of buffers to shrink, etc. For more information consult the papers cited above.
Time Performance:
Performance  
Quantiles  Epsilon  Delta  Filling [#elements/sec] 
Quantile computation [#quantiles/sec] 

N unknown, N_{max}=inf 
N known, N_{max}=10^{7} 
N unknown, N_{max}=inf 
N known, N_{max}=10^{7} 

10^{4}  10 ^{ 1}  10 ^{ 1} 
1600000 
1300000  250000  130000  
10 ^{ 2}  360000  1200000  50000  20000  
10 ^{ 3}  150000  200000  3600  3000  
10 ^{ 4}  120000  170000  80  1000 
cern.jet.stat.tdouble.quantile
,
Serialized FormField Summary 

Fields inherited from class cern.colt.PersistentObject 

serialVersionUID 
Constructor Summary  

QuantileDoubleBin1D(boolean known_N,
long N,
double epsilon,
double delta,
int quantiles,
DoubleRandomEngine randomGenerator)
Equivalent to new QuantileBin1D(known_N, N, epsilon, delta, quantiles, randomGenerator, false, false, 2) . 

QuantileDoubleBin1D(boolean known_N,
long N,
double epsilon,
double delta,
int quantiles,
DoubleRandomEngine randomGenerator,
boolean hasSumOfLogarithms,
boolean hasSumOfInversions,
int maxOrderForSumOfPowers)
Constructs and returns an empty bin that, under the given constraints, minimizes the amount of memory needed. 

QuantileDoubleBin1D(double epsilon)
Equivalent to new QuantileBin1D(false, Long.MAX_VALUE, epsilon, 0.001, 10000, new cern.jet.random.engine.DRand(new java.util.Date()) . 
Method Summary  

void 
addAllOfFromTo(DoubleArrayList list,
int from,
int to)
Adds the part of the specified list between indexes from (inclusive) and to (inclusive) to the receiver. 
void 
clear()
Removes all elements from the receiver. 
Object 
clone()
Returns a deep copy of the receiver. 
String 
compareWith(AbstractDoubleBin1D other)
Computes the deviations from the receiver's measures to another bin's measures. 
double 
median()
Returns the median. 
double 
quantile(double phi)
Computes and returns the phiquantile. 
double 
quantileInverse(double element)
Returns how many percent of the elements contained in the receiver are <= element. 
DoubleArrayList 
quantiles(DoubleArrayList phis)
Returns the quantiles of the specified percentages. 
int 
sizeOfRange(double minElement,
double maxElement)
Returns how many elements are contained in the range [minElement,maxElement]. 
MightyStaticDoubleBin1D[] 
splitApproximately(DoubleArrayList percentages,
int k)
Divides (rebins) a copy of the receiver at the given percentage boundaries into bins and returns these bins, such that each bin approximately reflects the data elements of its range. 
MightyStaticDoubleBin1D[] 
splitApproximately(DoubleIAxis axis,
int k)
Divides (rebins) a copy of the receiver at the given interval boundaries into bins and returns these bins, such that each bin approximately reflects the data elements of its range. 
String 
toString()
Returns a String representation of the receiver. 
Methods inherited from class hep.aida.tdouble.bin.MightyStaticDoubleBin1D 

geometricMean, getMaxOrderForSumOfPowers, getMinOrderForSumOfPowers, harmonicMean, hasSumOfInversions, hasSumOfLogarithms, hasSumOfPowers, kurtosis, moment, product, skew, sumOfInversions, sumOfLogarithms, sumOfPowers 
Methods inherited from class hep.aida.tdouble.bin.StaticDoubleBin1D 

add, isRebinnable, max, min, size, sum, sumOfSquares 
Methods inherited from class hep.aida.tdouble.bin.AbstractDoubleBin1D 

addAllOf, buffered, equals, mean, rms, standardDeviation, standardError, trimToSize, variance 
Methods inherited from class hep.aida.tdouble.bin.AbstractDoubleBin 

center, center, error, error, offset, offset, value, value 
Methods inherited from class java.lang.Object 

getClass, hashCode, notify, notifyAll, wait, wait, wait 
Constructor Detail 

public QuantileDoubleBin1D(double epsilon)
public QuantileDoubleBin1D(boolean known_N, long N, double epsilon, double delta, int quantiles, DoubleRandomEngine randomGenerator)
public QuantileDoubleBin1D(boolean known_N, long N, double epsilon, double delta, int quantiles, DoubleRandomEngine randomGenerator, boolean hasSumOfLogarithms, boolean hasSumOfInversions, int maxOrderForSumOfPowers)
known_N
 specifies whether the number of elements over which quantiles
are to be computed is known or not.
N
 if known_N==true, the number of elements over which
quantiles are to be computed. if known_N==false, the
upper limit on the number of elements over which quantiles are
to be computed. In other words, the maximum number of elements
ever to be added. If such an upper limit is apriori unknown,
then set N = Long.MAX_VALUE.
epsilon
 the approximation error which is guaranteed not to be exceeded
(e.g. 0.001) (0 <= epsilon <= 1). To
get exact rather than approximate quantiles, set
epsilon=0.0;
delta
 the allowed probability that the actual approximation error
exceeds epsilon (e.g. 0.0001) (0 <= delta <=
1). To avoid probabilistic answers, set delta=0.0.
For example, delta = 0.0001 is equivalent to a
confidence of 99.99%.
quantiles
 the number of quantiles to be computed (e.g. 100) (
quantiles >= 1). If unknown in advance, set this
number large, e.g. quantiles >= 10000.
randomGenerator
 a uniform random number generator. Set this parameter to
null to use a default generator seeded with the
current time.
The next three parameters specify additional capabilities
unrelated to quantile computation. They are identical to the
one's defined in the constructor of the parent class
MightyStaticDoubleBin1D
.
hasSumOfLogarithms
 Tells whether MightyStaticDoubleBin1D.sumOfLogarithms()
can return meaningful
results. Set this parameter to false if measures of
sum of logarithms, geometric mean and product are not
required.
hasSumOfInversions
 Tells whether MightyStaticDoubleBin1D.sumOfInversions()
can return meaningful
results. Set this parameter to false if measures of
sum of inversions, harmonic mean and sumOfPowers(1) are not
required.
maxOrderForSumOfPowers
 The maximum order k for which
MightyStaticDoubleBin1D.sumOfPowers(int)
can return meaningful results. Set
this parameter to at least 3 if the skew is required, to at
least 4 if the kurtosis is required. In general, if moments
are required set this parameter at least as large as the
largest required moment. This method always substitutes
Math.max(2,maxOrderForSumOfPowers) for the parameter
passed in. Thus, sumOfPowers(0..2) always returns
meaningful results.Method Detail 

public void addAllOfFromTo(DoubleArrayList list, int from, int to)
addAllOfFromTo
in class MightyStaticDoubleBin1D
list
 the list of which elements shall be added.from
 the index of the first element to be added (inclusive).to
 the index of the last element to be added (inclusive).
IndexOutOfBoundsException
 if
list.size()>0 && (from<0  from>to  to>=list.size())
.public void clear()
clear
in class StaticDoubleBin1D
public Object clone()
clone
in class MightyStaticDoubleBin1D
public String compareWith(AbstractDoubleBin1D other)
compareWith
in class MightyStaticDoubleBin1D
other
 the other bin to compare with
public double median()
public double quantile(double phi)
phi
 the percentage for which the quantile is to be computed. phi
must be in the interval (0.0,1.0].
public double quantileInverse(double element)
element
 the element to search for.
public DoubleArrayList quantiles(DoubleArrayList phis)
quantile(double)
various times.
phis
 the percentages for which quantiles are to be computed. Each
percentage must be in the interval (0.0,1.0].
percentages must be sorted ascending.
public int sizeOfRange(double minElement, double maxElement)
minElement
 the minimum element to search for.maxElement
 the maximum element to search for.
public MightyStaticDoubleBin1D[] splitApproximately(DoubleArrayList percentages, int k)
The split(...) methods are particularly well suited for realtime interactive rebinning (the famous "scrolling slider" effect).
Passing equidistant percentages like (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) into this method will yield bins of an equidepth histogram, i.e. a histogram with bin boundaries adjusted such that each bin contains the same number of elements, in this case 20% each. Equidepth histograms can be useful if, for example, not enough properties of the data to be captured are known apriori to be able to define reasonable bin boundaries (partitions). For example, when guesses about minimas and maximas are strongly unreliable. Or when chances are that by focussing too much on one particular area other important areas and characters of a data set may be missed.
Implementation:
The receiver is divided into s = percentages.size()1 intervals (bins). For each interval I, its minimum and maximum elements are determined based upon quantile computation. Further, each interval I is split into k equipercentdistant subintervals (subbins). In other words, an interval is split into subintervals such that each subinterval contains the same number of elements.
For each subinterval S, its minimum and maximum are determined, again, based upon quantile computation. They yield an approximate arithmetic mean am = (min+max)/2 of the subinterval. A subinterval is treated as if it would contain only elements equal to the mean am. Thus, if the subinterval contains, say, n elements, it is assumed to consist of n mean elements (am,am,...,am). A subinterval's sum of elements, sum of squared elements, sum of inversions, etc. are then approximated using such a sequence of mean elements.
Finally, the statistics measures of an interval I are computed by summing up (integrating) the measures of its subintervals.
Accuracy:
Depending on the accuracy of quantile computation and the number of
subintervals per interval (the resolution). Objects of this class compute
exact or approximate quantiles, depending on the parameters used upon
instance construction. Objects of subclasses may always compute
exact quantiles, as is the case for DynamicDoubleBin1D
. Most
importantly for this class QuantileBin1D, a reasonably small
epsilon (e.g. 0.01, perhaps 0.001) should be used upon instance
construction. The confidence parameter delta is less important,
you may find delta=0.00001 appropriate.
The larger the resolution, the smaller the approximation error, up to
some limit. Integrating over only a few subintervals per interval will
yield very crude approximations. If the resolution is set to a reasonably
large number, say 10..100, more small subintervals are integrated,
resulting in more accurate results.
Note that for good accuracy, the number of quantiles computable with the
given approximation guarantees should upon instance construction be
specified, so as to satisfy
quantiles > resolution * (percentages.size()1)
Example:
resolution=2, percentList = (0.0, 0.1, 0.2, 0.5, 0.9, 1.0) means
the receiver is to be split into 5 bins:
The statistics measures for each bin are to be computed at a resolution of 2 subbins per bin. Thus, the statistics measures of a bin are the integrated measures over 2 subbins, each containing the same amount of elements:
Lets concentrate on the subbins of bin 0.
Assume the entire data set consists of N=100 elements.
Finally, the statistics measures of bin 0 are computed by summing up (integrating) the measures of its subintervals: Bin 0 has a size of N*(10%0%)=10 elements (we knew that already), sum of 1625+2250=3875, sum of squares of 528125+1012500=1540625, sum of inversions of 0.015+0.01=0.025, etc. From these follow other measures such as mean=3875/10=387.5, rms = sqrt(1540625 / 10)=392.5, etc. The other bins are computes analogously.
percentages
 the percentage boundaries at which the receiver shall be
split.k
 the desired number of subintervals per interval.public MightyStaticDoubleBin1D[] splitApproximately(DoubleIAxis axis, int k)
splitApproximately(DoubleArrayList,int)
do the real work.
axis
 an axis defining interval boundaries.k
 the desired number of subintervals per interval.public String toString()
toString
in class MightyStaticDoubleBin1D

Parallel Colt 0.7.2  
PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 