Class Notes 10/3/01
Last Time:
- in general context of:
Heavy tailed durations
Long Range Dependence
-
revisited protocol background
-
Mice and Elephants
graphic, with "one minute split" flows
- introduced simple model
- investigated modelling assumptions
- SiZer analysis: not quite homogeneous intensity
- QQ analysis: not quite exponential interarrivals
A quick overview of Extreme Value Theory
(Soundbite level introduction)
References:
Resnick, S. I. (1987) Extreme
values, regular variation and point processes, Springer.
Leadbetter, M. R., Lindgren,
G. and Rootzen, H. (1983) Extremes and related properties of random
sequences and processes, Springer.
A quick overview of Extreme Value Theory (cont.)
Context:
For i.i.d. random variables
with cumulative distribution
function ,
study the asymptotic (as )
distribution of
Comments / Corrections welcome!
A quick overview of Extreme Value Theory (cont.)
Analog:
Central Limit Theorem
describes the asymptotic distribution of the sample mean, as
Intuitive ideas:
-
sample mean
"clusters around" population mean
- gets closer as sample size grows
-
gets closer at specific rate
-
precise "normalization" gives limiting distribution
A quick overview of Extreme Value Theory (cont.)
Say "
is in the domain of attraction of the distribution
"
when:
there are sequences
and
, so
that
as
A quick overview of Extreme Value Theory (cont.)
Limiting distributions, :
Three types, depending on
"upper end behavior" of
1. "Extreme
Value", i.e. "Gumbel"
-
roughly happens for
with "exponential tails"
- e.g. Gaussian, Weibull, log Normal
-
could be bounded (but "little mass near end")
2. "Weibull"
-
roughly happens for
with "polynomial tails"
-
e.g. Cauchy, Pareto
3. "Negative
Weibull"
- roughly "bounded from above"
(with reasonable mass near end point)
-
e.g. Uniform, negative of Exponential, Weibull or Pareto
A quick overview of Extreme Value Theory (cont.)
Notes:
- can apply to minima as well as maxima, since
-
which is useful for "inter-arrival times"
-
since "time to next packet" is a minimum over flows
-
thus expect Weibull distribution for inter-arrivals
-
not all
have such a limiting distribution for
-
there exist precise mathematical characterizations
-
elegant and fun mathematics
A quick overview of Extreme Value Theory (cont.)
(careful for next time:
need
to rethink several things here:
-
"usual beta parametrization" is "b-1"
-
but probably this is better? (makes "pole" more clear)
-
should show beta densities
-
better (for later) to reformulate in terms of the min?
-
part about "Weibull tails" is not right (since at min not tail)
-
can be sorted out by Weibull graphic?
-
need to be careful about scale
-
perhaps do "median matching"?)
An example:
the beta
density:
note:
allows study of range of "upper bound behavior"
Can show: domain of attraction is
Negative Weibull
with shape parameter:
Interesting cases:
:
-
small near
,
- Weibull shape parameter is
- i.e. has lighter tail than exponential
- sensible, since "few observations near "
:
-
constant height near
,
- e.g. Uniform or negative of exponential
- Weibull shape parameter is
- i.e. domain of attraction of exponential distribution
- special notes:
- this is interarrival times of Poisson Process
- other cases are departures from this
:
-
has a pole (infinite peak) near
,
- Weibull shape parameter is
- i.e. has heavier tail than exponential
- sensible since "more observations near "
Recall checking of asumptions of "simple model"
2 b.
Weibull QQ, parameters est'd by quantile matching
- very good fit
- shape parameter 0.9 "close to Poisson 1.0"?
- provides a workable approximation????
- now understand potential mechanism:
minimization of Beta r.v.'s with a pole at 0
- but "clustered Poisson" may be more likely??
- since can think of such a mechanism
Recall checking of assumptions of "simple model" (cont.)
Recall
poor Q-Q analysis for:
- visual impression distorted by few large values
- very similar shape parameter 0.9
Fix
for this (boundary effect) problem:
consider
only flows that start after 0.2 of range:
- now same result as off peak
- Weibull, with shape parameter ~0.9
- shows "very large values" before were at beginning
- can reject hypothesis of exponential (graphic)
- but still maybe OK as "first pass approximation"
An Open problem: which of the following can explain the
strong mean changes observed
in the start time SiZer analysis?
(Recall graphics:
Off Peak
Peak)
a. Independent
Weibull(0.9) interarrivals?
b. Poisson
cluster process?
Recall checking of assumptions of "simple model" (cont.)
An aside on the SiZer analysis:
recall the possible boundary effect observed in the Peak case
-
downwards slope
Investigate by starting only after 20% of range:
- looks much better
- confirms problem was mostly boundary effect
-
but could still be some non-stationarity
Revisitation of "heavy tails"
interesting "physical explanation" of
log normal file size distributions
from:
Downey, A. B. (2000) The
structural cause of file size distributions, Wellesley College Tech. Report
CSD-TR25-2000, http://rocky.wellesley.edu/downey/filesize
Main results:
- Studies distributions of file sizes
- On individual computers
- Claims generally log Normal
-
not Pareto
Revisitation of "heavy tails" (cont.)
Fundamental Premise:
Most files were created from other files
Copying:
downloads, software installations, backups, ...
Translating and filtering:
changing format, compiling, ...
Editing:
programming, word processing, ...
Major assumption:
Changes affect file sizes multiplicatively
- easy for copying: factor of 1
- very sensible for translating and filtering
- OK for most programming and word processing?
- wrong for:
- concatenated files
- text files "created from scratch"
Revisitation of "heavy tails" (cont.)
Simple stochastic model:
for original file size
and a random modification
factor
new file size is
Distribution of ?
Perhaps doesn't matter by "Central Limit Theorem argument"
Revisitation of "heavy tails" (cont.)
Downey's conclusion:
Since log normal is "light
tailed" (e.g. all moments exist)
Then don't have:
Heavy tailed durations
Long Range Dependence
I.e. something else must
be creating the apparent LRD
A "really important" open problem
Can the lognormal lead to
Long Range Dependence?
Simplest answer:
No, not for any fixed log-normal
Deeper question:
what if the lognormal changes during the asymptotics?
A precise (actually not very!) question: for "simple model",
for what
sequences of parameters
and
,
do we get classical long range dependence
(in any sense defined in Lecture9-19-01),
as ????