Course  OR 778

Class Notes   10/3/01





Last Time:

    -    in general context of:

Heavy tailed durations   Long Range Dependence





    -    revisited protocol background
 

    -  Mice and Elephants graphic, with "one minute split" flows
 

    -    introduced simple model
 

    -    investigated modelling assumptions

            -    SiZer analysis:  not quite homogeneous intensity

            -    QQ analysis:  not quite exponential interarrivals
 
 
 


A quick overview of Extreme Value Theory







(Soundbite level introduction)
 
 

References:
 

Resnick, S. I. (1987) Extreme values, regular variation and point processes, Springer.
 

Leadbetter, M. R., Lindgren, G. and Rootzen, H. (1983) Extremes and related properties of random sequences and processes, Springer.
 
 
 


A quick overview of Extreme Value Theory (cont.)





Context:    For i.i.d. random variables 
 

with cumulative distribution function ,
 

study the asymptotic (as )  distribution of






Comments / Corrections welcome!
 
 
 


A quick overview of Extreme Value Theory (cont.)





Analog:         Central Limit Theorem
 

describes the asymptotic distribution of the sample mean, as





Intuitive ideas:

    -    sample mean  "clusters around" population mean 

    -    gets closer as sample size grows

    -    gets closer at specific rate 

    -    precise "normalization" gives limiting distribution
 
 
 


A quick overview of Extreme Value Theory (cont.)





Say "  is in the domain of attraction of the distribution "
 

when:
 

      there are sequences  and , so that
 
 





      as 
 
 
 


A quick overview of Extreme Value Theory (cont.)





Limiting distributions, :
 
 

Three types, depending on "upper end behavior" of 





1.    "Extreme Value", i.e. "Gumbel" 

    -    roughly happens for   with "exponential tails"

    -    e.g. Gaussian, Weibull, log Normal

    -    could be bounded (but "little mass near end")
 
 

2.    "Weibull" 

    -    roughly happens for   with "polynomial tails"

    -    e.g. Cauchy, Pareto
 
 

3.    "Negative Weibull" 

    -    roughly "bounded from above"

(with reasonable mass near end point)

    -    e.g. Uniform, negative of Exponential, Weibull or Pareto
 
 
 


A quick overview of Extreme Value Theory (cont.)





Notes:
 

    -    can apply to minima as well as maxima, since





    -    which is useful for "inter-arrival times"
 

    -    since "time to next packet" is a minimum over flows
 

    -    thus expect Weibull distribution for inter-arrivals
 

    -    not all   have such a limiting distribution for 
 

    -    there exist precise mathematical characterizations
 

    -    elegant and fun mathematics
 
 
 


A quick overview of Extreme Value Theory (cont.)





(careful for next time:
    need to rethink several things here:
    -    "usual beta parametrization" is "b-1"
    -    but probably this is better? (makes "pole" more clear)
    -    should show beta densities
    -    better (for later) to reformulate in terms of the min?
    -    part about "Weibull tails" is not right (since at min not tail)
    -    can be sorted out by Weibull graphic?
    -    need to be careful about scale
    -    perhaps do "median matching"?)

An example:    the beta  density:

    note:  allows study of range of "upper bound behavior"
 
 

Can show:    domain of attraction is

Negative Weibull 

with shape parameter: 
 
 

Interesting cases:
 

:

        -    small near ,

        -    Weibull shape parameter is 

        -    i.e. has lighter tail than exponential

        -    sensible, since "few observations near "
 

:

        -   constant height near ,

        -    e.g. Uniform or negative of exponential

        -    Weibull shape parameter is 

        -    i.e. domain of attraction of exponential distribution

        -    special notes:

             -   this is interarrival times of Poisson Process

             -   other cases are departures from this
 

:

        -   has a pole (infinite peak) near ,

        -    Weibull shape parameter is 

        -    i.e. has heavier tail than exponential

        -    sensible since "more observations near "
 
 


Recall checking of asumptions of "simple model"





2 b.    Weibull QQ, parameters est'd by quantile matching
 

Offpeak

    -    very good fit

    -    shape parameter 0.9 "close to Poisson 1.0"?

    -    provides a workable approximation????

    -    now understand potential mechanism:

minimization of Beta r.v.'s with a pole at 0

    -    but "clustered Poisson" may be more likely??

    -    since can think of such a mechanism
 
 


Recall checking of assumptions of "simple model" (cont.)





Recall poor Q-Q analysis for:
 
 

Peak

    -    visual impression distorted by few large values

    -    very similar shape parameter 0.9
 
 

Fix for this (boundary effect) problem:
 

consider only flows that start after 0.2 of range:
 

Boundary adjusted peak

    -    now same result as off peak

    -    Weibull, with shape parameter ~0.9

    -    shows "very large values" before were at beginning

    -    can reject hypothesis of exponential (graphic)

    -    but still maybe OK as "first pass approximation"
 
 


An Open problem:  which of the following can explain the

strong mean changes observed

in the start time  SiZer  analysis?





(Recall graphics:   Off Peak               Peak)
 
 

a.    Independent Weibull(0.9) interarrivals?
 

b.    Poisson cluster process?
 
 
 


Recall checking of assumptions of "simple model" (cont.)





An aside on the  SiZer  analysis:

recall the possible boundary effect observed in the Peak case

    -    downwards slope
 

Investigate by starting only after 20% of range:

Boundary Adjusted Peak

    -    looks much better

    -    confirms problem was mostly boundary effect

    -    but could still be some non-stationarity
 
 
 


Revisitation of "heavy tails"




interesting "physical explanation" of

log normal file size distributions

from:
 

Downey, A. B. (2000) The structural cause of file size distributions, Wellesley College Tech. Report CSD-TR25-2000, http://rocky.wellesley.edu/downey/filesize
 
 
 

Main results:

    -    Studies distributions of file sizes

    -    On individual computers

    -    Claims generally log Normal

    -    not Pareto
 
 
 


Revisitation of "heavy tails" (cont.)





Fundamental Premise:
 
 

Most files were created from other files





Copying:

downloads, software installations, backups, ...





Translating and filtering:

changing format, compiling, ...





Editing:

programming, word processing, ...







Major assumption:

Changes affect file sizes multiplicatively







    -    easy for copying: factor of 1

    -    very sensible for translating and filtering

    -    OK for most programming and word processing?

    -    wrong for:

            -    concatenated files

            -    text files "created from scratch"
 
 


Revisitation of "heavy tails" (cont.)





Simple stochastic model:
 

for original file size 
 

and a random modification factor 
 

new file size is 
 
 
 

Distribution of ?
 

Perhaps doesn't matter by "Central Limit Theorem argument"

(on log scale)







Revisitation of "heavy tails" (cont.)





Downey's conclusion:
 

Since log normal is "light tailed"  (e.g. all moments exist)
 

Then don't have:

Heavy tailed durations   Long Range Dependence






I.e. something else must be creating the apparent LRD
 
 
 
 


A "really important" open problem





Can the lognormal lead to Long Range Dependence?
 
 

Simplest answer:

No, not for any fixed log-normal





Deeper question:

what if the lognormal changes during the asymptotics?





A precise (actually not very!) question:  for "simple model",

    for what sequences of parameters   and ,

    do we get classical long range dependence

(in any sense defined in Lecture9-19-01),

as ????