Course  OR 778

Class Notes   10/24/01



Last Time:

For new HTTP Response Size data:

    -    Studied Asymptotic Indepence

    -    Hill Estimation of tail indices

    -    Used for "power renormalization"

    -    Considered Box-Cox family of power transformations

    -    Explored "ratio Hill estimation"

    -    log-log CCDF Tail Index Estimation
 
 
 


New HTTP Response Data



Data sources:  4 hour blocks of packet headers

        "Morning":   8:00-12:00

        "Afternoon":   13:00-17:00

        "Evening":    19:30-23:30
 
 

Gathered at UNC Main Link

During 7 days in April 2001
 
 
 


Asymptotic Independence



Previous View:   "Axis-hugging"

    -    Best View:  tail index transformed    graphic
 

Another View:

    -    For data "in tails",

    -    Project to (e.g.) "unit circle"

    -    I.e. study distribution of "angles" (between  and )

    -    Look for "pile up at end points"
 
 
 


Asymptotic Independence (cont.)



Angle Plot:

    -    1st try was very disappointing (so not shown)

    -    Everything was near   (of the order )

    -    Reason:  Scales not commensurate

    -    Refer to Scatterplot above

    -    Response Sizes were of the order   (bytes)

    -    Duration Times were of the order   (sec)

    -    Implies   distance heavily weighted towards Sizes
 
 
 


Asymptotic Independence (cont.)



Improved Angle Plot:

    -    Control for scale

    -    By dividing by marginal medians

    -    Since robust measure of "scale" for Pareto

    -    Note only "change axis labels" in previous scatterplot
 

Resulting SiZer analysis    graphic

    -    Find "axis hugging"?

    -    Spikes are "lines of constant transmission rate"?
 
 
 


Asymptotic Independence (cont.)




Interesting variation:

Also study relationship between Size, Time

and "Rate" = Size/Time



Tail behavior of Rate? Hill Plot

    -    Quite often smaller tail index than either Size or Time

    -    Often have  less than 1

    -    Implications?

    -    Will use anyway in Asy. Indep. Analysis
 
 
 


Asymptotic Independence (cont.)



Expectations:
 

Time vs. Rate:  Dependent

    -    Since big flows will feel either more or less packet loss

    -    Thus Time and Rate should be proportionally affected

    -    Since Rate = Size / Time
 

Rate vs. Size:  Independent

    -    Since Rate driven by packet loss

    -    But Size does not appear connected to Time
 
 
 


Asymptotic Independence (cont.)



Results:
 

Time vs. RateAngle Plots

    -    Independent!?!?    (recall expected dependent)

    -    Suggests above intuition is wrong?

    -    Does method seem OK here???

    -    Hopefully not programming error.....
 

Rate vs. SizeAngle Plots

    -    Dependent!?!?    (recall expected independent)

    -    Why should this be reversed???

    -    Expect either both dep. or both indep.???

    -    Need to rethink ratio??
 
 
 


New Response Size Q-Q Plots (revisited)



Another view of New Response Size Data:

Extreme Value Tail Index 




Recall Intuition:
 

    -    Shape parameter of Pareto  (polynomial power)
 

    -    Strong relation to Long Range Dependence

            -    in Mice and Elephants plots  (graphic)

            -   in Duration Distributions,

implies Classical LRD in aggregated time series




    -    Strong relation to moments:

            -    for   have infinite mean

            -    for   have finite mean but infinite variance

            -    for   both mean and variance are finite

            -    similar for larger   and higher moments
 
 
 


New Response Size Q-Q Plots (cont.)



Simple, straightforward Estimation of :

Slope of CCDF (i.e. 1 - CDF) on log - log scale





Log-log CCDF:  graphic
 

    -    All 21 time blocks appear as thin blue lines

    -    Each Individual labeled and highlighted in thick red

    -    Not very "linear"?

    -    Suggests classical extreme value theory

hasn't "kicked in" yet???

    -    Note "shapes" of curves surprisingly constant

    -    Suggests curvature is not "random phenomenon"!

    -    Instead something systematic about internet traffic?

    -    Point worth deeper statistical confirmation??

    -    Suggests enhancement of current mathematics????

    -    Friday evening an "extreme point"?  (least steep?)

    -    Many Resp. Sizes near 400 bytes???

(also for Friday, Afternoon, no where else?)

    -    Worth plotting data between 0.999 quantile and max???

(1,000 to 7,000 of these for each time block....)






New Response Size Q-Q Plots (cont.)



Now estimate "tail index" , by studying:

Slopes:  graphic



    -    Simply use difference quotients from log-log CCDF

    -    Numerical problem:  0 denominators

    -    Reset to bottom of plot

    -    Suggest ignoring those

    -    Could use fancier differentiation (e.g. over bigger range)

    -    But this "raw data" shows interesting structure

    -    "Almost always" have   (interesting for LRD)

    -    But no apparent "tail limit" for ?

    -    So do not satisfy "classical heavy tail definition"?

    -    But still clearly "intuitively heavy tailed"?

    -    Worth exploring alternate definitions?
 
 
 


New Mathematics for "Heavy Tails"?



Version 1:    For some ,



Open Problem 1:    For the simple Model,

with Version 1 tailed Duration Dist'n,   is

?




(i.e. have index   LRD)
 
 

Version 2:    Reformulate, in terms of:    have

"most of the time"  (in some sense)?
 
 

Open Problem 2:    For the simple model,

with Version 2 tailed Duration Dist'n,

can we still have    (in a suitable sense)?
 

How do we modify version 2 to make this happen?
 
 
 


Is "Long Range Dependence" Real?

Stegeman, A. W. (2001) "Non-stationarity versus long-range
dependence in computer network traffic measurements", unpublished.  Maybe available at A.W.Stegeman@math.rug.nl???
 
 

Main Idea:  analyzed several famous data sets,

suggests that ARIMA(p,1,q) fits better

    -    Recall ARIMA from:

"Time Series Background" Lecture 9/19/01

    -    Nonstationary

    -    Take 1st difference to return to ARMA

    -    Which is short range dependent

Argues that ARIMA(1,1,1) can exhibit "LRD-like properties"
 
 
 


Is "Long Range Dependence" Real?



How to investigate?

Zooming Periodogram of 1st differences?

    Recall from Lecture 10-15-01

1st Try:  Zooming Periodogam on differences

    -    No "Long Range Dependence"

    -    I.e. no "pole at origin"

    -    But differencing "kills low frequency components"

    -    What is expected?
 

Recall direct Zooming Periodogram

    -    Massive pole at 0

    -    So proceeded withe log-log analysis

    -    And used "theoretical normalization"