Course  OR 778

Class Notes   11/26/01




Last Time (10/31/01):

    -    Studied Cascaded On-Off Model
 

From 10/24/01:

    -    Asymptotic Independence, using SiZer analysis

    -    relationship between Size, Time and "Rate" = Size/Time

    -    Tail index estimation via slope of log-log CCDF

    -    Long Range Dependence vs. ARIMA(1,1,1)
 
 
 


Revisit: Asymptotic Independence




Recall from  10/22/01  and  10/24/01.
 

Main Issue:

Do "large values" (of variables  and ) "occur together"?




Motivating data set:

HTTP Response Data, from UNC, 2001

(web browsing data transfer)




Variables of interest:
 

    -    Response size (bytes)
 

    -    Response time (secs)
 

    -    Transmission rate, size / time (bytes / sec)
 
 
 


Revisit: Asymptotic Independence (cont.)




Important point:

    for study of joint behavior of ,

    makes sense to "put on same scale"
 
 

So did Hill estimation, and power transformation

            (details in 10/22/01)

    i.e. study joint distribution of 
 
 

Manisfestation of "asymptotic independence":

"axis hugging" in scatterplot




Scatterplot graphic
 

    -    found substantial apparent "axis hugging"
 

    -    but also exceptions
 

    -    interesting "streaks"

(common rates?)

(strongest Sunday morning: most off peak)




    -    how to make more precise?
 
 
 


Revisit: Asymptotic Independence (cont.)




More precise version:
 

    -    study "angles" from polar coordinates
 

    -    "axis hugging" appears as "piling up at ends"
 

    -    study statistical significance using  SiZer
 

    -    need "comparable scale" for "angle" to be sensible

(look carefully at scales in above scatterplot)




    -    so rescale by marginal medians
 
 
 


Revisit: Asymptotic Independence (cont.)



Comparison 1:  Time (duration) vs. Size
 

    -    Expect "dependence", since "bigger files need more time"
 

    -    Saw some "dependence"

            (mostly weekday mornings & afternoons)
 

    -    And some "independence"

            (mostly weekday evenings, weekends)
 

    -    Possible reason for independence:

Packet loss delays, that are indep. of size



    -    Problem with this idea:

Expect most independence during high traffic weekdays



    -    Explanation???
 
 
 


Revisit: Asymptotic Independence (cont.)




Comparison 2:  Rate vs. Time (duration)
 

    -    Generally independent (more than above)
 

    -    Consistent with above

(since long times are expected when rates are slower)



    -    But again more so for weekday evenings and weekends
 

    -    Small "dependent bumps" on weekdays
 

    -    Driver of this phenomenon???
 
 
 


Revisit: Asymptotic Independence (cont.)




Comparison 3:  Rate vs. Size
 

    -    Strong dependence
 

    -    biggest files get through fastest?
 

    -    Inconsistent with packet loss explanation

(since larger files should have more loss problems)




    -    strongest for weekdays (often unimodal)
 

    -    weaker for evenings / weekends (at least bimodal)
 

    -    Driver of this phenomenon???
 
 
 


Revisit: Asymptotic Independence (cont.)




Overall Conclusions:

    Packet loss explanation is wrong

    Need a new explanation
 
 

Could do:  explore variable 1 / time
 

    -    since "nonlinear part of rate"
 

    -    dubious, since only saved responses > 100 kbytes
 

    -    so careful treatment needs new data
 
 
 


Large Variable Association




Idea:

        variation of "Asymptotic Dependence"

        that is well defined for finite samples,

        not just in limit as 
 
 

Goals:
 

    1.    Indicate whether large values are

"more (less) associated than usual".






    2.    Reduce to classical Asymptotic Independence

in limit as   (in interesting cases)









Large Variable Association (cont.)




Approach:
 

a.  For "properly adjusted marginals"

        (adjust extreme value distributions as above?)

        (surely "make scales comparable", e.g. divide by median)
 

b.  Consider "polar coordinates" versions of ,

where  and .
 

c.  Transfrom data to "angularly equally spaced",

    by replacing sorted s with an equally spaced grid

            (essentially Probability Integral Transform)
 

d.  Study association of "large values", by density of s

        where corresponding  is large (> threshold)
 

e.  Have "large variable association" when "pile up in middle"
 

f.  Have "large variable disassociation" when "pile up at ends"
 

g.  Use SiZerto study statistical significance
 
 
 


Large Variable Association (cont.)




Toy Example 1:    5,000,000 absolute values of Standard Normal
 
 

Scatterplot, thresholded to 10,000
 

    -    Expect association same for large values
 

    -    Full data scatterplot similar, but with "filled in center"
 

    -    no "axis hugging"
 

    -    requires normalization, and much bigger n???
 
 

Large Variable Association movie (through thresholds)
 

    -    As expected looks very uniform
 

    -    Expected effects for decreasing sample size
 

    -    Some boundary problems in SiZer analysis

(used crude "mirror image" adjustment)

(inadequate for n = 5,000,000)









Large Variable Association (cont.)




Toy Example 2:    5,000,000 independent Pareto (1.5)
 

Raw data scatterplot, thresholded to 10,000
 

    -    Strong "axis hugging" as expected
 

    -    What is "distribution of angles (full data set)?
 
 

SiZer analysis of full set of angles
 

    -    Clear "U shape" (very significant)
 

    -    Thus transformation to "angularly equally spaced"

            is critical to concluding Large Variable Association
 
 

Check angular equal spacing, with SiZer analysis
 

    -    looks very uniform
 

    -    still have boundary effect problems
 
 

What is effect on corresponding scatterplot?
 

    -    surprisingly (?) negligible
 

    -    still have very strong axis hugging
 

    -    so thresholding should reveal strong Large Var. Disassoc.
 
 

Large Variable Association movie (through thresholds)
 

    -    As expected shows strong Large Variable Disassociation
 

    -    Strengthens for more thresholding
 

    -    Expected effects for decreasing sample size