Course  OR 778

Class Notes   10/22/01






Last Time:
 

    -    In Context of Simple Model:

            -    Investigated "Indep'ce" part of Poisson process starts

            -    Saw similar results to before

            -    I.e. Cluster Poisson sensible, Weibull interarrival not
 

    -    Studied Flow Scatterplots & "asymptotic independence"

            -    Boundary problems of IP flow data were too severe

    -    Introduced new HTTP Response Size data

            -    Looked at summary statistics

            -    Tried first version of Scatterplots
 
 
 


New HTTP Response Data






Data sources:  4 hour blocks of packet headers

        "Morning":   8:00-12:00

        "Afternoon":   13:00-17:00

        "Evening":    19:30-23:30
 
 

Gathered at UNC Main Link

During 7 days in April 2001

(not consecutive days due to hardware problems)

(but each weekday is represented)





Headers filtered to only HTTP packets

(recall subset of  IP  and  TCP)





Updated information:  these are "real responses",

i.e. individual files sent by TCP.



This information is extracted frrom packet headers.
 

Not simply determined only by addresses

(as done for IP data earlier)






New Flow Scatterplots








Recall Idea:  look for "asymptotic independence"
 

Expectation:  for heavy tailed distributions,

scatterplot should "hug axes"
 
 

Approach:  for new HTTP Response Sizes,

plot "response time duration (sec)" vs. "size of response (bytes)"

Combined Graphic (all 21 time blocks)
 

Personal "axis hugging rating":

Monday Morning:  Good

Monday Afternoon:  Good

Monday Evening:  OK

Tuesday Morning:  Very good

Tuesday Afternoon:  Good

Tuesday Evening:  Poor

Wednesday Morning:  Good

Wednesday Afternoon:  Very good

Wednesday Evening:  OK

Thursday Morning:  OK

Thursday Afternoon:  Good

Thursday Evening:  Very good

Friday Morning:  Poor

Friday Afternoon:  Poor

Friday Evening:  Good

Saturday Morning:  Good

Saturday Afternoon:  OK

Saturday Evening:  OK

Sunday Morning:  Good

Sunday Afternoon:  Poor

Sunday Evening:  OK
 
 
 


New Flow Scatterplots (cont.)








Other Observations:
 

    -    No apparent patterns???    With respect to:

            -    day of week?

            -    time of day?

            -    traffic load?

            -    max response size?
 

    -    Interesting "streaks" at some times

(strongest on Sunday morning)

            -    suggests a few common "transmission rates"

            -    most clear on Sun. morning, since least packet loss
 
 
 


New Flow Scatterplots (cont.)





Extreme Value Rescaled Analysis:
 

Idea:  put everything on "same tail index scale"
 
 

Notation:  let   denote Response Size (bytes)

    and let   denote Response Duration (sec)
 
 

Approach:  estimate "tail index" 

(for both of   and   separately)




then apply "axis hugging" view to joint distribution of 
 
 
 


New Flow Scatterplots (cont.)





Estimation of Tail Index, :

Hill Estimator





Reference:   sec. 6.4.2  of:

Embrechts, P., Klueppelberg, C. and Mikosch, T. (1997) Modelling Extremal Events, Springer.
 
 

For "reverse order statistics", 
 

Compare "previous data point" 
 

    with the tail, on a log scale, to get 
 

I.e. define 
 
 
 


New Flow Scatterplots (cont.)





Results of Hill Estimation:  graphic
 
 

    -    Strong dependence on number of data points 

    -    estimates mostly in range 

(recall infinite variance, but finite mean)

    -    but not always (sometimes above, sometimes below)

    -    Considered   from 100 to 2400

    -    Worth going smaller than 100???

    -    Which to use as an estimate?

    -    Chose the average over 100 to 2000

    -    Lower bound since "don't trust below that"

    -    Upper bound since use largest 2000 in scatterplot

    -    Is the average sensible????
 
 


New Flow Scatterplots (cont.)





Now for scatterplot of joint distribution of :
 

"Biggest 2000" now chosen by "distance to origin",

avoids potential earlier "biasing towards size" effects
 
 

Results:

Extreme Value rescaled Scatterplot
 

    -    Not much different from unscaled version?

    -    Since most estimated   near 1??

    -    Thursday Evening has largest relative difference?

    -    Rescaled Thursday Evening has less "axis hugging"?

    -    I. e. this transformation "hurts" axis hugging???

(this case only)

    -    most others have "more axis hugging"??

    -    Only because most powers  are larger than one?

    -    Visual impression ("axis hugging") driven mostly by

few biggest data points???

    -    What about other scales?
 
 


New Flow Scatterplots (cont.)





Joint Distribution of  (cont.)
 

Same scatterplot, on Square Root Scale
 

    -    significantly reduces visual impression of "axis hugging"
 

    -    Wednesday Afternoon looks best?
 

    -    "asymptotic independence" depends on "scale"?
 
 
 


New Flow Scatterplots (cont.)





Joint Distribution of  (cont.)
 
 

A deeper look at power transformations:

The Box-Cox Power family




Given a power  modify the usual

"power transformation" 




by a linear transformation:




Reason:  this gives "continuity at 0":

i.e. includes the log transformation as well
 
 
 


New Flow Scatterplots (cont.)





Box-Cox transformed Joint Distribution of 
 
 

Thursday Evening  graphic

    -    Looked best on original scale

    -    Consider the range 

from 1 (no change)  to  0  (log)

    -    Smaller    means less "axis hugging"

    -    "Axis hugging" needs to be defined in terms of scale????

    -    Seem to need:   improved finite sample definition of

"asymptotic indep."??







New Flow Scatterplots (cont.)





Big Problem with Hill Estimation:   Choice of 
 

A simplification:

can reduce from choosing two values to one




By estimating the ratio , using 

(for the same )




Then plot 
 

Tuning of Hill Ratio Estimator:  graphic

    -    Often rather close to 1?

    -    Thursday evening again an "extreme case"

(but recall transforming hurt "axis hugging")

    -    Stronger dependence on ?

    -    Ratio version any better that just "using same "?

    -    Worth pursuing???
 
 
 


New Flow Scatterplots (cont.)





Things that could be done next:
 

1.    Follow up on "streaks", i.e. understand those transm'n rates?

    -    Do overlay of important ones on all plots?

    -    If they appear frequently, get expert interpretation?

    -    Apply 2-d  SiZer type analysis to scatterplots?
 

2.    Look at log scale versions?

    -    What do we expect to learn?
 

3.    Get quantitative about "axis hugging"?

    -    Useful finite sample def'n of "asy. indep."???

    -    needs attention to "power - scale" of data??
 
 

How interesting are these???
 
 
 


New Response Size Q-Q Plots





Another view of New Response Size Data:

Extreme Value Tail Index 





Recall Intuition:
 

    -    Shape parameter of Pareto  (polynomial power)
 

    -    Strong relation to Long Range Dependence

            -    in Mice and Elephants plots  (graphic)

            -   in Duration Distributions,

implies Classical LRD in aggregated time series




    -    Strong relation to moments:

            -    for   have infinite mean

            -    for   have finite mean but infinite variance

            -    for   both mean and variance are finite

            -    similar for larger   and higher moments
 
 
 


New Response Size Q-Q Plots (cont.)





Simple, straightforward Estimation of :

Slope of CCDF (i.e. 1 - CDF) on log - log scale





Log-log CCDF:  graphic
 

    -    All 21 time blocks appear as thin blue lines

    -    Each Individual labeled and highlighted in thick red

    -    Not very "linear"?

    -    Suggests classical extreme value theory

hasn't "kicked in" yet???

    -    Note "shapes" of curves surprisingly constant

    -    Suggests curvature is not "random phenomenon"!

    -    Instead something systematic about internet traffic?

    -    Point worth deeper statistical confirmation??

    -    Suggests enhancement of current mathematics????

    -    Friday evening an "extreme point"?  (least steep?)

    -    Many Resp. Sizes near 400 bytes???

(also for Friday, Afternoon, no where else?)

    -    Worth plotting data between 0.999 quantile and max???

(1,000 to 7,000 of these for each time block....)







New Response Size Q-Q Plots (cont.)





Now estimate "tail index" , by studying:

Slopes:  graphic




    -    Simply use difference quotients from log-log CCDF

    -    Numeircal problem:  0 denominators

    -    Reset to bottom of plot

    -    Suggest ignoring those

    -    Could use fancier differentiation (e.g. over bigger range)

    -    But this "raw data" shows interesting structure

    -    "Almost always" have   (interesting for LRD)

    -    But no apparent "tail limit" for ?

    -    So do not satisfy "classical heavy tail definition"?

    -    But still clearly "intuitively heavy tailed"?

    -    Worth exploring alternate definitions?
 
 
 


New Mathematics for "Heavy Tails"?





Version 1:    For some ,





Open Problem 1:    For the simple Model,

with Version 1 tailed Duration Dist'n,   is

?




(i.e. have index   LRD)