Lecture11-26-01

Course OR 778

Class Notes 11/26/01

Last Time (10/31/01):

- Studied Cascaded On-Off Model

From 10/24/01:

- Asymptotic Independence, using SiZer analysis

- relationship between Size, Time and "Rate" = Size/Time

- Tail index estimation via slope of log-log CCDF

- Long Range Dependence vs. ARIMA(1,1,1)

Revisit: Asymptotic Independence

Recall from 10/22/01 and 10/24/01.

Main Issue:

Do "large values" (of variables and ) "occur together"?

Motivating data set:

HTTP Response Data, from UNC, 2001

(web browsing data transfer)

Variables of interest:

- Response size (bytes)

- Response time (secs)

- Transmission rate, size / time (bytes / sec)

Revisit: Asymptotic Independence (cont.)

Important point:

for study of joint behavior of ,

makes sense to "put on same scale"

So did Hill estimation, and power transformation

(details in 10/22/01)

i.e. study joint distribution of

Manisfestation of "asymptotic independence":

"axis hugging" in scatterplot

Scatterplot graphic

- found substantial apparent "axis hugging"

- but also exceptions

- interesting "streaks"

(common rates?)

(strongest Sunday morning: most off peak)

- how to make more precise?

Revisit: Asymptotic Independence (cont.)

More precise version:

- study "angles" from polar coordinates

- "axis hugging" appears as "piling up at ends"

- study statistical significance using SiZer

- need "comparable scale" for "angle" to be sensible

(look carefully at scales in above scatterplot)

- so rescale by marginal medians

Revisit: Asymptotic Independence (cont.)

Comparison 1: Time (duration) vs. Size

- Expect "dependence", since "bigger files need more time"

- Saw some "dependence"

(mostly weekday mornings & afternoons)

- And some "independence"

(mostly weekday evenings, weekends)

- Possible reason for independence:

Packet loss delays, that are indep. of size

- Problem with this idea:

Expect most independence during high traffic weekdays

- Explanation???

Revisit: Asymptotic Independence (cont.)

Comparison 2: Rate vs. Time (duration)

- Generally independent (more than above)

- Consistent with above

(since long times are expected when rates are slower)

- But again more so for weekday evenings and weekends

- Small "dependent bumps" on weekdays

- Driver of this phenomenon???

Revisit: Asymptotic Independence (cont.)

Comparison 3: Rate vs. Size

- Strong dependence

- biggest files get through fastest?

- Inconsistent with packet loss explanation

(since larger files should have more loss problems)

- strongest for weekdays (often unimodal)

- weaker for evenings / weekends (at least bimodal)

- Driver of this phenomenon???

Revisit: Asymptotic Independence (cont.)

Overall Conclusions:

Packet loss explanation is wrong

Need a new explanation

Could do: explore variable 1 / time

- since "nonlinear part of rate"

- dubious, since only saved responses > 100 kbytes

- so careful treatment needs new data

Large Variable Association

Idea:

variation of "Asymptotic Dependence"

that is well defined for finite samples,

not just in limit as

Goals:

1. Indicate whether large values are

"more (less) associated than usual".

2. Reduce to classical Asymptotic Independence

in limit as (in interesting cases)

Large Variable Association (cont.)

Approach:

a. For "properly adjusted marginals"

(adjust extreme value distributions as above?)

(surely "make scales comparable", e.g. divide by median)

b. Consider "polar coordinates" versions of ,

where and .

c. Transfrom data to "angularly equally spaced",

by replacing sorted s with an equally spaced grid

(essentially Probability Integral Transform)

d. Study association of "large values", by density of s

where corresponding is large (> threshold)

e. Have "large variable association" when "pile up in middle"

f. Have "large variable disassociation" when "pile up at ends"

g. Use SiZerto study statistical significance

Large Variable Association (cont.)

Toy Example 1: 5,000,000 absolute values of Standard Normal

Scatterplot, thresholded to 10,000

- Expect association same for large values

- Full data scatterplot similar, but with "filled in center"

- no "axis hugging"

- requires normalization, and much bigger n???

Large Variable Association movie (through thresholds)

- As expected looks very uniform

- Expected effects for decreasing sample size

- Some boundary problems in SiZer analysis

(used crude "mirror image" adjustment)

(inadequate for n = 5,000,000)

Large Variable Association (cont.)

Toy Example 2: 5,000,000 independent Pareto (1.5)

Raw data scatterplot, thresholded to 10,000

- Strong "axis hugging" as expected

- What is "distribution of angles (full data set)?

SiZer analysis of full set of angles

- Clear "U shape" (very significant)

- Thus transformation to "angularly equally spaced"

is critical to concluding Large Variable Association

Check angular equal spacing, with SiZer analysis

- looks very uniform

- still have boundary effect problems

What is effect on corresponding scatterplot?

- surprisingly (?) negligible

- still have very strong axis hugging

- so thresholding should reveal strong Large Var. Disassoc.

Large Variable Association movie (through thresholds)

- As expected shows strong Large Variable Disassociation

- Strengthens for more thresholding

- Expected effects for decreasing sample size