Class Notes 11/26/01
Last Time (10/31/01):
-
Studied Cascaded On-Off Model
From 10/24/01:
- Asymptotic Independence, using SiZer analysis
- relationship between Size, Time and "Rate" = Size/Time
- Tail index estimation via slope of log-log CCDF
-
Long Range Dependence vs. ARIMA(1,1,1)
Revisit: Asymptotic Independence
Recall from 10/22/01
and 10/24/01.
Main Issue:
Do "large values" (of variables
and
) "occur
together"?
Motivating data set:
HTTP Response Data, from UNC, 2001
(web browsing data transfer)
Variables of interest:
-
Response size (bytes)
-
Response time (secs)
-
Transmission rate, size / time (bytes / sec)
Revisit: Asymptotic Independence (cont.)
Important point:
for study
of joint behavior of ,
makes
sense to "put on same scale"
So did Hill estimation, and power transformation
(details in 10/22/01)
i.e. study
joint distribution of
Manisfestation of "asymptotic independence":
"axis hugging" in scatterplot
Scatterplot graphic
-
found substantial apparent "axis hugging"
-
but also exceptions
- interesting "streaks"
(common rates?)
(strongest Sunday morning: most off peak)
-
how to make more precise?
Revisit: Asymptotic Independence (cont.)
More precise version:
-
study "angles" from polar coordinates
-
"axis hugging" appears as "piling up at ends"
-
study statistical significance using SiZer
- need "comparable scale" for "angle" to be sensible
(look carefully at scales in above scatterplot)
-
so rescale by marginal medians
Revisit: Asymptotic Independence (cont.)
Comparison 1: Time
(duration) vs. Size
-
Expect "dependence", since "bigger files need more time"
- Saw some "dependence"
(mostly weekday mornings & afternoons)
- And some "independence"
(mostly weekday evenings, weekends)
- Possible reason for independence:
Packet loss delays, that are indep. of size
- Problem with this idea:
Expect most independence during high traffic weekdays
-
Explanation???
Revisit: Asymptotic Independence (cont.)
Comparison 2: Rate
vs. Time (duration)
-
Generally independent (more than above)
- Consistent with above
(since long times are expected when rates are slower)
-
But again more so for weekday evenings and weekends
-
Small "dependent bumps" on weekdays
-
Driver of this phenomenon???
Revisit: Asymptotic Independence (cont.)
Comparison 3: Rate
vs. Size
-
Strong dependence
-
biggest files get through fastest?
- Inconsistent with packet loss explanation
(since larger files should have more loss problems)
-
strongest for weekdays (often unimodal)
-
weaker for evenings / weekends (at least bimodal)
-
Driver of this phenomenon???
Revisit: Asymptotic Independence (cont.)
Overall Conclusions:
Packet loss explanation is wrong
Need a
new explanation
Could do: explore variable
1 / time
-
since "nonlinear part of rate"
-
dubious, since only saved responses > 100 kbytes
-
so careful treatment needs new data
Large Variable Association
Idea:
variation of "Asymptotic Dependence"
that is well defined for finite samples,
not just in limit as
Goals:
1. Indicate whether large values are
"more (less) associated than usual".
2. Reduce to classical Asymptotic Independence
in limit as
(in interesting cases)
Large Variable Association (cont.)
Approach:
a. For "properly adjusted marginals"
(adjust extreme value distributions as above?)
(surely "make scales comparable", e.g. divide by median)
b. Consider "polar
coordinates" versions of ,
where
and
.
c. Transfrom data to "angularly equally spaced",
by replacing
sorted s
with an equally spaced grid
(essentially Probability Integral Transform)
d. Study association
of "large values", by density of s
where corresponding
is large (> threshold)
e. Have "large variable
association" when "pile up in middle"
f. Have "large variable
disassociation" when "pile up at ends"
g. Use SiZerto
study statistical significance
Large Variable Association (cont.)
Toy Example 1:
5,000,000 absolute values of Standard Normal
Scatterplot,
thresholded to 10,000
-
Expect association same for large values
-
Full data scatterplot similar, but with "filled in center"
-
no "axis hugging"
-
requires normalization, and much bigger n???
Large
Variable Association movie (through thresholds)
-
As expected looks very uniform
-
Expected effects for decreasing sample size
- Some boundary problems in SiZer analysis
(used crude "mirror image" adjustment)
(inadequate for n = 5,000,000)
Large Variable Association (cont.)
Toy Example 2:
5,000,000 independent Pareto (1.5)
Raw
data scatterplot, thresholded to 10,000
-
Strong "axis hugging" as expected
-
What is "distribution of angles (full data set)?
SiZer
analysis of full set of angles
-
Clear "U shape" (very significant)
- Thus transformation to "angularly equally spaced"
is critical to concluding Large Variable Association
Check
angular equal spacing, with SiZer
analysis
-
looks very uniform
-
still have boundary effect problems
What is effect on corresponding
scatterplot?
-
surprisingly (?) negligible
-
still have very strong axis hugging
-
so thresholding should reveal strong Large Var. Disassoc.
Large
Variable Association movie (through thresholds)
-
As expected shows strong Large Variable Disassociation
-
Strengthens for more thresholding
-
Expected effects for decreasing sample size