Statistical Analysis and
Modelling
of Internet Traffic Data
Course Meetings:
Time: Mon. - Wed. 8:40 - 9:55
Room: Rhodes 471
Course Web Site:
http://www.orie.cornell.edu/~marron/OR778NetworkData/OR778home.html
maybe easier to follow link from:
http://www.orie.cornell.edu/~marron/
Instructor: J.
S. (Steve) Marron
Office: Rhodes
234
Office Hours:
Mon. 10 - 11, Tuesday 11 - 12
Phone: (607)
255-9147
Email: marron@stat.unc.edu
Course Email List:
please add yourself,
by sending an email with
"subscribe" as the subject,
to: or778-fa01-l-request@orie.cornell.edu
(useful for announcements, such as "notes now posted")
Course Work / Grading
Based on a presentation
Presentations:
- can be either a paper by others (you choose, or I suggest)
- or your own work
-
let's discuss soon
Last Time:
- Detailed Q-Q analysis of tail of Response Size Distributions
- Pareto(1.2) gave acceptable (?) fit
- So did Pareto(1.5) ??
- Moving window analysis showed non-stationarity
- log normal also gave decent fit ???
- how should we think about "heavy tails"????
- in context of:
Q-Q analysis revisited, I
Where are the quantiles on the Q-Q curve?
(Note: resimul'n of envelope gives "visual impression of variability")
This can also be understood by relating to the "smooth histogram":
Aside: Q-Q plot suggests HTTP responses of size 1????
There are 4 in the file,
clearly an error in data collection...
Q-Q analysis revisited, II
Restriction to "1st 50,000" seems small for studying tail behavior,
Repeat envelope
analysis with the full (n = 734,814) data set?
Pareto
quantile match 0.99 & 0.999
-
Same good (?) fit as before
Log
Normal Analysis quantile match 0.99 & 0.999
-
Looks unacceptably "curved"?
Log
Normal Analysis quantile match 0.9 & 0.999
-
Better, but still "too curved"?
Log
Normal Analysis Max. Lik. Est.
-
Good in "body of dist'n", but too poor in tail?
Q-Q analysis revisited, III
Can we get a "decently good
fit" from any parametric family?
Weibull
Analysis quantile match 0.99 & 0.999
-
visually very far away
-
large sample size makes more clear
Q-Q analysis revisited, IV
Comparison across plots is
slippery with differing edges,
so choose range:
Pareto, finite variance boundary
- much easier comparison
- Q-Q curve "shifts to the right"
- envelope covers same range (same theoretical quantiles)
-
more variability for heavier tails???
Q-Q analysis revisited, V
Review "moving window of 50,000", showing quantiles
Movie with "nearly light tail" Pareto
- important nonstationarity is between 0.99 and 0.999 quantiles
- cannot completely exclude light tails
- nonstationarity could be "long range dep." or "diurnal effect"
- how to study "dependence"?
-
expect better data soon
Q-Q analysis revisited, VI
How do parameter est's change
as the matched quantiles change?
Q matched Q-Q, q1 = 0.5, movie over q2
Summary plot of parameter estimates
-
est'd shape parameters ~ 1.2 - 1.3
Q matched Q-Q, q1 = 0.9, movie over q2
Summary plot of parameter estimates
- est'd shape parameters ~ 1.2 - 1.8
-
"spike" where q1 ~ q2
Q matched Q-Q, q1 = 0.99, movie over q2
Summary plot of parameter estimates
- est'd shape parameters ~ 1.2 - 1.8
-
"spike" where q1 ~ q2
Q matched Q-Q, q1 = 0.999, movie over q2
Summary plot of parameter estimates
- est'd shape parameters ~ 1.0 - 1.4
-
(downwards) "spike" where q1 ~ q2
Q matched Q-Q, q1 = 0.9999, movie over q2
Summary plot of parameter estimates
- est'd shape parameters ~ 1.1 - 1.3
-
"spike" where q1 ~ q2
Q-Q analysis revisited, VI
Could do:
summarize over q1, q2 "triangle"
Suspected Conclusion:
est'd shape parameters ~ 1.0 - 1.8
Seems like strong case for
heavy tails
Could do: formal hypothesis test, to reject
H0: shape parameter = 2
Q-Q analysis revisited, VII
What about other data views?
Overall Review of "Graphical
Goodness of Fit"
Reference:
Fisher, N. I. (1983) Graphical
Methods in Nonparametric Statistics: A Review and Annotated Bibliography,
International
Statistical Review, 51, 25-58.
Review of "Graphical Goodness of Fit"
Basis: "Cumulative Distribution Function" (CDF)
Probability quantile notation:
for "probability" and "quantile"
Thus
is called the "quantile function"
Review of "Graphical Goodness of Fit" (cont.)
Two types of CDF:
1. Theoretical
2. Empirical, based on data
Review of "Graphical Goodness of Fit" (cont.)
Direct Visualizations:
1. CDF - plot vs. grid of values
2. Quantile
- plot
(= sorted data) vs. grid of
values
Comparison Visualizations: (compare empirical with a theoretical)
3. P-P plot
- plot
vs.
for a grid of
values
4. Q-Q plot
- plot
vs.
for a grid of
values
Review of "Graphical Goodness of Fit" (cont.)
A Connection: For the Uniform(0,1) distribution,
so:
-
CDF is P-P plot against the Uniform(0,1)
-
Quantile is a Q-Q plot against the Uniform(0,1)
(these things aren't all
that different, just rescalings)
Review of "Graphical Goodness of Fit" (cont.)
Some distributions have special relations to appropriate scalings,
Can lead to "visual parameter
estimation":
E.g. 1: Gaussian,
solving for gives:
where is the Standard Normal Quantile.
So Q-Q plot against Standard Normal is linear (any Gaussian),
and
is the intercept, and
is the slope.
Review of "Graphical Goodness of Fit" (cont.)
E.g. 2:
Pareto, shape parameter
scale parameter
So get linear function (with slope ), for:
log(1-CDF) vs. log(quantiles)
(essentially CDF on log-log scales)
Review of "Graphical Goodness of Fit" (cont.)
E.g. 3: Weibull, shape parameter scale parameter
solve to get quantile function:
but
is the Quantile func'n of the Exponential(1)
so have linear function,
for log-log Q-Q against the Exponential.
Review of "Graphical Goodness of Fit" (cont.)
Some Toy Examples
Alternate Views of Response Size Data
Downey (2000) -
cdf based analyses (not Q-Q plots)
-
Clearly wrong scale
- much better
-
good connection to smooth histogram
Alternate Views (cont.)
Focus on Pareto view:
log(1-F) calculation
-
just up-side down flip
-
Pareto is line, with -shape parameter as slope
Alternate Views (cont.)
Personal Conclusions:
- Prefer Q-Q analysis, since can assess variability
(Completely invalid, because
of LRD / diurnal effects???)
(How to assess variability
in cdf? bootstrap?)
-
Pareto is "reasonable in large regions"
-
Lognormal is close, but inadequate
-
Weibull is way off
Interesting Open Problem (revisited)
1. Find a "good", precise mathematical definition of:
"heavy tailed" distributions
Some ideas:
- not moment based
- should depend on "range of interest"
- empirical version depends on sample size
- not a number, but a "curve"?
-
what will it be used for??