Lecture9-10-01

Course OR 778

Statistical Analysis and Modelling
of Internet Traffic Data

Course Meetings:

Time: Mon. - Wed. 8:40 - 9:55
Room: Rhodes 471

Course Web Site:

http://www.orie.cornell.edu/~marron/OR778NetworkData/OR778home.html

maybe easier to follow link from:

http://www.orie.cornell.edu/~marron/

Instructor: J. S. (Steve) Marron

Office: Rhodes 234
Office Hours: Mon. 10 - 11, Tuesday 11 - 12

Phone: (607) 255-9147
Email: marron@stat.unc.edu

Course Email List: please add yourself,
by sending an email with "subscribe" as the subject,
to: or778-fa01-l-request@orie.cornell.edu

(useful for announcements, such as "notes now posted")

Course Work / Grading

Based on a presentation

Presentations:

- can be either a paper by others (you choose, or I suggest)

- or your own work

- let's discuss soon

Last Time:

- Detailed Q-Q analysis of tail of Response Size Distributions

- Pareto(1.2) gave acceptable (?) fit

- So did Pareto(1.5) ??

- Moving window analysis showed non-stationarity

- log normal also gave decent fit ???

- how should we think about "heavy tails"????

- in context of:

Heavy tailed durations

Long Range Dependence

Q-Q analysis revisited, I

Where are the quantiles on the Q-Q curve?

Movie highlighting quantiles

(Note: resimul'n of envelope gives "visual impression of variability")

This can also be understood by relating to the "smooth histogram":

SiZer analysis movie

Aside: Q-Q plot suggests HTTP responses of size 1????

There are 4 in the file, clearly an error in data collection...

Q-Q analysis revisited, II

Restriction to "1st 50,000" seems small for studying tail behavior,

Repeat envelope analysis with the full (n = 734,814) data set?

Pareto quantile match 0.99 & 0.999
- Same good (?) fit as before

Log Normal Analysis quantile match 0.99 & 0.999
- Looks unacceptably "curved"?

Log Normal Analysis quantile match 0.9 & 0.999
- Better, but still "too curved"?

Log Normal Analysis Max. Lik. Est.
- Good in "body of dist'n", but too poor in tail?

Q-Q analysis revisited, III

Can we get a "decently good fit" from any parametric family?

Weibull Analysis quantile match 0.99 & 0.999

- visually very far away

- large sample size makes more clear

Q-Q analysis revisited, IV

Comparison across plots is slippery with differing edges,
so choose range:

Pareto quantile match

Pareto, twiddled parameters

Pareto, finite variance boundary

- much easier comparison

- Q-Q curve "shifts to the right"

- envelope covers same range (same theoretical quantiles)

- more variability for heavier tails???

Q-Q analysis revisited, V

Review "moving window of 50,000", showing quantiles

Movie with fit Pareto

Movie with "nearly light tail" Pareto

- important nonstationarity is between 0.99 and 0.999 quantiles

(50 - 500 largest data points)

- cannot completely exclude light tails

- nonstationarity could be "long range dep." or "diurnal effect"

- how to study "dependence"?

- expect better data soon

Q-Q analysis revisited, VI

How do parameter est's change as the matched quantiles change?

Q matched Q-Q, q1 = 0.5, movie over q2

Summary plot of parameter estimates

- est'd shape parameters ~ 1.2 - 1.3

Q matched Q-Q, q1 = 0.9, movie over q2

Summary plot of parameter estimates

- est'd shape parameters ~ 1.2 - 1.8

- "spike" where q1 ~ q2

Q matched Q-Q, q1 = 0.99, movie over q2

Summary plot of parameter estimates

- est'd shape parameters ~ 1.2 - 1.8

- "spike" where q1 ~ q2

Q matched Q-Q, q1 = 0.999, movie over q2

Summary plot of parameter estimates

- est'd shape parameters ~ 1.0 - 1.4

- (downwards) "spike" where q1 ~ q2

Q matched Q-Q, q1 = 0.9999, movie over q2

Summary plot of parameter estimates

- est'd shape parameters ~ 1.1 - 1.3

- "spike" where q1 ~ q2

Q-Q analysis revisited, VI

Could do: summarize over q1, q2 "triangle"

Suspected Conclusion: est'd shape parameters ~ 1.0 - 1.8

Seems like strong case for heavy tails

Could do: formal hypothesis test, to reject

H0: shape parameter = 2

Q-Q analysis revisited, VII

What about other data views?

Overall Review of "Graphical Goodness of Fit"

Reference:

Fisher, N. I. (1983) Graphical Methods in Nonparametric Statistics: A Review and Annotated Bibliography, International Statistical Review, 51, 25-58.

Review of "Graphical Goodness of Fit"

Basis: "Cumulative Distribution Function" (CDF)

Probability quantile notation:

for "probability" and "quantile"

Thus is called the "quantile function"

Review of "Graphical Goodness of Fit" (cont.)

Two types of CDF:

1. Theoretical

2. Empirical, based on data

Review of "Graphical Goodness of Fit" (cont.)

Direct Visualizations:

1. CDF - plot vs. grid of values

2. Quantile - plot (= sorted data) vs. grid of values

Comparison Visualizations: (compare empirical with a theoretical)

3. P-P plot - plot vs. for a grid of values

4. Q-Q plot - plot vs. for a grid of values

Review of "Graphical Goodness of Fit" (cont.)

A Connection: For the Uniform(0,1) distribution,

so:

- CDF is P-P plot against the Uniform(0,1)

- Quantile is a Q-Q plot against the Uniform(0,1)

(these things aren't all that different, just rescalings)

Review of "Graphical Goodness of Fit" (cont.)

Some distributions have special relations to appropriate scalings,

Can lead to "visual parameter estimation":

E.g. 1: Gaussian,

solving for gives:

where is the Standard Normal Quantile.

So Q-Q plot against Standard Normal is linear (any Gaussian),

and is the intercept, and is the slope.

Review of "Graphical Goodness of Fit" (cont.)

E.g. 2: Pareto, shape parameter scale parameter

So get linear function (with slope ), for:

log(1-CDF) vs. log(quantiles)

(essentially CDF on log-log scales)

Review of "Graphical Goodness of Fit" (cont.)

E.g. 3: Weibull, shape parameter scale parameter

solve to get quantile function:

but is the Quantile func'n of the Exponential(1)

so have linear function, for log-log Q-Q against the Exponential.

Review of "Graphical Goodness of Fit" (cont.)

Some Toy Examples

Pareto, varying shape

Pareto, varying scale

Weibull, varying shape

Weibull, varying scale

logNormal, varying mean

logNormal, varying scale

Alternate Views of Response Size Data

Downey (2000) - cdf based analyses (not Q-Q plots)

Direct CDF

- Clearly wrong scale

CDF(log10 data)

- much better

- good connection to smooth histogram

Alternate Views (cont.)

Focus on Pareto view:

log(1-F) calculation

CCDF(log10 data)

- just up-side down flip

log10 CCDF(log10 data)

- Pareto is line, with -shape parameter as slope

Alternate Views (cont.)

Personal Conclusions:

- Prefer Q-Q analysis, since can assess variability

(Completely invalid, because of LRD / diurnal effects???)
(How to assess variability in cdf? bootstrap?)

- Pareto is "reasonable in large regions"

- Lognormal is close, but inadequate

- Weibull is way off

Interesting Open Problem (revisited)

1. Find a "good", precise mathematical definition of:

"heavy tailed" distributions

Some ideas:

- not moment based

- should depend on "range of interest"

- empirical version depends on sample size

- not a number, but a "curve"?

- what will it be used for??