Course  OR 778

Statistical Analysis and Modelling
of Internet Traffic Data





Course Meetings:

Time:   Mon. - Wed. 8:40 - 9:55
Room:  Rhodes 471


Course Web Site:

http://www.orie.cornell.edu/~marron/OR778NetworkData/OR778home.html

maybe easier to follow link from:

http://www.orie.cornell.edu/~marron/
 
 
 
 


Instructor:   J. S. (Steve) Marron
 

Office:   Rhodes 234
Office Hours:   Mon. 10 - 11,    Tuesday 11 - 12
 

Phone:   (607) 255-9147
Email:   marron@stat.unc.edu
 

Course Email List:  please add yourself,
by sending an email with "subscribe" as the subject,
to:  or778-fa01-l-request@orie.cornell.edu

(useful for announcements, such as "notes now posted")














Course Work / Grading
 
 
 

Based on a presentation
 
 
 
 

Presentations:

    -    can be either a paper by others (you choose, or I suggest)

    -    or your own work

    -    let's discuss soon
 
 
 
 
 
 


Last Time:
 
 

    -    Detailed Q-Q analysis of tail of Response Size Distributions

    -    Pareto(1.2) gave acceptable (?) fit

    -    So did Pareto(1.5)  ??

    -    Moving window analysis showed non-stationarity

    -    log normal also gave decent fit ???

    -    how should we think about "heavy tails"????

    -    in context of:

Heavy tailed durations    Long Range Dependence











Q-Q analysis revisited, I






Where are the quantiles on the Q-Q curve?

Movie highlighting quantiles

(Note: resimul'n of envelope gives "visual impression of variability")






This can also be understood by relating to the "smooth histogram":

SiZer analysis movie
 
 
 

Aside:  Q-Q plot suggests HTTP responses of size 1????

There are 4 in the file, clearly an error in data collection...
 
 
 
 
 
 


Q-Q analysis revisited, II






Restriction to "1st 50,000" seems small for studying tail behavior,

Repeat envelope analysis with the full (n = 734,814) data set?
 
 

Pareto quantile match 0.99 & 0.999
    -    Same good (?) fit as before
 

Log Normal Analysis quantile match 0.99 & 0.999
    -    Looks unacceptably "curved"?
 

Log Normal Analysis quantile match 0.9 & 0.999
    -    Better, but still "too curved"?
 

Log Normal Analysis Max. Lik. Est.
    -    Good in "body of dist'n", but too poor in tail?
 
 
 
 
 


Q-Q analysis revisited, III







Can we get a "decently good fit" from any parametric family?
 
 
 

Weibull Analysis quantile match 0.99 & 0.999
 
 
 

    -    visually very far away
 

    -    large sample size makes more clear
 
 
 
 


Q-Q analysis revisited, IV







Comparison across plots is slippery with differing edges,
so choose range:

Pareto quantile match

Pareto, twiddled parameters

Pareto, finite variance boundary

    -    much easier comparison

    -    Q-Q curve "shifts to the right"

    -  envelope covers same range (same theoretical quantiles)

    -    more variability for heavier tails???
 
 
 
 
 


Q-Q analysis revisited, V







Review "moving window of 50,000", showing quantiles

Movie with fit Pareto

Movie with "nearly light tail" Pareto

    -    important nonstationarity is between 0.99 and 0.999 quantiles

(50 - 500 largest data points)

    -    cannot completely exclude light tails

    -    nonstationarity could be "long range dep." or "diurnal effect"

    -    how to study "dependence"?

    -    expect better data soon
 
 
 
 
 
 


Q-Q analysis revisited, VI








How do parameter est's change as the matched quantiles change?
 

Q matched Q-Q, q1 = 0.5, movie over q2

Summary plot of parameter estimates

    -    est'd shape parameters  ~  1.2 - 1.3
 

Q matched Q-Q, q1 = 0.9, movie over q2

Summary plot of parameter estimates

    -    est'd shape parameters  ~  1.2 - 1.8

    -    "spike" where q1 ~ q2
 

Q matched Q-Q, q1 = 0.99, movie over q2

Summary plot of parameter estimates

    -    est'd shape parameters  ~  1.2 - 1.8

    -    "spike" where q1 ~ q2
 
 

Q matched Q-Q, q1 = 0.999, movie over q2

Summary plot of parameter estimates

    -    est'd shape parameters  ~  1.0 - 1.4

    -    (downwards) "spike" where q1 ~ q2
 

Q matched Q-Q, q1 = 0.9999, movie over q2

Summary plot of parameter estimates

    -    est'd shape parameters  ~  1.1 - 1.3

    -    "spike" where q1 ~ q2
 
 
 
 
 
 


Q-Q analysis revisited, VI






Could do:    summarize over q1, q2 "triangle"
 
 
 

Suspected Conclusion:    est'd shape parameters  ~  1.0 - 1.8
 
 
 

Seems like strong case for heavy tails
 
 
 

Could do:   formal hypothesis test,  to reject

H0:    shape parameter = 2









Q-Q analysis revisited, VII







What about other data views?
 
 

Overall Review of "Graphical Goodness of Fit"
 
 
 

Reference:

Fisher, N. I. (1983) Graphical Methods in Nonparametric Statistics: A Review and Annotated Bibliography, International Statistical Review, 51, 25-58.
 
 
 
 
 
 


Review of "Graphical Goodness of Fit"






Basis:    "Cumulative Distribution Function"  (CDF)





Probability quantile notation:

    for "probability"          and "quantile" 





Thus   is called the "quantile function"
 
 
 
 
 
 


Review of "Graphical Goodness of Fit" (cont.)






Two types of CDF:
 
 

1.    Theoretical
 
 





2.    Empirical,  based on data 













Review of "Graphical Goodness of Fit" (cont.)





Direct Visualizations:

1.   CDF   -   plot    vs.    grid of   values

2.   Quantile   -   plot   (= sorted data)    vs.    grid of   values
 
 
 

Comparison Visualizations:    (compare empirical with a theoretical)

3.   P-P plot   -   plot    vs.    for a grid of   values
 

4.   Q-Q plot   -   plot    vs.    for a grid of   values
 
 
 
 
 


Review of "Graphical Goodness of Fit" (cont.)





A Connection:    For the Uniform(0,1) distribution,



so:
 

    -    CDF is P-P plot against the Uniform(0,1)
 

    -    Quantile is a Q-Q plot against the Uniform(0,1)
 
 
 

(these things aren't all that different, just rescalings)
 
 
 
 
 
 


Review of "Graphical Goodness of Fit" (cont.)



Some distributions have special relations to appropriate scalings,

Can lead to "visual parameter estimation":
 
 

E.g. 1:    Gaussian,

    solving for   gives:

    where    is the Standard Normal Quantile.

    So    Q-Q plot against Standard Normal  is linear (any Gaussian),

    and   is the intercept,     and  is the slope.
 
 
 
 
 


Review of "Graphical Goodness of Fit" (cont.)




E.g. 2:    Pareto, shape parameter     scale parameter 
 



 
 




So get linear function  (with slope ), for:

log(1-CDF)    vs.  log(quantiles)

(essentially CDF on log-log scales)








Review of "Graphical Goodness of Fit" (cont.)




E.g. 3:    Weibull, shape parameter     scale parameter 



solve to get quantile function:



but    is the Quantile func'n of the Exponential(1)
 
 

so have linear function, for log-log Q-Q  against the Exponential.
 
 
 
 
 


Review of "Graphical Goodness of Fit" (cont.)





Some Toy Examples
 

Pareto, varying shape

Pareto, varying scale

Weibull, varying shape

Weibull, varying scale

logNormal, varying mean

logNormal, varying scale
 
 
 
 
 
 
 
 


Alternate Views of Response Size Data






Downey (2000)  -  cdf based analyses  (not Q-Q plots)
 
 
 

Direct CDF

    -    Clearly wrong scale
 

CDF(log10 data)

    -    much better

    -    good connection to smooth histogram
 
 
 
 
 
 


Alternate Views (cont.)






Focus on Pareto view:
 

log(1-F) calculation
 

CCDF(log10 data)

    -    just up-side down flip
 
 
 
 

log10 CCDF(log10 data)

    -    Pareto is line, with -shape parameter as slope
 
 
 
 
 
 
 


Alternate Views (cont.)






Personal Conclusions:
 

    -    Prefer Q-Q analysis, since can assess variability

(Completely invalid, because of LRD / diurnal effects???)
(How to assess variability in cdf?  bootstrap?)





    -    Pareto is "reasonable in large regions"
 

    -    Lognormal is close, but inadequate
 

    -    Weibull is way off
 
 
 
 
 
 
 


Interesting Open Problem (revisited)






1.    Find a "good", precise mathematical definition of:

"heavy tailed" distributions





Some ideas:

    -    not moment based

    -    should depend on "range of interest"

    -    empirical version depends on sample size

    -    not a number, but a "curve"?

    -    what will it be used for??