Course  OR 778

Statistical Analysis and Modelling
of Internet Traffic Data










Course Meetings:

Time:   Mon. - Wed. 8:40 - 9:55
Room:  Rhodes 471


Course Web Site:
???
 
 
 


Instructor:   J. S. (Steve) Marron
 

Office:   Rhodes 234
Office Hours:   Mon. 10 - 11,    Tuesday 11 - 12
 

Phone:   (607) 255-9147
Email:   marron@stat.unc.edu
 

Course Email List:  please add yourself,
by sending an email with "subscribe" as the subject,
to:  or778-fa01-l-request@orie.cornell.edu

(useful for announcements, such as "notes now posted")








Course Work / Grading
 
 
 

Based on a presentation
 
 
 
 

Presentations:

    -    can be either a paper by others (you choose, or I suggest)

    -    or your own work

    -    let's discuss soon
 
 
 
 
 
 


Last Time:
 
 

    -    Big picture of Internet traffic

    -    Studying Response Size Distributions

    -  SiZer analysis showed non-standard dist'n
                    with statistically significant "bumps"

    -    Introduced Q-Q plots, and assessment of variability
                    via simulated envelope

    -    Q-Q plots suggest possible Pareto fits
 
 
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)







Q-Q plot for full 734,814 HTTP Response Sizes:

Response Size Q-Q plot
 

    -    Pareto(1.2) good fit in tails?

    -    surprisingly good?

    -    have added sim’d overlay

    -    nearly no “variability”?   (except at ends)

    -    Shape parameter 1.2,   has   & 

    -    poor fit for main dist’n (as expected)
 
 
 
 
 


Q-Q Plots (cont.)








Pareto parameter estimation:
 

    -    Pareto Quantile matching
 

    -    Conventional estimation methods "tricky"
 

    -    Likelihood requires numerical optimization
 

    -    Method of Moments fails
 

    -    Want "tails", not "body"
 
 
 
 
 


Q-Q Plots (cont.)








Parameter Estimation:    Pareto quantile matching

   -    Estimate "scale parameter"      and "shape parameter" 

   -    Choose   to match quantiles

(for ):

Theoretical:   where

with

Empirical:   where 












Investigation I:  Heavy Tails?  (cont.)








Study variability for only 1st  50,000:

Q-Q Plot for 1st  50,000 Responses
 

    -    Quantile matched parameter estimates

    -    Small enough to do sim’d overlay

    -    Variation very small over wide range (but not in "tails")

    -    “small wiggles” actually outside

    -    Shape parameter changes to 1.5?!?
 
 


Investigation I:  Heavy Tails?  (cont.)








How different are shape parameters  1.2  and 1.5?
 
 

Use parameters estimated for full 734,814:

Q-Q Plot for 1st  50,000, given parameters
 
 

    -    expect big difference in tail behaviors?
 

    -    seems like small difference (over range of interest)?
 

    -    better fit in tails?
 
 
 
 
 


Alternate application of Q-Q Plots








Suggested by change in est’d shape parameters:

Study possible “non-stationarity”
 
 

Approach:   moving window Q-Q of 50,000, through 1 mil.

Moving Window Q-Q plot

    -    Clear non-stationarity

    -    tails sometimes “heavier”, sometimes “lighter”

    -    and is "statistically significant

    -    diurnal (i.e. "time of day") effect???
 
 
 


Alternate application of Q-Q Plots







How non-stationary?
 

Explore shape parameter = 2

(boundary of finite variance)

Moving Q-Q plot at variance edge
 

    -    Similar nonstationarity
 

    -    tails sometimes “this light”
 

    -    only at "systematic time points"???
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)








Recent Controversy:    Downey (2000)
 
 

Downey, A. B. (2000) The structural cause of file size distributions, Wellesley College Tech. Report CSD-TR25-2000, http://rocky.wellesley.edu/downey/filesize
 
 
 

Does log-normal fit better than Pareto?
 

(studied distribution of "file sizes", but not so different)
 
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)








1st reaction:  that is ridiculous!

 - Pareto(1.5) has infinite moments  > 1

 - log-normal has all moments finite

 - Pareto fits the data  (pretty well?!?)
 
 

2nd thoughts:

    -    careful, internet traffic is:

                -   slippery

                -   full of surprises
 
 
 


Investigation I:  Heavy Tails?  (cont.)









QQ – plots for log-normal (HTTP Response Size data):
 
 

log-Normal Q-Q Plot for 1st  50,000 Responses
 
 

Visually fit log-Normal Q-Q Plot
 
 

Moving window log-Normal Q-Q Plot
 
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)








QQ – plots for log-normal:
 

    -    surprisingly good fit
 

    -    not as good as Pareto, but quite close?
 

    -    log-normal makes more physical sense??
 

    -    does usual “infinite moment” intuition really make sense???
 

    -    better ways to think about “heavy tails”????
 
 
 
 


Interesting Open Problem






1.    Find a "good", precise mathematical definition of:

"heavy tailed" distributions






Some ideas:

    -    not moment based

    -    should depend on "range of interest"

    -    empirical version depends on sample size

    -    not a number, but a "curve"?

    -    what will it be used for??
 
 
 
 


Why Care About Heavy Tails?





Current Folklore (for aggregated data):
 
 

Heavy tailed durations    Long Range Dependence

Toy Graphics, Exponential Durations

Toy Graphics, Pareto (1.5) Durations

(caused by the “few elephants”, but mice are there, too)





    -    Mandelbrot (60's)

    -    Paxson and Floyd (1995)

    -    Feldman, Gilbert and Willinger (1998)

    -    Riedi and Willinger (1999)
 
 
 


Reference Details

Mandelbrot, B. B. (1969) Long-run linearity, locally Gaussian processes, H-spectra
and infinite variance, International Economic Review, 10, 82-113.

Taqqu, M. and Levy, J. (1986) Using renewal processes to generate LRD and high
variability, in: Progress in probability and statistics, E. Eberlein and M. Taqqu eds. Birkhaeuser, Boston, 73-89.

Paxson, V. and Floyd, S. (1995) Wide Area traffic: the failure of Poisson modeling, IEEE/ACM Transactions on Networking, 3, 226-244.

Feldmann, A. Gilbert, A. C. and Willinger, W. (1998) Data networks as cascades: investigating the multifractal nature of Internet WAN traffic, Computer Communication Review, Proceedings of the ACM/SIGCOMM '98, 28, 42-55.

Riedi, R. and Willinger, W. (1999) Toward an improved understanding of network traffic dynamics, in Self-similar Network Traffic and Performance Evaluation, Wiley, New York.
 
 
 
 
 
 
 


Investigation II:  Long Range Dependence?






Question 1:    Is it really there?
 
 

    -    Early conceptions:   no

(renders classical queueing theory useless?)






    -    Very recent work (Cleveland, et. al.):    not important
 
 

    -    Motivates a very careful look
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)






Time series 1:    Aggregated point process data,

1 million Packet Arrival times (from 1998), over ~ 3 minutes






Simple analysis:    time series of bin counts

(Caution:  different view of data from above Response Sizes)

Toy example Graphic
 
 
 

        10,000 bins,      ~100 obs’s per bin

        Binwidth  ~  0.02 sec