Statistical Analysis and
Modelling
of Internet Traffic Data
Course Meetings:
Time: Mon. - Wed. 8:40 - 9:55
Room: Rhodes 471
Course Web Site:
???
Instructor: J.
S. (Steve) Marron
Office: Rhodes
234
Office Hours:
Mon. 10 - 11, Tuesday 11 - 12
Phone: (607)
255-9147
Email: marron@stat.unc.edu
Course Email List:
please add yourself,
by sending an email with
"subscribe" as the subject,
to: or778-fa01-l-request@orie.cornell.edu
(useful for announcements, such as "notes now posted")
Course Work / Grading
Based on a presentation
Presentations:
- can be either a paper by others (you choose, or I suggest)
- or your own work
-
let's discuss soon
Last Time:
- Big picture of Internet traffic
- Studying Response Size Distributions
-
SiZer
analysis showed non-standard dist'n
with statistically significant "bumps"
- Introduced Q-Q plots, and assessment of variability
via simulated envelope
- Q-Q plots suggest possible Pareto fits
Investigation I: Heavy Tails? (cont.)
Q-Q plot for full 734,814 HTTP Response Sizes:
- Pareto(1.2) good fit in tails?
- surprisingly good?
- have added sim’d overlay
- nearly no “variability”? (except at ends)
- Shape parameter 1.2, has &
-
poor fit for main dist’n (as expected)
Q-Q Plots (cont.)
Pareto parameter estimation:
-
Pareto Quantile matching
-
Conventional estimation methods "tricky"
-
Likelihood requires numerical optimization
-
Method of Moments fails
-
Want "tails", not "body"
Q-Q Plots (cont.)
Parameter Estimation: Pareto quantile matching
- Estimate "scale parameter" and "shape parameter"
- Choose & to match quantiles
(for ):
Theoretical: where
with
Empirical: where
Investigation I: Heavy Tails? (cont.)
Study variability for only 1st 50,000:
Q-Q
Plot for 1st 50,000 Responses
- Quantile matched parameter estimates
- Small enough to do sim’d overlay
- Variation very small over wide range (but not in "tails")
- “small wiggles” actually outside
-
Shape parameter changes to 1.5?!?
Investigation I: Heavy Tails? (cont.)
How different are shape parameters
1.2 and 1.5?
Use parameters estimated for full 734,814:
Q-Q
Plot for 1st 50,000, given parameters
-
expect big difference in tail behaviors?
-
seems like small difference (over range of interest)?
-
better fit in tails?
Alternate application of Q-Q Plots
Suggested by change in est’d shape parameters:
Study possible “non-stationarity”
Approach: moving window Q-Q of 50,000, through 1 mil.
- Clear non-stationarity
- tails sometimes “heavier”, sometimes “lighter”
- and is "statistically significant
-
diurnal (i.e. "time of day") effect???
Alternate application of Q-Q Plots
How non-stationary?
Explore shape parameter = 2
Moving
Q-Q plot at variance edge
-
Similar nonstationarity
-
tails sometimes “this light”
-
only at "systematic time points"???
Investigation I: Heavy Tails? (cont.)
Recent Controversy:
Downey (2000)
Downey, A. B. (2000) The
structural cause of file size distributions, Wellesley College Tech. Report
CSD-TR25-2000, http://rocky.wellesley.edu/downey/filesize
Does log-normal fit better
than Pareto?
(studied distribution of
"file sizes", but not so different)
Investigation I: Heavy Tails? (cont.)
1st reaction: that is ridiculous!
- Pareto(1.5) has infinite moments > 1
- log-normal has all moments finite
- Pareto fits the data
(pretty well?!?)
2nd thoughts:
- careful, internet traffic is:
- slippery
- full of surprises
Investigation I: Heavy Tails? (cont.)
QQ – plots for log-normal
(HTTP Response Size data):
log-Normal
Q-Q Plot for 1st 50,000 Responses
Visually
fit log-Normal Q-Q Plot
Moving
window log-Normal Q-Q Plot
Investigation I: Heavy Tails? (cont.)
QQ – plots for log-normal:
-
surprisingly good fit
-
not as good as Pareto, but quite close?
-
log-normal makes more physical sense??
-
does usual “infinite moment” intuition really make sense???
-
better ways to think about “heavy tails”????
Interesting Open Problem
1. Find a "good", precise mathematical definition of:
"heavy tailed" distributions
Some ideas:
- not moment based
- should depend on "range of interest"
- empirical version depends on sample size
- not a number, but a "curve"?
-
what will it be used for??
Why Care About Heavy Tails?
Current Folklore (for aggregated
data):
Heavy tailed durations Long Range Dependence
Toy Graphics, Exponential Durations
Toy Graphics, Pareto (1.5) Durations
(caused by the “few elephants”, but mice are there, too)
- Mandelbrot (60's)
- Paxson and Floyd (1995)
- Feldman, Gilbert and Willinger (1998)
-
Riedi and Willinger (1999)
Reference Details
Mandelbrot, B. B. (1969)
Long-run linearity, locally Gaussian processes, H-spectra
and infinite variance, International
Economic Review, 10, 82-113.
Taqqu, M. and Levy, J. (1986)
Using renewal processes to generate LRD and high
variability, in: Progress
in probability and statistics, E. Eberlein and M. Taqqu eds. Birkhaeuser,
Boston, 73-89.
Paxson, V. and Floyd, S. (1995) Wide Area traffic: the failure of Poisson modeling, IEEE/ACM Transactions on Networking, 3, 226-244.
Feldmann, A. Gilbert, A. C. and Willinger, W. (1998) Data networks as cascades: investigating the multifractal nature of Internet WAN traffic, Computer Communication Review, Proceedings of the ACM/SIGCOMM '98, 28, 42-55.
Riedi, R. and Willinger,
W. (1999) Toward an improved understanding of network traffic dynamics,
in Self-similar Network Traffic and Performance Evaluation, Wiley,
New York.
Investigation II: Long Range Dependence?
Question 1:
Is it really there?
- Early conceptions: no
(renders classical queueing theory useless?)
-
Very recent work (Cleveland, et. al.): not important
-
Motivates a very careful look
Investigation II: Long Range Dependence? (cont.)
Time series 1: Aggregated point process data,
1 million Packet Arrival times (from 1998), over ~ 3 minutes
Simple analysis: time series of bin counts
(Caution: different view of data from above Response Sizes)
10,000 bins, ~100 obs’s per bin
Binwidth ~ 0.02 sec