Statistical Analysis and
of Internet Traffic Data
Course Meetings:
Time: Mon. - Wed. 8:40 - 9:55
Room: Rhodes 471
Course Web Site:
Instructor: J.
S. (Steve) Marron
Office: Rhodes
Office Hours:
Mon. 10 - 11, Tuesday 11 - 12
Phone: (607)
Course Email List:
please add yourself,
by sending an email with
"subscribe" as the subject,
(useful for announcements, such as "notes now posted")
Course Work / Grading
Based on a presentation
- can be either a paper by others (you choose, or I suggest)
- or your own work
let's discuss soon
Last Time:
- Big picture of Internet traffic
- Studying Response Size Distributions
analysis showed non-standard dist'n
with statistically significant "bumps"
- Introduced Q-Q plots, and assessment of variability
via simulated envelope
- Q-Q plots suggest possible Pareto fits
Investigation I: Heavy Tails? (cont.)
Q-Q plot for full 734,814 HTTP Response Sizes:
- Pareto(1.2) good fit in tails?
- surprisingly good?
- have added sim’d overlay
- nearly no “variability”? (except at ends)
Shape parameter 1.2, has
poor fit for main dist’n (as expected)
Q-Q Plots (cont.)
Pareto parameter estimation:
Pareto Quantile matching
Conventional estimation methods "tricky"
Likelihood requires numerical optimization
Method of Moments fails
Want "tails", not "body"
Q-Q Plots (cont.)
Parameter Estimation: Pareto quantile matching
Estimate "scale parameter"
and "shape parameter"
to match quantiles
(for ):
Investigation I: Heavy Tails? (cont.)
Study variability for only 1st 50,000:
Plot for 1st 50,000 Responses
- Quantile matched parameter estimates
- Small enough to do sim’d overlay
- Variation very small over wide range (but not in "tails")
- “small wiggles” actually outside
Shape parameter changes to 1.5?!?
Investigation I: Heavy Tails? (cont.)
How different are shape parameters
1.2 and 1.5?
Use parameters estimated for full 734,814:
Plot for 1st 50,000, given parameters
expect big difference in tail behaviors?
seems like small difference (over range of interest)?
better fit in tails?
Alternate application of Q-Q Plots
Suggested by change in est’d shape parameters:
Study possible “non-stationarity”
Approach: moving window Q-Q of 50,000, through 1 mil.
- Clear non-stationarity
- tails sometimes “heavier”, sometimes “lighter”
- and is "statistically significant
diurnal (i.e. "time of day") effect???
Alternate application of Q-Q Plots
How non-stationary?
Explore shape parameter = 2
Q-Q plot at variance edge
Similar nonstationarity
tails sometimes “this light”
only at "systematic time points"???
Investigation I: Heavy Tails? (cont.)
Recent Controversy:
Downey (2000)
Downey, A. B. (2000) The
structural cause of file size distributions, Wellesley College Tech. Report
Does log-normal fit better
than Pareto?
(studied distribution of
"file sizes", but not so different)
Investigation I: Heavy Tails? (cont.)
1st reaction: that is ridiculous!
- Pareto(1.5) has infinite moments > 1
- log-normal has all moments finite
- Pareto fits the data
(pretty well?!?)
2nd thoughts:
- careful, internet traffic is:
- slippery
- full of surprises
Investigation I: Heavy Tails? (cont.)
QQ – plots for log-normal
(HTTP Response Size data):
Q-Q Plot for 1st 50,000 Responses
fit log-Normal Q-Q Plot
window log-Normal Q-Q Plot
Investigation I: Heavy Tails? (cont.)
QQ – plots for log-normal:
surprisingly good fit
not as good as Pareto, but quite close?
log-normal makes more physical sense??
does usual “infinite moment” intuition really make sense???
better ways to think about “heavy tails”????
Interesting Open Problem
1. Find a "good", precise mathematical definition of:
"heavy tailed" distributions
Some ideas:
- not moment based
- should depend on "range of interest"
- empirical version depends on sample size
- not a number, but a "curve"?
what will it be used for??
Why Care About Heavy Tails?
Current Folklore (for aggregated
Heavy tailed durations
Long Range Dependence
Toy Graphics, Exponential Durations
Toy Graphics, Pareto (1.5) Durations
(caused by the “few elephants”, but mice are there, too)
- Mandelbrot (60's)
- Paxson and Floyd (1995)
- Feldman, Gilbert and Willinger (1998)
Riedi and Willinger (1999)
Reference Details
Mandelbrot, B. B. (1969)
Long-run linearity, locally Gaussian processes, H-spectra
and infinite variance, International
Economic Review, 10, 82-113.
Taqqu, M. and Levy, J. (1986)
Using renewal processes to generate LRD and high
variability, in: Progress
in probability and statistics, E. Eberlein and M. Taqqu eds. Birkhaeuser,
Boston, 73-89.
Paxson, V. and Floyd, S. (1995) Wide Area traffic: the failure of Poisson modeling, IEEE/ACM Transactions on Networking, 3, 226-244.
Feldmann, A. Gilbert, A. C. and Willinger, W. (1998) Data networks as cascades: investigating the multifractal nature of Internet WAN traffic, Computer Communication Review, Proceedings of the ACM/SIGCOMM '98, 28, 42-55.
Riedi, R. and Willinger,
W. (1999) Toward an improved understanding of network traffic dynamics,
in Self-similar Network Traffic and Performance Evaluation, Wiley,
New York.
Investigation II: Long Range Dependence?
Question 1:
Is it really there?
- Early conceptions: no
(renders classical queueing theory useless?)
Very recent work (Cleveland, et. al.): not important
Motivates a very careful look
Investigation II: Long Range Dependence? (cont.)
Time series 1: Aggregated point process data,
1 million Packet Arrival times (from 1998), over ~ 3 minutes
Simple analysis: time series of bin counts
(Caution: different view of data from above Response Sizes)
10,000 bins, ~100 obs’s per bin
Binwidth ~ 0.02 sec