Class Notes 10/15/01
Last Time:
- in general context of:
Heavy tailed durations
Long Range Dependence
- revisited simple model assumptions
- considered modifications for start time process
- Weibull process improvement, not compelling
-
Cluster Poisson process gave useful improvement?
Zooming Spectral Analysis
Recall "zooming autocorrelation" analysis: graphic
- worked with binned data
- across wide range of scales
- i.e. binwidths 0.0003 - 0.3 secs
- expanding range of 10,000 bins
- looked like "white noise" at fine scales
- apparent Long Range Dependence at larger scales
- surprising "lifting" between scales
-
explained by simple calculations
Zooming Spectral Analysis (cont.)
Bill Cleveland suggestion:
similar analysis, replacing autocorrelation by periodogram
- introduced while studying LRD
- essentially Fourier transform of autocovariance
- more "natural scale"?
- finds "periodicities" in data
-
slope at 0 (log-log scale) is linear for LRD
Zooming Spectral Analysis (cont.)
1st View: conventional axes
power vs. frequency
- can't see anything at smallest scales (binwidths)
- see a little bit at largest scales
- all for very low frequencies
- clearly not correct view
- "improper use" of graph area
-
logs seem indicated
Zooming Spectral Analysis (cont.)
2nd
View: log-log view
- stretches both ways
- so now "properly use graph area"
- clear systematic behavior
- explanation?
- where is "time" as zooming is done?
- purple lines show "highest frequency for each scale"
- move to right one step in each frame
- "total power" increases (recall larger data window)
-
LRD (slope at 0) increases??
Zooming Spectral Analysis (cont.)
Simple Analysis: study effect of "combining bins"
on Periodogram
(at frequency
)
Assume underlying spectral density is continuous
(recall Periodgram is a crude estimate of spectral density)
Then can show:
- simple multiplicative relationship
- power is doubled
- "effective frequency" is doubled
-
interesting to "control by scaling"
Zooming Spectral Analysis (cont.)
3rd
View: normalized power
-
i.e. divide by 2 (vertically) in each frame
-
high frequency power stays constant
-
as predicted by above theory
-
low frequency power grows
-
but only as purple lines move across
-
i.e. extend into lower and lower frequencies
Zooming Spectral Analysis (cont.)
4th view: normalized power and frequency
- rescale time so purple bars remain constant
- see increase happening at lowest frequencies
- Periodgram seems to follow:
- horizontal line at high frequencies
- sloped line at low frequencies
- consistent with mixture of white noise & LRD process
- could do parameter estimation?
- picture much more clear than zooming periodogram??
-
maybe more on this later???
Revisit Flow Duration Distributions
Recall Mice and Elephants plot:
Studied "distribution of lengths"
Q-Q plots showed Pareto gave reasonable fit?
Interesting point:
Estimated shape parameter
Quite different from earlier HTTP Response Size Analysis,
where had shape parameters
in range
Revisit Flow Duration Distributions (cont.)
An explanation of the difference: Theory of
Residual Life Time Distributions
also known as
Forward Recurrence Time Distributions
Idea: study Mice and Elephants line lengths (flow life times),
conditioned on surviving to a given point
Reason: big problems with "truncation" of flows
i.e. they could have started earlier
Revisit Flow Duration Distributions (cont.)
Theoretical (in tail limit)
adjustment:
Reference: Example 3.5.3, page 214, of Resnick, S. I. (1987)
Adventures
in Stochastic Processes, Birkhauser.
Soundbite summary: for Pareto distribution,
for corresponding Residual Life distribution,
should
reduce shape parameter
by 1
- since integrate Pareto tail
- consistent with above values
- but reduction smaller than 1 (~ only 0.6-0.7)??
- driven by boundary effects?
- not full Residual Life setting???
- an artifact of crude parameter estimation?
Revisit Flow Duration Distributions (cont.)
Try alternate (more dodgy?) parameter estimates
- goal: see how much difference this can make
- tried: twiddling matched quantiles
- to "really different region"
-
result: not too much change in estimated
- and went further in support of "residual life" idea
- conclude: some residual life effect is present
(for these data)
Revisit Flow Duration Distributions (cont.)
Variation: recall Pareto
vs. log-normal controversy
e.g. Downey's motivation of log normal
- fits is "about as good" as Pareto fits??
- none are all that good
-
expected, because of previously described boundary effects
Revisit Flow Duration Distributions (cont.)
Open problems:
What is the Residual Life
Distribution for the log normal?
Again log normal, or different
shape?
Can this give new insights
regarding the controversy?
If data are log normal, and a Pareto is fit to two quantiles, will the
residual life time distribution still have the correspondingly adjusted
Pareto
fit?