Class Notes 9/17/01
Last Time:
- Finished (?) study of heavy tails
- Began study of "Long Range Dependence"
- via correlation analysis (sensible??)
- in context of:
Course Goals:
Explore Internet Traffic from several viewpoints
Highlight interesting open problems
Promote possible joint research
Maximize understanding by all class members
Wednesday's meeting:
(Sound Bite) Introduction
to Time Series Analysis
- Autocorrelation
- ARMA process
- Periodogram
- Partial Correlation
- ARIMA processes
- Long Range dependence
Fractional ARIMA processes
Investigation II: Long Range Dependence?
Question 1:
Is it really there?
- Early conceptions: no
(renders classical queueing theory useless?)
Current thought: yes
Very recent work (Cleveland, et. al.): not important
Motivated zooming autocorrelation view.
Revealed "both viewpoints correct" depending on scale
Surprised at "how dependence comes in"?
Expected "lump of dependence" coming in from right??
Investigation II: Long Range Dependence? (cont.)
Notion of large lump on right (in autocorr.):
consistent with “periodicities”?.
Caution 1:
large lump,
but not clear that
large lump
Caution 2: TCP has its periodicites
An aside about aggregation
A tempting idea:
"packet loss effects will kill independence at small scales"
BUT: aggregated data
say something different
AN EXPLANATION: depends on where loss occurs:
- loss at link where measuring? then YES
far away from measurement point? then NO
Recall simple view of the Internet:
Current situation:
- Backbone is "over-provisioned"
(working at 5-10% capacity)
- Loss occurs mostly at "edges"
(or between backbones)
- Thus aggregation of these could be independent
(since loss is happening at many different places)
Investigation II: Long Range Dependence? (cont.)
Observed effects due to data
Time Series’s 2: For increasing seq’s of 10,000 bins
time scales
# obs’s / bin
total length
Major Problem:
assumes “stationarity”
Investigation II: Long Range Dependence? (cont.)
- can’t distinguish from indep. at small scales
- strong dependence at larger scales
- “vertical lifting of dependence”
not “coming in from right”
- looking at too narrow a lag range?
where are “times” in zooming auto-correlation?
Investigation II: Long Range Dependence? (cont.)
Larger lag range &
“time markers”
cyan bar shows old lag boundary
yellow bars show how time zooms
vertical lift not completely level
but still doesn’t “move in from right”
instead “lifts first on left”
Investigation II: Long Range Dependence? (cont.)
Zooming Autocorrelation 4: Time invariant view
Rescale to fix yellow time bars
expect “curve follows mountains of dependence”
from “dependence at time scale” model
instead see “dependence increasing with scale”
Explanation: simple cross scale calculation
Hannig, J., Marron, J. S.
and Riedi, R. H. (2001) Zooming statistics: Inference across scales,
of the Korean Statistical Society, 30, 327-353. Go
here to download.
Idea: Compare autocorr’n
when adjacent bins are combined:
Relate lag
at scale
to lag
at scale
Can show:
Explanation (cont.)
- when really uncorr’d, always stays at 0
- slight positive autocorr. Magnified by 2
- big lift for small lag one autocorr.
- small lift for large lag one autocorr.
- small scale Poisson model is not correct
but still OK as a fine scale approximation???
Investigation III: Zooming SiZer
Idea: Study "dependence" in terms of
"non-stationarity in mean"
Recall SiZer
finds "significant slopes"
Need for zooming: to
view wide range of scales
SiZer Background
settings: scatterplot smoothing and histograms
Fossils data
Incomes data
- Central Question:
Which features are “really there”?
Solution Part I, Scale Space
Solution Part II, SiZer
SiZer Background (cont.)
Smoothing Setting 1: Scatterplots
E.g. Fossil
from T. Bralower, Dept. Geological Sciences, UNC
Strontium Ratio in fossil shells
reflects global sea level
surrogate for climate
over millions of years
SiZer Background (cont.)
of Fossil Data (details given later)
dotted line: undersmoothed (feels sampling variability)
dashed line: oversmoothed (important features missed?)
solid line: smoothed about right?
Central question: Which
features are “really there”?
SiZer Background (cont.)
My scatterplot smoothing method (others disagree):
local linear smoothing
Main idea: (illustrated by toy example)
use kernel window to “determine neighborhood”
then “fit a line within the window”
then “slide window along”
Window Width, h, is
SiZer Background (cont.)
Smoothing Setting 2: Histograms
Family Income Data: British Family Expenditure Survey, 1975
- Distribution of Incomes
~ 7000 families
- Again under- and over- smoothing issues
- Perhaps 2 modes in data?
Histogram Problem 1: Binwidth (well known)
Central question: Which features are “really there”?
(e.g. 2 modes?)
SiZer Background (cont.)
Why not use (conventional)
Histogram Problem 2: Bin shift (less well known)
- For same binwidth
- get much different impression
by only “shifting grid location"
Solution to binshift problem: average over all shifts
- 1st peak all in one bin: bimodal
1st peak split between bins: unimodal
histogram provides understanding,
so should use for data analysis
Another name: Kernel
Density Estimate
SiZer Background (cont.)
Kernel density estimation
View 1: Smooth histogram
View 2: Distribute probability
mass, according to data
E.g. Chondrite
data (from how many sources?)
SiZer Background (cont.)
Kernel density estimation
Central Issue: width of window, i.e. “bandwidth”, h
Controls critical amount of smoothing
Old Approach: data based bandwidth selection
Jones M. C., Marron, J. S.
and Sheather, S. J. (1996) A brief survey of bandwidth selection for density
estimation, Journal of the American Statistical Association, 91,
New Approach: "scale space"
(look at all of them)
SiZer Background (cont.)
“Scale Space” – idea from
Computer Vision
Conceptual basis:
- Oversmoothing = “view from afar” (macroscopic)
Undersmoothing = “zoomed in view” (microscopic)
Main idea: all smooths contain useful information,
so study “full spectrum” (i. e. all smoothing levels)