Course  OR 778

Class Notes   9/17/01





Last Time:
 

    -    Finished (?) study of heavy tails

    -    Began study of "Long Range Dependence"

    -    via correlation analysis   (sensible??)

    -    in context of:

Heavy tailed durations   Long Range Dependence









Course Goals:
 

    -    Explore Internet Traffic from several viewpoints
 

    -    Highlight interesting open problems
 

    -    Promote possible joint research
 

    -    Maximize understanding by all class members
 
 
 
 
 
 
 
 
 
 


Wednesday's meeting:



(Sound Bite)  Introduction to Time Series Analysis
 

    -    Autocorrelation

    -    ARMA process

    -    Periodogram

    -    Partial Correlation

    -    ARIMA processes

    -    Long Range dependence

    -    Fractional ARIMA processes
 
 
 
 


Investigation II:  Long Range Dependence?





Question 1:    Is it really there?
 
 

    -    Early conceptions:   no

(renders classical queueing theory useless?)






    -    Current thought:   yes
 

    -    Very recent work (Cleveland, et. al.):    not important
 

    -    Motivated zooming autocorrelation view.
 

    -    Revealed "both viewpoints correct" depending on scale
 

    -    Surprised at "how dependence comes in"?
 

    -    Expected "lump of dependence" coming in from right??
 
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)






Notion of large lump on right (in autocorr.):

consistent with “periodicities”?.






Caution 1:    periodicities    large lump,

but not clear that      large lump    periodicity







Caution 2:    TCP has its periodicites

Individual TCP connection zooming graphic










An aside about aggregation





A tempting idea:

"packet loss effects will kill independence at small scales"





BUT:  aggregated data say something different
 
 

AN EXPLANATION:  depends on where loss occurs:

    -    loss at link where measuring?    then YES

    -    far away from measurement point?    then NO
 
 

Recall simple view of the Internet:


 
 

Current situation:
 

    -    Backbone is "over-provisioned"

(working at 5-10% capacity)





    -    Loss occurs mostly at "edges"

(or between backbones)





    -    Thus aggregation of these could be independent

(since loss is happening at many different places)








Investigation II:  Long Range Dependence?  (cont.)






Observed effects due to data sparsity?
 
 

Time Series’s 2:    For increasing seq’s of 10,000 bins

    -    time scales 

    -    # obs’s / bin 

    -    total length 
 
 

Major Problem:      assumes “stationarity”
 
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)





Zooming Autocorrelation 2:

    -    can’t distinguish from indep. at small scales

    -    strong dependence at larger scales

    -    “vertical lifting of dependence”

    -    not “coming in from right”
 

Questions:

    -    looking at too narrow a lag range?

    -    where are “times” in zooming auto-correlation?
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)





Zooming Autocorrelation 3:

Larger lag range  &  “time markers”
 

    -   cyan bar shows old lag boundary
 

    -   yellow bars show how time zooms
 

    -    vertical lift not completely level
 

    -    but still doesn’t “move in from right”
 

    -    instead “lifts first on left”
 
 
 
 


Investigation II:  Long Range Dependence?  (cont.)





Zooming Autocorrelation 4:    Time invariant view

Rescale to fix yellow time bars






    -    expect “curve follows mountains of dependence”
 

    -    from “dependence at time scale” model
 

    -    instead see “dependence increasing with scale”
 
 
 
 
 
 


Explanation:  simple cross scale calculation






Hannig, J., Marron, J. S. and Riedi, R. H. (2001)  Zooming statistics: Inference across scales, Journal of the Korean Statistical Society, 30, 327-353.  Go here to download.
 
 

Idea:  Compare autocorr’n when adjacent bins are combined:
 

Relate lag   at scale          to lag   at scale 
 

Can show:










Explanation (cont.)





Notes:

   -   when really uncorr’d, always stays at 0

   -   slight positive autocorr. Magnified by 2

   -   big lift for small lag one autocorr.

   -   small lift for large lag one autocorr.

   -   small scale Poisson model is not correct

Note:   "proven departure from Poisson"
different from:  "not Poisson"

   -   but still OK as a fine scale approximation???
 
 
 
 
 
 


Investigation III:  Zooming SiZer





Idea:  Study "dependence" in terms of

"non-stationarity in mean"





Recall SiZer finds "significant slopes"
 
 
 

Need for zooming:  to view wide range of scales
 
 
 
 
 
 
 


SiZer Background




    -    settings: scatterplot smoothing and histograms
 

    -    Fossils data
 

    -    Incomes data
 

    -    Central Question:

Which features are “really there”?



    -    Solution Part I, Scale Space
 

    -    Solution Part II, SiZer
 
 
 
 
 


SiZer Background (cont.)



Smoothing Setting 1: Scatterplots
 

E.g.  Fossil Data
 

    -    from T. Bralower, Dept. Geological Sciences, UNC
 

    -    Strontium Ratio in fossil shells
 

    -    reflects global sea level
 

    -    surrogate for climate
 

    -    over millions of years
 
 
 
 
 


SiZer Background (cont.)




Smooths of Fossil Data (details given later)
 

    -    dotted line: undersmoothed (feels sampling variability)
 

    -    dashed line: oversmoothed (important features missed?)
 

    -    solid line: smoothed about right?
 
 
 

Central question: Which features are “really there”?
 
 
 
 
 


SiZer Background (cont.)




My scatterplot smoothing method (others disagree):

local linear smoothing




Main idea: (illustrated by toy example)

use kernel window to “determine neighborhood”

then “fit a line within the window”

then “slide window along”




Window Width, h, is critical
 
 
 
 
 


SiZer Background (cont.)



Smoothing Setting 2: Histograms
 

Family Income Data: British Family Expenditure Survey, 1975

    -    Distribution of Incomes

    -    ~ 7000 families
 
 

Histogram Analysis:

    -    Again under- and over- smoothing issues

    -    Perhaps 2 modes in data?

    -    Histogram Problem 1: Binwidth (well known)
 
 

Central question: Which features are “really there”?

(e.g. 2 modes?)








SiZer Background (cont.)




Why not use (conventional) histograms?
 
 

Histogram Problem 2: Bin shift (less well known)

    -    For same binwidth

    -    get much different impression

    -    by only “shifting grid location"
 
 

Solution to binshift problem: average over all shifts

    -    1st peak all in one bin: bimodal

    -    1st peak split between bins: unimodal
 
 

Smooth histogram provides understanding,
so should use for data analysis




Another name: Kernel Density Estimate
 
 
 
 
 


SiZer Background (cont.)




Kernel density estimation
 
 

View 1: Smooth histogram
 
 

View 2: Distribute probability mass, according to data
 
 

E.g. Chondrite data (from how many sources?)
 
 
 
 
 
 


SiZer Background (cont.)




Kernel density estimation (cont.)
 

Central Issue: width of window, i.e. “bandwidth”,  h

E.g. Incomes data

Controls critical amount of smoothing




Old Approach: data based bandwidth selection

Jones M. C., Marron, J. S. and Sheather, S. J. (1996) A brief survey of bandwidth selection for density estimation, Journal of the American Statistical Association, 91, 401-407.
 
 

New Approach: "scale space" (look at all of them)
 
 
 
 
 
 


SiZer Background (cont.)




“Scale Space” – idea from Computer Vision
 
 

Conceptual basis:

    -    Oversmoothing = “view from afar” (macroscopic)

    -    Undersmoothing = “zoomed in view” (microscopic)
 
 

Main idea: all smooths contain useful information,

so study “full spectrum” (i. e. all smoothing levels)