Course  OR 778

Class Notes   9/24/01






Last Time (self contained):

    -    Overview of time series basics

    -    Classical theory

            -    stationarity, autocorrelation, AR(I)MA,

            -    spectral analysis

    -    Long Range Dependence

            -    autocorrelation and spectral characterization

            -    Fractional ARIMA

            -    Hurst scaling (expanding histogram graphic)
 
 
 

Time before last:

    -    in context of:

Heavy tailed durations   Long Range Dependence

    -    Zooming autocorrelation analysis

            -    Bin counts nearly independent at small scales

            -    Heavy dependence at large scales

    -    Heading toward zooming SiZer analysis

    -    1st doing SiZer background
 
 
 
 
 


Philosophical Aside:





What are the basic questions being addressed here?
 

I.e. what do network researchers really want to know?
 
 

Context 1:    Classical Applied Statistics

    -    i.e. statistical consulting

    -    before analyzing data, identify research question

    -    frequently different from what is first asked!

    -    excellent paradigm to avoid "finding what isn't real"

(5% of true null H0s are "rejected")







Philosophical Aside (cont.)




Context 2:    Challenging and "vague" scientific areas

    -    bioinformatics (genomics, etc.)

    -    data mining   ("finding info" in large data bases)

    -    electronic security,  e.g. "intrusion defense"

    -    mathematical finance???

    -    internet traffic
 
 

Common aspects of such research:

    -    unclear what basics questions are

    -    they are developed in interaction with analysis

    -    close collaboration is vital

    -    "multiple comparison" issues are

endemic and challenging








Investigation III:  Zooming SiZer





Idea:  Study "dependence" in terms of

"non-stationarity in mean"





Recall SiZer finds "significant slopes"
 
 
 

Need for zooming:  to view wide range of scales
 
 
 


SiZer Background





    -    settings: scatterplot smoothing and histograms
 

    -    Fossils data
 

    -    Incomes data
 

    -    Central Question:

Which features are “really there”?





    -    Solution Part I, Scale Space
 

    -    Solution Part II, SiZer
 
 
 
 
 


SiZer Background (cont.)





Smooths of Fossil Data (local linear)
 

    -    dotted line: undersmoothed (feels sampling variability)
 

    -    dashed line: oversmoothed (important features missed?)
 

    -    solid line: smoothed about right?
 
 

Central question: Which features are “really there”?
 
 
 
 
 


SiZer Background (cont.)





Smoothing Setting 2: Histograms
 

Family Income Data: British Family Expenditure Survey, 1975

    -    Distribution of Incomes

    -    ~ 7000 families
 
 

Kernel Density Estimation Analysis:

    -    Again under- and over- smoothing issues

    -    Perhaps 2 modes in data?
 
 

Central question: Which features are “really there”?

(e.g. 2 modes?)










SiZer Background (cont.)





“Scale Space” – idea from Computer Vision
 
 

Conceptual basis:

    -    Oversmoothing = “view from afar” (macroscopic)

    -    Undersmoothing = “zoomed in view” (microscopic)
 
 

Main idea: all smooths contain useful information,

so study “full spectrum” (i. e. all smoothing levels)





Fun views:   Spectrum Overlay & Spectrum Suface
 
 

Note: this viewpoint makes

“data based bandwidth selection”

        much less important (than I once thought….)
 
 
 


SiZer Background (cont.)





SiZer:

Significance of Zero crossings,

of the derivative, in scale space





Combines:

    -    needed statistical inference

    -    novel visualization

To get: a powerful exploratory data analysis method
 
 

Chaudhuri, P. and Marron, J. S. (1999) SiZer for exploration of structure in curves, Journal of the American Statistical Association, 94, 807-823.
 
 


SiZer Background (cont.)






Basic idea: a “bump” is characterized by:

an increase, followed by a decrease






Generalization: many “features of interest” captured by

sign of the slope of the smooth





SiZer Basis:

Statistical inference on slopes, over scale space











SiZer Background (cont.)





Visual presentation:
 

    Color map over scale space:
 

    - Blue: slope significantly upwards (deriv . CI above 0)
 

    - Red: slope significantly downwards (der. CI below 0)
 

    - Purple: slope insignificant (deriv. CI contains 0)
 
 
 
 
 


SiZer Background (cont.)






SiZer analysis of Fossils data:

Upper Left: Scatterplot, family of smooths, 1 highlighted
 

Upper Right: Scale space rep’n of family, with SiZer colors
 

Lower Left: SiZer map, more easy to view
 

Lower Right: SiCon map – replace "slope" by "curvature"
 

Slider (in movie viewer) highlights different smoothing levels
 
 
 
 
 


SiZer Background (cont.)





SiZer analysis of Fossils data (cont.)
 

Oversmoothed:

    -    Decreases at left, not on right
 

Medium smoothed:

    -    Main valley significant, and left most increase

    -    smaller valley not statistically significant
 

Undersmoothed:

    -    “noise wiggles” not significant
 
 

Additional SiZer color: gray not enough data for inference
 
 
 


SiZer Background (cont.)





SiZer analysis of Fossils data (cont.)
 

Common Question: which is “right”?

    -    decreases on left, then flat

    -    up, then down, then up again

    -    no significant features
 
 

Answer: All are “right”, just different “scales of view”,

i.e. “levels of resolution of data”










SiZer Background (cont.)





SiZer analysis of Incomes data:
 

Oversmoothed: Only one mode
 

Medium smoothed: Two modes statistically significant

Confirmed by PhD dissertion of H. P. Schmitz (U. Bonn):

Schmitz, H. P. and Marron, J. S. (1992) Simultaneous estimation of several size distributions of  income, Econometric Theory, 8, 476-488.
 

Undersmoothed: many “noise wiggles”, not significant
 
 
 

Again: all are “correct”, just different “scales”
 
 
 


SiZer Background (cont.)





Simulated example 1: Marron - Wand Trimodal, #9
 

n=100:    only one mode "significant"
 

n=1000:    two modes now "appear from background noise"
 

n=10,000:    finally all 3 modes are "really there"
 
 
 

Simulated example 2: Marron - Wand Discrete Comb, #15

    -    similar lessons to above

    -    someday:  "draw" local bandwidth on SiZer map
 
 
 


SiZer Background (cont.)





Finance "tick data":   (time, price) of single stock transactions
 
 

Idea:  "on line" version of SiZer
for viewing and understanding trends





Notes:

    -    "trends" depend heavily on "scale"

    -    "double points" and more

    -    "background color" transition
 
 
 
 


SiZer Background (cont.)





Usefulness of SiZer in exploratory data analysis:
 

    -    Smoothing experts: saves time
 

    -    Smoothing beginners: avoids terrible mistakes

            -    don’t find things that “aren’t there”

            -    do find important features
 

    -    Directly targets critical scientific question:

Is a deeper analysis worthwhile?









SiZer Background (cont.)





Would you like to try a SiZer analysis?
 
 

Matlab software:

http://www.stat.unc.edu/faculty/marron/marron_software.html
 
 

JAVA version (demo, beta): Follow the SiZer link from the
Wagner Associates home page:

http://www.wagner.com/www.wagner.com/SiZer/
 
 

More details, examples and discussions:

http://www.stat.unc.edu/faculty/marron/DataAnalyses/SiZer_Intro.html
 
 
 
 


Investigation III:  Zooming SiZer (cont.)





Recall time series 1:    Aggregated point process data,

1 million Packet Arrival times (from 1998), over ~ 3 minutes





Recall 1st zooming autocorrelation plot
 

    -    smallest scale nearly uncorr’d   (Cleveland)
 

    -    Correlation “lifts vertically”
 

    -    gets to long range dependence (folklore)
 
 
 


Investigation III:  Zooming SiZer (cont.)





Alternate view:   Zooming SiZer
 

    -    local linear smoothing of bincounts

to avoid "edge effects"




    -    across very wide range of scales
 

    -    needs more pixels than screen allows
 

    -    thus do zooming view (zoom in over time)
 

    -    zoom in to yellow bd’ry in next frame
 

    -    readjust vertical axis
 
 
 


Investigation III:  Zooming SiZer (cont.)





Notes on Zooming SiZer:

    -    Coarse scales:  amazing amount of "significant structure"

    -    reminiscent of “self-similar fractal” type process

    -    fewer significant features at small scale

    -    but they exist, so not Poisson process

    -    Poisson approximation OK at small scale???

    -    smooths (top part) "stable" at large scales?

    -    variation dimishes as mean increases?
 
 
 


Investigation III:  Zooming SiZer (cont.)





Is this "significant structure" really important?
 
 

Simple comparison:

SiZer analysis of 1 million i.i.d. uniforms




    -   SiZer map all purple, i.e. no structure

    -    except at edges

    -    due to using kernel density estimation

    -    Shows internet data wiggles are statistically significant

    -    But "practically significant"????