Statistics 6D,   Visualizing Data

Class Notes:  Tuesday 11/12/02
 


    -    Check new material on student pages (from Class Home Page)
 
 


    -    Organize Class Lunch
 
 


Recall important Idea from histograms:    Income Data

"Bin Width"  controls amount of "smoothing":

    -    Too small:    result is too "wiggly",   "feels samping variation"

    -    Too big:    smooths away important structure

    -    Deep Question:    Are the two bumps "really there"?
 


Surprising point:   Not only is "width" important, so is "location"

Shifting Histogram for Income Data

    -    Uses same binwidth all through

    -    But "slides grid along"

    -    Two bumps turn into one

    -    What is going on?

    -    Lesson:  not only need to worry about binwidth

    -    Location can be important, too

    -    Effects smaller for smaller binwidth
 
 


Explanation:   overlay "average of all shifts" (shown in green)

    -    See two clear peaks

    -    Histo shows 2 bumps, when 1st peak centered in a bin

    -    Histo shows 1 bump, when 1st peak split between two bins

    -    1 or 2 bumps depends on luck of the draw???

    -    Casts doubt on histograms

    -    Better choice:    use green curve for data analysis

    -    Called "kernel density estimate"
 
 


Kernel Density Estimation:    Alternate View

    -    Data:    Chondrites

    -    Meteors that hit surface of the earth

    -    Early question:   from how many sources do they come?

    -    Interesting quantity:    % silica

    -    Approach:   make curve with area 1:

            -    tall where there are many data points

            -    low where there are few data points

            -    put small curve with area 1/n near each data point

            -    Add them up to make kernel density estimate

    -    Gives strong impression of 3 sources

    -    This was green curve for income data above
 
 


Notes about Kernel Density Estimate:

    -    Still have to deal with "window width"    Incomes Data

            -    Too small:    curve is too wiggly

            -    Too big:    may smooth away important features

            -    About right:   can find interesting structure

    -    Important Question:    Which "bumps" are "really there"?

    -    I.e. Important underlying structure, not sampling variation
 
 


Back to Statistics 6D Home Page