Notes 9/10/02

Statistics 6D, Visualizing Data

Class Notes: Thursday 9/10/02

- Check new material on student pages (from Class Home Page)

- Excel construction of histograms (from Computing Tips)

(need to finish "chosen bins" part)

Continue Analysis of Buffalo Snowfall data...

Background: City of Buffalo, N.Y., known for heavy snows

Data: TIme Series of annual accumulated snow falls (inches)

Recall Excel default histogram constructed in: Toy Example Excel File

Comments:

- Excel chose binwidth = ~14

- Only 8 bins chosen, too large?

- Too few bins for "serious structure"?

- Note one year unusually small

Binwidth deliberately "too small"

- Tried binwidth = 3

- Requires many bins to include all the data

- Histogram looks "very bumpy"

- Hard to see "large scale features of distribution"

Binwidth "clearly too big"

- Tried binwidth = 30

- 10 times as big as above

- Averages taken over too big a range

- Obscures potential interesting population structure

Binwidth "about right"???

- Tried binwidth = 10

- "in between" above 2?

- large enough to remove "sampling artifacts"?

- Small enough to suggest 3 modes?

- Interesting question: are modes "important underlying structure"???

Again highlights important issue for histograms: choice of binwidth

Recommendation: try several binwidths

Including both too big, and too small

Third Class Assignment: Explore a new data set with histograms

- Start with data in spreadsheet StudyHabitsIndexData.xls

* Number attempt to quantify "quality of study habits

* Measured for 18 females and 20 males

* How do the populations compare???

- Address this question by an Excel analysis based on histograms

* Just try something, then we compare and discuss

- Display your results and conclusions on a new web page

* Linked to your home page

* You select format and style of presentation

* But insert some graphics generated by Excel

- Some graphics ideas to consider:

* Look at two separate histos, or some "combined version"???

* I.e. single graphic showing both "together" (experiment with Excel)

* Answers depend on binwidth, how to effectively display several?

- Some additional questions (answer on your web page, w/ discussion):

* Which group "looks better on average"?

* Can you "quantify this idea"? (e.g. give numerical measures)

* Which group "looks more spread" (i.e. has "greater variation")

* Quantify this idea by using the STDEV function in Excel

            *    Suppose you are an employer who must hire
                     somebody from one of the two groups.
                     Would you hire a female or a male, if:

+ You are forced to choose "at random"

+ You can carefully select from a large group of each type

Why?

Fun Questions:

- How should data be gathered?

- Does it make much difference??

- Are larger samples always better???

An interesting historical context:

Political polls for presidential elections

Source (also has additional related information):

1936: Roosevelt vs. Landon

Popular Poll: Literary Digest Magazine

- Correctly called every election since 1916

- Mailed survey to 10 million voters

- Got 2.4 million responses

- Largest political poll ever in history!

Results:

"Landslide" for Roosevelt, but Literary Digest totally missed!

Why???

+ Problem with how the sample was chosen!

+ Who got the survey form? ("selection bias")

- Literary Digest readers

- Addresses from phone books

- Addresses from country club membership lists, ....

+ Who filled out the form? ("nonreponse bias")

- Makes sample "even less representative" of population...

+ When sampling is biased, bigger sample size doesn't help,

Only repeats the mistake on a larger scale!

Big Lesson: need a sample that is representative of the population

An alternate survey method: quota sampling

- Done by Gallup poll in 1936

- Idea: try hard to "make sample like population"

- Avoid non-response bias by personal interviews

(today done by telephone)

- Each interviewer has quotas:

                ___% male
                ___% income groups
                ___% religion ....

- Used sample of only 50,000 (<< 2.4 million) to correctly call election

- Also correctly called L. D.'s bad prediction

(by asking: "did you return the L. D. survey?")

- Quota sampling was used successfully until....

1948: Truman vs. Dewey

The polls, and the results:

Famous Picture: Truman smiling with newspaper saying "Dewey Wins"

Why??? Problem with quota sampling?

"Unintential bias" - consequence of "human choice" of pollsters

- E.g. may prefer to search for quota in "nicer neighborhood"

- Always gave 5 - 6% error (also in 1936)

- Only mattered for this close election

Main Lesson: Can't get a "representative sample" by human choice!

Toy E.g: Choose a "random number" from {1,2,3,4}

Interesting Fact: Too many people tend to choose 3

Solution: Choose samples "at random"

Paradoxical terminology: Random sampling is called "scientific sampling"

(has a better sound)

What does "random sampling" mean?

Each member of the population is "equally likely to be in sample".

Toy E.g: Use mechanism where each of 1,2,3,4 is chosen "1/4 of the time:

Note: This motivates study of probability theory

Added payoff to "scientific" (random!) sampling:

Can use Probability Theory to quantify uncertainty!

(learn about this in other statistics courses)

Generate random numbers using Excel,

Part 10, in Computing Tips

Back to Statistics 6D Home Page