Class Notes: Thursday
9/10/02
- Check new material on student pages (from Class Home Page)
- Excel construction of histograms (from Computing Tips)
(need to finish "chosen bins" part)
Continue Analysis of Buffalo Snowfall data...
Background: City of Buffalo, N.Y., known for heavy snows
Data: TIme
Series of annual accumulated snow falls (inches)
Recall Excel default histogram
constructed in: Toy Example Excel File
Comments:
- Excel chose binwidth = ~14
- Only 8 bins chosen, too large?
- Too few bins for "serious structure"?
-
Note one year unusually small
Binwidth deliberately "too small"
- Tried binwidth = 3
- Requires many bins to include all the data
- Histogram looks "very bumpy"
-
Hard to see "large scale features of distribution"
Binwidth "clearly too big"
- Tried binwidth = 30
- 10 times as big as above
- Averages taken over too big a range
-
Obscures potential interesting population structure
Binwidth "about right"???
- Tried binwidth = 10
- "in between" above 2?
- large enough to remove "sampling artifacts"?
- Small enough to suggest 3 modes?
-
Interesting question: are modes "important underlying
structure"???
Again highlights important
issue for histograms: choice of binwidth
Recommendation: try several binwidths
Including
both
too big, and too small
Third Class Assignment: Explore a new data set with histograms
- Start with data in spreadsheet StudyHabitsIndexData.xls
* Number attempt to quantify "quality of study habits
* Measured for 18 females and 20 males
* How do the populations compare???
- Address this question by an Excel analysis based on histograms
* Just try something, then we compare and discuss
- Display your results and conclusions on a new web page
* Linked to your home page
* You select format and style of presentation
* But insert some graphics generated by Excel
- Some graphics ideas to consider:
* Look at two separate histos, or some "combined version"???
* I.e. single graphic showing both "together" (experiment with Excel)
* Answers depend on binwidth, how to effectively display several?
- Some additional questions (answer on your web page, w/ discussion):
* Which group "looks better on average"?
* Can you "quantify this idea"? (e.g. give numerical measures)
* Which group "looks more spread" (i.e. has "greater variation")
* Quantify this idea by using the STDEV function in Excel
* Suppose you are an employer who must hire
somebody from one of the two groups.
Would you hire a female or a male, if:
+ You are forced to choose "at random"
+ You can carefully select from a large group of each type
Why?
Fun Questions:
- How should data be gathered?
- Does it make much difference??
-
Are larger samples always better???
An interesting historical context:
Political polls for presidential elections
Source (also has additional related information):
1936: Roosevelt vs. Landon
Popular Poll: Literary Digest Magazine
- Correctly called every election since 1916
- Mailed survey to 10 million voters
- Got 2.4 million responses
-
Largest political poll ever in history!
Results:
"Landslide" for Roosevelt,
but Literary Digest totally missed!
Why???
+ Problem with how the sample was chosen!
+ Who got the survey form? ("selection bias")
- Literary Digest readers
- Addresses from phone books
- Addresses from country club membership lists, ....
+ Who filled out the form? ("nonreponse bias")
- Makes sample "even less representative" of population...
+ When sampling is biased, bigger sample size doesn't help,
Only repeats the mistake on a larger scale!
Big Lesson:
need a sample that is representative of the population
An alternate survey method: quota sampling
- Done by Gallup poll in 1936
- Idea: try hard to "make sample like population"
- Avoid non-response bias by personal interviews
(today done by telephone)
- Each interviewer has quotas:
___% male
___% income groups
___% religion ....
- Used sample of only 50,000 (<< 2.4 million) to correctly call election
- Also correctly called L. D.'s bad prediction
(by asking: "did you return the L. D. survey?")
-
Quota sampling was used successfully until....
1948: Truman
vs. Dewey
The polls, and the results:
Famous Picture:
Truman smiling with newspaper saying "Dewey Wins"
Why??? Problem with quota sampling?
"Unintential bias" - consequence of "human choice" of pollsters
- E.g. may prefer to search for quota in "nicer neighborhood"
- Always gave 5 - 6% error (also in 1936)
-
Only mattered for this close election
Main Lesson:
Can't get a "representative sample" by human choice!
Toy E.g: Choose a "random number" from {1,2,3,4}
Interesting Fact:
Too many people tend to choose 3
Solution:
Choose samples "at random"
Paradoxical terminology: Random sampling is called "scientific sampling"
(has a better sound)
What does "random sampling" mean?
Each member of the population
is "equally likely to be in sample".
Toy E.g:
Use mechanism where each of 1,2,3,4 is chosen "1/4 of the time:
Note: This
motivates study of probability theory
Added payoff to "scientific" (random!) sampling:
Can use Probability Theory to quantify uncertainty!
(learn about this in other
statistics courses)
Generate random numbers using Excel,
Back to Statistics
6D Home Page