Class Notes 11/28/01
Last Time :
- Asymptotic Independence, using SiZer analysis
- relationship between Size, Time and "Rate" = Size/Time
- Saw hard to explain dependencies
-
Introduced modification: "Large Variable Association"
Large Variable Association
Idea:
variation of "Asymptotic Dependence"
that is well defined for finite samples,
not just in limit as
Goals:
1. Indicate whether large values are
"more (less) associated than usual".
2. Reduce to classical Asymptotic Independence
in limit as
(in interesting cases)
Large Variable Association (cont.)
Approach:
a. For "properly adjusted marginals"
(adjust extreme value distributions as above?)
(surely "make scales comparable", e.g. divide by median)
b. Consider "polar
coordinates" versions of ,
where
and
.
c. Transfrom data to "angularly equally spaced",
by replacing
sorted s
with an equally spaced grid
(essentially Probability Integral Transform)
d. Study association
of "large values", by density of s
where corresponding
is large (> threshold)
e. Have "large variable
association" when "pile up in middle"
f. Have "large variable
disassociation" when "pile up at ends"
g. Use SiZer
to study statistical significance
Large Variable Association (cont.)
Toy Example 1: 5,000,000 absolute values of Standard Normal
Large Variable Association movie (through thresholds)
-
As expected looks very uniform
Toy Example 2: 5,000,000 independent Pareto (1.5)
Large Variable Association movie (through thresholds)
-
As expected shows strong Large Variable Disassociation
Large Variable Association (cont.)
Toy Example 3:
5,000,000 independent Unif(0,1)
Raw data scatterplot, thresholded to 10,000
- Just have small corner left after threhsolding
- Opposite effect from Pareto
-
I.e. anticipate strong Association this time
Angular equal spacing scatterplot
- above effects diluted only slightly
- since full data angles not far from uniform
-
continue to anticipate strong Association
Large Variable Association movie (through thresholds)
- As expected shows strong Large Variable Association
- Interesting "triangular distribution"
-
Shape theoretically predictable?
Large Variable Association (cont.)
Toy Example 4:
5,000,000 absolute correlated Gaussian
Raw data scatterplot, thresholded to 10,000
- Have outer piece of parabola after threhsolding
- Opposite effect from Pareto
-
I.e. anticipate strong Association this time
Angular equal spacing scatterplot
- this time get clear "spreading effect"
-
but continue to anticipate strong Association
Large Variable Association movie (through thresholds)
- As expected shows strong Large Variable Association
- have a limiting Gaussian shape?
-
Shape theoretically predictable?
Large Variable Association (cont.)
Toy Example 5:
5,000,000 independent Exponential
Raw data scatterplot, thresholded to 10,000
- Surprising shape?
- Symmetric with respect to L1 unit ball?
-
More sensible to measure "distance" using L1 norm?
Angular equal spacing scatterplot
- very little change
-
slight "pinching towards center" (as expected)
Large Variable Association movie (through thresholds)
- As expected shows no Large Variable Association
- have a limiting Gaussian shape?
-
Shape theoretically predictable?
Large Variable Association (cont.)
Toy Example 6: 5,000,000 independent log-normal(5.28,2.46)
-
Parameters from earlier "Response SiZe QQ fit"
Raw data scatterplot, thresholded to 10,000
- Similar lessons to Pareto
-
Note single large observation (> twice others)
Angular equal spacing scatterplot
-
very little change, as for Pareto
Large Variable Association movie (through thresholds)
-
As for Pareto, shows no Large Variable Association
Large Variable Association (cont.)
Further notes:
- main difference with "Asymptotic Independence" is no
"extreme value normalization"
- makes Large Variable Association more sensible??
(at least much easier to implement)
- what are drawbacks of this approach?
-
when are these the same in ???
- perhaps in "heavy tailed case"?
- could use this as a definition of "heavy tailed"??
-
how will things go with the real data?
Large Variable Association (cont.)
Real Data Analysis:
Response Size Data
Large Variable Association
Movies (through thresholds)
1. Time vs. Size,
Thursday afternoon (peak)
- Large Value Disassociated?
- Too many angles near 1??
- Again Large Value Disassociated?
- Again too many values near 1??
Sunday morning (off peak)
- Mostly disassociated?
- But some "other peaks near 0 and 1"?
- Again too many values near 1?
2. Rate (Size / Time) vs. Size, movie
Thursday afternoon (peak)
- More strongly disassociated
- This time too many near 0??
- Diassociated?
- Even more near 0
Sunday morning (off peak)
- Strong disassociation for larger threshold
- Too many near 0 for smaller threshold?
3. Inverse Rate (Time / Size) vs. Time, movie
Thursday afternoon (peak)
- Way too many near 1?
- Call this disassociation?
- Similar lessons, with spike not at 1?
Sunday morning (off peak)
- Similar lessons
Large Variable Association (cont.)
Overall Impressions:
- Poor job done of making "axes commensurate"
- Recall used extreme value rescaling before
- "Angular Prob. Int. Trans." hasn't solved this
- Disappointed that now "everything looks independent"
- Recall seemed found interesting associations earlier
-
Were those "really there"???
Large Variable Association (cont.)
What about other time blocks?
For threshold fixed at 200,
2. Rate (Size / Time) vs. Size
3. Inverse Rate (Time / Size) vs. Time
-
Lessons similar to the above
Large Variable Association (cont.)
How to make axes more commensurate?
There are a number of choices, e.g.
- Rescale (e.g. by median)
- Prob. Int. Trans. on Angles
- Apply Prob. Int. Trans. to axes (Copula trans.)
- Prob. Int. Trans. on radii (polar coord. Copula?)
-
mix and match
Problem: hard to keep
track of (understand) all of these
Large Variable Association (cont.)
Approach: richer visualization
Setting: Thursday
afternoon
Raw
Data:
-
Scatterplot shows size numbers far bigger than time
-
log -log scatterplot show selection bias
-
Hard question: how to do better???
-
Q-Q plots show axes far from commensurate
-
not totally surprising, since units completely different
-
all angles piled up on 0.
Median
Rescaled:
-
Now all on time axis, but one huge size
-
single outlier has terrible effect...
-
log-log Q-Q crosses diagonal at median
-
does poor job making axes commensurate
-
no multiplicative rescaling can help
-
angular density surprisingly uniform
Median
RS & Angular PIT.:
-
quite similar to above
-
expected since angles already reasonably uniform
Med.
RS & A. PIT & top 200:
-
discussed above
-
still see axes not very commensurate
Copula
Transformed (PIT on both axes):
-
axes very commensurate
-
especially clear in Q-Q plots
-
angular density clustered around 0.5
-
caused by mapping to unit square
-
have more angles in "direction of corner"
-
preferable may be "angular version of copula"
Copula
and Angular PIT:
-
not a large change is needed for this
-
angles not extremely far from uniform
-
some rotation away from diagonal
-
note little distortion in marginal Q-Q plots
Copula
& A. PIT & Top 200:
-
Suggest Strong Association!?!
-
Opposite of above conclusion
-
Scatterplot is "outer rim" of above scatterplot
-
all quite close to diagonal
-
seems to "choose wrong part of data"
-
sensible modification???
Angular
PIT:
-
some improvement over raw data
-
but axes still non-commensurate
-
except perhaps on log scale
Angular
PIT & Radius PIT:
-
best job of commensurate axes
-
looks like useful "polar coordinate copula"?
A.
PIT & R. PIT & top 200:
-
chose "outer rim" of polar coord. copula
-
more dense in some regions than others
-
marginal Q-Q plots show axes commensurate
-
SiZer map shows some disassociation
-
Too many towards 0?
-
i.e. still dominated by "large size"?
-
But opposite direction from above
-
Why is R. PIT needed, if only study angles?
A.
PIT & top 200:
-
same SiZer analysis (since same angles)
-
but axes look much less commensurate??
-
something wrong with this "commensurate" idea????
-
this analysis better than "median rescaled" above???