Lecture11-28-01

Course OR 778

Class Notes 11/28/01

Last Time :

- Asymptotic Independence, using SiZer analysis

- relationship between Size, Time and "Rate" = Size/Time

- Saw hard to explain dependencies

- Introduced modification: "Large Variable Association"

Large Variable Association

Idea:

variation of "Asymptotic Dependence"

that is well defined for finite samples,

not just in limit as

Goals:

1. Indicate whether large values are

"more (less) associated than usual".

2. Reduce to classical Asymptotic Independence

in limit as (in interesting cases)

Large Variable Association (cont.)

Approach:

a. For "properly adjusted marginals"

(adjust extreme value distributions as above?)

(surely "make scales comparable", e.g. divide by median)

b. Consider "polar coordinates" versions of ,

where and .

c. Transfrom data to "angularly equally spaced",

by replacing sorted s with an equally spaced grid

(essentially Probability Integral Transform)

d. Study association of "large values", by density of s

where corresponding is large (> threshold)

e. Have "large variable association" when "pile up in middle"

f. Have "large variable disassociation" when "pile up at ends"

g. Use SiZer to study statistical significance

Large Variable Association (cont.)

Toy Example 1: 5,000,000 absolute values of Standard Normal

Large Variable Association movie (through thresholds)

- As expected looks very uniform

Toy Example 2: 5,000,000 independent Pareto (1.5)

Large Variable Association movie (through thresholds)

- As expected shows strong Large Variable Disassociation

Large Variable Association (cont.)

Toy Example 3: 5,000,000 independent Unif(0,1)

Raw data scatterplot, thresholded to 10,000

- Just have small corner left after threhsolding

- Opposite effect from Pareto

- I.e. anticipate strong Association this time

Angular equal spacing scatterplot

- above effects diluted only slightly

- since full data angles not far from uniform

- continue to anticipate strong Association

Large Variable Association movie (through thresholds)

- As expected shows strong Large Variable Association

- Interesting "triangular distribution"

- Shape theoretically predictable?

Large Variable Association (cont.)

Toy Example 4: 5,000,000 absolute correlated Gaussian

Raw data scatterplot, thresholded to 10,000

- Have outer piece of parabola after threhsolding

- Opposite effect from Pareto

- I.e. anticipate strong Association this time

Angular equal spacing scatterplot

- this time get clear "spreading effect"

- but continue to anticipate strong Association

Large Variable Association movie (through thresholds)

- As expected shows strong Large Variable Association

- have a limiting Gaussian shape?

- Shape theoretically predictable?

Large Variable Association (cont.)

Toy Example 5: 5,000,000 independent Exponential

Raw data scatterplot, thresholded to 10,000

- Surprising shape?

- Symmetric with respect to L1 unit ball?

- More sensible to measure "distance" using L1 norm?

Angular equal spacing scatterplot

- very little change

- slight "pinching towards center" (as expected)

Large Variable Association movie (through thresholds)

- As expected shows no Large Variable Association

- have a limiting Gaussian shape?

- Shape theoretically predictable?

Large Variable Association (cont.)

Toy Example 6: 5,000,000 independent log-normal(5.28,2.46)

- Parameters from earlier "Response SiZe QQ fit"

Raw data scatterplot, thresholded to 10,000

- Similar lessons to Pareto

- Note single large observation (> twice others)

Angular equal spacing scatterplot

- very little change, as for Pareto

Large Variable Association movie (through thresholds)

- As for Pareto, shows no Large Variable Association

Large Variable Association (cont.)

Further notes:

- main difference with "Asymptotic Independence" is no

"extreme value normalization"

- makes Large Variable Association more sensible??

(at least much easier to implement)

- what are drawbacks of this approach?

- when are these the same in ???

- perhaps in "heavy tailed case"?

- could use this as a definition of "heavy tailed"??

- how will things go with the real data?

Large Variable Association (cont.)

Real Data Analysis: Response Size Data

Large Variable Association Movies (through thresholds)

1. Time vs. Size,

Thursday afternoon (peak)

- Large Value Disassociated?

- Too many angles near 1??

Thursday evening

- Again Large Value Disassociated?

- Again too many values near 1??

Sunday morning (off peak)

- Mostly disassociated?

- But some "other peaks near 0 and 1"?

- Again too many values near 1?

2. Rate (Size / Time) vs. Size, movie

Thursday afternoon (peak)

- More strongly disassociated

- This time too many near 0??

Thursday evening

- Diassociated?

- Even more near 0

Sunday morning (off peak)

- Strong disassociation for larger threshold

- Too many near 0 for smaller threshold?

3. Inverse Rate (Time / Size) vs. Time, movie

Thursday afternoon (peak)

- Way too many near 1?

- Call this disassociation?

Thursday evening

- Similar lessons, with spike not at 1?

Sunday morning (off peak)

- Similar lessons

Large Variable Association (cont.)

Overall Impressions:

- Poor job done of making "axes commensurate"

- Recall used extreme value rescaling before

- "Angular Prob. Int. Trans." hasn't solved this

- Disappointed that now "everything looks independent"

- Recall seemed found interesting associations earlier

- Were those "really there"???

Large Variable Association (cont.)

What about other time blocks?

For threshold fixed at 200,

1. Time vs. Size

2. Rate (Size / Time) vs. Size

3. Inverse Rate (Time / Size) vs. Time

- Lessons similar to the above

Large Variable Association (cont.)

How to make axes more commensurate?

There are a number of choices, e.g.

- Rescale (e.g. by median)

- Prob. Int. Trans. on Angles

- Apply Prob. Int. Trans. to axes (Copula trans.)

- Prob. Int. Trans. on radii (polar coord. Copula?)

- mix and match

Problem: hard to keep track of (understand) all of these

Large Variable Association (cont.)

Approach: richer visualization
Setting: Thursday afternoon
Raw Data:
    -    Scatterplot shows size numbers far bigger than time
    -    log -log scatterplot show selection bias
    -    Hard question: how to do better???
    -    Q-Q plots show axes far from commensurate
    -    not totally surprising, since units completely different
    -    all angles piled up on 0.
Median Rescaled:
    -    Now all on time axis, but one huge size
    -    single outlier has terrible effect...
    -    log-log Q-Q crosses diagonal at median
    -    does poor job making axes commensurate
    -    no multiplicative rescaling can help
    -    angular density surprisingly uniform
Median RS & Angular PIT.:
    -    quite similar to above
    -    expected since angles already reasonably uniform
Med. RS & A. PIT & top 200:
    -    discussed above
    -    still see axes not very commensurate
Copula Transformed (PIT on both axes):
    -    axes very commensurate
    -    especially clear in Q-Q plots
    -    angular density clustered around 0.5
    -    caused by mapping to unit square
    -    have more angles in "direction of corner"
    -    preferable may be "angular version of copula"
Copula and Angular PIT:
    -    not a large change is needed for this
    -    angles not extremely far from uniform
    -    some rotation away from diagonal
    -    note little distortion in marginal Q-Q plots
Copula & A. PIT & Top 200:
    -    Suggest Strong Association!?!
    -    Opposite of above conclusion
    -    Scatterplot is "outer rim" of above scatterplot
    -    all quite close to diagonal
    -    seems to "choose wrong part of data"
    -    sensible modification???
Angular PIT:
    -    some improvement over raw data
    -    but axes still non-commensurate
    -    except perhaps on log scale
Angular PIT & Radius PIT:
    -    best job of commensurate axes
    -    looks like useful "polar coordinate copula"?
A. PIT & R. PIT & top 200:
    -    chose "outer rim" of polar coord. copula
    -    more dense in some regions than others
    -    marginal Q-Q plots show axes commensurate
    -    SiZer map shows some disassociation
    -    Too many towards 0?
    -    i.e. still dominated by "large size"?
    -    But opposite direction from above
    -    Why is R. PIT needed, if only study angles?
A. PIT & top 200:
    -    same SiZer analysis (since same angles)
    -    but axes look much less commensurate??
    -    something wrong with this "commensurate" idea????
    -    this analysis better than "median rescaled" above???