Class Notes 10/22/01
Last Time:
- In Context of Simple Model:
- Investigated "Indep'ce" part of Poisson process starts
- Saw similar results to before
- I.e. Cluster Poisson sensible, Weibull interarrival
not
- Studied Flow Scatterplots & "asymptotic independence"
- Boundary problems of IP flow data were too severe
- Introduced new HTTP Response Size data
- Looked at summary statistics
- Tried first version of Scatterplots
New HTTP Response Data
Data sources: 4 hour blocks of packet headers
"Morning": 8:00-12:00
"Afternoon": 13:00-17:00
"Evening": 19:30-23:30
Gathered at UNC Main Link
During 7 days in April 2001
(not consecutive days due to hardware problems)
(but each weekday is represented)
Headers filtered to only HTTP packets
(recall subset of IP and TCP)
Updated information: these are "real responses",
i.e. individual files sent by TCP.
This information is extracted
frrom packet headers.
Not simply determined only by addresses
(as done for IP data earlier)
New Flow Scatterplots
Recall Idea: look for
"asymptotic independence"
Expectation: for heavy tailed distributions,
scatterplot should "hug axes"
Approach: for new HTTP Response Sizes,
plot "response time duration (sec)" vs. "size of response (bytes)"
Combined
Graphic (all 21 time blocks)
Personal "axis hugging rating":
Monday Morning: Good
Monday Afternoon: Good
Monday Evening: OK
Tuesday Morning: Very good
Tuesday Afternoon: Good
Tuesday Evening: Poor
Wednesday Morning: Good
Wednesday Afternoon: Very good
Wednesday Evening: OK
Thursday Morning: OK
Thursday Afternoon: Good
Thursday Evening: Very good
Friday Morning: Poor
Friday Afternoon: Poor
Friday Evening: Good
Saturday Morning: Good
Saturday Afternoon: OK
Saturday Evening: OK
Sunday Morning: Good
Sunday Afternoon: Poor
Sunday Evening: OK
New Flow Scatterplots (cont.)
Other Observations:
- No apparent patterns??? With respect to:
- day of week?
- time of day?
- traffic load?
- max response size?
- Interesting "streaks" at some times
(strongest on Sunday morning)
- suggests a few common "transmission rates"
- most clear on Sun. morning, since least packet loss
New Flow Scatterplots (cont.)
Extreme Value Rescaled Analysis:
Idea: put everything
on "same tail index scale"
Notation: let
denote Response Size (bytes)
and let
denote Response Duration (sec)
Approach: estimate
"tail index"
(for both of
and
separately)
then apply "axis hugging"
view to joint distribution of
New Flow Scatterplots (cont.)
Estimation of Tail Index, :
Hill Estimator
Reference: sec. 6.4.2 of:
Embrechts, P., Klueppelberg,
C. and Mikosch, T. (1997) Modelling Extremal Events, Springer.
For "reverse order statistics",
Compare "previous data point"
with the
tail, on a log scale, to get
I.e. define
New Flow Scatterplots (cont.)
Results of Hill Estimation:
graphic
-
Strong dependence on number of data points
-
estimates mostly in range
(recall infinite variance, but finite mean)
- but not always (sometimes above, sometimes below)
-
Considered
from 100 to 2400
- Worth going smaller than 100???
- Which to use as an estimate?
- Chose the average over 100 to 2000
- Lower bound since "don't trust below that"
- Upper bound since use largest 2000 in scatterplot
-
Is the average sensible????
New Flow Scatterplots (cont.)
Now for scatterplot of joint
distribution of :
"Biggest 2000" now chosen by "distance to origin",
avoids potential earlier
"biasing towards size" effects
Results:
Extreme
Value rescaled Scatterplot
- Not much different from unscaled version?
-
Since most estimated
near 1??
- Thursday Evening has largest relative difference?
- Rescaled Thursday Evening has less "axis hugging"?
- I. e. this transformation "hurts" axis hugging???
(this case only)
- most others have "more axis hugging"??
-
Only because most powers
are larger than one?
- Visual impression ("axis hugging") driven mostly by
few biggest data points???
-
What about other scales?
New Flow Scatterplots (cont.)
Joint Distribution of
(cont.)
Same
scatterplot, on Square Root Scale
-
significantly reduces visual impression of "axis hugging"
-
Wednesday Afternoon looks best?
-
"asymptotic independence" depends on "scale"?
New Flow Scatterplots (cont.)
Joint Distribution of
(cont.)
A deeper look at power transformations:
The Box-Cox Power family
Given a power
modify the usual
"power transformation"
by a linear transformation:
Reason: this gives "continuity at 0":
i.e. includes the log transformation
as well
New Flow Scatterplots (cont.)
Box-Cox transformed Joint
Distribution of
Thursday Evening graphic
- Looked best on original scale
-
Consider the range
from 1 (no change) to 0 (log)
-
Smaller
means less "axis hugging"
- "Axis hugging" needs to be defined in terms of scale????
- Seem to need: improved finite sample definition of
"asymptotic indep."??
New Flow Scatterplots (cont.)
Big Problem with Hill Estimation:
Choice of
A simplification:
can reduce from choosing two values to one
By estimating the ratio ,
using
(for the same )
Then plot
Tuning of Hill Ratio Estimator: graphic
- Often rather close to 1?
- Thursday evening again an "extreme case"
(but recall transforming hurt "axis hugging")
-
Stronger dependence on ?
-
Ratio version any better that just "using same "?
-
Worth pursuing???
New Flow Scatterplots (cont.)
Things that could be done
next:
1. Follow up on "streaks", i.e. understand those transm'n rates?
- Do overlay of important ones on all plots?
- If they appear frequently, get expert interpretation?
-
Apply 2-d SiZer
type analysis to scatterplots?
2. Look at log scale versions?
-
What do we expect to learn?
3. Get quantitative about "axis hugging"?
- Useful finite sample def'n of "asy. indep."???
-
needs attention to "power - scale" of data??
How interesting are these???
New Response Size Q-Q Plots
Another view of New Response Size Data:
Extreme Value Tail Index
Recall Intuition:
-
Shape parameter of Pareto (polynomial power)
- Strong relation to Long Range Dependence
- in Mice and Elephants plots (graphic)
-
in Duration Distributions,
implies Classical LRD in aggregated time series
- Strong relation to moments:
- for
have infinite mean
- for
have finite mean but infinite variance
- for
both mean and variance are finite
- similar for larger
and higher moments
New Response Size Q-Q Plots (cont.)
Simple, straightforward Estimation
of :
Slope of CCDF (i.e. 1 - CDF) on log - log scale
Log-log CCDF: graphic
- All 21 time blocks appear as thin blue lines
- Each Individual labeled and highlighted in thick red
- Not very "linear"?
- Suggests classical extreme value theory
hasn't "kicked in" yet???
- Note "shapes" of curves surprisingly constant
- Suggests curvature is not "random phenomenon"!
- Instead something systematic about internet traffic?
- Point worth deeper statistical confirmation??
- Suggests enhancement of current mathematics????
- Friday evening an "extreme point"? (least steep?)
- Many Resp. Sizes near 400 bytes???
(also for Friday, Afternoon, no where else?)
- Worth plotting data between 0.999 quantile and max???
(1,000 to 7,000 of these for each time block....)
New Response Size Q-Q Plots (cont.)
Now estimate "tail index" ,
by studying:
Slopes: graphic
- Simply use difference quotients from log-log CCDF
- Numeircal problem: 0 denominators
- Reset to bottom of plot
- Suggest ignoring those
- Could use fancier differentiation (e.g. over bigger range)
- But this "raw data" shows interesting structure
-
"Almost always" have
(interesting for LRD)
-
But no apparent "tail limit" for ?
- So do not satisfy "classical heavy tail definition"?
- But still clearly "intuitively heavy tailed"?
-
Worth exploring alternate definitions?
New Mathematics for "Heavy Tails"?
Version 1:
For some ,
Open Problem 1: For the simple Model,
with Version 1 tailed Duration Dist'n, is
(i.e. have index
LRD)