Lecture10-22-01

Course OR 778

Class Notes 10/22/01

Last Time:

- In Context of Simple Model:

- Investigated "Indep'ce" part of Poisson process starts

- Saw similar results to before

- I.e. Cluster Poisson sensible, Weibull interarrival not

- Studied Flow Scatterplots & "asymptotic independence"

- Boundary problems of IP flow data were too severe

- Introduced new HTTP Response Size data

- Looked at summary statistics

- Tried first version of Scatterplots

New HTTP Response Data

Data sources: 4 hour blocks of packet headers

"Morning": 8:00-12:00

"Afternoon": 13:00-17:00

"Evening": 19:30-23:30

Gathered at UNC Main Link

During 7 days in April 2001

(not consecutive days due to hardware problems)

(but each weekday is represented)

Headers filtered to only HTTP packets

(recall subset of IP and TCP)

Updated information: these are "real responses",

i.e. individual files sent by TCP.

This information is extracted frrom packet headers.

Not simply determined only by addresses

(as done for IP data earlier)

New Flow Scatterplots

Recall Idea: look for "asymptotic independence"

Expectation: for heavy tailed distributions,

scatterplot should "hug axes"

Approach: for new HTTP Response Sizes,

plot "response time duration (sec)" vs. "size of response (bytes)"

Combined Graphic (all 21 time blocks)

Personal "axis hugging rating":

Monday Morning: Good

Monday Afternoon: Good

Monday Evening: OK

Tuesday Morning: Very good

Tuesday Afternoon: Good

Tuesday Evening: Poor

Wednesday Morning: Good

Wednesday Afternoon: Very good

Wednesday Evening: OK

Thursday Morning: OK

Thursday Afternoon: Good

Thursday Evening: Very good

Friday Morning: Poor

Friday Afternoon: Poor

Friday Evening: Good

Saturday Morning: Good

Saturday Afternoon: OK

Saturday Evening: OK

Sunday Morning: Good

Sunday Afternoon: Poor

Sunday Evening: OK

New Flow Scatterplots (cont.)

Other Observations:

- No apparent patterns??? With respect to:

- day of week?

- time of day?

- traffic load?

- max response size?

- Interesting "streaks" at some times

(strongest on Sunday morning)

- suggests a few common "transmission rates"

- most clear on Sun. morning, since least packet loss

New Flow Scatterplots (cont.)

Extreme Value Rescaled Analysis:

Idea: put everything on "same tail index scale"

Notation: let denote Response Size (bytes)

and let denote Response Duration (sec)

Approach: estimate "tail index"

(for both of and separately)

then apply "axis hugging" view to joint distribution of

New Flow Scatterplots (cont.)

Estimation of Tail Index, :

Hill Estimator

Reference: sec. 6.4.2 of:

Embrechts, P., Klueppelberg, C. and Mikosch, T. (1997) Modelling Extremal Events, Springer.

For "reverse order statistics",

Compare "previous data point"

with the tail, on a log scale, to get

I.e. define

New Flow Scatterplots (cont.)

Results of Hill Estimation: graphic

- Strong dependence on number of data points

- estimates mostly in range

(recall infinite variance, but finite mean)

- but not always (sometimes above, sometimes below)

- Considered from 100 to 2400

- Worth going smaller than 100???

- Which to use as an estimate?

- Chose the average over 100 to 2000

- Lower bound since "don't trust below that"

- Upper bound since use largest 2000 in scatterplot

- Is the average sensible????

New Flow Scatterplots (cont.)

Now for scatterplot of joint distribution of :

"Biggest 2000" now chosen by "distance to origin",

avoids potential earlier "biasing towards size" effects

Results:

Extreme Value rescaled Scatterplot

- Not much different from unscaled version?

- Since most estimated near 1??

- Thursday Evening has largest relative difference?

- Rescaled Thursday Evening has less "axis hugging"?

- I. e. this transformation "hurts" axis hugging???

(this case only)

- most others have "more axis hugging"??

- Only because most powers are larger than one?

- Visual impression ("axis hugging") driven mostly by

few biggest data points???

- What about other scales?

New Flow Scatterplots (cont.)

Joint Distribution of (cont.)

Same scatterplot, on Square Root Scale

- significantly reduces visual impression of "axis hugging"

- Wednesday Afternoon looks best?

- "asymptotic independence" depends on "scale"?

New Flow Scatterplots (cont.)

Joint Distribution of (cont.)

A deeper look at power transformations:

The Box-Cox Power family

Given a power modify the usual

"power transformation"

by a linear transformation:

Reason: this gives "continuity at 0":

i.e. includes the log transformation as well

New Flow Scatterplots (cont.)

Box-Cox transformed Joint Distribution of

Thursday Evening graphic

- Looked best on original scale

- Consider the range

from 1 (no change) to 0 (log)

- Smaller means less "axis hugging"

- "Axis hugging" needs to be defined in terms of scale????

- Seem to need: improved finite sample definition of

"asymptotic indep."??

New Flow Scatterplots (cont.)

Big Problem with Hill Estimation: Choice of

A simplification:

can reduce from choosing two values to one

By estimating the ratio , using

(for the same )

Then plot

Tuning of Hill Ratio Estimator: graphic

- Often rather close to 1?

- Thursday evening again an "extreme case"

(but recall transforming hurt "axis hugging")

- Stronger dependence on ?

- Ratio version any better that just "using same "?

- Worth pursuing???

New Flow Scatterplots (cont.)

Things that could be done next:

1. Follow up on "streaks", i.e. understand those transm'n rates?

- Do overlay of important ones on all plots?

- If they appear frequently, get expert interpretation?

- Apply 2-d SiZer type analysis to scatterplots?

2. Look at log scale versions?

- What do we expect to learn?

3. Get quantitative about "axis hugging"?

- Useful finite sample def'n of "asy. indep."???

- needs attention to "power - scale" of data??

How interesting are these???

New Response Size Q-Q Plots

Another view of New Response Size Data:

Extreme Value Tail Index

Recall Intuition:

- Shape parameter of Pareto (polynomial power)

- Strong relation to Long Range Dependence

- in Mice and Elephants plots (graphic)

- in Duration Distributions,

implies Classical LRD in aggregated time series

- Strong relation to moments:

- for have infinite mean

- for have finite mean but infinite variance

- for both mean and variance are finite

- similar for larger and higher moments

New Response Size Q-Q Plots (cont.)

Simple, straightforward Estimation of :

Slope of CCDF (i.e. 1 - CDF) on log - log scale

Log-log CCDF: graphic

- All 21 time blocks appear as thin blue lines

- Each Individual labeled and highlighted in thick red

- Not very "linear"?

- Suggests classical extreme value theory

hasn't "kicked in" yet???

- Note "shapes" of curves surprisingly constant

- Suggests curvature is not "random phenomenon"!

- Instead something systematic about internet traffic?

- Point worth deeper statistical confirmation??

- Suggests enhancement of current mathematics????

- Friday evening an "extreme point"? (least steep?)

- Many Resp. Sizes near 400 bytes???

(also for Friday, Afternoon, no where else?)

- Worth plotting data between 0.999 quantile and max???

(1,000 to 7,000 of these for each time block....)

New Response Size Q-Q Plots (cont.)

Now estimate "tail index" , by studying:

Slopes: graphic

- Simply use difference quotients from log-log CCDF

- Numeircal problem: 0 denominators

- Reset to bottom of plot

- Suggest ignoring those

- Could use fancier differentiation (e.g. over bigger range)

- But this "raw data" shows interesting structure

- "Almost always" have (interesting for LRD)

- But no apparent "tail limit" for ?

- So do not satisfy "classical heavy tail definition"?

- But still clearly "intuitively heavy tailed"?

- Worth exploring alternate definitions?

New Mathematics for "Heavy Tails"?

Version 1: For some ,

Open Problem 1: For the simple Model,

with Version 1 tailed Duration Dist'n, is

(i.e. have index LRD)