Lecture9-26-01

Course OR 778

Class Notes 9/26/01

Last Time

- finished SiZer background

- zooming SiZer analysis

- showed aggregated data not Homogeneous Poisson

- different from aggregation conclusions of Cleveland, et. al.

Mice and Elephants View

In context of:

Heavy tailed durations Long Range Dependence

Earlier graphical "toy views":

Exponential Durations

Pareto 1.5 Durations

Showed how "few long connections"

(i.e. heavy tailed duration distributions)

might induce Long Range Dependence

Mice and Elephants View

Basis for "real data" view:

- 5 million IP packets (all packets: TCP & UDP & ...)

- gathered at UNC main link, during:

- peak time - weekday afternoon (< 5 min.)

- off-peak time - Sunday morning (~40 min)

- during April, 2000

- with:

- time stamps

- packet sizes

- index of sending and receiving IP addresses

Mice and Elephants View (cont.)

Data Summarization:

define "IP flow" (connection) as set of packets with same index

- recall toy graphic (same index = same color)

- keep only first packet time and last packet time

Weakness:

multiple visits to server are combined

Graphic (same as "simulated durations" above):

- represent as horizontal line (over time)

- vertical height: random, to separate lines

- random sample of 1000 (full 142170 is too much)

Mice and Elephants View (cont.)

Peak time graphic

- Looks more like Pareto 1.5, than Exponential?

- Boundary effects?

- Mean much larger than median

Off - Peak graphic

- Similar, (but stronger) Mice and Elephants effects

- careful about much longer time span

(needed to gather 5 million packets)

Mice and Elephants View (cont.)

Mitigation of boundary effects: narrower time window

- study for Off Peak data

2. 80% of full time span

- different randomization

- now see flows overlapping edge (as expected)

- "right number" overlapping??

- mean connection is smaller (since smaller window)

- median connection is larger?!?!

(must have eliminated more mice than elephants)

- 1.3% cover full time window

Mice and Elephants View (cont.)

Mitigation of boundary effects: narrower time window (cont.)

3. 10% of full time span

- many more long lines

- reduced sample from 1000 to 500

- just so could see something (otherwise "too dark")

- Mean and Median both much larger (more long lines)

- 20% cover full window (~3.5 min.)

- "Length Biased" sampling effect?

Mice and Elephants View (cont.)

Mitigation of boundary effects: narrower time window (cont.)

4. 1% of full time span

- above effects all much stronger

- > 80% cover full window! (< 0.5 min.)

- for decent visual effect, reduced sample to only 200

- lines randomly placed on equally spaced vertical grid

(otherwise "Poisson clumping" is visually distracting)

(this was done in above plots as well)

- Recall full sample median was 0.5 sec.

- Clear "length biased" sampling effect!

Mice and Elephants View (cont.)

Length Biased sampling:

Classical Reference:

Daniels, H. E. (1942) A new technique for the analysis of fibre length distribution in wool, J. Text. Inst., 33, 1209-1211.

Background: sampling from a basket of fibers

- long fibers more likely to be drawn

- creates bias in "population of lengths"

- bias can be precisely calculated

- thus can suitably adjust

Open Problem: Use "length biased" sampling and "truncated data" ideas:

- to explore correctness of 80% window view

- to correctly modify smaller window views

- to find "best view" for mice and elephant plots

Variation also involved: "censored" and "truncated" sampling

Good reference:
Turnbull, B. W. (1976) The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data, Journal of the Royal Statistical Society, Series B, 38, 290-295.

Not pursued more deeply for now

Mice and Elephants View (cont.)

Which distributions "fit"?

Kernel Density Estimation graphic

View 1: SiZer analysis

- ordinary scale is useless

- elephants so big, that many mice get obscured

View 2: log SiZer analysis

- much more useful scale

- large percentage at min (1 packet flows)

- recall: median ~ 0.5 sec, mean ~ 100 sec

- large clump approximately Gaussian?

- overall, mixture of 4 Gaussians??

- recall 1.3% cover full window, i.e. "at max"

- many "small significant bumps"

Mice and Elephants View (cont.)

Which distributions "fit"? (cont.)

Pareto Q-Q graphic

- "many at min" is vertical line at bottom

- "window upper bound" is horizontal line at top

- fit at 0.8 and 0.9 quantiles, since boundary effects

(windowing and length biasedness) drive larger quantiles

- suggests shape parameter < 1 (infinite mean)

- but very slippery, since fit overall "looks pretty bad"

- true distribution is clearly a "mixture"

(of at least 3, SiZer analysis suggested 4)

Mice and Elephants View (cont.)

Which distributions "fit"? (cont.)

log Normal Q-Q graphic

- again "many at min" is vertical line at bottom

- again "window upper bound" is horizontal line at top

- again fit at 0.8 and 0.9 quantiles,

because boundary effects drive larger quantiles

- fit distribution has all moments finite.

- again fit overall "looks pretty bad" (comparable to Pareto)

- so inference is very unreliable

- true distribution is clearly a "mixture" (of at least 3 or 4)

Mice and Elephants View (cont.)

Mice and Elephants simulated from fit distributions:

1. Off Peak (recall original 80% window plot)

Use arrival times from real data

Durations simulated from Exponential (same mean)

- still way off

- too few mice and too few elephants

Durations simulated from fit Pareto

- looks "much more like real data"?

- mean bigger by factor of 60?

- median bigger by factor of 12?

- clearly not "all that close" in distribution (as seen in Q-Q)

Durations simulated from fit log Normal

- looks better than Exponential

- but not as good as Pareto?

- not enough elephants?

- but mean and median are closer to truth

- again distributions not all that close (saw from Q-Q)

Mice and Elephants View (cont.)

Mice and Elephants simulated from fit distributions (cont.):

1. Peak time (above was "off peak"):

First study 80% windowed version

- mean is smaller than Off Peak

(recall smaller time span)

- median is larger

(length biased sampling???)

Durations simulated from fit Pareto

- looks not bad

Durations simulated from fit log Normal

- this time looks better than Pareto?

- number of elephants seems closer??

- mean is closer?

Caution: dangerous to draw conclusions from "visual effect"

Mice and Elephants View (cont.)

Revisit issue: what is an IP "flow" (connection)?

Above definition:

an "IP flow" is a set of packets with
same sending and receiving IP addresses

Above noted weakness:

multiple visits to server are combined
e.g. several web pages from same site

Potential problem: could be "long lags" that interrupt:

Heavy tailed durations Long Range Dependence

Mice and Elephants View (cont.)

More protocol background:

a. all packets are IP (Internet Protocol)

b. subsets of IP include TCP, UDP, ...

c. TCP (Transmission Control Protocol) packets

- are "acknowledged", to get "certified transmission"

- thus involve loss recovery mechanisms

- includes HTTP, FTP, Telnet, SMTP, Napster,...

- ~80 % of current traffic

d. UDP (User Datagram Protocol) packets are "sent out only",

- i.e. loss is ignored

- includes streaming music, video, ...

- most of rest of traffic

e. but there are other packets

- not assignable to flows by IP address

f. above analysis includes both TCP and UDP flows