Lecture10-1-01

Course OR 778

Class Notes 10/1/01

Last Time: Mice and Elephants View

- Visualization related to:

Heavy tailed durations Long Range Dependence

- extracted flows from 5 million packets

- by source and destination addresses

- time windows gave truncation - length biasing

- fit distributions - Pareto, lognormal

- constructed simulated versions

- careful look at IP, TCP, UDP, ...

Mice and Elephants View (cont.)

Revisit protocol background:

a. all packets are IP (Internet Protocol)

b. subsets of IP include TCP, UDP, ...

c. TCP (Transmission Control Protocol) packets

- are "acknowledged", to get "certified transmission"

- thus involve loss recovery mechanisms

- includes HTTP, FTP, Telnet, SMTP, Napster,...

- ~80 % of current traffic

d. UDP (User Datagram Protocol) packets are "sent out only",

- i.e. loss is ignored

- includes streaming music, video, ...

- most of rest of traffic

Mice and Elephants View (cont.)

Revisit protocol background (cont.)

Interesting comment:

"I thought the main difference between TCP and UDP

was that TCP is 'network nice' while UDP is not"

Definition of 'network nice':

- sends data at "moderate rate" (TCP windowing)

- deliberately backs off when congestion is detected

- serious acknowledgement of "sharing" of resources

- crude attempt at optimizing bandwidth in "overall" sense

- "greedy behavior" is harder (but not impossible)

Don Smith's resolution of these impressions:

Both differences are correct, relevance depends on viewpoint:

Protocol Researcher:

"loss free transmission" difference is more fundamental

Network Traffic Researcher:

network behavior difference is more important

Mice and Elephants View (cont.)

Revisit issue: what is an IP "flow" (connection)?

Above definition:

an "IP flow" is a set of packets with
same sending and receiving IP addresses

Above noted weakness:

multiple visits to server are combined
e.g. several web pages from same site

Potential problem: could be "long lags" that interrupt:

Heavy tailed durations Long Range Dependence

Mice and Elephants View (cont.)

Quick data view:

study "maximal time gap" for above flows

Peak:

- max time gap = 261 (sec)

- total time window = 267 (sec)

Off Peak:

- max time gap = 2113 (sec)

- total time window = 2199 (sec)

- clearly an important issue!

- related to length bias issues??

Mice and Elephants View (cont.)

Quick fix: re-definition of "flow"

Split above flows, whenever gap between packets is > 60 (sec)

- "very long" TCP loss period

(TCP recovery "usually" 0.01 - 0.1 sec)

- not so long as "think time between browser clicks"

Mice and Elephants View (cont.)

Some quick summary statistics:

Peak Times:

Original Mean: 20.5

Split Mean: 14.1

Off Peak Times:

Original Mean: 86.2

Split Mean: 13.8

- Conclusion: substantial change in elephants

- especially in longer off peak range

Mice and Elephants View (cont.)

Some quick summary statistics (cont.)

Peak Times:

Original Median: 1.1

Split Median: 0.9

Off Peak Times:

Original Median: 0.5

Split Median: 0.3

- Conclusion: small change in mice

- smaller median suggests many split pieces are small!

Mice and Elephants View (cont.)

Heavy tail duration views (80% window):

Original Peak (revisited)

Split Peak

- mean, median & full window fraction are slightly smaller

- more mice and fewer elephants

- not a large change

Original Off Peak (revisited)

Split Off Peak

- mean, median & full window fraction are much smaller

- far fewer elephants

- dramatic visual change

- big time of day difference

- explainable by less Sunday morning Napster???

A simple model

Hopes:

- provide additional structure to support ideas.

- e.g. mice and elephants plots

- in particular, illustrate:

Heavy tailed durations Long Range Dependence

- yield "open problems" that become more accessible

- be reasonably realistic

A simple model (cont.)

- Continuous time (simpler than discrete?)

- Homogeneous Poisson "starting time" (for flows)

- Draw independent "duration time" (for each flow)

- Define = number of "active flows"

- Queueing name? ??? Reference?

- Allows analysis of

Heavy tailed durations Long Range Dependence

- Good reference?

Cox(1984) Long-Range Dependence: A Review, in Statistics: An Appraisal. Proceedings 50th Anniversary Conference. H. A. David, H. T. David (eds.). The Iowa State University Press, 55-74.
(in discrete case)

A simple model (cont.)

How reasonable is this?

Simple checks of assumptions:

1. SiZer analysis of homogeneity (constant intensity)

2. QQ investigation of interarrival times

A simple model (cont.)

1. SiZer analysis of homogeneity (constant intensity)

peak off-peak

- only for flow start times (recall connection (flow) graphic)

- clearly not homogenous (lots of red & blue regions)

- but not so "rough" as packet level analysis?

- homogeneous Poisson OK as an approximation????

- recall different time scales

- non-stationary type decrease in peak??

- note similar effects over 250 sec in off peak!

- "stationarity" depends on scale??

A simple model (cont.)

1. SiZer analysis of homogeneity (cont.):

Comparison to full packet SiZer:

(recall connection (flow) graphic)

all packets off-peak all packets peak

- start times have "fewer ups and downs"?

- just due to larger sample size???

- or have "subtracted out Long Range Dependence"???

- ups and downs are correlated?

- suggests "Long Range Dependence" also affects starts?

- strong correlation on right than on left?

- could be due to start time boundary effects???

A simple model (cont.)

2. QQ investigation of interarrival times

between session starts (recall connection (flow) graphic)

2 a. Exponential QQ, scale est'd by sample mean

Offpeak

- unacceptable fit

- especially slope

- suggesting wrong shape parameter

Offpeak, log scale

- magnifies small values

- still suggests wrong shape

Peak

- a few dramatically large values

- caused by boundary effects??

Peak, log scale

- again magnifies small values

- again suggests wrong shape parameter

A simple model (cont.)

2. QQ investigation of interarrival times (cont.)

2 b. Weibull QQ, parameters est'd by quantile matching

Offpeak

- very good fit

- shape parameter 0.9 "close to Poisson 1.0"?

- provides a workable approximation????

Offpeak, log scale

- similar lessons

Peak

- visual impression distorted by few large values

- very similar shape parameter 0.9

Peak, log scale

- similar lessons

Consequences of Weibull, with shape parameter 0.9:

- "Weibull bunching near 0" drives above SiZer analysis??

- suggests an improved model????