Class Notes 10/1/01
Last Time: Mice
and Elephants
View
- Visualization related to:
Heavy tailed durations
Long Range Dependence
- extracted flows from 5 million packets
- by source and destination addresses
- time windows gave truncation - length biasing
- fit distributions - Pareto, lognormal
- constructed simulated versions
- careful look at IP, TCP, UDP, ...
Mice and Elephants View (cont.)
Revisit
protocol background:
a.
all packets are IP (Internet Protocol)
b.
subsets of IP include TCP, UDP, ...
c. TCP (Transmission Control Protocol) packets
- are "acknowledged", to get "certified transmission"
- thus involve loss recovery mechanisms
- includes HTTP, FTP, Telnet, SMTP, Napster,...
- ~80 % of current traffic
d. UDP (User Datagram Protocol) packets are "sent out only",
- i.e. loss is ignored
- includes streaming music, video, ...
- most of rest of traffic
Mice and Elephants View (cont.)
Revisit
protocol background (cont.)
Interesting comment:
"I thought the main difference between TCP and UDP
was that TCP is 'network nice' while UDP is not"
Definition of 'network nice':
- sends data at "moderate rate" (TCP windowing)
- deliberately backs off when congestion is detected
- serious acknowledgement of "sharing" of resources
- crude attempt at optimizing bandwidth in "overall" sense
- "greedy behavior" is harder (but not impossible)
Don
Smith's resolution of these impressions:
Both differences are correct, relevance depends on viewpoint:
Protocol Researcher:
"loss free transmission" difference is more fundamental
Network Traffic Researcher:
network behavior difference is more important
Mice and Elephants View (cont.)
Revisit
issue: what is an IP "flow" (connection)?
Above definition:
an
"IP flow" is a set of packets with
same
sending and receiving IP addresses
Above noted weakness:
multiple
visits to server are combined
e.g.
several web pages from same site
Potential problem: could be "long lags" that interrupt:
Heavy tailed durations
Long Range Dependence
Mice and Elephants View (cont.)
Quick data view:
study "maximal time gap" for above flows
Peak:
- max time gap = 261 (sec)
- total time window = 267 (sec)
- max time gap = 2113 (sec)
- total time window = 2199 (sec)
- clearly an important issue!
- related to length bias issues??
Mice and Elephants View (cont.)
Quick
fix: re-definition of "flow"
Split
above flows, whenever gap between packets is > 60 (sec)
- "very long" TCP loss period
(TCP recovery "usually" 0.01 - 0.1 sec)
- not so long as "think time between browser clicks"
Mice and Elephants View (cont.)
Some
quick summary statistics:
Peak Times:
Original Mean: 20.5
Split Mean: 14.1
Off Peak Times:
Original Mean: 86.2
Split Mean: 13.8
- Conclusion: substantial change in elephants
- especially in longer off peak range
Mice and Elephants View (cont.)
Some
quick summary statistics (cont.)
Peak Times:
Original Median: 1.1
Split Median: 0.9
Off Peak Times:
Original Median: 0.5
Split Median: 0.3
- Conclusion: small change in mice
- smaller median suggests many split pieces are small!
Mice and Elephants View (cont.)
Heavy
tail duration views (80% window):
Original Peak (revisited)
- mean, median & full window fraction are slightly smaller
- more mice and fewer elephants
- not a large change
Original Off Peak (revisited)
- mean, median & full window fraction are much smaller
- far fewer elephants
- dramatic visual change
- big time of day difference
- explainable by less Sunday morning Napster???
A simple model
Hopes:
- provide additional structure to support ideas.
- e.g. mice
and elephants
plots
- in particular, illustrate:
Heavy tailed durations
Long Range Dependence
- yield "open problems" that become more accessible
- be reasonably realistic
A simple model (cont.)
- Continuous time (simpler than discrete?)
- Homogeneous Poisson "starting time" (for flows)
- Draw independent "duration time" (for each flow)
- Define
= number of "active flows"
- Queueing name? ???
Reference?
- Allows analysis of
Heavy tailed durations
Long Range Dependence
- Good reference?
Cox(1984)
Long-Range Dependence: A Review, in Statistics: An Appraisal.
Proceedings 50th Anniversary Conference. H. A. David, H. T. David
(eds.). The Iowa State University Press, 55-74.
(in
discrete case)
A simple model (cont.)
How
reasonable is this?
Simple
checks of assumptions:
1.
SiZer
analysis of homogeneity (constant intensity)
2.
QQ investigation of interarrival times
A simple model (cont.)
1. SiZer analysis of homogeneity (constant intensity)
- only for flow start times (recall connection (flow)
graphic)
- clearly not homogenous (lots of red
& blue regions)
- but not so "rough" as packet level analysis?
- homogeneous Poisson OK as an approximation????
- recall different time scales
- non-stationary type decrease in peak??
- note similar effects over 250 sec in off peak!
- "stationarity" depends on scale??
A simple model (cont.)
1.
SiZer
analysis of homogeneity (cont.):
Comparison to full packet SiZer:
(recall
connection (flow) graphic)
- start times have "fewer ups and downs"?
- just due to larger sample size???
- or have "subtracted out Long Range Dependence"???
- ups and downs are correlated?
- suggests "Long Range Dependence" also affects starts?
- strong correlation on right than on left?
- could be due to start time boundary effects???
A simple model (cont.)
2. QQ investigation of interarrival times
between session starts (recall connection (flow) graphic)
2 a.
Exponential QQ, scale est'd by sample mean
- unacceptable fit
- especially slope
- suggesting wrong shape parameter
- magnifies small values
- still suggests wrong shape
- a few dramatically large values
- caused by boundary effects??
- again magnifies small values
- again suggests wrong shape parameter
A simple model (cont.)
2.
QQ investigation of interarrival times (cont.)
2 b.
Weibull QQ, parameters est'd by quantile matching
- very good fit
- shape parameter 0.9 "close to Poisson 1.0"?
- provides a workable approximation????
- similar lessons
- visual impression distorted by few large values
- very similar shape parameter 0.9
- similar lessons
Consequences of Weibull, with shape parameter 0.9:
- "Weibull bunching near 0" drives above SiZer analysis??
- suggests an improved model????