Class Notes 10/1/01
Last Time:   Mice
and Elephants
View
 
- Visualization related to:
Heavy tailed durations  
Long Range Dependence
- extracted flows from 5 million packets
- by source and destination addresses
- time windows gave truncation - length biasing
- fit distributions - Pareto, lognormal
- constructed simulated versions
   
-    careful look at IP, TCP, UDP, ...
 
 
 
Mice and Elephants View (cont.)
Revisit
protocol background:
 
a.   
all packets are IP (Internet Protocol)
 
b.   
subsets of IP include TCP, UDP, ...
 
c. TCP (Transmission Control Protocol) packets
- are "acknowledged", to get "certified transmission"
- thus involve loss recovery mechanisms
- includes HTTP, FTP, Telnet, SMTP, Napster,...
   
-    ~80 % of current traffic
 
d. UDP (User Datagram Protocol) packets are "sent out only",
- i.e. loss is ignored
- includes streaming music, video, ...
   
-    most of rest of traffic
 
 
 
Mice and Elephants View (cont.)
Revisit
protocol background (cont.)
 
 
Interesting comment:
"I thought the main difference between TCP and UDP
was that TCP is 'network nice' while UDP is not"
Definition of 'network nice':
- sends data at "moderate rate" (TCP windowing)
- deliberately backs off when congestion is detected
- serious acknowledgement of "sharing" of resources
- crude attempt at optimizing bandwidth in "overall" sense
   
-    "greedy behavior" is harder (but not impossible)
 
 
Don
Smith's resolution of these impressions:
 
 
Both differences are correct, relevance depends on viewpoint:
Protocol Researcher:
"loss free transmission" difference is more fundamental
Network Traffic Researcher:
network behavior difference is more important
Mice and Elephants View (cont.)
Revisit
issue:  what is an IP "flow" (connection)?
 
Above definition:
an
"IP flow" is a set of packets with
same
sending and receiving IP addresses
Above noted weakness:
multiple
visits to server are combined
e.g.
several web pages from same site
Potential problem: could be "long lags" that interrupt:
Heavy tailed durations  
Long Range Dependence
Mice and Elephants View (cont.)
Quick data view:
study "maximal time gap" for above flows
Peak:
- max time gap = 261 (sec)
   
-    total time window = 267 (sec)
 
- max time gap = 2113 (sec)
   
-    total time window = 2199 (sec)
 
 
- clearly an important issue!
   
-    related to length bias issues??
 
 
 
Mice and Elephants View (cont.)
Quick
fix:  re-definition of "flow"
 
 
Split
above flows, whenever gap between packets is > 60 (sec)
 
- "very long" TCP loss period
(TCP recovery "usually" 0.01 - 0.1 sec)
   
-    not so long as "think time between browser clicks"
 
 
 
Mice and Elephants View (cont.)
Some
quick summary statistics:
 
 
Peak Times:
Original Mean: 20.5
   
Split Mean:         14.1
 
Off Peak Times:
Original Mean: 86.2
   
Split Mean:         13.8
 
- Conclusion: substantial change in elephants
   
-    especially in longer off peak range
 
 
 
Mice and Elephants View (cont.)
Some
quick summary statistics (cont.)
 
 
Peak Times:
Original Median: 1.1
   
Split Median:         0.9
 
Off Peak Times:
Original Median: 0.5
   
Split Median:         0.3
 
- Conclusion: small change in mice
   
-    smaller median suggests many split pieces are small!
 
 
 
Mice and Elephants View (cont.)
Heavy
tail duration views (80% window):
 
 
Original Peak (revisited)
- mean, median & full window fraction are slightly smaller
- more mice and fewer elephants
   
-    not a large change
 
 
Original Off Peak (revisited)
- mean, median & full window fraction are much smaller
- far fewer elephants
- dramatic visual change
- big time of day difference
   
-    explainable by less Sunday morning Napster???
 
 
 
A simple model
Hopes:
 
   
-    provide additional structure to support ideas.
 
   
-    e.g. mice
and elephants
plots
 
- in particular, illustrate:
Heavy tailed durations  
Long Range Dependence
   
-    yield "open problems" that become more accessible
 
   
-    be reasonably realistic
 
 
 
A simple model (cont.)
   
-    Continuous time (simpler than discrete?)
 
   
-    Homogeneous Poisson "starting time" (for flows)
 
   
-    Draw independent "duration time" (for each flow)
 
   
-    Define  
=  number of "active flows"
 
 
 
 
   
-    Queueing name? ???  
Reference?
 
- Allows analysis of
Heavy tailed durations  
Long Range Dependence
   
-    Good reference?
 
Cox(1984)
Long-Range Dependence: A Review, in Statistics: An Appraisal. 
Proceedings 50th Anniversary Conference.  H. A. David, H. T. David
(eds.).  The Iowa State University Press, 55-74.
(in
discrete case)
 
 
 
A simple model (cont.)
How
reasonable is this?
 
 
Simple
checks of assumptions:
 
1. 
SiZer
analysis of homogeneity (constant intensity)
 
2.   
QQ investigation of interarrival times
 
 
 
A simple model (cont.)
1. SiZer analysis of homogeneity (constant intensity)
   
-    only for flow start times (recall connection (flow)
graphic)
 
   
-    clearly not homogenous (lots of red
& blue regions)
 
   
-    but not so "rough" as packet level analysis?
 
   
-    homogeneous Poisson OK as an approximation????
 
   
-    recall different time scales
 
   
-    non-stationary type decrease in peak??
 
   
-    note similar effects over 250 sec in off peak!
 
   
-    "stationarity" depends on scale??
 
 
 
A simple model (cont.)
1. 
SiZer
analysis of homogeneity (cont.):
 
Comparison to full packet SiZer:
(recall
connection (flow) graphic)
 
 
- start times have "fewer ups and downs"?
- just due to larger sample size???
   
-    or have "subtracted out Long Range Dependence"???
 
- ups and downs are correlated?
   
-    suggests "Long Range Dependence" also affects starts?
 
- strong correlation on right than on left?
   
-    could be due to start time boundary effects???
 
 
 
A simple model (cont.)
2. QQ investigation of interarrival times
between session starts (recall connection (flow) graphic)
2 a.   
Exponential QQ, scale est'd by sample mean
 
- unacceptable fit
- especially slope
   
-    suggesting wrong shape parameter
 
- magnifies small values
   
-    still suggests wrong shape
 
- a few dramatically large values
   
-    caused by boundary effects??
 
- again magnifies small values
   
-    again suggests wrong shape parameter
 
 
 
A simple model (cont.)
2.   
QQ investigation of interarrival times (cont.)
 
 
2 b.   
Weibull QQ, parameters est'd by quantile matching
 
- very good fit
- shape parameter 0.9 "close to Poisson 1.0"?
   
-    provides a workable approximation????
 
   
-    similar lessons
 
 
- visual impression distorted by few large values
   
-    very similar shape parameter 0.9
 
   
-    similar lessons
 
 
Consequences of Weibull, with shape parameter 0.9:
- "Weibull bunching near 0" drives above SiZer analysis??
   
-    suggests an improved model????