Course  OR 778

Class Notes   10/1/01






Last Time:   Mice and Elephants View
 

    -    Visualization related to:

Heavy tailed durations   Long Range Dependence

    -    extracted flows from 5 million packets

    -    by source and destination addresses

    -    time windows gave truncation - length biasing

    -    fit distributions - Pareto, lognormal

    -    constructed simulated versions

    -    careful look at IP, TCP, UDP, ...
 
 
 


Mice and Elephants View (cont.)






Revisit protocol background:
 

a.    all packets are IP (Internet Protocol)
 

b.    subsets of IP include TCP, UDP, ...
 

c.    TCP (Transmission Control Protocol) packets

    -    are "acknowledged", to get "certified transmission"

    -    thus involve loss recovery mechanisms

    -    includes HTTP, FTP, Telnet, SMTP, Napster,...

    -    ~80 % of current traffic
 

d.    UDP (User Datagram Protocol) packets are "sent out only",

    -    i.e. loss is ignored

    -    includes streaming music, video, ...

    -    most of rest of traffic
 
 
 


Mice and Elephants View (cont.)






Revisit protocol background (cont.)
 
 

Interesting comment:

"I thought the main difference between TCP and UDP

was that TCP is 'network nice' while UDP is not"






Definition of 'network nice':

    -    sends data at "moderate rate" (TCP windowing)

    -    deliberately backs off when congestion is detected

    -    serious acknowledgement of "sharing" of resources

    -    crude attempt at optimizing bandwidth in "overall" sense

    -    "greedy behavior" is harder (but not impossible)
 
 

Don Smith's resolution of these impressions:
 
 

Both differences are correct, relevance depends on viewpoint:






Protocol Researcher:

"loss free transmission" difference is more fundamental






Network Traffic Researcher:

network behavior difference is more important








Mice and Elephants View (cont.)






Revisit issue:  what is an IP "flow" (connection)?
 

Above definition:

an "IP flow" is a set of packets with
same sending and receiving IP addresses






Above noted weakness:

multiple visits to server are combined
e.g. several web pages from same site






Potential problem:  could be "long lags" that interrupt:

Heavy tailed durations   Long Range Dependence








Mice and Elephants View (cont.)






Quick data view:

study "maximal time gap" for above flows






Peak:

    -    max time gap = 261 (sec)

    -    total time window = 267 (sec)
 

Off Peak:

    -    max time gap = 2113 (sec)

    -    total time window = 2199 (sec)
 
 

    -    clearly an important issue!

    -    related to length bias issues??
 
 
 


Mice and Elephants View (cont.)







Quick fix:  re-definition of "flow"
 
 

Split above flows, whenever gap between packets is > 60 (sec)
 

    -    "very long" TCP loss period

(TCP recovery "usually" 0.01 - 0.1 sec)







    -    not so long as "think time between browser clicks"
 
 
 


Mice and Elephants View (cont.)






Some quick summary statistics:
 
 

Peak Times:

    Original Mean:    20.5

    Split Mean:         14.1
 

Off Peak Times:

    Original Mean:    86.2

    Split Mean:         13.8
 

    -    Conclusion: substantial change in elephants

    -    especially in longer off peak range
 
 
 


Mice and Elephants View (cont.)






Some quick summary statistics (cont.)
 
 

Peak Times:

    Original Median:    1.1

    Split Median:         0.9
 

Off Peak Times:

    Original Median:    0.5

    Split Median:         0.3
 

    -    Conclusion: small change in mice

    -    smaller median suggests many split pieces are small!
 
 
 


Mice and Elephants View (cont.)






Heavy tail duration views (80% window):
 
 

Original Peak    (revisited)

Split Peak

    -    mean, median & full window fraction are slightly smaller

    -    more mice and fewer elephants

    -    not a large change
 
 

Original Off Peak    (revisited)

Split Off Peak

    -    mean, median & full window fraction are much smaller

    -    far fewer elephants

    -    dramatic visual change

    -    big time of day difference

    -    explainable by less Sunday morning Napster???
 
 
 


A simple model






Hopes:
 

    -    provide additional structure to support ideas.
 

    -    e.g. mice and elephants plots
 

    -    in particular, illustrate:

Heavy tailed durations   Long Range Dependence






    -    yield "open problems" that become more accessible
 

    -    be reasonably realistic
 
 
 


A simple model (cont.)






    -    Continuous time (simpler than discrete?)
 

    -    Homogeneous Poisson "starting time" (for flows)
 

    -    Draw independent "duration time" (for each flow)
 

    -    Define   =  number of "active flows"
 
 
 
 

    -    Queueing name? ???   Reference?
 

    -    Allows analysis of

Heavy tailed durations   Long Range Dependence





    -    Good reference?
 

Cox(1984) Long-Range Dependence: A Review, in Statistics: An Appraisal.  Proceedings 50th Anniversary Conference.  H. A. David, H. T. David (eds.).  The Iowa State University Press, 55-74.
(in discrete case)
 
 
 


A simple model (cont.)






How reasonable is this?
 
 

Simple checks of assumptions:
 

1.  SiZer analysis of homogeneity (constant intensity)
 

2.    QQ investigation of interarrival times
 
 
 


A simple model (cont.)





1.  SiZer analysis of homogeneity (constant intensity)

peak                     off-peak





    -    only for flow start times (recall connection (flow) graphic)
 

    -    clearly not homogenous (lots of red & blue regions)
 

    -    but not so "rough" as packet level analysis?
 

    -    homogeneous Poisson OK as an approximation????
 

    -    recall different time scales
 

    -    non-stationary type decrease in peak??
 

    -    note similar effects over 250 sec in off peak!
 

    -    "stationarity" depends on scale??
 
 
 


A simple model (cont.)





1.  SiZer analysis of homogeneity (cont.):
 

Comparison to full packet SiZer:

(recall connection (flow) graphic)
 
 

all packets off-peak                     all packets peak





    -    start times have "fewer ups and downs"?

    -    just due to larger sample size???

    -    or have "subtracted out Long Range Dependence"???
 

    -    ups and downs are correlated?

    -    suggests "Long Range Dependence" also affects starts?
 

    -    strong correlation on right than on left?

    -    could be due to start time boundary effects???
 
 
 


A simple model (cont.)






2.    QQ investigation of interarrival times

between session starts (recall connection (flow) graphic)






2 a.    Exponential QQ, scale est'd by sample mean
 

Offpeak

    -    unacceptable fit

    -    especially slope

    -    suggesting wrong shape parameter
 

Offpeak, log scale

    -    magnifies small values

    -    still suggests wrong shape
 

Peak

    -    a few dramatically large values

    -    caused by boundary effects??
 

Peak, log scale

    -    again magnifies small values

    -    again suggests wrong shape parameter
 
 
 


A simple model (cont.)






2.    QQ investigation of interarrival times (cont.)
 
 

2 b.    Weibull QQ, parameters est'd by quantile matching
 

Offpeak

    -    very good fit

    -    shape parameter 0.9 "close to Poisson 1.0"?

    -    provides a workable approximation????
 

Offpeak, log scale

    -    similar lessons
 
 

Peak

    -    visual impression distorted by few large values

    -    very similar shape parameter 0.9
 

Peak, log scale

    -    similar lessons
 
 

Consequences of Weibull, with shape parameter 0.9:

    -    "Weibull bunching near 0" drives above SiZer analysis??

    -    suggests an improved model????