Course  OR 778

Class Notes   9/26/01







Last Time
 

    -    finished SiZer background

    -    zooming SiZer analysis

    -    showed aggregated data not Homogeneous Poisson

    -    different from aggregation conclusions of Cleveland, et. al.
 
 
 
 


Mice and Elephants View






In context of:

Heavy tailed durations   Long Range Dependence






Earlier graphical "toy views":

      Exponential Durations

      Pareto 1.5 Durations
 
 

Showed how "few long connections"

            (i.e.  heavy tailed duration distributions)

            might induce Long Range Dependence
 
 
 
 


Mice and Elephants View






Basis for "real data" view:
 

    -    5 million IP packets (all packets: TCP & UDP & ...)
 

    -    gathered at UNC main link, during:

                -    peak time - weekday afternoon (< 5 min.)

                -    off-peak time - Sunday morning (~40 min)

                -    during April, 2000
 

    -    with:

                -    time stamps

                -    packet sizes

                -    index of sending and receiving IP addresses
 
 
 


Mice and Elephants View (cont.)






Data Summarization:

define "IP flow" (connection) as set of packets with same index

    -    recall toy graphic    (same index = same color)

    -    keep only first packet time and last packet time
 

Weakness:

multiple visits to server are combined






Graphic (same as "simulated durations" above):

    -    represent as horizontal line (over time)

    -    vertical height: random, to separate lines

    -    random sample of 1000 (full 142170 is too much)
 
 
 
 
 


Mice and Elephants View (cont.)







Peak time graphic

    -    Looks more like Pareto 1.5, than Exponential?

    -    Boundary effects?

    -    Mean much larger than median
 

Off - Peak graphic

    -    Similar, (but stronger) Mice and Elephants effects

    -    careful about much longer time span

(needed to gather 5 million packets)








Mice and Elephants View (cont.)






Mitigation of boundary effects:  narrower time window

    -    study for Off Peak data
 
 

2.  80% of full time span
 

    -    different randomization
 

    -    now see flows overlapping edge (as expected)
 

    -    "right number" overlapping??
 

    -    mean connection is smaller (since smaller window)
 

    -    median connection is larger?!?!

(must have eliminated more mice than elephants)





    -    1.3% cover full time window
 
 
 


Mice and Elephants View (cont.)






Mitigation of boundary effects:  narrower time window (cont.)
 
 

3.  10% of full time span
 

    -    many more long lines
 

    -    reduced sample from 1000 to 500
 

    -    just so could see something (otherwise "too dark")
 

    -    Mean and Median both much larger (more long lines)
 

    -    20% cover full window (~3.5 min.)
 

    -    "Length Biased" sampling effect?
 
 
 


Mice and Elephants View (cont.)






Mitigation of boundary effects:  narrower time window (cont.)
 
 

4.  1% of full time span
 

    -    above effects all much stronger
 

    -    > 80% cover full window!  (< 0.5 min.)
 

    -    for decent visual effect, reduced sample to only 200
 

    -    lines randomly placed on equally spaced vertical grid

(otherwise "Poisson clumping" is visually distracting)

(this was done in above plots as well)





    -    Recall full sample median was 0.5 sec.
 

    -    Clear "length biased" sampling effect!
 
 
 


Mice and Elephants View (cont.)






Length Biased sampling:
 

Classical Reference:

Daniels, H. E. (1942) A new technique for the analysis of fibre length distribution in wool, J. Text. Inst., 33, 1209-1211.
 
 

Background:  sampling from a basket of fibers

    -    long fibers more likely to be drawn

    -    creates bias in "population of lengths"

    -    bias can be precisely calculated

    -    thus can suitably adjust
 
 

Open Problem:  Use "length biased" sampling and "truncated data" ideas:

    -    to explore correctness of 80% window view

    -    to correctly modify smaller window views

    -    to find "best view" for mice and elephant plots
 
 

Variation also involved:  "censored" and "truncated" sampling

Good reference:
Turnbull, B. W. (1976) The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data, Journal of the Royal Statistical Society, Series B, 38, 290-295.
 
 

Not pursued more deeply for now
 
 
 
 


Mice and Elephants View (cont.)







Which distributions "fit"?

Kernel Density Estimation graphic






View 1:  SiZer analysis

    -    ordinary scale is useless

    -    elephants so big, that many mice get obscured
 

View 2:  log SiZer analysis

    -    much more useful scale

    -    large percentage at min (1 packet flows)

    -    recall:    median ~ 0.5 sec,    mean ~ 100 sec

    -    large clump approximately Gaussian?

    -    overall, mixture of 4 Gaussians??

    -    recall 1.3% cover full window, i.e. "at max"

    -    many "small significant bumps"
 
 
 


Mice and Elephants View (cont.)






Which distributions "fit"?   (cont.)

Pareto Q-Q graphic






    -    "many at min" is vertical line at bottom

    -    "window upper bound" is horizontal line at top

    -    fit at 0.8 and 0.9 quantiles, since boundary effects

(windowing and length biasedness)
                drive larger quantiles

    -    suggests shape parameter < 1  (infinite mean)

    -    but very slippery, since fit overall "looks pretty bad"

    -    true distribution is clearly a "mixture"

(of at least 3, SiZer analysis suggested 4)








Mice and Elephants View (cont.)







Which distributions "fit"?   (cont.)

log Normal Q-Q graphic






    -    again "many at min" is vertical line at bottom

    -    again "window upper bound" is horizontal line at top

    -    again fit at 0.8 and 0.9 quantiles,

because boundary effects drive larger quantiles

    -    fit distribution has all moments finite.

    -    again fit overall "looks pretty bad" (comparable to Pareto)

    -    so inference is very unreliable

    -    true distribution is clearly a "mixture" (of at least 3 or 4)
 
 
 


Mice and Elephants View (cont.)






Mice and Elephants simulated from fit distributions:
 
 

1.    Off Peak  (recall original 80% window plot)

Use arrival times from real data






Durations simulated from Exponential (same mean)

    -    still way off

    -    too few mice and too few elephants
 

Durations simulated from fit Pareto

    -    looks "much more like real data"?

    -    mean bigger by factor of 60?

    -    median bigger by factor of 12?

    -    clearly not "all that close" in distribution (as seen in Q-Q)
 

Durations simulated from fit log Normal

    -    looks better than Exponential

    -    but not as good as Pareto?

    -    not enough elephants?

    -    but mean and median are closer to truth

    -    again distributions not all that close (saw from Q-Q)
 
 
 


Mice and Elephants View (cont.)







Mice and Elephants simulated from fit distributions (cont.):
 
 

1.    Peak time (above was "off peak"):
 

First study 80% windowed version
 

    -    mean is smaller than Off Peak

(recall smaller time span)





    -    median is larger

(length biased sampling???)






Durations simulated from fit Pareto

    -    looks not bad
 

Durations simulated from fit log Normal

    -    this time looks better than Pareto?

    -    number of elephants seems closer??

    -    mean is closer?
 
 
 

Caution:  dangerous to draw conclusions from "visual effect"
 
 
 


Mice and Elephants View (cont.)






Revisit issue:  what is an IP "flow" (connection)?
 

Above definition:

an "IP flow" is a set of packets with
same sending and receiving IP addresses





Above noted weakness:

multiple visits to server are combined
e.g. several web pages from same site





Potential problem:  could be "long lags" that interrupt:

Heavy tailed durations   Long Range Dependence








Mice and Elephants View (cont.)






More protocol background:
 

a.    all packets are IP (Internet Protocol)
 

b.    subsets of IP include TCP, UDP, ...
 

c.    TCP (Transmission Control Protocol) packets

    -    are "acknowledged", to get "certified transmission"

    -    thus involve loss recovery mechanisms

    -    includes HTTP, FTP, Telnet, SMTP, Napster,...

    -    ~80 % of current traffic
 

d.    UDP (User Datagram Protocol) packets are "sent out only",

    -    i.e. loss is ignored

    -    includes streaming music, video, ...

    -    most of rest of traffic
 

e.    but there are other packets

    -    not assignable to flows by IP address
 

f.    above analysis includes both TCP and UDP flows