Class Notes 9/26/01
Last Time
- finished SiZer background
- zooming SiZer analysis
- showed aggregated data not Homogeneous Poisson
- different from aggregation conclusions of Cleveland,
et. al.
Mice and Elephants View
In context of:
Heavy tailed durations Long Range Dependence
Earlier graphical "toy views":
Showed how "few long connections"
(i.e. heavy tailed duration distributions)
might induce Long Range Dependence
Mice and Elephants View
Basis
for "real data" view:
- 5 million IP packets (all packets: TCP & UDP &
...)
- gathered at UNC main link, during:
- peak time - weekday afternoon (< 5 min.)
- off-peak time - Sunday morning (~40 min)
- during April, 2000
- with:
- time stamps
- packet sizes
- index of sending and receiving IP addresses
Mice and Elephants View (cont.)
Data Summarization:
define "IP flow" (connection) as set of packets with same index
- recall toy graphic (same index = same color)
- keep only first packet time and last packet
time
Weakness:
multiple visits to server are combined
Graphic (same as "simulated durations" above):
- represent as horizontal line (over time)
- vertical height: random, to separate lines
- random sample of 1000 (full 142170 is too much)
Mice and Elephants View (cont.)
Peak time graphic
- Looks more like Pareto 1.5, than Exponential?
- Boundary effects?
- Mean much larger than median
Off - Peak graphic
- Similar, (but stronger) Mice and Elephants effects
- careful about much longer time span
(needed to gather 5 million packets)
Mice and Elephants View (cont.)
Mitigation of boundary effects: narrower time window
- study for Off Peak data
- different randomization
- now see flows overlapping edge (as expected)
- "right number" overlapping??
- mean connection is smaller (since smaller window)
- median connection is larger?!?!
(must have eliminated more mice than elephants)
- 1.3% cover full time window
Mice and Elephants View (cont.)
Mitigation
of boundary effects: narrower time window (cont.)
- many more long lines
- reduced sample from 1000 to 500
- just so could see something (otherwise "too dark")
- Mean and Median both much larger (more long lines)
- 20% cover full window (~3.5 min.)
- "Length Biased" sampling effect?
Mice and Elephants View (cont.)
Mitigation
of boundary effects: narrower time window (cont.)
- above effects all much stronger
- > 80% cover full window! (< 0.5 min.)
- for decent visual effect, reduced sample to only 200
- lines randomly placed on equally spaced vertical grid
(otherwise "Poisson clumping" is visually distracting)
(this was done in above plots as well)
- Recall full sample median was 0.5 sec.
- Clear "length biased" sampling effect!
Mice and Elephants View (cont.)
Length
Biased sampling:
Classical Reference:
Daniels,
H. E. (1942) A new technique for the analysis of fibre length distribution
in wool, J. Text. Inst., 33, 1209-1211.
Background: sampling from a basket of fibers
- long fibers more likely to be drawn
- creates bias in "population of lengths"
- bias can be precisely calculated
- thus can suitably adjust
Open Problem: Use "length biased" sampling and "truncated data" ideas:
- to explore correctness of 80% window view
- to correctly modify smaller window views
- to find "best view" for mice
and elephant
plots
Variation also involved: "censored" and "truncated" sampling
Good
reference:
Turnbull,
B. W. (1976) The Empirical Distribution Function with Arbitrarily Grouped,
Censored and Truncated Data, Journal of the Royal Statistical Society,
Series B, 38, 290-295.
Not
pursued more deeply for now
Mice and Elephants View (cont.)
Which distributions "fit"?
Kernel Density Estimation graphic
View 1: SiZer analysis
- ordinary scale is useless
- elephants
so big, that many mice
get obscured
View 2: log SiZer analysis
- much more useful scale
- large percentage at min (1 packet flows)
- recall: median ~ 0.5 sec, mean ~ 100 sec
- large clump approximately Gaussian?
- overall, mixture of 4 Gaussians??
- recall 1.3% cover full window, i.e. "at max"
- many "small significant bumps"
Mice and Elephants View (cont.)
Which distributions "fit"? (cont.)
Pareto Q-Q graphic
- "many at min" is vertical line at bottom
- "window upper bound" is horizontal line at top
- fit at 0.8 and 0.9 quantiles, since boundary effects
- suggests shape parameter < 1 (infinite mean)
- but very slippery, since fit overall "looks pretty bad"
- true distribution is clearly a "mixture"
Mice and Elephants View (cont.)
Which distributions "fit"? (cont.)
log Normal Q-Q graphic
- again "many at min" is vertical line at bottom
- again "window upper bound" is horizontal line at top
- again fit at 0.8 and 0.9 quantiles,
- fit distribution has all moments finite.
- again fit overall "looks pretty bad" (comparable to Pareto)
- so inference is very unreliable
- true distribution is clearly a "mixture" (of at least
3 or 4)
Mice and Elephants View (cont.)
Mice
and Elephants
simulated from fit distributions:
1. Off Peak (recall original 80% window plot)
Use arrival times from real data
Durations simulated from Exponential (same mean)
- still way off
- too few mice
and too few elephants
Durations simulated from fit Pareto
- looks "much more like real data"?
- mean bigger by factor of 60?
- median bigger by factor of 12?
- clearly not "all that close" in distribution (as seen
in Q-Q)
Durations simulated from fit log Normal
- looks better than Exponential
- but not as good as Pareto?
- not enough elephants?
- but mean and median are closer to truth
- again distributions not all that close (saw from Q-Q)
Mice and Elephants View (cont.)
Mice
and Elephants
simulated from fit distributions (cont.):
1.
Peak time (above was "off peak"):
First
study 80% windowed version
- mean is smaller than Off Peak
(recall smaller time span)
- median is larger
(length biased sampling???)
Durations simulated from fit Pareto
- looks not bad
Durations simulated from fit log Normal
- this time looks better than Pareto?
- number of elephants seems closer??
- mean is closer?
Caution:
dangerous to draw conclusions from "visual effect"
Mice and Elephants View (cont.)
Revisit
issue: what is an IP "flow" (connection)?
Above definition:
an
"IP flow" is a set of packets with
same
sending and receiving IP addresses
Above noted weakness:
multiple
visits to server are combined
e.g.
several web pages from same site
Potential problem: could be "long lags" that interrupt:
Heavy tailed durations Long Range Dependence
Mice and Elephants View (cont.)
More
protocol background:
a.
all packets are IP (Internet Protocol)
b.
subsets of IP include TCP, UDP, ...
c. TCP (Transmission Control Protocol) packets
- are "acknowledged", to get "certified transmission"
- thus involve loss recovery mechanisms
- includes HTTP, FTP, Telnet, SMTP, Napster,...
- ~80 % of current traffic
d. UDP (User Datagram Protocol) packets are "sent out only",
- i.e. loss is ignored
- includes streaming music, video, ...
- most of rest of traffic
e. but there are other packets
- not assignable to flows by IP address
f.
above analysis includes both TCP and UDP flows