Statistical Analysis and
Modelling
of Internet Traffic Data
Course Meetings:
Time: Mon. - Wed. 8:40 - 9:55
Room: Rhodes 471
Course Web Site:
???
Instructor: J.
S. (Steve) Marron
Office: Rhodes
???
Office Hours:
Mon. 10 - 11, Tuesday 11 - 12
Phone: (607)
255-9147
Email: marron@stat.unc.edu
Course Work / Grading
S/U: based on
a presentation
Full Grade: based
on a presentation, plus data analysis project
Presentations:
- can be either a paper by others (you choose, or I suggest)
-
or your own work
Course Goals:
-
Explore Internet Traffic from several viewpoints
-
Highlight interesting open problems
-
Promote possible joint research
-
Maximize understanding by all class members
Fun Aspects of Internet Traffic:
- Challenging topic
- Will stretch your thoughts and pre-conceptions
- Lots of controversy (about even "basic" notions)
- No models provide a "good fit" (yet???)
- Loads of data
- Interesting phenomena across wide range of "scales"
-
"Land of Opportunity" for many types of researchers
What this course is not
about:
-
Web page content
-
Surveys of browsing habits
-
Network intrusion / security
-
Marketing strategies
Big Picture View of Internet Traffic
(will generally do details,
e.g. discuss "TCP/IP", only when needed)
Gigantic (worldwide) Communications
Networks:
- Telephone
Network
- Internet
Both based on “connection”
between 2 points
Fundamental Difference, I
Manner of equipment usage:
- Telephone: each connection has sole use
(of ~2 wires)
Congestion
no connection
- Internet: all connections share resources
(transmissions split into small “packets”)
(packet size <= 1500 bytes)
Congestion
packet loss & delays
Fundamental Difference, II
Distribution of duration
of connections:
- Telephone: exponential distribution
(say something & how long can hold phone?)
- Internet: heavy tailed distributions???
(current controversy, let’s study)
(very long and very short connections!)
Fundamental Difference, III
Mathematical Models:
- Telephone: queueing theory
Poisson arrivals, exponential durations
- Internet: ??? (controversial, let’s study)
Early ideas: queueing theory as an approx.
More recent: not appropriate (heavy tails +)
This year: Poisson OK at main link?
Tails truly heavy?
Partial Explanation of Differing Views
Depends on “where measuring is done”
Source of most data considered here:
“Tap” on Main Link at UNC
- Heavy traffic both directions
- 35,000 web browsers
-
Sunsite (mirror site for large data bases)
-
1998 peak traffic: ~3 minutes for 1 mil. Packets
-
2001 peak traffic: ~1 minute for 1 mil. Packets
Data "raw material"
Sequence of Packet Headers, with info, such as:
- Arrival Time
- Source & Destination addresses
- Packet Type (request, data, acknowledgement, …)
- Packet Size (40 – 1500 bytes)
-
Sequence number
Data extraction:
Heavy “database filtering”, by UNC Comp. Sci. folks,
Jeffay, Smith, Ott, Hernandez, Long
Investigation I: Heavy Tails?
Data Set 1: 734,814 HTTP response sizes
(file sizes of browser “requests” (~clicks))
llustrative
graphic Studies “total size
of lines”
I.e. “each line is one data
point, so ignore packets for now”
Note: “file sizes
being used” vs. “file sizes residing”
Distributional shape?
SiZer analysis 1: “smooth histogram” of raw data
Investigation I: Heavy Tails? (cont.)
SiZer analysis 1:
raw data (previous graphic)
-
everything “at origin”
-
heavy tails lead to very few, very big data points
-
fashionable terminology: many mice,
& few elephants
Better Visualization:
use log scale
Investigation I: Heavy Tails? (cont.)
SiZer analysis 2: “smooth histogram” of log data
Movie
version of SiZer for log data
-
most between 100 & 10,000
-
statistically signficant “bumps” (banners ads, …)
-
not a common, named distribution
-
heavy tail? (histograms very poor at this)
SiZer
Visual presentation:
Color map over family of
smooth histograms:
-
Blue: slope significantly upwards (deriv.
CI above 0)
-
Red: slope significantly downwards (der.
CI below 0)
-
Purple: slope insignificant (deriv.
CI contains 0)
Investigation I: Heavy Tails? (cont.)
Simple, Powerful, Statistical
Tool: Q-Q plot
Idea: plot
Quantiles of empirical distribution
against
Quantiles of theoretical distribution
Toy example, illustrating Q-Q plot
Have “good fit” when close
to 45 degree line
Q-Q Plots
Major Problem:
sampling variability?
Which departures from 45 degree line are:
- only acceptable random variation?
-
important deviations in distribution?
Graphical Approach:
- Overlay envelope, showing “natural variability”
-
simulated from theoretical dist’n
Q-Q Plots (cont.)
Toy Example 1:
Simulated
data (),
from Weibull(1) (exponential)
- Weibull and Pareto “fit”
(QQ stays within envelope)
- Gaussian and LogGaussian don’t “fit”
(QQ goes way outside envelope)
Q-Q Plots (cont.)
Toy Example 2:
Simulated
data (),
from Pareto(1.5)
Heavy tails (,
)
- Pareto fits, Weibull & Gaussian way off (in tail)
- log Q-Q plots useful (for heavy tails)
-
lognormal nearly fits???
Investigation I: Heavy Tails? (cont.)
Q-Q plot for full 734,814 HTTP Response Sizes:
- Pareto(1.2) good fit in tails?
- surprisingly good?
- no sim’d overlay, since sample size too large
- nearly no “variability”? (except at ends)
-
Shape parameter 1.2, has
&
-
poor fit for main dist’n (as expected)