Course  OR 778

Statistical Analysis and Modelling
of Internet Traffic Data









Course Meetings:

Time:   Mon. - Wed. 8:40 - 9:55
Room:  Rhodes 471


Course Web Site:
???
 
 
 


Instructor:   J. S. (Steve) Marron
 

Office:   Rhodes ???
Office Hours:   Mon. 10 - 11,    Tuesday 11 - 12
 

Phone:   (607) 255-9147
Email:   marron@stat.unc.edu
 
 
 
 


Course Work / Grading
 

S/U:   based on a presentation
 

Full Grade:   based on a presentation, plus data analysis project
 
 
 
 
 

Presentations:

    -    can be either a paper by others (you choose, or I suggest)

    -    or your own work
 
 
 
 


Course Goals:
 

    -    Explore Internet Traffic from several viewpoints
 

    -    Highlight interesting open problems
 

    -    Promote possible joint research
 

    -    Maximize understanding by all class members
 
 
 
 
 


Fun Aspects of Internet Traffic:

    -    Challenging topic

    -    Will stretch your thoughts and pre-conceptions

    -    Lots of controversy  (about even "basic" notions)

    -    No models provide a "good fit" (yet???)

    -    Loads of data

    -    Interesting phenomena across wide range of "scales"

    -    "Land of Opportunity" for many types of researchers
 
 
 
 


What this course is not about:
 

    -    Web page content
 

    -    Surveys of browsing habits
 

    -    Network intrusion / security
 

    -    Marketing strategies
 
 
 
 


Big Picture View of Internet Traffic

(will generally do details, e.g. discuss "TCP/IP", only when needed)
 
 
 
 
 

Gigantic (worldwide) Communications Networks:
 

    - Telephone Network
 

    - Internet
 

Both based on “connection” between 2 points
 
 
 
 


Fundamental Difference, I

Manner of equipment usage:
 

    - Telephone:   each connection has sole use

(of ~2 wires)

Congestion   no connection








    - Internet:  all connections share resources

(transmissions split into small “packets”)

(packet size  <=  1500 bytes)

Congestion    packet loss & delays











Fundamental Difference, II







Distribution of duration of connections:
 
 

    - Telephone:   exponential distribution

(say something  &  how long can hold phone?)








    - Internet:  heavy tailed distributions???

(current controversy, let’s study)

(very long and very short connections!)












Fundamental Difference, III

Mathematical Models:
 
 

    - Telephone:   queueing theory

Poisson arrivals, exponential durations








    - Internet:  ???   (controversial, let’s study)

                Early ideas:  queueing theory as an approx.

                More recent:  not appropriate (heavy tails +)

                This year:  Poisson OK at main link?

                        Tails truly heavy?
 
 
 


Partial Explanation of Differing Views







Depends on “where measuring is done”









Source of most data considered here:







“Tap” on Main Link at UNC

(University of North Carolina - Chapel Hill)







    -    Heavy traffic both directions

        -    35,000 web browsers

    -    Sunsite (mirror site for large data bases)
 

    -    1998 peak traffic:   ~3 minutes for 1 mil. Packets
 

    -    2001 peak traffic:    ~1 minute for 1 mil. Packets
 
 
 
 


Data "raw material"







Sequence of Packet Headers, with info, such as:

    -    Arrival Time

    -    Source & Destination addresses

    -    Packet Type (request, data, acknowledgement, …)

    -    Packet Size  (40 – 1500 bytes)

    -    Sequence number
 

Data extraction:

Heavy “database filtering”, by UNC Comp. Sci. folks,

Jeffay, Smith, Ott, Hernandez, Long











Investigation I:  Heavy Tails?







Data Set 1:    734,814    HTTP response sizes

(file sizes of browser “requests” (~clicks))

llustrative graphic        Studies “total size of lines”
I.e. “each line is one data point, so ignore packets for now”
 

Note:   “file sizes being used”    vs.    “file sizes residing”
 
 

Distributional shape?
 

SiZer analysis 1:   “smooth histogram” of raw data

SiZer graphic
 
 
 


Investigation I:  Heavy Tails?  (cont.)







SiZer analysis 1:   raw data   (previous graphic)
 

    -    everything “at origin”
 

    -    heavy tails   lead to   very few, very big data points
 

    -    fashionable terminology:      many mice, &  few elephants
 
 
 

Better Visualization:    use log scale
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)







SiZer analysis 2:   “smooth histogram” of log data

Movie version of SiZer for log data
 

    -    most between  100  &  10,000
 

    -    statistically signficant “bumps”  (banners ads, …)
 

    -  not a common, named distribution
 

    -    heavy tail?   (histograms very poor at this)
 
 
 
 


SiZer







Visual presentation:
 

Color map over family of smooth histograms:
 

    -  Blue:  slope significantly upwards (deriv. CI above 0)
 

    -  Red:  slope significantly downwards (der. CI below 0)
 

    -  Purple:  slope insignificant (deriv. CI contains 0)
 
 
 
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)







Simple, Powerful, Statistical Tool:  Q-Q plot
 

Idea:  plot

Quantiles of empirical distribution

against

Quantiles of theoretical distribution







Toy example, illustrating Q-Q plot

Have “good fit” when close to 45 degree line
 
 
 
 


Q-Q Plots







Major Problem:      sampling variability?
 
 

Which departures from 45 degree line are:

    -    only acceptable random variation?

    -    important deviations in distribution?
 
 

Graphical Approach:

    -    Overlay envelope, showing “natural variability”

    -    simulated from theoretical dist’n
 
 
 


Q-Q Plots (cont.)







Toy Example 1:

    Simulated data (),  from Weibull(1)  (exponential)
 
 

Q-Q Plot for Weibull(1)

    -    Weibull and Pareto “fit”

(QQ stays within envelope)






    -    Gaussian and LogGaussian don’t “fit”

(QQ goes way outside envelope)









Q-Q Plots (cont.)







Toy Example 2:

    Simulated data (),  from Pareto(1.5)

            Heavy tails ()
 
 

Q-Q Plot for Pareto(1.5)

    -    Pareto fits,  Weibull & Gaussian way off (in tail)

    -    log Q-Q plots useful (for heavy tails)

    -    lognormal nearly fits???
 
 
 
 


Investigation I:  Heavy Tails?  (cont.)






Q-Q plot for full 734,814 HTTP Response Sizes:

Response Size Q-Q plot
 

    -    Pareto(1.2) good fit in tails?

    -    surprisingly good?

    -    no sim’d overlay, since sample size too large

    -    nearly no “variability”?   (except at ends)

    -    Shape parameter 1.2,   has   & 

    -    poor fit for main dist’n (as expected)