Class Notes 10/17/01
Last Time:
- Zooming Spectral Analysis
- showed "more natural scale" for studying LRD
- Revisited Flow Duration Distributions
- from "Residual Life Time Distribution" viewpoint
Recall Simple Model
Intuitive basis: Mice and Elephants plot from before:
Model:
a. "line
starts" as homogenous Poisson process
b. durations,
(i.e. "line lengths") as indep., i.i.d. some dist'n
Variations:
-
Independent Weibull Interarrival starts
-
Clustered Poisson starts
Simple Model (cont.)
Model Checks, for Start Time
Poisson Process:
Based on properties of Poisson
process:
1. Homogeneity
of Poisson Intensity
2. Exponential
Distribution of Interarrival times
3. Independence
of Interarrivals
Simple Model (cont.)
Model Checks, for Start Time
Poisson Process (cont.)
1. Homogeneity
of Poisson Intensity (done before)
- SiZer analysis: not quite homogeneous intensity
e.g. Off Peak Flow starts
- Could duplicate SiZer structure with Weibull Interarrivals
e.g. Weibull Simulation
- But then got distribution wrong
- Also found Clustered Poisson with good SiZer structure
- And this time found better interarrival distribution
- Although clustering distribution uncertain
(no intuitive explanation of parameters)
Simple Model (cont.)
Model Checks, for Start Time
Poisson Process (cont.)
2. Exponential
Distribution of Interarrival times (done before)
- Not true for simple model (got Weibull(0.9) instead)
e.g. Q-Q analysis
-
Motivated Weibull Interarrival model
- Could make disappear with "memory-less" approach
-
Suggested Cluster Poisson model
-
Also used for parameter estimation
Simple Model (cont.)
Model Checks, for Start Time
Poisson Process (cont.)
3. Independence
of Interarrivals (new material)
Approach: Repeat "memory-less" analysis,
but this time study autocorrelation
(treating sequence of interarrival times as time series)
- substantial dependence for small threshold, a
- essentially independent for large thresholds
- transition range is around log10(a) in (-2,-1)
(as in above memory-less thresholded Q-Q analysis)
- not consistent with Weibull Interarrival Process
- but is consistent with cluster Poisson Process
-
strange "bump" in Peak? Driven by "boundary effect"?
Boundary Adjusted Peak (only flows starting after 20%)
- similar lessons
-
and strange peak disappears
Simple Model (cont.)
Model Checks, for Start Time
Poisson Process (cont.)
3. Independence
of Interarrivals (cont.)
Final check by simulation:
a. Best simulated Weibull Interarrivals: Autocorr.
-
Looks independent as expected
b. Best Cluster Poisson: Autocorr.
-
Looks very dependent as expected
c. Best Cluster Poisson: Memory-less Autocorr.
-
Similar properties to above
Conclusion: reconfirmation of above ideas:
- Weibull Interarrival Model doesn't work
-
Cluster Poisson Model has promise
Possible future work:
Better Cluster Poisson parameter estimation
Flow Scatterplots
Idea: look for "asymptotic
independence"
Expectation: for heavy tailed distributions,
scatterplot should "hug axes"
Approach: for previously analyzed IP flows,
plot "flow time duration" vs. "number of packets in flow"
- Data points hug axes!
- But wrong ones!!!
- Reason is boundary truncation
- Points at top could be "really much higher"
- Then when vertical axis is rescaled, might "hug axes"
- Note: "lower diagonal thersholds"
- driven by "fastest transmission rate"?
-
note: a wide mixture of effective rates
Flow Scatterplots (cont.)
Quick fix?
Consider only flows that start after 20% of total time
- Looks much better??
- At least not "hugging top"
- But not "hugging axes" either
- Problems is cutting off largest flows?
- Note: number of packets becomes much smaller
- "Lower thresholds" look better
- but again, are NOT studying tail
-
Need better data to properly study this
New HTTP Response Data
Addresses "boundary effect" problems encountered above
(and allows working with
much larger sample sizes)
Data sources: 4 hour blocks of packet headers
"Morning": 8:00-12:00
"Afternoon": 13:00-17:00
"Evening": 19:30-23:30
Gathered at UNC Main Link
During 7 days in April 2001
(not consecutive days do to hardware problems)
(but each weekday is represented)
Headers filtered to only HTTP packets
(recall subset of IP and TCP)
Unsure: flows determined only by addresses?
Or: using
other header parts?
New HTTP Response Data (cont.)
Some summaries:
- Clear diurnal effects, and more
- Weekdays generally bigger than weekends
- But Friday evening smaller (students out partying?)
- And Sunday evening larger (students back at work?)
- On most weekdays:
afternoon > morning > evening
- Biggest is Wednesday afternoon (significant?)
-
Smallest is Sunday morning
New HTTP Response Data (cont.)
Some summaries (cont.):
2. Total
Size (bytes)
-
Lessons very similar to above
-
Not surprising (?), more responses mean more data
3. Mean Size (bytes/response)
- Surprisingly stable?
- Differences "statistically significant"?
- Perhaps "average size of web pages" quite constant
- at least over this 1 week time scale
- suggests "aggregated user behavior" is constant??
-
thus only number of users drives most of traffic???
New HTTP Response Data (cont.)
Some summaries (cont.):
4. Standard Deviation of Response Sizes
- Far less stable than mean
- Reflects "heavy tail" parameter in range (1,2)??
(stable mean, unstable variance)
- Maybe user behavior does change dirunally???
- Or are they "random instabilities"?
(as predicted by heavy tail theory?)
- no obvious pattern (time of day, day of week)?
- No correlation to size of traffic?
- Note: S. D. order of magnitude bigger than mean
- Thus strongly skewed distribution
(since all are positive)
New HTTP Response Data (cont.)
Some summaries (cont.):
5. Max Response
- "Instability" of same type as observed for S. D.
- In fact peaks in same locations!
- S. D. seems driven by maxima
- No big surprise, expected from extreme value theory?
- S. D. is not a very useful measure here
- Fortunately stored quantiles as well
(will analyze next time)
New Flow Scatterplots
Recall Idea: look for
"asymptotic independence"
Expectation: for heavy tailed distributions,
scatterplot should "hug axes"
Approach: for new HTTP Response Sizes,
plot "response time duration (sec)" vs. "size of response (bytes)"
Combined
Graphic (all 21 time blocks)
Personal "axis hugging rating":
Monday Morning: Good
Monday Afternoon: Good
Monday Evening: OK
Tuesday Morning: Very good
Tuesday Afternoon: Good
Tuesday Evening: Poor
Wednesday Morning: Good
Wednesday Afternoon: Very good
Wednesday Evening: OK
Thursday Morning: OK
Thursday Afternoon: Good
Thursday Evening: Very good
Friday Morning: Poor
Friday Afternoon: Poor
Friday Evening: Good
Saturday Morning: Good
Saturday Afternoon: OK
Saturday Evening: OK
Sunday Morning: Good
Sunday Afternoon: Poor
Sunday Evening: OK
New Flow Scatterplots (cont.)
Other Observations:
- No apparent patterns??? With respect to:
- day of week?
- time of day?
- traffic load?
- max response size?
- Interesting "streaks" at some times
(strongest on Sunday morning)
- suggests a few common "transmission rates"
- most clear on Sun. morning, since least packet loss
New Flow Scatterplots (cont.)
Things that could be done:
1. Follow up on "streaks", i.e. understand those transm'n rates?
- Do overlay of important ones on all plots?
- If they appear frequently, get expert interpretation?
-
Apply 2-d SiZer
type analysis to scatterplots?
2. Look at log scale versions?
-
What do we expect to learn?
3. Get
quantitative about "axis hugging"?
How interesting are these???