Lecture10-17-01

Course OR 778

Class Notes 10/17/01

Last Time:

- Zooming Spectral Analysis

- showed "more natural scale" for studying LRD

- Revisited Flow Duration Distributions

- from "Residual Life Time Distribution" viewpoint

Recall Simple Model

Intuitive basis: Mice and Elephants plot from before:

graphic

Model:

a. "line starts" as homogenous Poisson process

b. durations, (i.e. "line lengths") as indep., i.i.d. some dist'n

Variations:

- Independent Weibull Interarrival starts

- Clustered Poisson starts

Simple Model (cont.)

Model Checks, for Start Time Poisson Process:

Based on properties of Poisson process:

1. Homogeneity of Poisson Intensity

2. Exponential Distribution of Interarrival times

3. Independence of Interarrivals

Simple Model (cont.)

Model Checks, for Start Time Poisson Process (cont.)

1. Homogeneity of Poisson Intensity (done before)

- SiZer analysis: not quite homogeneous intensity

e.g. Off Peak Flow starts

- Could duplicate SiZer structure with Weibull Interarrivals

e.g. Weibull Simulation

- But then got distribution wrong

e.g. corresponding Q-Q plot

- Also found Clustered Poisson with good SiZer structure

e.g. Clustered Poisson Simulation

- And this time found better interarrival distribution

e.g. corresponding Q-Q plot

- Although clustering distribution uncertain

(no intuitive explanation of parameters)

Simple Model (cont.)

Model Checks, for Start Time Poisson Process (cont.)

2. Exponential Distribution of Interarrival times (done before)

- Not true for simple model (got Weibull(0.9) instead)

e.g. Q-Q analysis

- Motivated Weibull Interarrival model

- Could make disappear with "memory-less" approach

e.g. zoom through thresholds

- Suggested Cluster Poisson model

- Also used for parameter estimation

Simple Model (cont.)

Model Checks, for Start Time Poisson Process (cont.)

3. Independence of Interarrivals (new material)

Approach: Repeat "memory-less" analysis,

but this time study autocorrelation

(treating sequence of interarrival times as time series)

Off Peak Peak

- substantial dependence for small threshold, a

- essentially independent for large thresholds

- transition range is around log10(a) in (-2,-1)

(as in above memory-less thresholded Q-Q analysis)

- not consistent with Weibull Interarrival Process

- but is consistent with cluster Poisson Process

- strange "bump" in Peak? Driven by "boundary effect"?

Boundary Adjusted Peak (only flows starting after 20%)

- similar lessons

- and strange peak disappears

Simple Model (cont.)

Model Checks, for Start Time Poisson Process (cont.)

3. Independence of Interarrivals (cont.)

Final check by simulation:

a. Best simulated Weibull Interarrivals: Autocorr.

- Looks independent as expected

b. Best Cluster Poisson: Autocorr.

- Looks very dependent as expected

c. Best Cluster Poisson: Memory-less Autocorr.

- Similar properties to above

Conclusion: reconfirmation of above ideas:

- Weibull Interarrival Model doesn't work

- Cluster Poisson Model has promise

Possible future work:

Better Cluster Poisson parameter estimation

Flow Scatterplots

Idea: look for "asymptotic independence"

Expectation: for heavy tailed distributions,

scatterplot should "hug axes"

Approach: for previously analyzed IP flows,

plot "flow time duration" vs. "number of packets in flow"

Peak Off Peak

- Data points hug axes!

- But wrong ones!!!

- Reason is boundary truncation

- Points at top could be "really much higher"

- Then when vertical axis is rescaled, might "hug axes"

- Note: "lower diagonal thersholds"

- driven by "fastest transmission rate"?

- note: a wide mixture of effective rates

Flow Scatterplots (cont.)

Quick fix?

Consider only flows that start after 20% of total time

Peak Off Peak

- Looks much better??

- At least not "hugging top"

- But not "hugging axes" either

- Problems is cutting off largest flows?

- Note: number of packets becomes much smaller

- "Lower thresholds" look better

- but again, are NOT studying tail

- Need better data to properly study this

New HTTP Response Data

Addresses "boundary effect" problems encountered above

(and allows working with much larger sample sizes)

Data sources: 4 hour blocks of packet headers

"Morning": 8:00-12:00

"Afternoon": 13:00-17:00

"Evening": 19:30-23:30

Gathered at UNC Main Link

During 7 days in April 2001

(not consecutive days do to hardware problems)

(but each weekday is represented)

Headers filtered to only HTTP packets

(recall subset of IP and TCP)

Unsure: flows determined only by addresses?

Or: using other header parts?

New HTTP Response Data (cont.)

Some summaries:

1. Number of Responses

- Clear diurnal effects, and more

- Weekdays generally bigger than weekends

- But Friday evening smaller (students out partying?)

- And Sunday evening larger (students back at work?)

- On most weekdays:

afternoon > morning > evening

- Biggest is Wednesday afternoon (significant?)

- Smallest is Sunday morning

New HTTP Response Data (cont.)

Some summaries (cont.):

2.    Total Size (bytes)
    -    Lessons very similar to above
    -    Not surprising (?), more responses mean more data

3. Mean Size (bytes/response)

- Surprisingly stable?

- Differences "statistically significant"?

- Perhaps "average size of web pages" quite constant

- at least over this 1 week time scale

- suggests "aggregated user behavior" is constant??

- thus only number of users drives most of traffic???

New HTTP Response Data (cont.)

Some summaries (cont.):

4. Standard Deviation of Response Sizes

- Far less stable than mean

- Reflects "heavy tail" parameter in range (1,2)??

(stable mean, unstable variance)

- Maybe user behavior does change dirunally???

- Or are they "random instabilities"?

(as predicted by heavy tail theory?)

- no obvious pattern (time of day, day of week)?

- No correlation to size of traffic?

- Note: S. D. order of magnitude bigger than mean

- Thus strongly skewed distribution

(since all are positive)

New HTTP Response Data (cont.)

Some summaries (cont.):

5. Max Response

- "Instability" of same type as observed for S. D.

- In fact peaks in same locations!

- S. D. seems driven by maxima

- No big surprise, expected from extreme value theory?

- S. D. is not a very useful measure here

- Fortunately stored quantiles as well

(will analyze next time)

New Flow Scatterplots

Recall Idea: look for "asymptotic independence"

Expectation: for heavy tailed distributions,

scatterplot should "hug axes"

Approach: for new HTTP Response Sizes,

plot "response time duration (sec)" vs. "size of response (bytes)"

Combined Graphic (all 21 time blocks)

Personal "axis hugging rating":

Monday Morning: Good

Monday Afternoon: Good

Monday Evening: OK

Tuesday Morning: Very good

Tuesday Afternoon: Good

Tuesday Evening: Poor

Wednesday Morning: Good

Wednesday Afternoon: Very good

Wednesday Evening: OK

Thursday Morning: OK

Thursday Afternoon: Good

Thursday Evening: Very good

Friday Morning: Poor

Friday Afternoon: Poor

Friday Evening: Good

Saturday Morning: Good

Saturday Afternoon: OK

Saturday Evening: OK

Sunday Morning: Good

Sunday Afternoon: Poor

Sunday Evening: OK

New Flow Scatterplots (cont.)

Other Observations:

- No apparent patterns??? With respect to:

- day of week?

- time of day?

- traffic load?

- max response size?

- Interesting "streaks" at some times

(strongest on Sunday morning)

- suggests a few common "transmission rates"

- most clear on Sun. morning, since least packet loss

New Flow Scatterplots (cont.)

Things that could be done:

1. Follow up on "streaks", i.e. understand those transm'n rates?

- Do overlay of important ones on all plots?

- If they appear frequently, get expert interpretation?

- Apply 2-d SiZer type analysis to scatterplots?

2. Look at log scale versions?

- What do we expect to learn?

3. Get quantitative about "axis hugging"?

How interesting are these???