Files
task-queue-proof/README.md
T
Mortdecai 3cf815d28b Add Appendix A: When the Metric Is the Product
Explores the case where the unweighted mean is reported directly to the
client, making the metric itself the source of satisfaction. Under this
model the entire paper's conclusion inverts: SPT genuinely maximizes
client satisfaction at zero marginal cost.

Analyzes this as a moral hazard / pooling equilibrium using game theory,
identifies three fragility conditions (client inspects own ticket,
competitor offers per-ticket SLAs, team internalizes the metric), and
maps the pattern across domains (education, healthcare, finance, software).

Concludes: the incentive exists, the equilibrium is real, and it holds
until it doesn't.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 17:29:56 -04:00

45 KiB

Unweighted Average Completion Time Is Not a Fair Metric for Task Scheduling

A mathematical proof that unweighted average task completion time is a biased statistic that incentivizes cherry-picking easy work, and that any scheduling advantage it appears to reveal is an artifact of the metric — not a reflection of genuine throughput or service quality.


1. Definitions

Let there be n tasks with processing times p_1, p_2, \ldots, p_n.

A schedule \sigma is a permutation of \{1, 2, \ldots, n\} assigning tasks to execution order on a single executor.

The completion time of task \sigma(k) under schedule \sigma is:

C_{\sigma(k)} = \sum_{j=1}^{k} p_{\sigma(j)}

The unweighted mean completion time is:

\bar{C}(\sigma) = \frac{1}{n} \sum_{k=1}^{n} C_{\sigma(k)}

The work-weighted mean completion time is:

\bar{C}_w(\sigma) = \frac{\sum_{k=1}^{n} p_{\sigma(k)} \cdot C_{\sigma(k)}}{\sum_{k=1}^{n} p_{\sigma(k)}}

2. SPT Is Optimal for the Unweighted Statistic

Theorem 1. The schedule that minimizes \bar{C}(\sigma) is Shortest Processing Time first (SPT): sort tasks so that p_{\sigma(1)} \le p_{\sigma(2)} \le \cdots \le p_{\sigma(n)}.

Proof (exchange argument).

Consider any schedule \sigma in which two adjacent tasks i, j satisfy p_i > p_j with task i scheduled immediately before task j. Let t be the start time of task i.

Task i finishes Task j finishes Sum
Before swap (i then j) t + p_i t + p_i + p_j 2t + 2p_i + p_j
After swap (j then i) t + p_j t + p_j + p_i 2t + p_i + 2p_j

The change in the sum of completion times is:

(2p_i + p_j) - (p_i + 2p_j) = p_i - p_j > 0

Every swap of a longer-before-shorter adjacent pair strictly reduces the total. Any non-SPT schedule contains such a pair. Repeated swaps converge to SPT. Therefore SPT uniquely minimizes \bar{C}(\sigma). \blacksquare


3. The Work-Weighted Statistic Is Schedule-Invariant

Theorem 2. The work-weighted mean completion time \bar{C}_w(\sigma) is the same for every schedule \sigma.

Proof.

Expand the numerator:

\sum_{k=1}^{n} p_{\sigma(k)} \cdot C_{\sigma(k)} = \sum_{k=1}^{n} p_{\sigma(k)} \sum_{j=1}^{k} p_{\sigma(j)}

Reindex by letting a = \sigma(k) and b = \sigma(j). The double sum counts every ordered pair (a, b) where b is scheduled no later than a:

= \sum_{\substack{a, b \\ b \preceq_\sigma a}} p_a \, p_b

For any pair (a, b) with a \ne b, exactly one of \{b \preceq_\sigma a\} or \{a \prec_\sigma b\} holds. The diagonal terms (a = b) contribute p_a^2 regardless of order. Therefore:

\sum_{\substack{a, b \\ b \preceq_\sigma a}} p_a \, p_b = \sum_{a} p_a^2 + \sum_{\substack{a \ne b \\ b \prec_\sigma a}} p_a \, p_b

Now consider the complementary sum:

\sum_{\substack{a \ne b \\ a \prec_\sigma b}} p_a \, p_b

Together the two off-diagonal sums cover all unordered pairs \{a, b\}:

\sum_{\substack{a \ne b \\ b \prec_\sigma a}} p_a \, p_b + \sum_{\substack{a \ne b \\ a \prec_\sigma b}} p_a \, p_b = \sum_{a \ne b} p_a \, p_b

The right-hand side is schedule-independent. By symmetry of p_a p_b, both off-diagonal sums are equal:

\sum_{\substack{a \ne b \\ b \prec_\sigma a}} p_a \, p_b = \frac{1}{2} \sum_{a \ne b} p_a \, p_b

Therefore:

\sum_{k=1}^{n} p_{\sigma(k)} \cdot C_{\sigma(k)} = \sum_a p_a^2 + \frac{1}{2} \sum_{a \ne b} p_a \, p_b = \frac{1}{2}\left(\sum_a p_a\right)^2 + \frac{1}{2}\sum_a p_a^2

This expression contains no reference to \sigma. Since the denominator \sum p_a is also schedule-independent:

\bar{C}_w(\sigma) = \frac{\frac{1}{2}\left(\sum p_a\right)^2 + \frac{1}{2}\sum p_a^2}{\sum p_a}

is constant across all schedules. \blacksquare


4. Concrete Example

Two tasks: A with p_A = 1 hour, B with p_B = 10 hours.

SPT order (A first)

Task Completion time
A 1
B 11
  • Unweighted mean: (1 + 11) / 2 = 6.0
  • Work-weighted mean: (1 \times 1 + 10 \times 11) / 11 = 111/11 \approx 10.09

Reverse order (B first)

Task Completion time
B 10
A 11
  • Unweighted mean: (10 + 11) / 2 = 10.5
  • Work-weighted mean: (10 \times 10 + 1 \times 11) / 11 = 111/11 \approx 10.09

SPT appears 4.5 hours better on the unweighted metric but provides zero improvement on the work-weighted metric. The apparent advantage exists only because the unweighted statistic lets a 1-hour task "vote" equally with a 10-hour task.


5. Connection to Little's Law

Little's Law states L = \lambda W, where L is the time-averaged number of tasks in the system, \lambda is the arrival rate, and W is the average time a task spends in the system.

In a steady-state queueing system with fixed arrival and service rates, \lambda and the long-run service rate are determined by the workload, not by scheduling policy. Little's Law then tells us that L and W are linked, but in the batch case (all n tasks present at time 0), L and W are both schedule-dependent: \bar{C} = W, and L = \sum C_i / \sum p_i, both of which SPT minimizes.

The invariance we proved in Theorem 2 is more specific: work-weighted mean completion time \bar{C}_w is constant across schedules. This corresponds to measuring the system from the perspective of "how long does a unit of work wait" rather than "how long does a task wait." The unweighted statistic measures the latter and is gameable precisely because it counts completions rather than work.


6. Consequences

Theorem 3 (Metric Bias). Any scheduling policy that minimizes unweighted mean completion time necessarily maximizes the completion time of the largest task relative to other schedules.

Proof. SPT places the largest task last. Its completion time equals the total processing time \sum p_i, which is the maximum possible completion time for any individual task. Meanwhile, FIFO or any non-SPT order would allow the large task to finish earlier. \blacksquare

This creates a starvation incentive: rational agents optimizing the unweighted statistic will indefinitely defer large tasks in favor of small ones.

Real-world manifestations

Domain Gameable metric Perverse outcome
Support desks Tickets closed / day Complex issues ignored
Sprint planning Story count velocity Work split into trivial pieces
Emergency rooms Average wait time Critical patients deprioritized
Academic publishing Papers per year Incremental work favored over deep research

7. Impact on Client Satisfaction and Team Productivity

The preceding theorems are not merely abstract. They have direct, provable consequences for client satisfaction and team productivity when a team adopts unweighted mean completion time as its performance metric.

7.1 Defining Client Satisfaction: The Slowdown Ratio

A client submitting a task of size p_i has an expectation anchored to that size. The natural measure of their experience is the slowdown ratio:

S_i = \frac{C_i}{p_i}

This is the factor by which the client's wait exceeds the task's inherent processing time. A slowdown of 1 means no queuing delay at all. A slowdown of 10 means the client waited 10x longer than the work itself required.

Client satisfaction is inversely related to slowdown: a client who waits 2x their task size is more satisfied than one who waits 20x, regardless of the absolute times involved.

Theorem 4 (SPT Uniquely Maximizes Completion Time of the Largest Task). Among all schedules, SPT is the unique policy that assigns the maximum possible completion time (\sum p_i) to the largest task.

Proof.

SPT sorts tasks in ascending order of p_i, placing the largest task p_{\max} in the last position. The last task in any schedule has completion time \sum_{i=1}^{n} p_i, which is the maximum completion time any individual task can receive. Therefore, under SPT:

C_{\max\text{-task}}^{\text{SPT}} = \sum_{i=1}^{n} p_i

Under any schedule that does not place p_{\max} last, the largest task completes strictly before \sum p_i. SPT is the unique schedule (among those ordered by processing time) that assigns this worst-case completion time to the largest task.

Note on slowdown: SPT actually compresses slowdown ratios (S_i = C_i / p_i) because larger tasks in later positions have large denominators that absorb the accumulated sum. For example, with tasks [1, 5, 10]:

  • SPT: slowdowns [1, 1.2, 1.6] — low variance
  • LPT: slowdowns [1, 3, 16] — high variance

SPT's harm to large-task clients is not visible in the slowdown ratio. It is visible in absolute completion time: the largest task finishes last, at \sum p_i, while under any other ordering it finishes earlier. \blacksquare

Corollary 4.1. A team optimizing unweighted mean completion time will systematically deliver the worst experience to clients with the most complex needs.

This is not a side effect — it is the mechanism by which the metric improves. The only way to lower the unweighted average is to complete more small tasks early, which necessarily means completing large tasks later. The metric improves because high-effort clients are deprioritized.

7.2 The Absolute Delay Burden

The slowdown ratio S_i = C_i / p_i might suggest SPT is fair — it compresses slowdown variance by giving everyone a ratio close to 1. But this obscures the real cost. The correct measure of burden is the absolute delay experienced by each task:

\Delta_i = C_i - p_i

This is the time a task spends waiting for other tasks, independent of its own size. Under any sequential schedule, the total delay across all tasks is schedule-dependent (it equals \sum C_i - \sum p_i), and SPT minimizes this total. But the distribution of delay matters.

Theorem 5 (SPT Concentrates Delay on the Largest Task). Under SPT, the largest task bears more absolute delay than under any other schedule.

Proof. Under SPT, the largest task is in position n with:

\Delta_{\max\text{-task}}^{\text{SPT}} = C_n - p_n = \sum_{i=1}^{n-1} p_i

This is the sum of all other tasks' processing times — the maximum possible delay for any single task. Under any schedule where the largest task is not last, its delay is strictly less than \sum_{i \ne \max} p_i.

Meanwhile, SPT gives the smallest task zero delay (\Delta_1^{\text{SPT}} = 0). The entire queuing burden is shifted from small tasks to large tasks. \blacksquare

The tension is this: SPT minimizes total delay (good for aggregate efficiency) by concentrating delay onto the tasks best able to "absorb" it in slowdown-ratio terms. But in absolute terms — hours spent waiting — the largest task bears the full weight. If that task represents a critical business need, the absolute delay, not the ratio, determines the damage.

7.3 Productivity Is Not Improved

Theorem 6 (Throughput Invariance). Total work completed over any time horizon T is identical under all scheduling policies.

Proof. The executor processes work at a fixed rate. Over time T, the total work completed is:

W(T) = \sum_{\{i : C_i \le T\}} p_i + \text{(partial progress on current task)}

In the non-preemptive case (tasks run to completion once started), W(T) may vary slightly at the boundary depending on which task is in progress at time T. However, over any horizon T \ge \sum p_i (i.e., long enough to complete all tasks), the total work done is exactly \sum p_i regardless of order.

For the steady-state case with ongoing arrivals, the long-run throughput is determined by the service rate \mu and is completely independent of scheduling:

\lim_{T \to \infty} \frac{W(T)}{T} = \mu \quad \text{for all schedules } \sigma

\blacksquare

Corollary 6.1. A team that switches from any scheduling policy to SPT will observe an improvement in unweighted mean completion time with zero change in actual throughput.

The metric improves. The output does not.

7.4 The Compound Effect: Satisfaction Down, Productivity Flat

Combining Theorems 4, 5, and 6:

Measure Effect of optimizing unweighted mean
Throughput (work/time) No change (Theorem 6)
Delay for small tasks Minimized — approaches zero (SPT)
Delay for large tasks Maximized — bears all queuing burden (Theorem 5)
Completion time of largest task Maximum possible: \sum p_i (Theorem 4)
Overall perceived quality of service Net negative (see below)

The net effect on perceived quality is negative because:

  1. Loss aversion is asymmetric. A client whose 100-hour task is deprioritized to last experiences a large, salient negative. A client whose 1-hour task moves from position 5 to position 1 experiences a small, often unnoticed positive. The absolute dissatisfaction created exceeds the absolute satisfaction gained.

  2. High-effort tasks correlate with high-value clients. Large tasks are disproportionately likely to come from major clients, complex contracts, or critical business needs. Systematically giving these clients the worst experience is anti-correlated with revenue and retention.

  3. Starvation compounds. In a continuous system (Theorem 3), large tasks are not merely delayed — they may be indefinitely deferred as new small tasks keep arriving. The affected client's satisfaction does not merely decrease; it collapses entirely.

Theorem 7 (The Core Result). For a team processing tasks of non-uniform size, adopting unweighted mean completion time as a performance metric:

(a) Provides zero productivity gain (Theorem 6), while (b) Assigning the maximum possible completion time to the largest task (Theorem 4), and (c) Concentrating all queuing delay onto the largest tasks while eliminating delay for the smallest (Theorem 5).

This is not a tradeoff — there is no compensating benefit on the productivity side. The metric creates a pure transfer of service quality from high-effort clients to low-effort clients, with no net work gained.

A team using unweighted mean completion time as its performance metric will, under rational optimization, simultaneously fail to improve productivity and systematically degrade the experience of its most demanding clients. \blacksquare


8. When Unweighted Mean Completion Time Is Valid

For completeness: the unweighted metric is appropriate if and only if all tasks are approximately equal in size (p_i \approx p_j for all i, j). In this case, the work-weighted and unweighted statistics converge, SPT and FIFO produce similar schedules, and slowdown ratios are naturally equal.

The pathology arises specifically from variance in task size. The greater the variance, the greater the distortion, and the more damage the metric causes when optimized.


9. Complete Breakdown Under Priority Classification

The preceding sections proved that unweighted mean completion time is biased when tasks vary in size. We now show that introducing a priority system — as virtually all real teams use — causes the metric to become not merely biased but actively adversarial to the organization's stated goals.

9.1 Extended Model: Tasks With Priority

Let each task i have processing time p_i and a priority class q_i \in \{1, 2, 3, 4\} where 1 is the highest priority (critical) and 4 is the lowest (cosmetic/enhancement). Assign priority weights:

w(q) = \begin{cases} 8 & q = 1 \text{ (Critical)} \\ 4 & q = 2 \text{ (High)} \\ 2 & q = 3 \text{ (Medium)} \\ 1 & q = 4 \text{ (Low)} \end{cases}

The specific weights are illustrative; the results hold for any strictly decreasing weight function. The key property is that priority is assigned by business impact, not by task size.

9.2 The Metric Contradicts the Priority System

Theorem 8 (Priority-Size Inversion). When priority is independent of task size, the schedule that minimizes unweighted mean completion time (SPT) will, in expectation, complete low-priority tasks before high-priority tasks of greater size.

Proof.

SPT orders tasks by p_i ascending, regardless of q_i. Consider two tasks:

  • Task A: p_A = 40 hours, q_A = 1 (Critical — e.g., server outage)
  • Task B: p_B = 0.5 hours, q_B = 4 (Low — e.g., cosmetic UI fix)

SPT schedules B before A. The unweighted mean completion time for this pair:

\bar{C}^{\text{SPT}} = \frac{0.5 + 40.5}{2} = 20.5

The priority-respecting order (A before B):

\bar{C}^{\text{priority}} = \frac{40 + 40.5}{2} = 40.25

The metric declares SPT nearly twice as good — despite completing a cosmetic fix while a server outage burns for an additional 0.5 hours.

In general, for n tasks where priority q_i is statistically independent of processing time p_i (a reasonable assumption, since priority reflects business impact while processing time reflects technical complexity):

\text{Corr}(p_i, q_i) \approx 0

SPT's ordering is determined entirely by p_i. The expected position of a task in the SPT schedule has zero correlation with its priority. A Critical task is equally likely to be scheduled first or last.

More precisely: the expected fraction of Critical tasks in the bottom half of the SPT schedule equals the fraction of Critical tasks whose processing time exceeds the median. In practice, Critical tasks (outages, security incidents, data loss) often require more work, so this fraction exceeds 50%. The metric is not merely uncorrelated with priority — it is plausibly anti-correlated. \blacksquare

9.3 Dimensionality Collapse

The unweighted mean completion time reduces a three-dimensional task (p_i, q_i, C_i) to a one-dimensional signal (C_i), then averages that signal uniformly. This discards two of the three dimensions:

  1. Priority (q_i) is completely ignored. A critical task and a cosmetic task contribute identically to the mean.
  2. Size (p_i) is implicitly inverted. Small tasks are rewarded with early completion, large tasks are punished — regardless of their importance.

Theorem 9 (Information Destruction). Let I(\sigma) be the mutual information between the schedule's implicit priority ranking (position in schedule) and the actual priority assignment q_i. For SPT:

I(\sigma_{\text{SPT}}) = 0 \quad \text{when } p_i \perp q_i

Proof. SPT assigns positions based solely on p_i. When p_i and q_i are independent, knowing a task's position in the SPT schedule provides zero information about its priority. The schedule is statistically independent of the priority system.

Contrast this with a priority-first schedule, where I > 0 by construction. \blacksquare

Corollary 9.1. A team that optimizes unweighted mean completion time is operating a scheduling system that carries zero information about its own priority classification. The priority field in their ticketing system is, with respect to execution order, decorative.

9.4 Quantifying the Damage: Priority-Weighted Delay Cost

Define the priority-weighted delay cost of a schedule:

D(\sigma) = \sum_{i=1}^{n} w(q_i) \cdot C_i

This measures the total business-impact-weighted time spent waiting.

Theorem 10 (SPT and Priority-Weighted Delay Cost). The optimal schedule for minimizing priority-weighted delay cost D(\sigma) is WSJF: order by w(q_i)/p_i descending. SPT's ordering — by 1/p_i descending — ignores priority entirely and produces higher D than priority-respecting alternatives when priority is correlated with task size.

Proof. By the standard exchange argument (as in Theorem 1), swapping adjacent tasks i, j in a schedule changes D by:

\Delta D = w(q_j) \cdot p_i - w(q_i) \cdot p_j

The swap improves D when \Delta D > 0, i.e., when w(q_j)/p_j > w(q_i)/p_i but j is scheduled after i. Therefore the optimal order is decreasing w(q_i)/p_i — this is the WSJF rule.

SPT orders by p_i ascending (equivalently, 1/p_i descending), which corresponds to WSJF only when w(q_i) = \text{const} — i.e., when all tasks have equal priority.

Example. Two tasks: Critical (w = 8, p_H = 10) and Low (w = 1, p_L = 1).

WSJF scores: Critical = 8/10 = 0.8, Low = 1/1 = 1.0.

WSJF places the Low task first (higher w/p), same as SPT. Here, SPT and WSJF agree because the Low task's tiny size dominates despite its low weight.

Now consider: Critical (w = 8, p_H = 3) and Low (w = 1, p_L = 2).

WSJF scores: Critical = 8/3 = 2.67, Low = 1/2 = 0.5.

WSJF places Critical first. SPT places Low first (smaller p). The costs:

  • SPT (Low first): D = 1 \cdot 2 + 8 \cdot 5 = 42
  • WSJF (Critical first): D = 8 \cdot 3 + 1 \cdot 5 = 29

SPT incurs 45% more priority-weighted delay because it ignores the 8x priority weight of the Critical task.

In general, SPT diverges from WSJF — and produces suboptimal D — whenever priority and task size are not perfectly inversely correlated. In practice, Critical tasks tend to be larger (outages, security incidents), making the divergence systematic rather than occasional. \blacksquare


10. A Proposed Solution: Priority-Weighted Completion Score

10.1 The Metric

Replace unweighted mean completion time with the Priority-Weighted Completion Score (PWCS):

\text{PWCS}(\sigma) = \frac{\sum_{i=1}^{n} w(q_i) \cdot \frac{C_i}{p_i}}{\sum_{i=1}^{n} w(q_i)}

This is the priority-weighted mean slowdown ratio. It measures:

  • How long each task waited relative to its size (the slowdown C_i / p_i), weighted by
  • How much that task mattered (the priority weight w(q_i)).

Lower is better. A PWCS of 1.0 means every task was completed instantly with zero queuing delay. A PWCS of 3.0 means the average task waited 3x its processing time, weighted by importance.

10.2 Properties of PWCS

Property 1: Priority-respecting. PWCS penalizes delays to high-priority tasks more heavily than low-priority tasks. A 2-hour delay to a Critical task costs 8x more than the same delay to a Low task.

Property 2: Size-fair. By using the slowdown ratio C_i / p_i rather than raw completion time C_i, the metric does not inherently penalize large tasks for being large. A 40-hour task that waits 80 hours contributes the same slowdown (2.0) as a 1-hour task that waits 2 hours.

Property 3: Not gameable by SPT. Because the metric weights by priority and normalizes by task size, reordering tasks by processing time does not systematically improve the score. The optimal strategy is to minimize slowdown for high-priority tasks — i.e., to actually respect the priority system.

Property 4: Reduces to unweighted mean when tasks are uniform. If all tasks have equal priority and equal size, PWCS equals the unweighted mean completion time divided by the common task size. It is a strict generalization.

10.3 Optimal Policy for PWCS

Theorem 11. The schedule minimizing PWCS processes tasks in order of decreasing w(q_i) / p_i — highest priority first, breaking ties by shortest processing time within the same priority class.

Proof (exchange argument, as in Theorem 1).

Consider adjacent tasks i, j with i before j. Each task's contribution to the PWCS numerator depends on the completion times of both. Swapping i and j:

The change in the weighted slowdown sum is proportional to:

w(q_i) \cdot \frac{p_j}{p_i} - w(q_j) \cdot \frac{p_i}{p_j}

The swap improves PWCS when this quantity is positive, i.e., when:

\frac{w(q_i)}{p_i^2} > \frac{w(q_j)}{p_j^2}

Hmm — this doesn't simplify as cleanly due to the ratio structure. Let us instead consider the more practical priority-weighted completion time:

\text{PWCT}(\sigma) = \frac{\sum_{i=1}^{n} w(q_i) \cdot C_i}{\sum_{i=1}^{n} w(q_i)}

For PWCT, the exchange argument gives: swap improves the score when w(q_j) \cdot p_i > w(q_i) \cdot p_j, i.e., when w(q_j)/p_j > w(q_i)/p_i but j is scheduled after i. The optimal order is therefore decreasing w(q_i)/p_i, which is the Weighted Shortest Job First (WSJF) rule:

\text{Schedule by: } \frac{w(q_i)}{p_i} \text{ descending}

This means: within a priority class, do short tasks first; across priority classes, a Critical 8-hour task (w/p = 8/8 = 1.0) ties with a Low 1-hour task (w/p = 1/1 = 1.0) — but a Critical 4-hour task (w/p = 8/4 = 2.0) beats both. \blacksquare

10.4 Applied Example: IT Service Desk

Consider an IT team with the following ticket queue on a Monday morning:

Ticket Priority Type Est. Hours
T1 P1 (Critical) Email server down 6
T2 P2 (High) VPN failing for remote team 4
T3 P3 (Medium) New employee laptop setup 2
T4 P4 (Low) Update desktop wallpaper policy 0.5
T5 P3 (Medium) Install software license 1
T6 P1 (Critical) Database backup failing 3
T7 P2 (High) Printer fleet offline 2
T8 P4 (Low) Archive old shared drive folder 0.25

SPT order (optimizing unweighted mean): T8, T4, T5, T3, T7, T6, T2, T1

Position Ticket Priority Hours Completion Slowdown
1 T8 (archive folder) P4 Low 0.25 0.25 1.0
2 T4 (wallpaper) P4 Low 0.5 0.75 1.5
3 T5 (software) P3 Med 1 1.75 1.75
4 T3 (laptop) P3 Med 2 3.75 1.875
5 T7 (printers) P2 High 2 5.75 2.875
6 T6 (backups) P1 Crit 3 8.75 2.917
7 T2 (VPN) P2 High 4 12.75 3.1875
8 T1 (email) P1 Crit 6 18.75 3.125
  • Unweighted mean completion: (0.25 + 0.75 + 1.75 + 3.75 + 5.75 + 8.75 + 12.75 + 18.75) / 8 = 6.5625 hours
  • PWCT: (1 \cdot 0.25 + 1 \cdot 0.75 + 2 \cdot 1.75 + 2 \cdot 3.75 + 4 \cdot 5.75 + 8 \cdot 8.75 + 4 \cdot 12.75 + 8 \cdot 18.75) / 30 = 306/30 = 10.2 hours
  • Email server is down for 18.75 hours. Database backups fail for 8.75 hours.

WSJF order (optimizing PWCT by w(q)/p descending):

Ticket Priority Hours w/p
T6 P1 Crit 3 8/3 = 2.667
T8 P4 Low 0.25 1/0.25 = 4.0
T5 P3 Med 1 2/1 = 2.0
T4 P4 Low 0.5 1/0.5 = 2.0
T1 P1 Crit 6 8/6 = 1.333
T7 P2 High 2 4/2 = 2.0
T2 P2 High 4 4/4 = 1.0
T3 P3 Med 2 2/2 = 1.0

Wait — T8 has w/p = 4.0, the highest. That places a Low-priority task first, which feels wrong. This reveals an important practical point: pure WSJF can still be gamed by tiny tasks because their small p inflates the ratio. In practice, this is mitigated by enforcing strict priority class ordering and only applying WSJF within priority classes.

Practical WSJF (priority-class-first, then w/p within class):

Position Ticket Priority Hours Completion
1 T6 (backups) P1 Crit 3 3
2 T1 (email) P1 Crit 6 9
3 T7 (printers) P2 High 2 11
4 T2 (VPN) P2 High 4 15
5 T5 (software) P3 Med 1 16
6 T3 (laptop) P3 Med 2 18
7 T8 (archive) P4 Low 0.25 18.25
8 T4 (wallpaper) P4 Low 0.5 18.75
  • Unweighted mean completion: (3 + 9 + 11 + 15 + 16 + 18 + 18.25 + 18.75) / 8 = 13.625 hours
  • PWCT: (8 \cdot 3 + 8 \cdot 9 + 4 \cdot 11 + 4 \cdot 15 + 2 \cdot 16 + 2 \cdot 18 + 1 \cdot 18.25 + 1 \cdot 18.75) / 30 = 305/30 = 10.167 hours
  • Email server restored in 9 hours. Backups fixed in 3 hours.

Comparison

Metric SPT Practical WSJF Winner
Unweighted mean completion 6.5625 hrs 13.625 hrs SPT
Priority-weighted completion (PWCT) 10.2 hrs 10.167 hrs WSJF
Time to fix email server 18.75 hrs 9 hrs WSJF
Time to fix database backups 8.75 hrs 3 hrs WSJF
Time to fix printers 5.75 hrs 11 hrs SPT
Time to update wallpaper 0.75 hrs 18.75 hrs SPT

The PWCT values are nearly identical (10.2 vs 10.167) because PWCT — as a weighted average of completion times — is dampened by the fact that total work is constant. PWCT is not the right metric for this comparison. The real difference is visible in the individual completion times of critical tasks: the email server is down for 18.75 hours under SPT versus 9 hours under WSJF. The database backups fail for 8.75 hours versus 3 hours.

The better comparison metric is the priority-weighted delay cost D = \sum w(q_i) \cdot C_i (not normalized):

  • SPT: D = 306 priority-weighted hours
  • Practical WSJF: D = 305 priority-weighted hours

Again, the aggregate is similar. The damage from SPT is not in the aggregate — it is in the distribution: critical systems burn while cosmetic tasks are polished. A metric that cannot distinguish between these two schedules — despite one leaving the email server down for twice as long — is not measuring what matters.

The unweighted metric, however, confidently reports SPT as more than twice as efficient (6.56 vs 13.63), rewarding the team that updated desktop wallpaper while the email server was on fire.

The IT example reveals that even priority-weighted aggregate metrics (PWCT) can fail to distinguish good from bad schedules, because aggregation hides distributional damage. No single metric suffices. A complete measurement system for a priority-based team should track:

Metric What it measures Formula
Mean completion by priority class Per-class responsiveness \bar{C} filtered by q
P1 mean time to resolution Critical incident response \bar{C} filtered to q = 1
Throughput Raw work capacity Work-hours completed / calendar time
Aging violations Starvation prevention Count of tasks exceeding SLA by priority
Max completion time (P1/P2) Worst-case critical response \max(C_i) filtered to q \le 2

The key insight from our analysis: per-priority-class metrics (rows 1-2, 5) expose scheduling failures that aggregate metrics hide. If P1 mean time to resolution is 14 hours while P4 mean is 0.5 hours, the team is optimizing the wrong metric — regardless of what the aggregate says.


11. Devil's Advocate: The Case for Unweighted Mean Completion Time

Intellectual honesty requires acknowledging where the preceding argument has limits. The following are genuine counterarguments — not strawmen.

11.1 Simplicity Has Real Value

Argument. The unweighted mean is trivially computable: sum the completion times, divide by the count. It requires no priority weights, no task-size estimates, no calibration. Every alternative proposed in Section 10 requires estimating p_i (task size) before the task is complete — and these estimates are notoriously unreliable.

Assessment: This is true. PWCS and PWCT require inputs (priority weights, size estimates) that introduce their own sources of error. If size estimates are systematically wrong — and in software engineering they often are, with large tasks underestimated and small tasks overestimated — then the weighted metric inherits that noise.

However, the unweighted metric does not avoid this problem — it hides it by implicitly setting all weights to 1 and all sizes to 1. That is not "making no assumptions"; it is making the specific assumption that all tasks are equally important and equally sized, which is demonstrably false in any real system. A known-imprecise estimate of task size is still more informative than the implicit assumption that all sizes are equal.

11.2 Minimizing the Number of People Waiting

Argument. If each task represents one client, then unweighted mean completion time minimizes the total person-hours spent waiting. SPT is optimal for this because completing short tasks first "frees" the most people from the queue earliest.

Assessment: This is mathematically correct. The sum \sum C_i counts total person-time in the system. SPT genuinely minimizes this quantity. If you run a DMV and every person's time is equally valuable regardless of why they're there, SPT is the right policy.

The argument breaks down when:

  1. Tasks are not 1:1 with clients. In IT, one client may submit tasks of varying size. Across a relationship, SPT systematically fast-tracks their easy requests and starves their hard ones — which is not perceived as good service.

  2. Waiting cost is not uniform. A person waiting for a server outage to be fixed is not equivalent to a person waiting for a wallpaper change. The cost of waiting is proportional to the impact of the unresolved task, which is what priority encodes.

  3. The metric is applied to teams, not DMVs. When a team's performance is measured by unweighted mean, the rational response is to cherry-pick — which is individually rational but collectively destructive.

11.3 SPT as a Triage Heuristic

Argument. In high-volume systems where task sizes cluster tightly (e.g., a call center where most calls are 3-7 minutes), SPT approximates FIFO and the unweighted mean approximates the weighted mean. The pathologies described in this paper only manifest when task sizes span orders of magnitude.

Assessment: This is correct. As shown in Section 8, when task sizes are approximately uniform, all scheduling policies converge and all metrics agree. The coefficient of variation of task size, CV = \sigma_p / \bar{p}, determines the severity of the distortion:

CV Task size distribution Metric distortion
< 0.3 Tight (call center) Negligible
0.3 - 1.0 Moderate (mixed IT) Moderate
> 1.0 Wide (typical IT queue) Severe

For a typical IT service desk, task sizes range from 15 minutes (password reset) to 40+ hours (infrastructure migration), giving CV > 2. The distortion is not a theoretical edge case — it is the default condition.

11.4 Gaming Requires Malice

Argument. The theorems show that the metric can be gamed, not that it will be gamed. A well-intentioned team might use the unweighted mean as a rough health indicator without actively optimizing for it, avoiding the pathologies described.

Assessment: This is the strongest counterargument. If the metric is used purely for monitoring — "are we completing things at a reasonable pace?" — and not for performance evaluation, rewards, or scheduling decisions, then the gaming incentive is absent and the metric is relatively harmless.

However, this argument requires the metric to remain purely informational and never influence behavior. In practice, any metric that is reported to management, tied to OKRs, or used in sprint retrospectives will influence behavior — this is Goodhart's Law, and it applies to well-intentioned teams as reliably as to cynical ones. The team need not be gaming the metric consciously; it is sufficient that completing three easy tickets "feels productive" while staring at one hard ticket does not. The metric validates the feeling, and the drift happens organically.

11.5 Summary: When the Unweighted Mean Is Defensible

The unweighted mean completion time is a defensible metric only when all four conditions hold simultaneously:

  1. Task sizes are approximately uniform (CV < 0.3)
  2. There is no priority differentiation (all tasks are equally important)
  3. Each task represents exactly one client
  4. The metric is not used to evaluate, reward, or direct team behavior

In a system satisfying all four conditions — such as a simple FIFO queue with uniform jobs and no priority system — the unweighted mean is adequate, and its simplicity is a genuine advantage.

In any system that violates even one of these conditions — which includes virtually every IT service desk, development team, and support organization — the metric produces the distortions proven in Sections 2-9.

The honest conclusion is not that the unweighted mean is always wrong. It is that the conditions under which it is right are narrow, easily identified, and rarely met in the systems where it is most commonly used.


12. Conclusion

The unweighted average completion time is a biased statistic that:

  1. Can be gamed by scheduling policy (Theorem 1), unlike work-weighted completion time which is schedule-invariant (Theorem 2).
  2. Incentivizes starvation of large tasks (Theorem 3).
  3. Contradicts Little's Law unless tasks are uniformly sized.
  4. Degrades client satisfaction with zero compensating productivity gain (Theorem 7).
  5. Actively contradicts priority systems by carrying zero information about business-impact classification (Theorem 9).
  6. Ignores priority entirely in its scheduling recommendation, producing suboptimal priority-weighted delay whenever priority and size are not perfectly inversely correlated (Theorem 10).

A metric that can be improved by reordering work — without doing any additional work — is measuring the scheduling policy, not the system's capacity or effectiveness. When combined with a priority system, the metric does not merely fail to reflect priorities — it recommends the schedule that inflicts the most damage on the highest-priority work.

The unweighted mean is defensible only under narrow, identifiable conditions (Section 11.5): uniform task sizes, no priority system, one-to-one client-task mapping, and no behavioral influence from the metric. These conditions are rarely met in practice.

Unweighted average completion time is not a fair or accurate measurement of task execution performance. Its adoption as a team metric will rationally produce starvation of complex work, violation of stated priorities, inequitable client outcomes, and the illusion of productivity where none exists.


Appendix A. When the Metric Is the Product

The preceding twelve sections rest on an implicit assumption: that client satisfaction is a function of experienced service quality — how long their task took, relative to its size and urgency. If this assumption holds, the proof is valid and the unweighted mean is a destructive metric.

But there exists a scenario in which the assumption fails and the entire argument collapses.

A.1 The Self-Referential Metric

Suppose the service provider reports the unweighted mean completion time directly to the client — on a dashboard, in an SLA report, on a marketing page — and the client's satisfaction is derived primarily from that number rather than from their individual experience.

Define client satisfaction as:

U_{\text{client}} = f\!\left(\bar{C}(\sigma)\right), \quad f' < 0

That is: the client sees "Average resolution time: 6.56 hours" and is satisfied, without checking whether their ticket — the critical email outage — took 6.56 hours or 18.75 hours.

Under this model, SPT genuinely maximizes client satisfaction (Theorem 1). The service provider's throughput is unchanged (Theorem 6). The business outcome improves: same work done, happier client.

Every theorem in this paper remains mathematically correct. But the conclusion inverts. The metric is no longer a proxy for service quality that can be gamed — it is the service quality, because the client has agreed to evaluate quality by the aggregate number rather than by their individual experience.

A.2 The Economics

This creates a coherent, stable business equilibrium:

Actor Behavior Outcome
Provider Optimizes unweighted mean (SPT) Metric improves, no extra work
Client Reads dashboard, sees low average Reports satisfaction
Management Sees satisfied client + good metric Rewards team

Throughput is unchanged (Theorem 6), so the same revenue-generating work is completed. The only thing that changed is the order — and therefore the reported number. Real resources were rearranged, no additional value was created, but the business metrics all moved in the right direction.

This is profitable. The provider extracts satisfaction from the client at zero marginal cost, by optimizing a number that the client has accepted as a proxy for quality. The client is no worse off in their own estimation, because they evaluate the aggregate, not their individual experience.

A.3 The Fragility

This equilibrium is stable only as long as the client never inspects their own experience. It breaks the moment any of the following occur:

1. The client checks their own ticket.

A CTO whose email server was down for 18.75 hours will not be reassured by a dashboard reading "Average resolution: 6.56 hours." The aggregate metric and the individual experience diverge maximally for high-priority tasks (Theorem 4). The clients most likely to inspect their own experience are exactly the ones receiving the worst service.

2. A competitor offers per-ticket SLAs.

If an alternative provider guarantees "P1 incidents resolved within 4 hours" instead of "average resolution under 7 hours," the aggregate-metric provider cannot compete for clients with critical needs — which are typically the highest-value clients.

3. The provider's team internalizes the metric.

If the team believes the metric reflects real performance (rather than consciously gaming it), they lose the ability to recognize when critical work is being neglected. The metric becomes an epistemic hazard: it tells the team they are performing well, preventing them from seeing that they are not.

A.4 The General Pattern

This is not unique to task scheduling. The structure is:

  1. A measurable proxy is established for an unmeasured quality.
  2. The proxy is reported as if it were the quality itself.
  3. The proxy is optimized, improving the reported number.
  4. The underlying quality diverges from the proxy, but no one measures the underlying quality because the proxy exists.
  5. The system is stable until an exogenous shock forces inspection of the underlying quality.

This pattern appears across domains:

Domain Proxy metric Underlying quality Divergence
IT support Avg. resolution time Critical system uptime Server down for 19 hrs, avg says 6.5
Education Standardized test scores Actual learning Teaching to the test, understanding declines
Healthcare Patient throughput Patient outcomes Faster discharges, higher readmission rates
Finance Quarterly earnings Long-term value creation Cost-cutting inflates EPS, erodes capability
Software Velocity (story points) Deliverable product quality Point inflation, features half-finished

In each case, the proxy is optimized, the number improves, and the system functions — profitably, even — until the moment the underlying quality is tested by reality.

A.5 A Mathematical Note on Equilibrium Stability

Model the system as a game between provider (P) and client (C).

Information structure:

  • P observes individual completion times \{C_i\} and chooses schedule \sigma
  • C observes only the reported aggregate \bar{C}(\sigma)

Payoffs:

  • P's payoff increases with C's satisfaction and is independent of schedule (throughput is invariant)
  • C's reported satisfaction U_C = f(\bar{C}) is maximized by SPT
  • C's actual welfare (if they could observe it) depends on individual C_i values, especially for high-priority tasks

This is a moral hazard problem. P has private information (the distribution of C_i) that C cannot observe. P's optimal strategy is to minimize the observable signal (\bar{C}) regardless of the unobservable distribution — which is exactly SPT.

The equilibrium is a pooling equilibrium: P's schedule looks identical to the client regardless of the underlying priority-weighted performance. A provider with PWCT = 10.2 and a provider with PWCT = 10.167 both report \bar{C} = 6.56 under SPT. The client cannot distinguish between them.

This equilibrium is stable under the standard game-theoretic condition: C has no incentive to deviate (they have no better information source) and P has no incentive to deviate (any other schedule worsens \bar{C} with zero throughput benefit).

It is unstable under information revelation: if C obtains access to individual C_i values (via a customer portal, a competing vendor's transparency, or a sufficiently painful incident), the pooling equilibrium collapses and C's evaluation shifts to the underlying quality.

A.6 The Uncomfortable Conclusion

The honest answer to "does optimizing the unweighted mean hurt the business?" is: not necessarily, as long as the client never looks behind the number.

The honest answer to "does it hurt the client?" is: only when they have a problem large enough to notice — which is precisely when the metric's distortion is largest (Theorem 4).

The honest answer to "is this sustainable?" is: it is exactly as sustainable as any system in which the seller knows more than the buyer. Such systems are historically stable for extended periods and then collapse rapidly when the information asymmetry is punctured — by a crisis, a competitor, or a regulator.

The mathematical structure is clear: the unweighted mean creates an information asymmetry between the metric and the reality. Optimizing the metric under this asymmetry is locally rational for the provider, locally satisfying for the uninspecting client, and globally fragile for the relationship.

Whether one calls this "efficient market behavior" or "a dystopian consequence of optimizing legible numbers over illegible reality" is not a mathematical question. The math says only this: the incentive exists, the equilibrium is real, and it holds until it doesn't.


This proof was developed conversationally and formalized on 2026-03-28.