Let be uniformly distributed on , and define Prove that converges in probability to 1. (20 points)

Probability Theory Solution – Problem 68: Prove that converges in probability to 1

Question

Let $X_{n}$ be uniformly distributed on $\Sigma=\{1,2,\ldots,N\}$ , and define

$T_{N}=\inf\{n\colon\{X_{1},\ldots X_{n}\}=\{1,2,\ldots,N\}\}$

Prove that $\frac{T_{N}}{N\ln N}$ converges in probability to 1.

(20 points)

Step-by-step solution

Step 1. Let $\tau_k$ denote the number of trials needed to collect the $k$ -th new element after $k-1$ distinct elements have already been collected, for $k=1,2,\dots,N$ . At that stage, the probability of obtaining a new element on each trial is $p_k = \frac{N-(k-1)}{N}$ . The random variable $\tau_k$ follows a geometric distribution $\mathrm{Ge}(p_k)$ with probability mass function $P(\tau_k = m) = (1-p_k)^{m-1}p_k$ , and $\tau_1, \tau_2, \dots, \tau_N$ are mutually independent. The total number of trials $T_N$ can be expressed as $T_N = \sum_{k=1}^{N} \tau_k$ .

Step 2. The expectation of the geometric random variable $\tau_k$ is $E[\tau_k] = \frac{1}{p_k} = \frac{N}{N-k+1}$ , and its variance is $\mathrm{Var}(\tau_k) = \frac{1-p_k}{p_k^2}$ . By linearity of expectation, the expectation of $T_N$ is: $E[T_N] = \sum_{k=1}^{N} E[\tau_k] = \sum_{k=1}^{N} \frac{N}{N-k+1} = N \sum_{j=1}^{N} \frac{1}{j}$ . Using the asymptotic expansion of the harmonic series $\sum_{j=1}^{N} \frac{1}{j} = \ln N + \gamma + o(1)$ (where $\gamma$ is the Euler--Mascheroni constant), we obtain $E[T_N] = N \ln N + O(N)$ , and hence $\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1$ .

Step 3. By independence, the variance of $T_N$ is: $\mathrm{Var}(T_N) = \sum_{k=1}^{N} \mathrm{Var}(\tau_k) = \sum_{k=1}^{N} \frac{1-p_k}{p_k^2} < \sum_{k=1}^{N} \frac{1}{p_k^2} = \sum_{j=1}^{N} \frac{N^2}{j^2} = N^2 \sum_{j=1}^{N} \frac{1}{j^2}$ . Since the series $\sum_{j=1}^{\infty} \frac{1}{j^2}$ converges to $\frac{\pi^2}{6}$ , there exists a constant $C$ such that $\mathrm{Var}(T_N) \le C N^2$ .

Step 4. Applying Chebyshev's inequality, for any $\epsilon > 0$ : $P\left( \left| \frac{T_N - E[T_N]}{N \ln N} \right| \ge \epsilon \right) = P\left( \left| T_N - E[T_N] \right| \ge \epsilon N \ln N \right) \le \frac{\mathrm{Var}(T_N)}{\epsilon^2 (N \ln N)^2}$ . Substituting the variance bound: $\frac{\mathrm{Var}(T_N)}{\epsilon^2 (N \ln N)^2} \le \frac{C N^2}{\epsilon^2 N^2 (\ln N)^2} = \frac{C}{\epsilon^2 (\ln N)^2}$ . As $N \to \infty$ , this tends to $0$ , so $\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0$ . Combined with the conclusion from Step 2 that $\frac{E[T_N]}{N \ln N} \to 1$ , we obtain $\frac{T_N}{N \ln N} = \frac{T_N - E[T_N]}{N \ln N} + \frac{E[T_N]}{N \ln N} \xrightarrow{P} 1$ .

Final answer

QED.

Marking scheme

The following is the marking rubric based on the official solution:

1. Checkpoints (total: 7 pts)

Score exactly one chain | take the maximum subtotal among chains; do not add points across chains.

Chain A: Geometric Distribution Decomposition and Chebyshev's Inequality (official solution)

Geometric distribution modeling and decomposition (2 pts) [additive]
Decompose the total time $T_N$ as a sum of stage-wise waiting times $T_N = \sum_{k=1}^N \tau_k$ (or using other notation such as $X_k$ ), and explicitly state that the $\tau_k$ are mutually independent and geometrically distributed. (1 pt)
Correctly write the parameter: state that the success probability in the $k$ -th stage (collecting the $k$ -th distinct element) is $p_k = \frac{N-k+1}{N}$ or an equivalent form. (1 pt)
Expectation computation and asymptotics (2 pts) [additive]
Use linearity of expectation to compute $E[T_N] = \sum \frac{1}{p_k} = N \sum_{j=1}^N \frac{1}{j}$ . (1 pt)
Use the harmonic series property $\sum_{j=1}^N \frac{1}{j} \sim \ln N$ to explicitly conclude $\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1$ (or $E[T_N] = N \ln N + O(N)$ ). (1 pt)
Variance order estimate (2 pts) [additive]
Use independence to write the variance sum expression: $\mathrm{Var}(T_N) = \sum_{k=1}^N \mathrm{Var}(\tau_k) = \sum \frac{1-p_k}{p_k^2}$ (or $\le \sum \frac{1}{p_k^2}$ ). (1 pt)
Use the convergence of $\sum_{j=1}^\infty \frac{1}{j^2}$ to show that $\mathrm{Var}(T_N)$ is of order at most $N^2$ (i.e., $\mathrm{Var}(T_N) \le C N^2$ or $O(N^2)$ ). (1 pt)
*(Note: If the student does not perform the bounding step but correctly computes the exact expression containing $\sum \frac{1}{j^2}$ and identifies its leading term as $O(N^2)$ , full credit is also awarded.)*
Convergence in probability proof (1 pt) [additive]
Apply Chebyshev's inequality, substitute the above variance bound, and prove $\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0$ . (1 pt)
*(Note: If the student has proved the centered quantity converges in probability to 0, and in the expectation step has shown the mean ratio converges to 1, this point is awarded directly; the conclusion must be based on the preceding order estimates.)*

Total (max 7)

2. Zero-credit items

Merely copying the problem statement, definitions, or formulas without any concrete substitution or derivation specific to this problem.
Asserting $T_N \approx N \ln N$ based on intuition alone, without the geometric distribution decomposition or series summation as justification.
Listing the Chebyshev inequality formula but not computing $E[T_N]$ or $\mathrm{Var}(T_N)$ , rendering the right-hand side of the inequality unanalyzable.
Merely citing the "coupon collector problem" result without any derivation.

3. Deductions

Independence omission (-1): When computing $\mathrm{Var}(\sum \tau_k) = \sum \mathrm{Var}(\tau_k)$ , failing to mention or implicitly use the independence of the $\tau_k$ .
Parameter logic error (-1): Setting the geometric distribution parameter $p_k$ to be seriously incorrect (e.g., constant or with the wrong monotonicity in $k$ ), preventing subsequent series analysis.
Confusing convergence concepts (-1): Proving "convergence of expectations" instead of "convergence in probability" (e.g., only showing $\lim E[\frac{T_N}{N \ln N}] = 1$ without discussing variance or probability concentration).
Serious series estimation error (cap at 4/7): Claiming the harmonic series $\sum \frac{1}{j}$ converges (yielding $E[T_N] \sim O(N)$ ), causing the entire asymptotic analysis to be incorrect.
Logical gap (-1): In the final step, directly asserting non-convergence from $\mathrm{Var}(T_N) \to \infty$ (which is correct but unnormalized), or failing to divide the variance by the denominator $(N \ln N)^2$ for comparison.

Probability Theory – Problem 68: Prove that converges in probability to 1