Let follow the uniform distribution on , and define Prove that converges in probability to 1. (20 points)

Probability Theory Solution – Problem 30: Prove that converges in probability to 1

Question

Let $X_{n}$ follow the uniform distribution on $\Sigma=\{1,2,\ldots,N\}$ , and define

$T_{N}=\inf\{n\colon\{X_{1},\ldots,X_{n}\}=\{1,2,\ldots,N\}\}$

Prove that $\frac{T_{N}}{N\ln N}$ converges in probability to 1.

(20 points)

Step-by-step solution

Step 1. Denote by $\tau_k$ the number of trials needed to collect the $k$ -th distinct element after $k-1$ distinct elements have already been collected, where $k=1,2,\dots,N$ . At that stage, the probability of obtaining a new element on each trial is $p_k = \frac{N-(k-1)}{N}$ . The random variable $\tau_k$ follows a geometric distribution $Ge(p_k)$ with probability mass function $P(\tau_k = m) = (1-p_k)^{m-1}p_k$ , and $\tau_1, \tau_2, \dots, \tau_N$ are mutually independent. The total number of trials $T_N$ can be expressed as $T_N = \sum_{k=1}^{N} \tau_k$ .

Step 2. The expectation of the geometric random variable $\tau_k$ is $E[\tau_k] = \frac{1}{p_k} = \frac{N}{N-k+1}$ , and its variance is $Var(\tau_k) = \frac{1-p_k}{p_k^2}$ . By linearity of expectation, the expectation of $T_N$ is: $E[T_N] = \sum_{k=1}^{N} E[\tau_k] = \sum_{k=1}^{N} \frac{N}{N-k+1} = N \sum_{j=1}^{N} \frac{1}{j}$ . Using the asymptotic expansion of the harmonic series $\sum_{j=1}^{N} \frac{1}{j} = \ln N + \gamma + o(1)$ (where $\gamma$ is the Euler--Mascheroni constant), we obtain $E[T_N] = N \ln N + O(N)$ , and hence $\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1$ .

Step 3. By independence, the variance of $T_N$ is: $Var(T_N) = \sum_{k=1}^{N} Var(\tau_k) = \sum_{k=1}^{N} \frac{1-p_k}{p_k^2} < \sum_{k=1}^{N} \frac{1}{p_k^2} = \sum_{j=1}^{N} \frac{N^2}{j^2} = N^2 \sum_{j=1}^{N} \frac{1}{j^2}$ . Since the series $\sum_{j=1}^{\infty} \frac{1}{j^2}$ converges to $\frac{\pi^2}{6}$ , there exists a constant $C$ such that $Var(T_N) \le C N^2$ .

Step 4. Applying Chebyshev's inequality, for any $\epsilon > 0$ : $P\left( \left| \frac{T_N - E[T_N]}{N \ln N} \right| \ge \epsilon \right) = P\left( \left| T_N - E[T_N] \right| \ge \epsilon N \ln N \right) \le \frac{Var(T_N)}{\epsilon^2 (N \ln N)^2}$ . Substituting the variance bound: $\frac{Var(T_N)}{\epsilon^2 (N \ln N)^2} \le \frac{C N^2}{\epsilon^2 N^2 (\ln N)^2} = \frac{C}{\epsilon^2 (\ln N)^2}$ . As $N \to \infty$ , the right-hand side tends to $0$ , so $\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0$ . Combined with the conclusion from Step 2 that $\frac{E[T_N]}{N \ln N} \to 1$ , we obtain $\frac{T_N}{N \ln N} = \frac{T_N - E[T_N]}{N \ln N} + \frac{E[T_N]}{N \ln N} \xrightarrow{P} 1$ .

Final answer

QED.

Marking scheme

The following is the detailed marking rubric based on the official solution:

1. Checkpoints (Total: 7 points)

Score exactly one chain; take the maximum subtotal among chains; do not add points across chains.

Chain A: Geometric distribution decomposition and Chebyshev's inequality (official solution)

Geometric distribution modeling and decomposition (2 points) [additive]
Decompose the total count $T_N$ into a sum of stage-wise waiting times $T_N = \sum_{k=1}^N \tau_k$ (or using other notation such as $X_k$ ), and explicitly state that the $\tau_k$ are mutually independent and geometrically distributed. (1 point)
Correctly specify the parameter: state that the success probability in the $k$ -th stage (collecting the $k$ -th distinct element) is $p_k = \frac{N-k+1}{N}$ or an equivalent form. (1 point)
Computation and asymptotics of the expectation (2 points) [additive]
Using linearity of expectation, compute $E[T_N] = \sum \frac{1}{p_k} = N \sum_{j=1}^N \frac{1}{j}$ . (1 point)
Using the harmonic series property $\sum_{j=1}^N \frac{1}{j} \sim \ln N$ , explicitly conclude that $\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1$ (or $E[T_N] = N \ln N + O(N)$ ). (1 point)
Order-of-magnitude estimate for the variance (2 points) [additive]
Using independence, write the variance-sum expression: $Var(T_N) = \sum_{k=1}^N Var(\tau_k) = \sum \frac{1-p_k}{p_k^2}$ (or $\le \sum \frac{1}{p_k^2}$ ). (1 point)
Using the convergence of $\sum_{j=1}^\infty \frac{1}{j^2}$ , show that $Var(T_N)$ is of order at most $N^2$ (i.e., $Var(T_N) \le C N^2$ or $O(N^2)$ ). (1 point)
*(Note: If the student does not perform the bounding step but correctly computes the exact expression involving $\sum \frac{1}{j^2}$ and identifies its leading term as $O(N^2)$ , full credit is awarded.)*
Proof of convergence in probability (1 point) [additive]
Apply Chebyshev's inequality, substitute the variance bound obtained above, and prove that $\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0$ . (1 point)
*(Note: If the student has shown that the centered quantity converges in probability to 0, and in the expectation step has shown that the mean ratio converges to 1, this point is awarded directly; the conclusion must be based on the preceding order-of-magnitude estimates.)*

Total (max 7)

2. Zero-credit items

Merely copying the problem statement, definitions, or formulas without performing any specific substitution or derivation for this problem.
Asserting $T_N \approx N \ln N$ based solely on intuition, without geometric distribution decomposition or series summation as justification.
Stating Chebyshev's inequality without computing $E[T_N]$ or $Var(T_N)$ , rendering the right-hand side of the inequality unanalyzable in the limit.
Citing the conclusion of the "coupon collector problem" without any derivation.

3. Deductions

Missing independence (-1): When computing $Var(\sum \tau_k) = \sum Var(\tau_k)$ , failing to mention or implicitly use the independence of the $\tau_k$ .
Parameter logic error (-1): The geometric distribution parameter $p_k$ is set seriously incorrectly (e.g., set as a constant or with the wrong monotonicity in $k$ ), preventing subsequent series analysis.
Confusing convergence concepts (-1): Proving "convergence in expectation" instead of "convergence in probability" (e.g., only showing $\lim E[\frac{T_N}{N \ln N}] = 1$ without discussing variance or probability concentration).
Serious series estimation error (cap at 4/7): Claiming that the harmonic series $\sum \frac{1}{j}$ converges (i.e., concluding $E[T_N] \sim O(N)$ ), which invalidates the entire asymptotic analysis.
Logical gap (-1): In the final step, directly asserting non-convergence from $Var(T_N) \to \infty$ (which is correct but unnormalized), or failing to divide the variance by the denominator $(N \ln N)^2$ for comparison.

Probability Theory – Problem 30: Prove that converges in probability to 1