MathIsimple

Probability Theory – Problem 68: Prove that converges in probability to 1

Question

Let XnX_{n} be uniformly distributed on Σ={1,2,,N}\Sigma=\{1,2,\ldots,N\}, and define

TN=inf{n ⁣:{X1,Xn}={1,2,,N}}T_{N}=\inf\{n\colon\{X_{1},\ldots X_{n}\}=\{1,2,\ldots,N\}\}

Prove that TNNlnN\frac{T_{N}}{N\ln N} converges in probability to 1.

(20 points)

Step-by-step solution

Step 1. Let τk\tau_k denote the number of trials needed to collect the kk-th new element after k1k-1 distinct elements have already been collected, for k=1,2,,Nk=1,2,\dots,N. At that stage, the probability of obtaining a new element on each trial is pk=N(k1)Np_k = \frac{N-(k-1)}{N}. The random variable τk\tau_k follows a geometric distribution Ge(pk)\mathrm{Ge}(p_k) with probability mass function P(τk=m)=(1pk)m1pkP(\tau_k = m) = (1-p_k)^{m-1}p_k, and τ1,τ2,,τN\tau_1, \tau_2, \dots, \tau_N are mutually independent. The total number of trials TNT_N can be expressed as TN=k=1NτkT_N = \sum_{k=1}^{N} \tau_k.

Step 2. The expectation of the geometric random variable τk\tau_k is E[τk]=1pk=NNk+1E[\tau_k] = \frac{1}{p_k} = \frac{N}{N-k+1}, and its variance is Var(τk)=1pkpk2\mathrm{Var}(\tau_k) = \frac{1-p_k}{p_k^2}. By linearity of expectation, the expectation of TNT_N is: E[TN]=k=1NE[τk]=k=1NNNk+1=Nj=1N1jE[T_N] = \sum_{k=1}^{N} E[\tau_k] = \sum_{k=1}^{N} \frac{N}{N-k+1} = N \sum_{j=1}^{N} \frac{1}{j}. Using the asymptotic expansion of the harmonic series j=1N1j=lnN+γ+o(1)\sum_{j=1}^{N} \frac{1}{j} = \ln N + \gamma + o(1) (where γ\gamma is the Euler--Mascheroni constant), we obtain E[TN]=NlnN+O(N)E[T_N] = N \ln N + O(N), and hence limNE[TN]NlnN=1\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1.

Step 3. By independence, the variance of TNT_N is: Var(TN)=k=1NVar(τk)=k=1N1pkpk2<k=1N1pk2=j=1NN2j2=N2j=1N1j2\mathrm{Var}(T_N) = \sum_{k=1}^{N} \mathrm{Var}(\tau_k) = \sum_{k=1}^{N} \frac{1-p_k}{p_k^2} < \sum_{k=1}^{N} \frac{1}{p_k^2} = \sum_{j=1}^{N} \frac{N^2}{j^2} = N^2 \sum_{j=1}^{N} \frac{1}{j^2}. Since the series j=11j2\sum_{j=1}^{\infty} \frac{1}{j^2} converges to π26\frac{\pi^2}{6}, there exists a constant CC such that Var(TN)CN2\mathrm{Var}(T_N) \le C N^2.

Step 4. Applying Chebyshev's inequality, for any ϵ>0\epsilon > 0: P(TNE[TN]NlnNϵ)=P(TNE[TN]ϵNlnN)Var(TN)ϵ2(NlnN)2P\left( \left| \frac{T_N - E[T_N]}{N \ln N} \right| \ge \epsilon \right) = P\left( \left| T_N - E[T_N] \right| \ge \epsilon N \ln N \right) \le \frac{\mathrm{Var}(T_N)}{\epsilon^2 (N \ln N)^2}. Substituting the variance bound: Var(TN)ϵ2(NlnN)2CN2ϵ2N2(lnN)2=Cϵ2(lnN)2\frac{\mathrm{Var}(T_N)}{\epsilon^2 (N \ln N)^2} \le \frac{C N^2}{\epsilon^2 N^2 (\ln N)^2} = \frac{C}{\epsilon^2 (\ln N)^2}. As NN \to \infty, this tends to 00, so TNE[TN]NlnNP0\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0. Combined with the conclusion from Step 2 that E[TN]NlnN1\frac{E[T_N]}{N \ln N} \to 1, we obtain TNNlnN=TNE[TN]NlnN+E[TN]NlnNP1\frac{T_N}{N \ln N} = \frac{T_N - E[T_N]}{N \ln N} + \frac{E[T_N]}{N \ln N} \xrightarrow{P} 1.

Final answer

QED.

Marking scheme

The following is the marking rubric based on the official solution:

1. Checkpoints (total: 7 pts)

Score exactly one chain | take the maximum subtotal among chains; do not add points across chains.

Chain A: Geometric Distribution Decomposition and Chebyshev's Inequality (official solution)

  • Geometric distribution modeling and decomposition (2 pts) [additive]
  • Decompose the total time TNT_N as a sum of stage-wise waiting times TN=k=1NτkT_N = \sum_{k=1}^N \tau_k (or using other notation such as XkX_k), and explicitly state that the τk\tau_k are mutually independent and geometrically distributed. (1 pt)
  • Correctly write the parameter: state that the success probability in the kk-th stage (collecting the kk-th distinct element) is pk=Nk+1Np_k = \frac{N-k+1}{N} or an equivalent form. (1 pt)
  • Expectation computation and asymptotics (2 pts) [additive]
  • Use linearity of expectation to compute E[TN]=1pk=Nj=1N1jE[T_N] = \sum \frac{1}{p_k} = N \sum_{j=1}^N \frac{1}{j}. (1 pt)
  • Use the harmonic series property j=1N1jlnN\sum_{j=1}^N \frac{1}{j} \sim \ln N to explicitly conclude limNE[TN]NlnN=1\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1 (or E[TN]=NlnN+O(N)E[T_N] = N \ln N + O(N)). (1 pt)
  • Variance order estimate (2 pts) [additive]
  • Use independence to write the variance sum expression: Var(TN)=k=1NVar(τk)=1pkpk2\mathrm{Var}(T_N) = \sum_{k=1}^N \mathrm{Var}(\tau_k) = \sum \frac{1-p_k}{p_k^2} (or 1pk2\le \sum \frac{1}{p_k^2}). (1 pt)
  • Use the convergence of j=11j2\sum_{j=1}^\infty \frac{1}{j^2} to show that Var(TN)\mathrm{Var}(T_N) is of order at most N2N^2 (i.e., Var(TN)CN2\mathrm{Var}(T_N) \le C N^2 or O(N2)O(N^2)). (1 pt)
  • *(Note: If the student does not perform the bounding step but correctly computes the exact expression containing 1j2\sum \frac{1}{j^2} and identifies its leading term as O(N2)O(N^2), full credit is also awarded.)*
  • Convergence in probability proof (1 pt) [additive]
  • Apply Chebyshev's inequality, substitute the above variance bound, and prove TNE[TN]NlnNP0\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0. (1 pt)
  • *(Note: If the student has proved the centered quantity converges in probability to 0, and in the expectation step has shown the mean ratio converges to 1, this point is awarded directly; the conclusion must be based on the preceding order estimates.)*

Total (max 7)

2. Zero-credit items

  • Merely copying the problem statement, definitions, or formulas without any concrete substitution or derivation specific to this problem.
  • Asserting TNNlnNT_N \approx N \ln N based on intuition alone, without the geometric distribution decomposition or series summation as justification.
  • Listing the Chebyshev inequality formula but not computing E[TN]E[T_N] or Var(TN)\mathrm{Var}(T_N), rendering the right-hand side of the inequality unanalyzable.
  • Merely citing the "coupon collector problem" result without any derivation.

3. Deductions

  • Independence omission (-1): When computing Var(τk)=Var(τk)\mathrm{Var}(\sum \tau_k) = \sum \mathrm{Var}(\tau_k), failing to mention or implicitly use the independence of the τk\tau_k.
  • Parameter logic error (-1): Setting the geometric distribution parameter pkp_k to be seriously incorrect (e.g., constant or with the wrong monotonicity in kk), preventing subsequent series analysis.
  • Confusing convergence concepts (-1): Proving "convergence of expectations" instead of "convergence in probability" (e.g., only showing limE[TNNlnN]=1\lim E[\frac{T_N}{N \ln N}] = 1 without discussing variance or probability concentration).
  • Serious series estimation error (cap at 4/7): Claiming the harmonic series 1j\sum \frac{1}{j} converges (yielding E[TN]O(N)E[T_N] \sim O(N)), causing the entire asymptotic analysis to be incorrect.
  • Logical gap (-1): In the final step, directly asserting non-convergence from Var(TN)\mathrm{Var}(T_N) \to \infty (which is correct but unnormalized), or failing to divide the variance by the denominator (NlnN)2(N \ln N)^2 for comparison.
Ask AI ✨