MathIsimple

Probability Theory – Problem 30: Prove that converges in probability to 1

Question

Let XnX_{n} follow the uniform distribution on Σ={1,2,,N}\Sigma=\{1,2,\ldots,N\}, and define

TN=inf{n ⁣:{X1,,Xn}={1,2,,N}}T_{N}=\inf\{n\colon\{X_{1},\ldots,X_{n}\}=\{1,2,\ldots,N\}\}

Prove that TNNlnN\frac{T_{N}}{N\ln N} converges in probability to 1.

(20 points)

Step-by-step solution

Step 1. Denote by τk\tau_k the number of trials needed to collect the kk-th distinct element after k1k-1 distinct elements have already been collected, where k=1,2,,Nk=1,2,\dots,N. At that stage, the probability of obtaining a new element on each trial is pk=N(k1)Np_k = \frac{N-(k-1)}{N}. The random variable τk\tau_k follows a geometric distribution Ge(pk)Ge(p_k) with probability mass function P(τk=m)=(1pk)m1pkP(\tau_k = m) = (1-p_k)^{m-1}p_k, and τ1,τ2,,τN\tau_1, \tau_2, \dots, \tau_N are mutually independent. The total number of trials TNT_N can be expressed as TN=k=1NτkT_N = \sum_{k=1}^{N} \tau_k.

Step 2. The expectation of the geometric random variable τk\tau_k is E[τk]=1pk=NNk+1E[\tau_k] = \frac{1}{p_k} = \frac{N}{N-k+1}, and its variance is Var(τk)=1pkpk2Var(\tau_k) = \frac{1-p_k}{p_k^2}. By linearity of expectation, the expectation of TNT_N is: E[TN]=k=1NE[τk]=k=1NNNk+1=Nj=1N1jE[T_N] = \sum_{k=1}^{N} E[\tau_k] = \sum_{k=1}^{N} \frac{N}{N-k+1} = N \sum_{j=1}^{N} \frac{1}{j}. Using the asymptotic expansion of the harmonic series j=1N1j=lnN+γ+o(1)\sum_{j=1}^{N} \frac{1}{j} = \ln N + \gamma + o(1) (where γ\gamma is the Euler--Mascheroni constant), we obtain E[TN]=NlnN+O(N)E[T_N] = N \ln N + O(N), and hence limNE[TN]NlnN=1\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1.

Step 3. By independence, the variance of TNT_N is: Var(TN)=k=1NVar(τk)=k=1N1pkpk2<k=1N1pk2=j=1NN2j2=N2j=1N1j2Var(T_N) = \sum_{k=1}^{N} Var(\tau_k) = \sum_{k=1}^{N} \frac{1-p_k}{p_k^2} < \sum_{k=1}^{N} \frac{1}{p_k^2} = \sum_{j=1}^{N} \frac{N^2}{j^2} = N^2 \sum_{j=1}^{N} \frac{1}{j^2}. Since the series j=11j2\sum_{j=1}^{\infty} \frac{1}{j^2} converges to π26\frac{\pi^2}{6}, there exists a constant CC such that Var(TN)CN2Var(T_N) \le C N^2.

Step 4. Applying Chebyshev's inequality, for any ϵ>0\epsilon > 0: P(TNE[TN]NlnNϵ)=P(TNE[TN]ϵNlnN)Var(TN)ϵ2(NlnN)2P\left( \left| \frac{T_N - E[T_N]}{N \ln N} \right| \ge \epsilon \right) = P\left( \left| T_N - E[T_N] \right| \ge \epsilon N \ln N \right) \le \frac{Var(T_N)}{\epsilon^2 (N \ln N)^2}. Substituting the variance bound: Var(TN)ϵ2(NlnN)2CN2ϵ2N2(lnN)2=Cϵ2(lnN)2\frac{Var(T_N)}{\epsilon^2 (N \ln N)^2} \le \frac{C N^2}{\epsilon^2 N^2 (\ln N)^2} = \frac{C}{\epsilon^2 (\ln N)^2}. As NN \to \infty, the right-hand side tends to 00, so TNE[TN]NlnNP0\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0. Combined with the conclusion from Step 2 that E[TN]NlnN1\frac{E[T_N]}{N \ln N} \to 1, we obtain TNNlnN=TNE[TN]NlnN+E[TN]NlnNP1\frac{T_N}{N \ln N} = \frac{T_N - E[T_N]}{N \ln N} + \frac{E[T_N]}{N \ln N} \xrightarrow{P} 1.

Final answer

QED.

Marking scheme

The following is the detailed marking rubric based on the official solution:

1. Checkpoints (Total: 7 points)

Score exactly one chain; take the maximum subtotal among chains; do not add points across chains.

Chain A: Geometric distribution decomposition and Chebyshev's inequality (official solution)

  • Geometric distribution modeling and decomposition (2 points) [additive]
  • Decompose the total count TNT_N into a sum of stage-wise waiting times TN=k=1NτkT_N = \sum_{k=1}^N \tau_k (or using other notation such as XkX_k), and explicitly state that the τk\tau_k are mutually independent and geometrically distributed. (1 point)
  • Correctly specify the parameter: state that the success probability in the kk-th stage (collecting the kk-th distinct element) is pk=Nk+1Np_k = \frac{N-k+1}{N} or an equivalent form. (1 point)
  • Computation and asymptotics of the expectation (2 points) [additive]
  • Using linearity of expectation, compute E[TN]=1pk=Nj=1N1jE[T_N] = \sum \frac{1}{p_k} = N \sum_{j=1}^N \frac{1}{j}. (1 point)
  • Using the harmonic series property j=1N1jlnN\sum_{j=1}^N \frac{1}{j} \sim \ln N, explicitly conclude that limNE[TN]NlnN=1\lim_{N \to \infty} \frac{E[T_N]}{N \ln N} = 1 (or E[TN]=NlnN+O(N)E[T_N] = N \ln N + O(N)). (1 point)
  • Order-of-magnitude estimate for the variance (2 points) [additive]
  • Using independence, write the variance-sum expression: Var(TN)=k=1NVar(τk)=1pkpk2Var(T_N) = \sum_{k=1}^N Var(\tau_k) = \sum \frac{1-p_k}{p_k^2} (or 1pk2\le \sum \frac{1}{p_k^2}). (1 point)
  • Using the convergence of j=11j2\sum_{j=1}^\infty \frac{1}{j^2}, show that Var(TN)Var(T_N) is of order at most N2N^2 (i.e., Var(TN)CN2Var(T_N) \le C N^2 or O(N2)O(N^2)). (1 point)
  • *(Note: If the student does not perform the bounding step but correctly computes the exact expression involving 1j2\sum \frac{1}{j^2} and identifies its leading term as O(N2)O(N^2), full credit is awarded.)*
  • Proof of convergence in probability (1 point) [additive]
  • Apply Chebyshev's inequality, substitute the variance bound obtained above, and prove that TNE[TN]NlnNP0\frac{T_N - E[T_N]}{N \ln N} \xrightarrow{P} 0. (1 point)
  • *(Note: If the student has shown that the centered quantity converges in probability to 0, and in the expectation step has shown that the mean ratio converges to 1, this point is awarded directly; the conclusion must be based on the preceding order-of-magnitude estimates.)*

Total (max 7)

2. Zero-credit items

  • Merely copying the problem statement, definitions, or formulas without performing any specific substitution or derivation for this problem.
  • Asserting TNNlnNT_N \approx N \ln N based solely on intuition, without geometric distribution decomposition or series summation as justification.
  • Stating Chebyshev's inequality without computing E[TN]E[T_N] or Var(TN)Var(T_N), rendering the right-hand side of the inequality unanalyzable in the limit.
  • Citing the conclusion of the "coupon collector problem" without any derivation.

3. Deductions

  • Missing independence (-1): When computing Var(τk)=Var(τk)Var(\sum \tau_k) = \sum Var(\tau_k), failing to mention or implicitly use the independence of the τk\tau_k.
  • Parameter logic error (-1): The geometric distribution parameter pkp_k is set seriously incorrectly (e.g., set as a constant or with the wrong monotonicity in kk), preventing subsequent series analysis.
  • Confusing convergence concepts (-1): Proving "convergence in expectation" instead of "convergence in probability" (e.g., only showing limE[TNNlnN]=1\lim E[\frac{T_N}{N \ln N}] = 1 without discussing variance or probability concentration).
  • Serious series estimation error (cap at 4/7): Claiming that the harmonic series 1j\sum \frac{1}{j} converges (i.e., concluding E[TN]O(N)E[T_N] \sim O(N)), which invalidates the entire asymptotic analysis.
  • Logical gap (-1): In the final step, directly asserting non-convergence from Var(TN)Var(T_N) \to \infty (which is correct but unnormalized), or failing to divide the variance by the denominator (NlnN)2(N \ln N)^2 for comparison.
Ask AI ✨