Exploring Genetic Variants with Markov Chain Monte Carlo: A Deep Dive into Metropolis-Hastings Simulations with Python

A stochastic process is a mathematical concept used to describe systems or phenomena that evolve over time in a way that is inherently random or uncertain. It's a collection of random variables, typically indexed by time, representing the evolution of some random value or system state over time. Stochastic processes are fundamental in probability theory and are widely used in various fields like finance, economics, physics, biology, and engineering.

Key Characteristics of Stochastic Processes:

1. Randomness: The core feature of a stochastic process is its inherent randomness. Future states of the process are generally unpredictable and are described in probabilistic terms.

2. Time Dependence: Stochastic processes are defined across time. The time parameter can be continuous (e.g., every fraction of a second) or discrete (e.g., days, months).

3. Index Set: The collection of times at which the process is observed or defined. This could be a discrete set (like integers) or continuous (like real numbers).

4. State Space: The set of possible values that each random variable in the process can take. For example, the stock price at any given time is a real number, so the state space is the set of real numbers.

Examples of Stochastic Processes:

1. Random Walk: A simple stochastic process where each step is determined by a random event, such as flipping a coin.

2. Brownian Motion: A continuous stochastic process often used to model the random motion of particles suspended in a fluid.

3. Poisson Process: Used to model random events that happen independently of each other over time, such as phone calls arriving at a call center.

4. Markov Chains: A stochastic process where the future state depends only on the current state and not on the past states.

Types of Stochastic Processes:

1. Discrete-Time vs. Continuous-Time: In discrete-time processes, observations are made at specific times (e.g., daily stock prices). Continuous-time processes involve continuous monitoring over time (e.g., Brownian motion).

2. Discrete-State vs. Continuous-State: Discrete-state processes have a finite or countable number of states (e.g., Markov chains), while continuous-state processes have an infinite number of possible states (e.g., Brownian motion).

3. Independent vs. Dependent: Independent processes have increments that are independent of each other (e.g., Poisson process), while dependent processes have increments that are correlated.

Application of MCMC in Stochastic Processes:

Markov Chain Monte Carlo (MCMC) methods are particularly well-suited for dealing with stochastic processes, especially in complex or high-dimensional contexts where traditional analytical or numerical approaches are impractical. Here are some applications:

1. Parameter Estimation: In many stochastic processes, certain parameters may be unknown. MCMC can be used to estimate these parameters by sampling from their posterior distributions, especially in a Bayesian framework.

2. Model Calibration: MCMC can calibrate stochastic models to observed data, ensuring that the model accurately reflects the real-world system it's meant to represent.

3. Prediction and Forecasting: MCMC methods can be used to generate predictive distributions, giving a probabilistic forecast of a stochastic process's future behavior.

4. Uncertainty Quantification: In stochastic modeling, quantifying the uncertainty inherent in predictions is crucial. MCMC can help quantify this uncertainty.

5. Bayesian Inference: MCMC is a key tool in Bayesian inference, which is often used in the context of stochastic processes to update the probability for a hypothesis as more evidence or information becomes available.

6. Complex Systems: In fields like computational biology or finance, where systems can be highly complex and data-intensive, MCMC provides a means to explore the behavior of these systems under uncertainty.

Applications of MCMC in Genetics

MCMC methods have significant applications in biology, particularly in modeling population dynamics, epidemiology, and genetic variation. One detailed example is the use of MCMC in inferring dependent population dynamics from genetic data within the coalescent framework.

In the coalescent framework, the genealogy of a sample of individuals from a population is represented as a timed and rooted binary tree. This model considers the ancestry of these individuals, tracing back in time to their common ancestors. The rate at which pairs of lineages merge to form a common ancestor (coalescent times) depends on the effective population size (EPS), which is a measure of genetic diversity within the population.

MCMC methods are used to estimate the EPS from genetic data, where Bayesian inference is employed. This involves assuming a prior distribution on the EPS and updating this distribution based on the observed genetic data to obtain a posterior distribution. The posterior distribution provides insights into the population size dynamics over time.

The process involves modeling the genealogies of samples from different populations and accounting for the dependencies between them. This helps in estimating the population sizes and understanding the relationship between different populations.

This approach also incorporates other types of data, like temporal sampling information, to improve the accuracy of estimates and reduce bias. Such advanced methods are crucial in understanding complex biological phenomena, especially in the context of rapidly evolving pathogens and changing population dynamics.

Monte Carlo Simulations

Monte Carlo simulations are a versatile and powerful statistical tool used across various scientific disciplines. They involve using random sampling techniques to model and analyze complex systems and processes that are difficult to solve analytically.

The core principle behind Monte Carlo simulations is the generation of a large number of random samples to represent the possible outcomes of a process or system. By analyzing these samples, researchers can estimate and predict the behavior of complex systems under different conditions. This approach is especially beneficial in situations involving uncertainty and variability.

The applications of Monte Carlo simulations are diverse and cover areas such as financial risk assessment, engineering design, computational physics, and environmental modeling. By providing a way to simulate and analyze complex phenomena, Monte Carlo simulations have become an indispensable tool in modern scientific research and decision-making.

Monte Carlo simulations involve several detailed steps, each crucial for accurate modeling and analysis:

1. Problem Definition: Clearly articulate the problem. Identify the variables and their probabilistic behavior.

2. Generate Random Inputs: Using random number generators, create inputs that reflect the probability distributions of the variables. This randomness emulates the variability inherent in real-world scenarios.

3. Perform Simulations: Apply the model to the random inputs, repeating this process many times. Each run represents a different possible outcome, given the probabilistic nature of the inputs.

4. Aggregate and Analyze Results: Collect the outcomes from all runs and perform statistical analysis. This could include calculating averages, variances, and other statistical measures.

5. Interpretation: Interpret the results in the context of the problem. This step involves understanding what the simulation results mean in real-world terms.

6. Verification and Validation: Ensure the model accurately reflects the system being simulated and that the results are consistent with known data.

For a practical example, consider estimating the value of Pi using a Monte Carlo method. The logic behind this approach involves randomly generating points in a square and determining how many fall within a quarter circle inscribed in the square. The ratio of the number of points inside the circle to the total number of points, multiplied by 4, gives an estimate of Pi.

Python code using numpy and provide a graphical output of this simulation.

  
    import numpy as np
    import matplotlib.pyplot as plt
    
    # Define the number of points
    num_points = 10000
    
    # Generate random points
    x = np.random.rand(num_points)
    y = np.random.rand(num_points)
    
    # Calculate the distance from the origin (0,0)
    distance = np.sqrt(x**2 + y**2)
    
    # Determine the points within the quarter circle
    inside_circle = distance <= 1
    
    # Estimate Pi
    pi_estimate = np.sum(inside_circle) / num_points * 4
    
    # Plotting the points
    plt.figure(figsize=(8,8))
    plt.scatter(x[inside_circle], y[inside_circle], color='blue', marker='.', label='Inside Circle')
    plt.scatter(x[~inside_circle], y[~inside_circle], color='red', marker='.', label='Outside Circle')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Monte Carlo Simulation for Estimating Pi')
    plt.legend()
    plt.axis('equal')
    plt.show()
    
    pi_estimate

The Python code above simulates the estimation of Pi using a Monte Carlo method. In the graphical output:

- Blue points represent random points that fall inside the quarter circle.

- Red points are those that fall outside the quarter circle.

The process involves generating random points within a unit square (0 to 1 on both x and y axes) and calculating the proportion of those points that fall within a quarter circle inscribed within the square. This proportion, when multiplied by 4, gives an estimate of Pi.

In this simulation with 10,000 points, the estimated value of Pi is approximately 3.1236. The accuracy of this estimate increases with the number of points used in the simulation. The graph visually depicts how the Monte Carlo method randomly samples points and uses their distribution to estimate a value.

Markov Chain Monte Carlo (MCMC) is a method used in computational statistics to sample from a probability distribution when direct sampling is challenging. It's particularly useful in high-dimensional spaces and in Bayesian statistics. The logic behind MCMC involves constructing a Markov chain that has the desired distribution as its equilibrium or stationary distribution.

Difference Between MCMC and Traditional Monte Carlo Simulations:

1. Monte Carlo Simulations:

- Directly sample from the probability distribution of interest.

- Work well when it's easy to generate samples from the distribution.

- Often used for problems with independent random variables.

2. Markov Chain Monte Carlo (MCMC):

- Construct a Markov chain whose stationary distribution is the target distribution.

- Each sample depends on the previous one, hence "chain".

- Effective in complex, high-dimensional spaces where direct sampling is difficult.

- Commonly used in Bayesian statistics to compute posterior distributions.

Benefits of MCMC Over Traditional Monte Carlo:

1. Dealing with Complex Distributions: MCMC excels in situations where the probability distribution is complex or high-dimensional, making direct sampling infeasible.

2. Efficiency in High-Dimensional Spaces: MCMC can be more efficient than traditional Monte Carlo in high-dimensional spaces, as it can explore the space more effectively.

3. Bayesian Inference: MCMC is particularly powerful for Bayesian inference, where the goal is to update the probability for a hypothesis as more evidence or information becomes available.

4. Flexibility: MCMC methods are flexible and can be adapted to a wide range of problems.

5. Convergence to True Distribution: MCMC algorithms, given enough time, can produce a sample that is arbitrarily close to the target distribution.

Formulation and Logic Behind MCMC:

Definition of a Markov Chain:

1. Sequence of Random Variables: A step is a discrete time point in the sequence of events of a Markov chain. If we consider a Markov chain as a process evolving over time, each step corresponds to a time unit where a transition (or a change in state) can occur. The Markov chain is a series of random variables 1,2,3,…X1,X2,X3,…, where each Xi is the state of the system at step i. For example, in a simple weather model, Xi could represent the weather condition (like "sunny" or "rainy") on day i. At each step, the system transitions from its current state to a new state. The probability of this transition depends only on the current state and not on the path the system took to reach that state, adhering to the Markov property. Consider a simple example of a Markov chain modeling a board game where each step represents a player's turn. At each turn (step), the state of the game (like the position of the player on the board) changes according to certain transition probabilities (like the roll of dice). Each step represents an instance where the system transitions from one state to another according to the probabilities defined by the Markov process.

1. Memoryless Property: The key property of a Markov chain is its "memoryless" nature, meaning the probability of transitioning to the next state depends only on the current state and not on the sequence of events that preceded it. Mathematically, this is written as:

P(Xn+1=x∣Xn=xn,Xn−1=xn−1,…,X1=x1)=P(Xn+1=x∣Xn=xn)

This equation states that the future state Xn+1 depends only on the present state Xn and not on the past states.

Markov Chain Formulation for MCMC:

1. Target Distribution: In MCMC, the aim is to construct a Markov chain whose equilibrium (or stationary) distribution is the probability distribution from which we want to sample (e.g., a posterior distribution in Bayesian inference).

2. Convergence: The Markov chain is designed to eventually converge to this target distribution, regardless of where it starts. Convergence means that, after a sufficient number of steps, the distribution of the states of the chain approximates the target distribution closely.

Transition Probabilities:

1. Defining Transitions: Transition probabilities are the core of a Markov chain. They define the likelihood of moving from one state to another.

2. Matrix Representation: These probabilities are often represented in a matrix called a transition matrix, where each element Pij represents the probability of moving from state i to state j.

3. Detailed Balance: For MCMC, it's crucial to maintain the detailed balance condition. This condition ensures that the chain will be in equilibrium with the target distribution. The detailed balance condition is given by:

π(i)P(i→j)=π(j)P(j→i)

Here, π(i) is the probability of being in state i in the stationary distribution, and P(i→j) is the transition probability from state i to state j. This equation means that for any two states i and j, the probability of moving from i to j is the same as moving from j to i, when weighted by the stationary probabilities of being in each of those states.

Ensuring Convergence to the Target Distribution:

- Ergodicity: The chain must be ergodic; it should be able to reach any state from any other state, given enough time.

- Detailed Balance: A condition where the probability flux out of any state is equal to the probability flux into the state in the stationary distribution.

- Formulation: For a transition from state i to state j, the detailed balance is mathematically expressed as π(i)P(i → j) = π(j)P(j → i), where π is the stationary distribution and P(i → j) is the transition probability.

Sampling From the Markov Chain:

- Burn-In Period: Initially, the chain may not follow the target distribution. Samples collected during this period, called the "burn-in" period, are often discarded.

- Collecting Samples: After the burn-in period, samples are collected from the chain. These samples are used to approximate the target distribution.

Dealing with Autocorrelation:

- Thinning: Sometimes, only every nth sample is taken to reduce autocorrelation.

- Analysis: Autocorrelation can be analyzed using autocorrelation functions and plots.

Evaluating Convergence:

- Diagnostic Tests: Various statistical tests (like Gelman-Rubin) and visual methods (like trace plots) are used to assess convergence.

- Multiple Chains: Running multiple chains from different starting points can help in assessing convergence.

Statistical Inference:

- Estimation: Use the samples to estimate quantities of interest (like mean, variance, etc.).

- Posterior Distributions: In Bayesian statistics, MCMC is used to sample from posterior distributions.

Common MCMC Algorithms:

1. Metropolis-Hastings Algorithm:

- Proposes a new state and accepts it with a probability that ensures detailed balance.

- Acceptance probability is min(1, [π(j)P(j → i)] / [π(i)P(i → j)]).

2. Gibbs Sampling:

- Special case of Metropolis-Hastings.

- Samples each variable in turn, conditional on all other variables.

- Useful in high-dimensional problems.

3. Hamiltonian Monte Carlo:

- Incorporates information about the structure of the space being sampled.

- Reduces random walk behavior and improves efficiency.

Python Implementation of MCMC with Metropolis-Hastings Sampling

The Metropolis-Hastings algorithm

The Metropolis-Hastings algorithm creates a Markov chain that has the target probability distribution as its equilibrium distribution. It does this by generating a sequence of sample values in such a way that, even though the initial samples may be drawn from a distribution that is significantly different from the target distribution, as more samples are drawn, the distribution of values converges to the target distribution.

Formulation and Steps:

1. Initial Setup:

- Start with an Initial Guess: Begin with an arbitrary point x0 from the state space.

- Choose a Proposal Distribution: Select a proposal distribution, Q(x,y), which suggests a new state y given the current state x.

2. Iterative Process:

- For each iteration t:

- Generate a Candidate: From the current position xt, generate a candidate y for the next sample using the proposal distribution Q(xt,y).

- Calculate Acceptance Probability: Compute the acceptance probability, A(xt,y), which is the probability of moving to the candidate state y. This is given by:

A(xt,y)=min(1, P(y)Q(y,xt)/P(xt)Q(xt,y))

Here, P is the target distribution, and Q is the proposal distribution.

- Accept or Reject: Generate a uniform random number u from the interval [0,1]. If u≤A(xt, y), then accept the candidate by setting xt + 1 = y; otherwise, reject the candidate and set xt + 1 =xt.

3. Convergence:

- After many iterations, the distribution of the values xt converges to the target distribution P.

Key Features:

- Proposal Distribution: Can be any distribution, but its choice significantly affects the efficiency of the algorithm. Common choices include Gaussian distributions centered at the current state.

- Detailed Balance: The algorithm maintains the detailed balance condition, ensuring that the stationary distribution of the Markov chain is indeed the target distribution.

- Flexibility: It's versatile and can be used for a wide range of target distributions.

Challenges:

- Convergence: Determining when the algorithm has converged to the target distribution can be non-trivial.

- Dependence on Proposal Distribution: The efficiency of the algorithm depends heavily on the choice of the proposal distribution.

- Burn-In Period: The initial samples (before the chain converges) might need to be discarded, and determining the length of this burn-in period can be challenging.

Using the Metropolis-Hastings algorithm for simulating variant allele distributions in populations, we can consider two distinct scenarios: one for a diseased population and another for a healthy population. We'll simulate how a particular genetic variant is distributed in both populations and compare them.

Scenario and Assumptions:

1. Target Distributions:

- Diseased Population: Assume the variant allele has a higher frequency and variation in the diseased population, possibly due to its association with the disease.

- Healthy Population: The variant allele has a lower frequency and less variation in the healthy population.

2. Proposal Distribution:

- We can still use a normal distribution centered at the current state. However, the standard deviation might differ between the two populations to reflect the difference in variation of allele frequencies.

Formulating Target Distributions:

We will model the frequency of a genetic variant in both populations using normal distributions but with different means and variances:

- Diseased Population: Higher mean and higher variance.

- Healthy Population: Lower mean and lower variance.

Implementation:

The `metropolis_hastings` function remains the same. We will adjust the target distributions for each population and compare the results.

  
import numpy as np

def metropolis_hastings_log(p_log, q_sample, x_init, iterations):
    x = x_init
    samples = [x]
    for _ in range(iterations):
        x_candidate = q_sample(x)
        acceptance_log = p_log(x_candidate) - p_log(x)
        if np.log(np.random.rand()) < acceptance_log:
            x = x_candidate
        samples.append(x)
    return np.array(samples)

1. p_log: This parameter is a function that calculates the logarithm of the target probability distribution (`p`) at a given point. Since direct computation of probabilities can lead to numerical underflow or overflow for extreme values, using the logarithmic form mitigates these issues.

2. q_sample: This function generates a candidate sample for the next state of the Markov chain, based on the current state. It's part of the proposal distribution mechanism in the Metropolis-Hastings algorithm.

3. x_init: The initial starting point of the Markov chain. This value can be arbitrarily chosen, though the choice can affect the convergence time of the algorithm.

4. iterations: The number of iterations or steps to run the Metropolis-Hastings algorithm. Each iteration represents a move (or potential move) within the Markov chain.

The function starts with the initial sample `x_init` and a list `samples` to store the chain's states.

The main loop runs for the specified number of `iterations`. In each iteration, it performs the following steps:

1. Generate Candidate: `x_candidate` is generated using `q_sample`, which proposes a new state based on the current state `x`.

2. Compute Acceptance Log Probability: `acceptance_log` calculates the logarithm of the acceptance probability. It's the difference between the log probabilities of the candidate state and the current state.

3. Accept or Reject: A random number is drawn, and its logarithm is compared with `acceptance_log`. If the log of the random number is less than `acceptance_log`, the candidate state is accepted (the chain moves to `x_candidate`); otherwise, it remains in the current state.

4. Append to Samples: Whether accepted or not, the state (new or current) is appended to the `samples` list.

- Return Samples: After completing all iterations, the function returns the `samples` list as a NumPy array, representing the states of the Markov chain.

Key Points:

- Using logarithms for probabilities allows the algorithm to handle cases where probabilities are very small or large, improving numerical stability and preventing issues like underflow or overflow.

- The function effectively simulates a Markov chain whose stationary distribution is the target distribution, useful in various statistical and machine learning applications, especially in scenarios with complex or unknown probability distributions.

  
import math
import numpy as np
import matplotlib.pyplot as plt

def proposal_sample(x):
    step = np.random.normal(0, 10)  # Random step from normal distribution
    return max(min(x + step, 2 * N), 0)  # Ensure the new sample is within [0, 2N]

# Logarithm of target distributions for allele counts
def log_target_disease(x):
    if 0 <= x <= 2 * N:
        return x * math.log(lambda_disease) - lambda_disease - sum(math.log(i) for i in range(1, int(x) + 1))
    else:
        return -float('inf')  # Logarithm of 0

def log_target_healthy(x):
    if 0 <= x <= 2 * N:
        return x * math.log(lambda_healthy) - lambda_healthy - sum(math.log(i) for i in range(1, int(x) + 1))
    else:
        return -float('inf')

# Assuming values based on the plot and typical use case
N = 100  # Maximum allele count
n_iterations = 10000  # Number of iterations for the MCMC
lambda_disease = 120  # Poisson parameter for diseased population
lambda_healthy = 80  # Poisson parameter for healthy population
        
        
# Running the modified Metropolis-Hastings Algorithm
samples_disease = metropolis_hastings_log(log_target_disease, proposal_sample, N, n_iterations)
samples_healthy = metropolis_hastings_log(log_target_healthy, proposal_sample, N, n_iterations)

# Plotting the results
plt.figure(figsize=(12, 6))
plt.hist(samples_disease, bins=range(2 * N + 1), density=True, alpha=0.6, color='red', label='Diseased Population')
plt.hist(samples_healthy, bins=range(2 * N + 1), density=True, alpha=0.6, color='blue', label='Healthy Population')
plt.title('Comparison of Variant Allele Counts in Diseased vs Healthy Populations')
plt.xlabel('Allele Count')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()

The simulation results are now successfully displayed, comparing the distribution of variant allele counts in diseased and healthy populations:

- Red Histogram (Diseased Population): This shows the distribution of allele counts in the diseased population, with a higher mean reflecting the assumption of a higher prevalence of the variant allele.

- Blue Histogram (Healthy Population): This represents the distribution in the healthy population, with a lower mean allele count, indicating a less frequent occurrence of the variant allele.

### Interpretation:

- The histograms visually contrast the variant allele distributions between the two populations.

- The diseased population histogram, with a higher mean, suggests a potential association between the genetic variant and the disease.

- In contrast, the healthy population histogram indicates a lower prevalence of the variant allele, as expected in the absence of disease association.

### Conclusion:

This simulation exemplifies how MCMC methods, specifically the Metropolis-Hastings algorithm, can be used to model and compare genetic variant distributions between different populations. Such analyses are crucial in genetic epidemiology for understanding the genetic basis of diseases and can inform future research and clinical strategies.