The post Finite Difference Schemes appeared first on Ben's Planet.

]]>The method derives directly from the mathematical definition of a derivative. Say you start for example with the first-order derivative of some function $f(x)$ as the limit

$$

\frac{df}{dx}(x) = f^\prime(x) = \lim_{h \to 0} \frac{f(x+h) – f(x)}{h}

$$

Then $f^\prime(x)$ can be approximated by the fraction on the right-hand side, choosing an appropriately small $h$.

$$

f^\prime(x) \approx \frac{1}{h} (f(x+h) – f(x)) + \mathcal{O}(h)

$$

Since no limit is taken, the difference between those function values remains finite $f(x+h) – f(x)$, therefore it is called the “Finite Difference Method”.

The example above is a so-called “forward finite difference”, since one of the function values is advanced by $h$ and the corresponding numerical finite difference scheme is named “Euler method”. A variation of this is the “backward finite difference”

$$

f^\prime(x) \approx \frac{1}{h} (f(x) – f(x-h)) + \mathcal{O}(h)

$$

Both schemes allow for solving differential equations rather quickly but might suffer from numerical instabilities (such as violation of energy conservation) and greater numerical errors on the order of $\mathcal{O}(h)$

Therefore, in practice slightly more advanced finite difference schemes are employed such as the central difference scheme

$$

f^\prime(x) \approx \frac{1}{\tilde{h}} (f(x+\tilde{h}/2) – f(x-\tilde{h}/2)) + \mathcal{O}(h^2)

$$

or, by choosing $h = \frac{\tilde{h}}{2}$, we obtain

$$

f^\prime(x) \approx \frac{1}{2h} (f(x+h) – f(x-h)) + \mathcal{O}(h^2)

$$

Central differences, despite being slightly more elaborate, have the huge advantage of being symmetric in $\pm h$ and giving a more accurate approximation, with the numerical error being of order $\mathcal{O}(h^2)$.

What about second-order derivatives? We can approximate those from our first order approximation via

$

\begin{aligned}

\frac{d^2f}{dx^2}(x) &= f^{\prime\prime}(x) = \frac{d}{dx}\frac{df}{dx} \approx \frac{d}{dx} \left( \frac{f(x+h) – f(x-h)}{2h} \right) \\

&\approx \frac{1}{2h} \left( \frac{f(x+2h) – f(x)}{2h} – \frac{f(x) – f(x-2h)}{2h} \right) \\ &= \frac{1}{(2h)^2}\left(f(x+2h) – 2f(x) + f(x-2h) \right)

\end{aligned}

$

or alternatively by redefining $h$ like above

$$

f^{\prime\prime}(x) = \frac{1}{h^2} \left(f(x) – 2f(x) + f(x) \right)

$$

In the same way, mixed derivatives are handled

$$

\frac{\partial^2 f}{\partial x \partial y} \approx \frac{1}{2h} \left(\left(\frac{\partial f}{\partial y}\right){x+h, y} – \left(\frac{\partial f}{\partial y}\right){x-h, y}\right)

$$

with

$$

\left(\frac{\partial f}{\partial y}\right)_{x+h, y} \approx \frac{f(x+h, y+h) – f(x+h, y-h)}{2 h}

$$

$$

\left(\frac{\partial f}{\partial y}\right)_{x-h, y} \approx\frac{f(x-h, y+h) – f(x-h, y-h)}{2 h}

$$

finally resulting in

$$

\begin{aligned}

\frac{\partial^2 f}{\partial x \partial y} = & \frac{1}{4h} (f(x+h, y+h) – f(x+h, y-h) \\

& – f(x-h, y+h) + f(x-h, y-h))\end{aligned}

$$

Last but not least, let’s get a bit fancy and construct a central differences scheme in two dimensions for affine coordinates. These are described by a pair of covariant vectors $\vec{e}_1$, $\vec{e}_2$ forming the metric tensor

$$

\hat{g}_{i,j} = \vec{e}_1 \cdot \vec{e}_2

$$

which governs all distances in affine coordinates. Its inverse, the dual metric tensor $\hat{g}^{i,j} = (\hat{g}_{i,j})^{-1}$, is employed for the actual coordinate transformation

$$

\begin{pmatrix}

\xi^1 \ \xi^2

\end{pmatrix} = \hat{g}^{\alpha\beta} \cdot

\begin{pmatrix} x \ y

\end{pmatrix}

$$

The first-order derivative, given by the gradient in affine coordinates, is analogous to the one in Cartesian coordinates

$$

\nabla f(\xi_1, \xi_2) = \frac{1}{2 h} \begin{pmatrix} f(\xi_1+h,\xi_2) – f(\xi_1-h,\xi_2) \\ f(\xi_1,\xi_2+h) – f(\xi_1,\xi_2-h)\end{pmatrix}

$$

For the Laplacian, also the dual metric tensor needs to be taken into account

$$

\begin{aligned}

\Delta f &= \nabla_i \nabla^i f = g^{i,j} \nabla_i \nabla_j f \\

&= g^{1,1} \frac{\partial^2 u}{\partial \xi_1 \partial \xi_1} +g^{1,2} \frac{\partial^2 f}{\partial \xi_1 \partial \xi_2} +g^{2,1} \frac{\partial^2 f}{\partial \xi_2 \partial \xi_1} +g^{2,2} \frac{\partial^2 f}{\partial \xi_2 \partial \xi_2}

\end{aligned}

$$

resulting in

$$

\begin{aligned}

h^2 \cdot \Delta f &=g^{1,1} (f(\xi_1+h, \xi_2) – 2 f(\xi_1, \xi_2) + f(\xi_1-h, \xi_2)) \\

& + g^{2,2} (f(\xi_1, \xi_2+h) – 2 f(\xi_1, \xi_2) + f(\xi_1, \xi_2-h)) \\

& + \frac{g^{1,2}}{2} (f(\xi_1+h, \xi_2+h) – f(\xi_1+h, \xi_2-h) \\

& – f(\xi_1-h, \xi_2+h) + f(\xi_1-h, \xi_2-h))\end{aligned}

$$

where we have used $g^{1,2} = g^{2,1}$

www.iue.tuwien.ac.at/phd/heinzl/node27.html

www.holoborodko.com/pavel/numerical-methods/numerical-derivative/central-differences/#comment-18109

www.mathematik.uni-dortmund.de/~kuzmin/cfdintro/lecture4.pdf

The post Finite Difference Schemes appeared first on Ben's Planet.

]]>The post Central Limit Theorem appeared first on Ben's Planet.

]]>*Consider a sequence $x_1, \dots, x_n$ of independent, identically distributed random variables (Bernoulli trials) with mean $\mu$ and variance $\sigma^2$. Their empirical mean is defined by*

$$

\bar{x} = \frac{1}{n} (x_1 + \dots + x_n) = \mathrm{empirical \, mean}

$$

*Normalizing it to a random variable with expectation value zero in the way*

$$

z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}}

$$

*the probability distribution $p(z) \to N(0,1)$ for large $n$ (“convergence in distribution”). Equivalently, we may write $\mathrm{CDR}(z) \to \mathrm{Erf}(z)$ for allmost all $z \in \mathbb{R}$.*

The marvel of the central limit theorem is the normal distribution emerging from adding up many random variables that themselves don’t have to be normal distributions.

However, they need to be independent (see the previous article on random variables) and they need to be ‘identically distributed’, i.e. drawn from the same probability distribution, with the same, well-defined mean $\mu$ and variance $\sigma^2$.

Let’s first try to understand why the limiting distribution of $z$ has to have mean zero and variance one:

First, since the expectation value of a random variable is linear,

$$

\left\langle \frac{x_1 + \dots + x_n}{n} \right\rangle

= \frac{\left\langle(x_1 + \dots + x_n) \right\rangle}{n}

= n \frac{\left\langle x_1 \right\rangle}{n} = \left\langle x_1 \right\rangle = \mu

$$

the expectation value of the empirical mean equals the mean of the individual distributions and thus the expectation value of $z$ equals zero. For statistical samples, one says that the empirical mean tends toward the population mean.

Second, for independent random variables, the variance adds up according to the Pythagorean theorem

$$

\text{Var}(aX + bY) = a^2 \text{Var}(X) + b^2 \text{Var}(Y)

$$

and thus

$$

\text{Var}\left(\frac{x_1 + \dots + x_n}{n}\right) = \frac{\text{Var}((x_1 + \dots + x_n))}{n^2} = \frac{n \text{Var}(x)}{n^2} = \frac{\text{Var}(x)}{n}

$$

Therefore, dividing by $\frac{\sigma}{\sqrt{n}}$ ensures that $z$ has variance one.

The real mystery that needs explanation then is: Why is it a normal distribution at all?

Whereas formal proofs are rather technical, we shortly sketch two ideas that will be further elaborated in follow-on articles:

**Elementary Explanation**

Intuitively, the central limit theorem states that the sum $\frac{x_1 + \dots + x_n}{\sqrt{n}}$ of random variables converges to a normal distribution for $n \to \infty$, independent of the individual $x_i$.

It does not matter if we mix or replace some of the random variables, as long as they have the same first and second moments. Especially, if we replace the $x_i$ by a corresponding normal distribution, the stability of the Gaussian guarantees that adding two Gaussians will again be a Gaussian.

Thus, it can be shown, that for the central limit theorem to be true it requires the individual $x_i$ to be dependent only on the first and second moment (see here).

**The Maximum Entropy Principle**

Following the principle of maximum entropy, the distribution governing $z$ should be the one with the maximum possible entropy consistent with the constraints that the mean being $\mu$ and the variance being $\sigma^2$.

This can be shown employing calculus of variations.

The central limit theorem is closely related to the law of large numbers, stating that the empirical mean $\frac{x_1 + \dots + x_n}{n}$ of identically distributed, independent random variables depends only on $\mu$ for $n \to \infty$.

Similarly, the central limit theorem states that the expression $\frac{x_1 + \dots + x_n}{\sqrt{n}}$ depends only on $\mu$ and $\sigma^2$ for $n \to \infty$.

The normal distribution is just one example of a stable distribution: adding two Gaussians yields again a Gaussian.

If either $\mu$ or $\sigma^2$ (or both) are infinite, the central limit theorem breaks down, but there may still be a limiting distribution that is not a normal distribution: these are generally called stable distributions.

The specifics of when a stable distribution exists and what form it has were elaborated in the 1930s by P. Lévy, A. Khintchine, and others (see for example the book by Paul and Baschnagel).

Wolfgang Paul and Jörg Baschnagel. Stochastic Processes. 2nd Edition 2013 Springer

math.stackexchange.com/questions/12983/intuition-about-the-central-limit-theorem

intuitive visualization: www.cantorsparadise.com/the-central-limit-theorem-why-is-it-so-2ae93edf6e8

The post Central Limit Theorem appeared first on Ben's Planet.

]]>The post Random Variables appeared first on Ben's Planet.

]]>*A random variable is a measurable function $f:X \to E$ from a set of possible outcomes $X$ to a measurable space $E$, i.e., it assigns values (from $E$) to the outcomes $X$ of a random experiment. It can be either discrete (countable image set, e.g. $E = \mathbb{Z}$) or continuous (uncountable image set, e.g. $E = \mathbb{R}$)*

Unlike an algebraic variable, whose value is well-determined by an equation, a random variable takes values in an entire range and cannot be predicted, for example, due to uncertainty in the initial conditions, too large complexity of the system, and other factors.

Examples for a random variable include

- the measurement outcome itself, i.e. ‘head’ or ‘tail’ for flipping a coin, when $X = E$
- combinations of measurement outcomes: e.g. the sum of rolling two dices
- in practice: the values that stock prices, heart rate, $\dots$ assume

Note, that a random variable only assigns values to outcomes and carries no notion of time. The time evolution of a random variable is a stochastic process.

An important concept is the independence of two random variables:

Suppose a random experiment yields a random outcome $x_1\in X_1$, while another experiment yields $x_2\in X_2$. We can combine these into a single random variable $x=(x_1,x_2)$ that takes values in the product set $X=X_1 \times X_2$ and consider the probability $P(A)$ defined on subsets $A\subseteq X$.

The probability $P(x_1)$ of $x_1$ is then given by the *marginal* probability $P(x_1)=P(\{x_1\}\times X_2)$, i.e. the probability of the set of all $x=(x_1,x_2)$, where the first component has the definite value $x_1$ and the second component $x_2$ is arbitrary. The probability $P(x_2)$ of $x_2$ is defined analogously.

We say the random variables $x_1$ and $x_2$ are *independent* iff

$$

P(x_1,x_2) = P(x_1)P(x_2) \quad.

$$

Thus, two random variables are independent if the values assumed by one do not influence the values taken by the other one – they convey no information about each other.

Two subsequent tosses of an ideal coin are independent. A counter-example would be drawing from an urn without replacement: the first drawing changes the probabilities for the second drawing.

One can consider random variables as elements of a function vector space, e.g. $L^2$ the space of all square-integrable functions. Being independent of each other then means that they are orthogonal with respect to the norm $||.||$ on that space.

Just like the Pythagorean theorem holds for the lengths of orthogonal vectors $x \perp y$

$$

||x + y||^2 = ||x||^2 + ||y||^2

$$

the variance is additive for two independent random variables $X$ and $Y$

$$

\sigma_{XY}^2 = \sigma_X^2 + \sigma_Y^2

$$

This article is part of the book project on stochastic processes.

The post Random Variables appeared first on Ben's Planet.

]]>The post Lyapunov Exponents appeared first on Ben's Planet.

]]>Let $\vec{x}(0)$ be the initial state of a chaotic dynamical system in phase space and let $\vec{x}(0) + \vec{\delta}_0$ be a set of nearby initial states, where $||\vec{\delta}_0||$ is very small (on the order of $||\vec{\delta}_0|| \sim 10^{-15}$ determined by floating-point accuracy). In numerical studies one finds, that the distance between such initially close orbits can be described by

$$

||\vec{\delta}(t)|| \sim ||\vec{\delta}_0|| e^{\lambda t}

$$

with $\lambda$ being the Lyapunov exponent, more specifically the largest Lyapunov exponent.

Positive Lyapunov exponents mean that initially close orbits separate exponentially fast and thus indicate sensitive dependence on initial conditions. Negative or zero Lyapunov exponents hint at fixed points or stable, periodic orbits.

But this is just part of the story because for a $N$-dimensional dynamical system phase-space needs to be described by an $N$-dimensional basis of orthogonal basis vectors and thus there are $N$ distinct either expanding, contracting, or invariant directions and thus an entire spectrum of $N$ distinct Lyapunov exponents.

Let’s consider an infinitesimally small hypersphere of initial states with main axes $\delta x^i_0$ and $i = 1, \dots, N$. Over time this hypersphere will deform into an ellipsoid, due to the presence of expanding and contracting directions, whose main axes are given by $\delta x^i_t$. Now, the Lyapunov exponents can be defined via

$$

\lambda_i = \lim_{t \to \infty} \frac{1}{t} \ln \left( \frac{\delta x^i_t}{\delta x^i_0} \right)

$$

This definition is based on the Multiplicative Ergodic Theorem, also called Oseledecs theorem as it was first proven by Valery Oseledec in 1965 [Oseledec1968} and subject to several theoretical studies later on [Raghunathan1979, Ruelle1979, Johnson1987, Walters1993].

Yet this definition is not very practical when it comes to actually calculate Lyapunov exponents, especially since due to the nature of a chaotic system we cannot guarantee that our hypersphere stays infinitesimally small over the time scale needed for the convergence of the Lyapunov spectrum. Therefore we will follow an alternative numerical approach in the following.

**Standard Method using Gram-Schmidt Orthogonalization**

We are using the standard method to determine the Lyapunov spectrum, originally developed by Benettin et al. [Benettin1980]. For a more pedagogical introduction and an extension to time series data have a look at the work of Wolf et al. [Wolf1985] and recent textbooks, such as [Skokos2010, Strogatz2015, Vallejo2017].

Following [Wolf1985], we will consider the time evolution of one particular initial state according to the non-linear equations of motion, yielding the fiducial trajectory. But instead of an infinitesimally small hypersphere of initial conditions we will consider equivalently the time evolution of an initially orthonormal basis of vectors via the linearized equations, anchored on the fiducial trajectory.

To be specific, let our $N$-dimensional dynamical system by described by the $N$-component, non-linear equation of motion

$$

\dot{\vec{x}} = \vec{f}(\vec{x})

$$

yielding the fiducial trajectory $\vec{x}^\ast(t)$. Then each of the orthonormal basis vectors $\vec{e}_i$ with $i = 1, \dots, N$ evolves according to the linearized equations of motion

$$

\dot{\vec{e}}_i = \hat{A} \dot{\vec{e}}_i

$$

with $\hat{A} = \frac{d \vec{f}}{d \vec{x}} \big|_{\vec{x} = \vec{x}^\ast}$ evaluated on the fiducial trajectory, yielding $N \times N$ additional equations of motion. Thus, the entire system of $N$ plus $N \times N$ equations is solved simultaneously. But even the linearized time evolution suffers from two numerical issues:

a) With time the vectors will grow exponentially large / small for positive / negative Lyapunov exponents.

b) Over time the vectors will collapse along the direction of greatest expansion.

To overcome these two issues we make repeated use of the Gram-Schmidt orthonormalization procedure on the basis of vectors (or apply alternatively some other QR-decomposition method like a Householder transformation [Geist1990]).

Thus, in practice the system of $N + N \times N$ equations of motion is evolved for a certain time $\Delta t$ (typically of the order of one orbital period), starting with the initial ONB $\vec{e}^k = [\vec{e}^k_1, \dots, \vec{e}^k_N]$ and finally obtaining the set of vectors $\vec{u}^k = [\vec{u}^k_1, \dots, \vec{u}^k_N]$ during the $k$-th iteration.

This system is orthogonalized using Gram-Schmidt yielding the set of vectors $\vec{v}^{k} = [\vec{v}^{k}_1, \vec{v}^{k}_2, \vec{v}^{k}_3]$. And that system is finally normalized and used as an initial ONB for the next round of iteration.

The GS-orthogonalization never alters the direction of the first vector in the system, which therefore — over time — seeks out the most rapidly growing direction (i.e. characterized by the largest Lyapunov exponent).

Due to the GS-orthogonalization procedure, the second vector has its component along the direction of greatest expansion removed. Throughout the iteration process, we are constantly changing its direction, so it is also not free to seek out the second most rapidly expanding direction. But, the vectors $\vec{u}_1$ and $\vec{u}_2$ span the same two-dimensional subspace as the vectors $\vec{v}_1$, $\vec{v}_2$. So despite repeated vector replacements, these two vectors explore the two-dimensional subspace whose area is growing most rapidly. This area is governed by the largest and second-largest Lyapunov exponent and grows according to $e^{(\lambda_1+\lambda_2) t}$ [Benettin1980].

Thus, by monitoring the length of the largest vector, proportional to $e^{\lambda_1 t}$, and the area spanned by the first two vectors, both Lyapunov exponents can be determined. In practice, since vectors $\vec{v}_1$ and $\vec{v}_2$ are orthogonal, we can determine $\lambda_2$ directly from the mean growth rate vector $\vec{v}_2$.

This reasoning can be generalized to the $k$-volume spanned by the first $k$ vectors, which grows according to $e^\mu$ where $\mu = \sum_{i=1}^k \lambda_i t$ and accordingly the mean growth rates of $k$-first vectors provide an estimate for the $k$ largest Lyapunov exponents $\lambda_k$:

$$

\lambda_k = \frac{1}{k} \sum_{k} \frac{\ln |\vec{v}_k|}{\Delta t} = \frac{1}{k \Delta t} \sum_{k} \ln |\vec{v}_k|

$$

Upon thousands of iterations, the fiducial trajectory transverses the chaotic sea for $t \to \infty$ and thus the Lyapunov spectrum is independent of the specific initial conditions, as long as they are within the chaotic sea rather than some regular basin of attraction.

While the largest Lyapunov exponent is an indicator of chaotic dynamics and characterizes single trajectories, the entire Lyapunov spectrum characterizes the dynamical system as a whole e.g. by the rate of information loss. Employing the binary logarithm, the Lyapunov exponents give the rate of information loss (positive exponent) or gain (negative exponent) in bits/second.

A beautiful example is given by [Wolf1985]: If the initial state of say the Lorenz attractor is prepared with an accuracy of 20 bits (one part in a million) and the largest Lyapunov exponent $\lambda_1 = 2.16$ represents the rate of information loss, after about $9 \ \mathrm{s} = 20 \ \mathrm{bits} /(2.16 \ \mathrm{bits}/\mathrm{s})$ the uncertainty about its state has spread over the entire attractor.

The Lyapunov spectrum can also serve to approximate the fractal dimension of a strange attractor according to the Kaplan Yorke Conjecture

The post Lyapunov Exponents appeared first on Ben's Planet.

]]>The post Neural-Network Quantum States appeared first on Ben's Planet.

]]>Describing a single spin-$\frac{1}{2}$ particle can be achieved by the simple ansatz

$$

| \psi \rangle = C_{\uparrow} |\uparrow\rangle + C_{\downarrow} | \downarrow\rangle

$$

and the normalization condition $|C_{\uparrow}|^2 + |C_{\downarrow}|^2 = 1$. While this is simple, for $N$ spin-$\frac{1}{2}$ particles the ansatz gets more elaborate

$$

\begin{aligned}| \psi \rangle & = C_{\uparrow \uparrow \dots \uparrow} |\uparrow \uparrow \dots \uparrow\rangle + C_{\uparrow \uparrow \dots \downarrow} |\uparrow \uparrow \dots \downarrow\rangle + \dots + C_{\downarrow \downarrow \dots \downarrow} | \downarrow \downarrow \dots \downarrow\rangle = \sum_k C_k | k \rangle \end{aligned}

$$

where

$$

| k\rangle = | \sigma^z_1 \sigma^z_2 \dots \sigma^z_N\rangle = | \sigma^z_1\rangle \otimes | \sigma^z_2\rangle \otimes \dots \otimes | \sigma^z_N\rangle

$$

And, at the same time, the Hilbert space becomes exponentially large as $k \in [1, 2^N]$.

More generally, a quantum many-body problem is described by a Hamiltonian $\hat{H}$ and our goal is to figure out the ground state

$$

\hat{H} | \psi_0\rangle = E_0 | \psi_0\rangle

$$

and its properties, but $| \psi_0\rangle$ is not known. For a small number $N$, the problem can be solved by means of exact diagonalization of $\hat{H}$, but the computational cost of doing so is at least of the order $\mathcal{O}(2^N)$ for ‘local’ Hamiltonians. Here, a Hamiltonian is considered local, if

- it has only k-body interactions
- k is not extensive

An example for such a local Hamiltonian is the transverse field Ising model

$$

\hat{H} = \sum_{i < j} J_{ij} \hat{\sigma}^z_i \hat{\sigma}^z_j + \sum_i h_i \hat{\sigma}^x_i

$$

Luckily, most interactions in physics are local, and when we fix $k$, the number of non-zero matrix elements,

$$

|H_{k k^\prime}| = |\langle k^\prime | \hat{H} | k\rangle| \neq 0

$$

grows only polynomially $\mathrm{poly}(N)$. But still, this is only practical up to $N \sim 40$ (for comparison: $2^{120}$ is approximately the number of atoms on earth).

**Variational Methods**

So in order to simulate larger system sizes one resorts to variational methods, which sample only a tiny sub-space of physically relevant states of the entire Hilbert space (e.g. constructed from the eigenstates $| \psi \rangle$ of the Hamiltonian $\hat{H}$). So we assume (!), that a quantum state $| \psi \rangle$ can be parameterized by a polynomially large set of parameters $w$ in the way

$$

| \psi(w) \rangle = \sum_k C_k(w) | k\rangle

$$

where

$$

C_k = \langle k | \psi \rangle = \psi(k) = \psi(\sigma^z_1, \dots, \sigma^z_N)

$$

To approximate the true ground state $| \psi_0 \rangle$ one can tune the parameters based on the variational principle that the energy expectation value

$$

E(w) = \frac{\langle \psi(w) | \hat{H} | \psi(w)\rangle}{\langle \psi(w) |\psi(w) \rangle} = \frac{\sum_n |b_n|^2 E_n}{\sum_n |b_n|^2} \geq E_0

$$

is minimized.

**Variational Classes**

**$E(w)$ can be calculated exactly**

There is one class of quantum many-body problems, for which $E(w)$ can be calculated exactly, i.e. apart from rounding errors. One example are **mean-field states**

$$

| \psi(w) \rangle = | \phi_1 \rangle \otimes | \phi_2 \rangle \otimes \dots \otimes | \phi_N \rangle

$$

with the single particle wave functions $| \phi_i \rangle$ and $\langle \phi_x | \phi_y \rangle = \delta_{xy}$. Physical quantities such as the expectation value

$$

\langle \psi(w) | \hat{\sigma}^x_i | \psi(w)\rangle

$$

can be calculated exactly. Another example are matrix product states

$$

| \psi(w) \rangle = \sum_k C_k(w) | k \rangle

$$

where

$$

C_k(w) = \hat{M}_1(\sigma^z_1) \cdot \hat{M}_2(\sigma^z_1) \cdot \dots \cdot \hat{M}_N(\sigma^z_N)

$$

is expressed by $\chi \times \chi$ matrices.

**$E(w)$ can be calculated approximately**

There is another class of quantum many-body problems, where $E(w)$ can only be calculated approximately, but in a controlled way, so that the accuracy can be increased by putting in more computational effort.

Here, the quantum many-body states $| \psi \rangle$ needs to be “computationally trackable” that is

1) $\langle k | \psi \rangle$ can be computed efficiently

2) one can sample efficiently from $P(k) = |\langle k | \psi \rangle$

where “efficiently” means “polynomially in time” in terms of computational complexity. An example where condition 1) is violated are PEPS.

In addition, we also need the theorem by van den Nest that any expectation value of a local operator $\hat{O}$ can be estimated with polynomial accuracy.

**Estimating Operator Properties**

The expectation value of an operator $\hat{O}$ can be estimated the following way:

$$

\begin{aligned}

\langle \hat{O}\rangle &=

\frac{\langle \psi | \hat{O} | \psi \rangle}{\langle \psi | \psi \rangle}

= \frac{\sum_{k k^\prime} \langle \psi | k \rangle \langle k | \hat{O} | | k^\prime \rangle \langle k^\prime | \psi \rangle}{\sum_k |\langle \psi | k \rangle|^2} \\

&= \frac{\sum_k |\langle \psi | k \rangle|^2 \sum_{k^\prime} O_{k k^\prime} \frac{\langle k^\prime | \psi \rangle}{\langle k | \psi \rangle}}{\sum_k |\langle \psi | k \rangle|^2} = \sum_k P(k) O^{\mathrm{loc}}(k)

\end{aligned}

$$

with the probability distribution

$$

P(k) = \frac{|\langle \psi | k \rangle|^2}{\sum_k |\langle \psi | k \rangle|^2}

$$

and the local form of the operator

$$

O^{\mathrm{loc}}(k) = \sum_{k^\prime} O_{k k^\prime} \frac{\langle k^\prime | \psi \rangle}{\langle k | \psi \rangle}

$$

To obtain a stochastic estimate, first values $k^{(1)}$, $k^{(2)}$, $\dots$ $k^{(M)}$ need to be drawn from $P(k)$. Then the expectation value can be estimated via

$$

\langle \psi | \hat{O} | \psi \rangle \approx \frac{1}{M} \sum_{i=1}^M O^{\mathrm{loc}}(k^{(i)})

$$

The error

$$

\Delta \langle \hat{O} \rangle_M \sim \sqrt{\frac{\mathrm{var}(O^{\mathrm{loc}})}{M}} \overset{M \to \infty}{\longrightarrow} 0

$$

can be improved systematically for larger $M$, provided that the variance $\mathrm{var}(O^{\mathrm{loc}})$ is finite after all.

**Sampling from $P(k)$**

Using Markow-Chain Monte Carlo (MCMC) one can construct a series of values

$$

k^{(1)} \to k^{(2)} \to k^{(3)} \to \dots \to k^{(M)}

$$

that samples from $P(k)$, if the transition probbility $\tau(k^{(i)} \to k^{(i+1)})$ fulfills the detailed balance condition

$$

P(k) \tau(k \to k^\prime) = P(k^\prime) \tau(k^\prime \to k)

$$

Employing the Metropolis-Hastings algorithm one splits the transition probability into a selection probability $\Omega$ and an acceptance ratio $A$

$$

\tau(k \to k^\prime) = \Omega(k \to k^\prime) A(k \to k^\prime)

$$

where the Metropolis acceptance ratio is given by

$$

A(k \to k^\prime) = \min \left(1, \frac{P(k^\prime)}{P(k)} \frac{\Omega(k^\prime \to k)}{\Omega(k \to k^\prime)} \right)

$$

**Quick Recap on Machine Learning**

**The Machine**

Generally speaking, machine learning deals with algorithms that approximate functionals $F(x_1, \dots, x_N)$, where $x_1, \dots, x_N$ are e.g. the pixel values of an image and the output is a single number, classifying the image as either “dog” or ” cat”. So $F(\vec{x})$ is the “machine” and within the subfield of deep learning it may be a feed-forward deep neural network of the form

$$

F(\vec{x}, \vec{w}) = g^{(D)} \circ w^{(D)} \dots \circ g^{(2)} \circ w^{(2)} \circ g^{(1)} \circ w^{(1)} \vec{\sigma}

$$

It consists of two parts

1) first a \sum of the input values $\vec{x}$ weighted with the weights $vec{w}$

$$

\sum_y w_{iy}^{(1)} x_y = h^{(i)}_i \in \mathbb{R}^{l_1}

$$

2) and secondly the application of a non-linear function $g$ also called the activation function

$$

(g \circ h^{(1)})_i = h^{\prime (1)}_i = g(h^{(1)}_i) \in \mathbb{R}^{l_1}

$$

Here $l_1, l_2, \dots$ represent the width of the network and $D$ the depth (number of layers). The activation value switches from 0 (small input value) to 1 (large input value), so possible choices are for example $\tanh(x)$ or $\ln(\cosh(x))$.

**The Learning**

The “learning” involves optimization of the weights $vec{w}$ to solve a certain task. There are generally three forms of learning:

1) supervised learning to make predictions based on labeled data

2) unsupervised learning to cluster data and identify patterns without labels

3) reinforcement learning, where a score function (in physics: energy function) is either maximized or minimized without labels

We are going to employ reinforcement learning to optimize the neural network quantum states. Let’s dive in.

**Neural Network Quantum States – Explained**

We represent the many-body wave function of a system of $N$ spin-$\frac{1}{2}$-particles as an N-dimensional function

$$

\langle k | \psi \rangle = \psi(\sigma_1^z, \dots, \sigma_N^z; w)

$$

where $\sigma_i^z \in {-1,1}$ are the projections across the $z$-axis. Using machine learning and especially deep learning this wave function is expressed as a neural network quantum state (NQS) in the form

$$

\psi(\vec{\sigma}) = g^{(D)} \circ w^{(D)} \dots \circ g^{(2)} \circ w^{(2)} \circ g^{(1)} \circ w^{(1)} \vec{\sigma}

$$

Importantly, for such a network the scalar products $\langle \sigma_1^z, \sigma_2^z, \dots, \sigma_N^z | \psi \rangle$ can be computed efficiently, i.e. in polynomial time.

**Why do NQS work?**

The main idea of NQS is to approximate a complicated function $F(vec{x})$ through simpler basis functions. That this even works is based on two theorems.

First, the **Kolmogorov-Arnold theorem** states, that any continuous, $n$-dimensional function $F(\vec{x})$ can be approximated in the way

$$

F(vec{x}) = \sum_{q = 0}^{2n} \Phi \left( \sum_{p=1}^n \lambda_p \phi(x_p + \eta q) + q \right)

$$

where $\Phi$ and $\phi$ are (hard to compute) univariate functions.

In addition, if for neural networks the form of the activation function is fixed, the **Universal Approximation Theorem** tells us

$$

F(\vec{x}) = \sum_{i=1}^N v_i \phi \left( \sum_j w_{ij} x_j + b_i \right)

$$

which essentially reflects the structure of a neural network: a weighted sum of inputs $x_j$ and weights $w_{ij}$, followed by the application of a non-linear activation function $\phi$.

**Reinforcement Learning of Quantum States**

For reinforcement learning we need to perform three steps:

1) obtain a flexible representation of $| \psi \rangle$

2) find an algorithm to compute $\langle \hat{H}\rangle = E(w)$

3) perform the learning: determine the optimal weights $w$ from $\min w(E(w))$

While a flexible representation of $| psi \rangle$ is given by a deep neural network, for the actual learning we need to estimate the expectation value

$$

E(w) = \frac{\langle \psi_M | \hat{H} | \psi_M \rangle}{\langle \psi_M | \psi_M \rangle} = \frac{\sum_k P(k) E^{\mathrm{loc}}(k)}{\sum_k P(k)} = \langle E^{\mathrm{loc}}(k) \rangle_P

$$

where

$$

E^{\mathrm{loc}}(k) = \sum_{k^\prime} H_{k k^\prime} \frac{\langle k^\prime | \psi \rangle}{\langle k | \psi \rangle} = \frac{\langle k | \hat{H} | \psi \rangle}{\langle k | \psi \rangle}

$$

Here, we see why the requirement of $\hat{H}$ being a local operator is important: now we just have a polynomial number of matrix elements $H_{k k^\prime}$ in the sum and thus $E^{\mathrm{loc}}(k)$ can be computed efficiently.

**Optimization Procedure**

Minimization of $E(w)$ can be achieved via gradient descent, for which we need the gradient $\frac{\partial E(w)}{\partial w_p}$. It can be shown that the following relation holds:

$$

\frac{\partial E(w)}{\partial w_P} = \langle E^{\mathrm{loc}}(k) O_P(k) \rangle_P – \langle E^{\mathrm{loc}}(k)\rangle_P \langle O_P(k)\rangle_P

$$

with

$$

O_P = \frac{\partial}{\partial w_p} \log \langle k | \psi \rangle

$$

Thinking of the gradient $\frac{\partial E(w)}{\partial w_P}$ as the average of a function $G(k)$

$$

\frac{\partial E(w)}{\partial w_P} = \langle G(k)\rangle_P

$$

it can be estimated from stochastic samples

$$

\frac{\partial E(w)}{\partial w_P} \approx \frac{1}{M} \sum_i^M G(k^{(i)})

$$

Now the gradient descent can be performed via

$$

w^{(s+1)}_P = w^{(s)}_P – \eta \frac{\partial E}{\partial w_P}

$$

where $\eta$ is the learning rate, which should not be chosen to large. As another minimization method one could use stochastic gradient descent, where Gaussian noise is added to the gradient in order to avoid getting stuck in local minima

$$

\frac{\partial E}{\partial w_p} + \mathrm{Normal} \left( \mu=0,\sigma^2 = \frac{v}{M} \right)

$$

centered around the origin with the variance $v$. This is equivalent to the langevin dynamics of a stochastic process $x_p(t)$

$$

x_p(t + \delta_t) = x_p(t) – \nabla_p v(t) \delta_t t + \mathrm{Normal}(\mu = 0, \sigma^2 = 2 \sigma_t T)

$$

where the probability distribution pf $x_p$ tends toward the Boltzmann distribution for $t \to \infty$

$$

\lim_{t \to \infty} P(x) = \frac{e^{-\frac{v(x)}{T}}}{Z}

$$

In our analogy, the learning rate corresponds to the step size $\eta = \delta_t$ and the variance $\sigma^2 = \frac{v}{M} = 2 \sigma_t T$ with $T = \frac{\eta}{M}$. So also the probability distribution of weights tends towards a Boltzmann distribution

$$

\lim_{s \to \infty} P(\vec{w}) \sim \frac{e^{-\frac{E(\vec{w})}{T}}}{Z}

$$

**Time-dependent Problems**

NQS cannot only be employed to learn ground states, but also to determine the time evolution of a quantum many-body state variationally, i.e. to solve the Schördinger equation

$$i \frac{\partial}{\partial t} | \phi(t) \rangle = \hat{H}(t) | \phi(t) \rangle$$

Therefore, the true state is approximated by an NQS $| \phi(t) \rangle \approx | \psi, w(t) \rangle$, where the time dependence is moved into the variational parameters.

**Time-Dependent Variational Principle**

Our goal is to derive a time-dependent variational principle, t\hat is based on minimizing the angle between the true time-evolved state $|\phi(t+\delta_t) \rangle$ and the time-evolved NQS $| \psi, w(t+\delta_t) \rangle$. Based on the work of Dirac and Frenkel, the time evolution for a small time step $\delta_t$ is approximated by

$$|\phi(t+\delta_t) \rangle = (\hat{1} – i \delta_t \hat{H}) |\phi(t) \rangle + \mathcal{O}(\delta^2_t)$$

and

$$

|\psi, w(t+\delta_t) \rangle = (\hat{1} + \delta_t \hat{\Delta}) |\psi, w(t) \rangle + \mathcal{O}(\delta^2_t)

$$

The first equation effectively discretizes the time evolution operator $U = e^{i \hat{H} t}$ for $\hbar = 1$. For the second equation we can determine $\hat{\Delta}$ from expression the time evolution step in the computational basis

$$

\langle k | \psi, w(t+\delta_t) \rangle = \langle k | \psi, w(t) \rangle + \delta_t \sum_P \frac{\langle k | \psi, w(t) \rangle}{w_P} \dot{w}_P + \dots

$$

so that

$$

\hat{\Delta} = \sum_P \frac{\partial \langle k | \psi, w(t) \rangle}{\partial w_P} \frac{\dot{w}_P}{\langle k | \psi, w(t) \rangle} = \sum_P O_P(k) \dot{w}_P

$$

To obtain a time-dependent variational principle we need to match the two states

$$

\begin{aligned}

&|A \rangle = (\hat{1} – i \delta_t \hat{H}) |\phi(t) \rangle \\

&|B \rangle = \left(\hat{1} + \delta_t \sum_P O_P(k) \dot{w}_P \right) |\phi(t) \rangle

\end{aligned}

$$

locally in time and thereby may as well get the global time evolution. This is achieved by defining the fidelity

$$

F_{AB}(\dot{w}_P) = \frac{|\langle A|B\rangle|^2}{\langle A|A\rangle \langle B|B\rangle}

$$

which is going to be quadratic in the lowest order in $dot{w}_P$ and which is maximized. So from the time-dependent variational principle

$$

\max_{\dot{w}_P} F_{AB}(\dot{w}_P)

$$

follows the equation of motion for $\dot{w}_P$ and one can show that

$$

i \sum_{P^\prime} \dot{w}_{P^\prime} S_{P P^\prime} = \langle O_P E^{\mathrm{loc}}\rangle – \langle O_P\rangle \langle E^{\mathrm{loc}}\rangle

$$

with the quantum geometric tensor (also called metric of the variational manifold)

$$

S_{P P^\prime} = \langle O^*_P O_{P^\prime}\rangle – \langle O^*_P\rangle \langle O_{P^\prime}\rangle = \left\langle \frac{\partial \psi}{\partial w_P} \Big| \frac{\partial \psi}{\partial w_{P^\prime}} \right \rangle

$$

**Time-dependent Variational Monte Carlo (t-VMC)**

- sample $P(k,t) = \frac{|\langle k | \psi, w(t) \rangle |^2}{N(t)}$
- measure $S_{P P^\prime}$, $G_P$ from the sample
- solve the linear system to find $\dot{w}_P(t)$
- $w_P(t+\delta_t) = w_P + \delta_t \dot{w}_P$ or use a better scheme

**Imaginary Time**

Also problems in imaginary time with

$$

| \phi(\tau)\rangle = e^{-\tau \hat{H}} | \phi(0) \rangle

$$

$$

\frac{\partial}{\partial \tau} | \phi(\tau) \rangle = – \hat{H} | \phi(\tau)\rangle

$$

and

$$

\lim_{\tau \to \infty} | \phi(\tau) \rangle = | \psi_0 \rangle

$$

if $\langle \psi_0 | \phi(0)\rangle \neq 0$ can be treated analogeously, leading to

$$

\sum_{P^\prime} \dot{w}_{P^\prime} S_{P P^\prime} = -\langle E^{\mathrm{loc}} O^*_P\rangle + \langle O^*_P\rangle \langle E^{\mathrm{loc}}\rangle

$$

This is equivalent to the “stochastic reconfiguration” optimization scheme [Sorella Book] and very close to “natural gradient descent” [Amari, 1998].

**Learning from Experiments**

Here we perform projective measurements on an actual quantum system, e.g. a gas of atoms, of a quantity

$$

Q(k) = |\langle k | \phi \rangle|^2

$$

Our goal is to reconstruct $\langle k | \phi \rangle$, i.e. the wave function, from these projective measurements in some basis $|k \rangle$ (therefore also called quantum state reconstruction).

**Variational Principle**

Once again we need a variational principle to perform the learning. Starting with the probability distribution

$$

P(k) = \frac{|\langle k | \psi, w \rangle|^2}{N(w)} = \frac{F(k,w)}{N(w)}

$$

generated from a neural network, and the actually measured exact probabilities

$$

Q(k) = |\langle k| \phi \rangle |^2

$$

one can define as a metric between both probability distributions the Kullback-Leibler-Divergence

$$

D_{KL}(P||Q) = \sum_k Q(k) \left[ \log(Q(k)) – \log(P(k)) \right]

$$

So if $P(k) = Q(k)$ we have $D_{KL} = 0$ and thus our variational principle is to minimize $D_{KL}(w)$. It can be shown, that the gradient of $D_{KL}(w)$ is given by

$$

\frac{\partial D_{KL}}{\partial w_P} = – \left\langle \frac{\partial}{\partial w_P} \log(F(k)) \right \rangle_Q + \left \langle \frac{\partial}{\partial w_P} \log(F(k)) \right \rangle_P

$$

with

$$

\frac{\partial}{\partial w_P} \log(F(k)) = 2 O_P(k)

$$

Importantly, this gradient can be evaluated efficiently.

**The Phase**

So far, we were able to reconstruct the amplitutde of the quantum state, but not its phase. To obtain the phase, we need to perform measurements in several bases, connected to the original basis $|k \rangle$ by a unitary transformation $\hat{U}_B$. Thus,

$$

P_B(k) = \frac{|\langle k | \hat{U}^*_B | \psi, w \rangle|^2}{N_B(w)}

$$

$$

Q_B(k) = |\langle k | \hat{U}^*_B | \phi \rangle|^2

$$

and

$$

D_{KL} = \sum_{k,B} Q_B(k) \left[ \log(Q_B(k)) – \log(P_B(k)) \right]

$$

and the variation can be performed without issues if $\hat{U}_B$ is only a local rotation.

The post Neural-Network Quantum States appeared first on Ben's Planet.

]]>