LOG#231. Statistical tools.

Subject today: errors. And we will review formulae to handle them with experimental data.

Errors can be generally speaking:

1st. Random. Due to imperfections of measurements or intrinsically random sources.

2nd. Systematic. Due to the procedures used to measure or uncalibrated apparatus.

There is also a distinction of accuracy and precision:

1st. Accuracy is closeness to the true value of a parameter or magnitude. It is, as you keep this definition, a measure of systematic bias or error. However, sometime accuracy is defined (ISO definition) as the combination between systematic and random errors, i.e., accuracy would be the combination of the two observational errors above. High accuracy would require, in this case, higher trueness and high precision.

2nd. Precision. It is a measure of random errors. They can be reduced with further measurements and they measure statistical variability. Precision also requires repeatability and reproducibility.

1. Statistical estimators.

Arithmetic mean:

(1)   \begin{equation*}\boxed{\overline{X}=\dfrac{\displaystyle{\sum_{i=1}^n x_i}}{n}=\dfrac{\left(\mbox{Sum of measurements}\right)}{\left(\mbox{Number of measurements}\right)}}\end{equation*}

Absolute error:

(2)   \begin{equation*}\boxed{ \varepsilon_{a}=\vert x_i-\overline{x}\vert}\end{equation*}

Relative error:

(3)   \begin{equation*}\boxed{\varepsilon_r=\dfrac{\varepsilon_a}{\overline{x}}\cdot 100}\end{equation*}

Average deviation or error:

(4)   \begin{equation*}\boxed{\delta_m=\dfrac{\sum_i\vert x_i-\overline{x}\vert}{n}}\end{equation*}

Variance or average quadratic error or mean squared error:

(5)   \begin{equation*}\boxed{\sigma_x^2=s^2=\dfrac{\displaystyle{\sum_{i=1}^n}\left(x_i-\overline{x}\right)^2}{n-1}}\end{equation*}

This is the unbiased variance, when the total population is the sample, a shift must be done from n-1 to n (Bessel correction). The unbiased formula is correct as far as it is a sample from a larger population.

Standard deviation (mean squared error, mean quadratic error):

(6)   \begin{equation*}\boxed{\sigma\equiv\sqrt{\sigma_x^2}=s=\sqrt{\dfrac{\displaystyle{\sum_{i=1}^n}\left(x_i-\overline{x}\right)^2}{n-1}}}\end{equation*}

This is the unbiased estimator of the mean quadratic error, or the standard deviation of the sample. The Bessel correction is assumed whenever our sample is lesser in size that than of the total population. For total population, the standard deviation reads after shifting n-1\rightarrow n:

(7)   \begin{equation*}\boxed{\sigma_n\equiv\sqrt{\sigma_{x,n}^2}=\sqrt{\dfrac{\displaystyle{\sum_{i=1}^n}\left(x_i-\overline{x}\right)^2}{n}}=s_n}\end{equation*}

Mean error or standard error of the mean:

(8)   \begin{equation*}\boxed{\varepsilon_{\overline{x}}=\dfrac{\sigma_x}{\sqrt{n}}=\sqrt{\dfrac{\displaystyle{\sum_{i=1}^n}\left(x_i-\overline{x}\right)^2}{n\left(n-1\right)}}}\end{equation*}

If, instead of the unbiased quadratic mean error we use the total population error, the corrected standar error reads

(9)   \begin{equation*}\boxed{\varepsilon_{\overline{x},n}=\dfrac{\sigma_x}{\sqrt{n}}=\sqrt{\dfrac{\displaystyle{\sum_{i=1}^n}\left(x_i-\overline{x}\right)^2}{n^2}}=\dfrac{\sqrt{\displaystyle{\sum_{i=1}^n}\left(x_i-\overline{x}\right)^2}}{n}}\end{equation*}

Variance of the mean quadratic error (variance of the variance):

(10)   \begin{equation*}\boxed{\sigma^2\left(s^2\right)=\sigma^2_{\sigma^2}=\sigma^2\left(\sigma^2\right)=\dfrac{2\sigma^4}{n-1}}\end{equation*}

Standard error of the mean quadratic error (error of the variance):

(11)   \begin{equation*}\boxed{\sigma\left(s^2\right)=\sqrt{\sigma^2_{\sigma^2}}=\sigma\left(\sigma^2\right)=\sigma_{\sigma^2}=\sigma^2\sqrt{\dfrac{2}{n-1}}}\end{equation*}

2. Gaussian/normal distribution intervals for a given confidence level (interval width a number of entire sigmas)

Here we provide the probability of a random variable distribution X following a normal distribution to have a value inside an interval of width n\sigma.

1 sigma amplitude (1\sigma).

(12)   \begin{equation*}x\in\left[\overline{x}-\sigma,\overline{x}+\sigma\right]\longrightarrow P\approx 68.3\%\sim\dfrac{1}{3}\end{equation*}

2 sigma amplitude (2\sigma).

(13)   \begin{equation*}x\in\left[\overline{x}-2\sigma,\overline{x}+2\sigma\right]\longrightarrow P\approx 95.4\%\sim\dfrac{1}{22}\end{equation*}

3 sigma amplitude (3\sigma).

(14)   \begin{equation*}x\in\left[\overline{x}-3\sigma,\overline{x}+3\sigma\right]\longrightarrow P\approx 99.7\%\sim\dfrac{1}{370}\end{equation*}

4 sigma amplitude (4\sigma).

(15)   \begin{equation*}x\in\left[\overline{x}-4\sigma,\overline{x}+4\sigma\right]\longrightarrow P\approx 99.994\%\sim\dfrac{1}{15787}\end{equation*}

5 sigma amplitude (5\sigma).

(16)   \begin{equation*}x\in\left[\overline{x}-5\sigma,\overline{x}+5\sigma\right]\longrightarrow P\approx 99.99994\%\sim\dfrac{1}{1744278}\end{equation*}

6 sigma amplitude (6\sigma).

(17)   \begin{equation*}x\in\left[\overline{x}-6\sigma,\overline{x}+6\sigma\right]\longrightarrow P\approx 99.9999998\%\sim\dfrac{1}{506797346}\end{equation*}

For a given confidence level C.L. (generally 90\%, 95\%, 98\%, 99\%), the interval width will be 1.645\sigma, 1.96\sigma, 2.326\sigma, 2.576\sigma.

3. Error propagation.

Usually, the error propagates in non direct measurements.

3A. Sum and substraction.

Let us define x\pm \delta x and y\pm \delta y. Furthermore, define the variable q=x\pm y. The error in q would be:

(18)   \begin{equation*}\boxed{\varepsilon (q)=\delta x+\delta y}\end{equation*}

Example. M_1=540\pm 10 g, M_2=940\pm 20 g. M_1=m_1+liquid, with m_1=72\pm 1g  and M_2=m_2+liquid, with m_2=97\pm 1g. Then, we have:

M=M_1-m_1+M_2-m_2=1311g as liquid mass.

\delta M=\delta M_1+\delta m_1+\delta M_2+\delta m_2=32g, as total liquid error.

M_0=1311\pm 32 g is the liquid mass and its error, together, with 3 significant digits or figures.

3B. Products and quotients (errors).


    \[x\pm \delta x=x\left(1\pm \dfrac{\delta x}{x}\right)\]

    \[y\pm \delta y=y\left(1\pm \dfrac{\delta x}{x}\right)\]

then, with q=xy you get

(19)   \begin{equation*}\boxed{\dfrac{\delta q}{\vert q\vert}=\dfrac{\delta x}{\vert x\vert}+\dfrac{\delta y}{\vert y\vert}=\vert y\vert\delta x+\vert x\vert\delta y}\end{equation*}

If q=x/y, you obtain essentially the same result:

(20)   \begin{equation*}\boxed{\dfrac{\delta q}{\vert q\vert}=\dfrac{\delta x}{\vert x\vert}+\dfrac{\delta y}{\vert y\vert}=\vert y\vert\delta x+\vert x\vert\delta y}\end{equation*}

3C. Error in powers.

With x\pm \delta x, q=x^n, then you derive

(21)   \begin{equation*}\dfrac{\delta q}{\vert q\vert}=\vert n\vert \dfrac{\delta x}{\vert x\vert}=\vert n\vert \vert x^{n-1}\vert \delta x\end{equation*}

and if g=f(x), with the error of x being \delta x, you get

(22)   \begin{equation*}\boxed{\delta f=\vert\dfrac{df}{dx}\vert\delta x}\end{equation*}

In the case of a several variables function, you apply a generalized Pythagorean theorem to get

(23)   \begin{equation*}\boxed{\delta q=\delta f(x_i)=\sqrt{\displaystyle{\sum_{i=1}^n}\left(\dfrac{\partial f}{\partial x_i}\delta x_i\right)^2}=\sqrt{\left(\dfrac{\partial f}{\partial x_1}\delta x_1\right)^2+\cdots+\left(\dfrac{\partial f}{\partial x_n}\delta x_n\right)^2}}\end{equation*}

or, equivalently, the errors are combined in quadrature (via standard deviations):

(24)   \begin{equation*}\boxed{\delta q=\delta f (x_1,\ldots,x_n)=\sqrt{\left(\dfrac{\partial f}{\partial x_1}\right)^2\delta^2 x_1+\cdots+\left(\dfrac{\partial f}{\partial x_n}\right)^2\delta^2 x_n}}\end{equation*}


(25)   \begin{equation*}\sigma (X)=\sigma (x_i)=\sqrt{\displaystyle{\sum_{i=1}^n}\sigma_i^2}=\sqrt{\sigma_1^2+\cdots+\sigma_n^2}\end{equation*}

for independent random errors (no correlations). Some simple examples are provided:

1st. q=kx, with x\pm \delta x, implies \boxed{\delta q=k\delta x}.

2nd. q=\pm x\pm y\pm \cdots, with x_i\pm \delta x_i, implies \boxed{\delta q=\delta x+\delta y+\cdots}.

3rd. q=kx_1^{\alpha_1}\cdots x_n^{\alpha_n} would imply

    \[\boxed{\dfrac{\delta q}{\vert q\vert}=\vert\alpha_1\vert\dfrac{\delta x_1}{\vert x_1\vert}+\cdots +\vert\alpha_n\vert\dfrac{\delta x_n\vert}{\vert x_n\vert}}\]

When different experiments with measurements \overline{x}_i\pm\sigma_i are provided, the best estimator for the combined mean is a weighted mean with the variance, i.e.,

(26)   \begin{equation*}\overline{X}_{best}=\dfrac{\displaystyle{\sum_{i=n}^n}\dfrac{\overline{x}_i}{\sigma^2_i}}{\displaystyle{\sum_{i=1}^n}\frac{1}{\sigma^2_i}}\end{equation*}

The best standard deviation from the different combined measurements would be:

(27)   \begin{equation*} \dfrac{1}{\sigma^2_{best}}=\displaystyle{\sum_{i=1}^n}\frac{1}{\sigma^2_i} \end{equation*}

This is also the maximal likelihood estimator of the mean assuming they are independent AND normally distributed. There, the standard error of the weighted mean would be

(28)   \begin{equation*}\sigma_{\overline{X}_{best}}=\sqrt{\dfrac{1}{\displaystyle{\sum_{i=1}^n}\dfrac{1}{\sigma^2_i}}}\end{equation*}

Least squares. Linear fits to a graph from points using least square procedure proceeds as follows. Let (X_i, Y_i) from i=1,\ldots,n be some sets of numbers from experimental data. Then, the linear function Y=AX+B that is the best fit to the data can be calculated with Y-Y_0=\overline{A}(X-X_0), where

    \[X_0=\overline{X}=\dfrac{\sum X_i}{n}\]

    \[Y_0=\overline{Y}=\dfrac{\sum Y_i}{n}\]

    \[\overline{A}=A=\dfrac{\sum (X_i-\overline{X})(Y_i-\overline{Y})}{\sum (X_i-\overline{X})^2}\]

Moreover, B=Y_0+AX_0.

We can also calculate the standard errors for A and B fitting. Let the data be

    \[y_i=\alpha+\beta x_i+\varepsilon_i\]

We want to minimize the variance, i.e., the squared errors \varepsilon_i^2, i.e., we need to minimize

    \[Q(\alpha,\beta)=\sum_{i=1}^n\varepsilon_i^ 2=\sum_{i=1}^2\left(y_i-\alpha-\beta x_i\right)^2\]

    \[\varepsilon_i=y_i-\alpha-\beta x_i\]

Writing y=\alpha+\beta x, the estimates are rewritten as

(29)   \begin{equation*} \hat{\alpha}=\overline{y}-\hat{\beta}\overline{x} \end{equation*}

(30)   \begin{equation*} \hat{\beta}=\dfrac{\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=i}^n(x_i-\overline{x})^2}=\dfrac{s_{x,y}}{s_x^2}=r_{xy }\dfrac{s_y}{s_x} \end{equation*}

where s_x, s_y are the uncorrected standard deviations of x, y samples, s_x^2, s_{x,y} are the sample variance and covariance. Moreover, the fit parameters have the standard errors

(31)   \begin{equation*} s_{\hat{\beta}}=\sqrt{\dfrac{\frac{1}{n-2}\sum_i\hat{\varepsilon}_i^2}{\sum_{i=1}^n(x_i-\overline{x})^2}} \end{equation*}

(32)   \begin{equation*} s_{\hat{\alpha}}=s_{\hat{\beta}}\sqrt{\dfrac{1}{n}\sum_{i=1}^nx_i^2}=\sqrt{\dfrac{1}{n(n-2)}\left(\sum_{i=1}^n\hat{\varepsilon}_i^2\right)\dfrac{\sum_{i=1}^n x_i^2}{\sum_{i=1}^n(x_i-\overline{x})^2}} \end{equation*}

Alternatively, all the above can be also written as follows. Define

(33)   \begin{eqnarray*} S_x=\sum x_i\\ S_y=\sum y_i\\ S_{xy}=\sum x_iy_i\\ S_{xx}=\sum x_i^2\\ S_{yy}=\sum y_i^2 \end{eqnarray*}

then, for a minimum square fit with y=\hat{\alpha}+\hat{\beta}x+\hat{\varepsilon}, we find out that

(34)   \begin{eqnarray*} \hat{\beta}=\dfrac{nS_{xy}-S_{x}S_{y}}{nS_{xx}-S_x^2} \hat{\alpha}=\dfrac{1}{n}S_y-\hat{\beta}\dfrac{1}{n}S_x\\ s_{\varepsilon}^2=\dfrac{1}{n(n-2)}\left[nS_{yy}-S_y^2-\hat{\beta}^2(nS_{xx}-S_x^2)\right]\\ s_{\hat{\beta}}^2=\dfrac{ns^2_{\varepsilon}}{nS_{xx}-S_x^2}\\ s_{\hat{\alpha}}^2=s_{\hat{\beta}}^2\dfrac{1}{n}S_{xx} \end{eqnarray*}

and where the correlation coefficient is

(35)   \begin{equation*} r=\dfrac{nS_{xy}-S_xS_y}{\sqrt{(nS_{xx}-S_x^2)(nS_{yy}-S_y^2)}}=\quad \frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_x s_y} =\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2 \sum\limits_{i=1}^n (y_i-\bar{y})^2}} \end{equation*}

and where s_x, s_y are the corrected sample standard deviations of x, y. To know what s_{x,y} is in a more general setting, we note that the sample mean vector \mathbf{\bar{x}} is a column vector whose j-element x_{ij} is the average value of the N observations of the j-variable:

    \[ \bar{x}_{j}=\frac{1}{N}\sum_{i=1}^{N}x_{ij},\quad j=1,\ldots,K.\]

and thus, the sample average or mean vector contains the average of every variable as component, such as

(36)   \begin{equation*} \mathbf{\bar{x}}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_i = \begin{bmatrix} \bar{x}_1 \\ \vdots \\ \bar{x}_j \\ \vdots \\ \bar{x}_K \end{bmatrix} \end{equation*}

The sample covariance matrix is a “K”-by-“K” matrix

    \[\textstyle \mathbf{Q}=\left[ q_{jk}\right] \]

with entries

    \[q_{jk}=s_{x,y}=\frac{1}{N-1}\sum_{i=1}^{N}\left( x_{ij}-\bar{x}_j \right) \left( x_{ik}-\bar{x}_k \right)\]

where q_{jk} is an estimate of the covariance between the j-th variable and the k-th variable of the population underlying the data. In terms of the observation vectors, the sample covariance is

    \[\mathbf{Q} = s_{x,y}={1 \over {N-1}}\sum_{i=1}^N (\mathbf{x}_i.-\mathbf{\bar{x}}) (\mathbf{x}_i.-\mathbf{\bar{x}})^\mathrm{T}\]

Finally, you can also provide a calculation with confidence level of the intervals where \hat{\beta},\hat{\alpha} are. The t-vallue has a Student’s t-distribution with n-2 degrees of freedom. Using it, we can construct a confidence interval for \hat{\beta}:

    \[ \beta \in \left[\widehat\beta - s_{\widehat\beta} t^*_{n - 2},\ \widehat\beta + s_{\widehat\beta} t^*_{n - 2}\right]\]

at confidence level (C.L.) 1-\gamma, where t^*_{n - 2}  is the \left(1 \;-\; \frac{\gamma}{2}\right)\text{-th} quantile of the t_{n-2} distribution. For example, \gamma=0.05, then the C.L. is 95\%.

Similarly, the confidence interval for the intercept coefficient \hat{\alpha} is given by

    \[\alpha \in \left[ \widehat\alpha - s_{\widehat\alpha} t^*_{n - 2},\ \widehat\alpha + s_{\widehat{\alpha}} t^*_{n - 2}\right]\]

at confidence level (C.L.) 1-\gamma, where as before above

    \[s_{\widehat\alpha} = s_{\widehat{\beta}}\sqrt{\frac{1}{n} \sum_{i=1}^n x_i^2} = \sqrt{\frac{1}{n(n - 2)} \left(\sum_{i=1}^n \widehat{\varepsilon}_i^{\,2} \right) \frac{\sum_{i=1}^n x_i^2} {\sum_{i=1}^n (x_i - \bar{x})^2} }\]

Remark: for non homogenous samples, the best estimation of the average is not the arithmetic mean, but the median.

See you in other blog post!

View ratings
Rate this article

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.