# LOG#231. Statistical tools.

Subject today: errors. And we will review formulae to handle them with experimental data.

Errors can be generally speaking:

1st. Random. Due to imperfections of measurements or intrinsically random sources.

2nd. Systematic. Due to the procedures used to measure or uncalibrated apparatus.

There is also a distinction of accuracy and precision:

1st. Accuracy is closeness to the true value of a parameter or magnitude. It is, as you keep this definition, a measure of systematic bias or error. However, sometime accuracy is defined (ISO definition) as the combination between systematic and random errors, i.e., accuracy would be the combination of the two observational errors above. High accuracy would require, in this case, higher trueness and high precision.

2nd. Precision. It is a measure of random errors. They can be reduced with further measurements and they measure statistical variability. Precision also requires repeatability and reproducibility.

1. Statistical estimators.

Arithmetic mean:

(1)

Absolute error:

(2)

Relative error:

(3)

Average deviation or error:

(4)

Variance or average quadratic error or mean squared error:

(5)

This is the unbiased variance, when the total population is the sample, a shift must be done from to (Bessel correction). The unbiased formula is correct as far as it is a sample from a larger population.

Standard deviation (mean squared error, mean quadratic error):

(6)

This is the unbiased estimator of the mean quadratic error, or the standard deviation of the sample. The Bessel correction is assumed whenever our sample is lesser in size that than of the total population. For total population, the standard deviation reads after shifting :

(7)

Mean error or standard error of the mean:

(8)

If, instead of the unbiased quadratic mean error we use the total population error, the corrected standar error reads

(9)

Variance of the mean quadratic error (variance of the variance):

(10)

Standard error of the mean quadratic error (error of the variance):

(11)

2. Gaussian/normal distribution intervals for a given confidence level (interval width a number of entire sigmas)

Here we provide the probability of a random variable distribution X following a normal distribution to have a value inside an interval of width .

1 sigma amplitude ().

(12)

2 sigma amplitude ().

(13)

3 sigma amplitude ().

(14)

4 sigma amplitude ().

(15)

5 sigma amplitude ().

(16)

6 sigma amplitude ().

(17)

For a given confidence level (generally ), the interval width will be .

3. Error propagation.

Usually, the error propagates in non direct measurements.

3A. Sum and substraction.

Let us define and . Furthermore, define the variable . The error in would be:

(18)

Example. , . , with   and , with . Then, we have:

as liquid mass.

, as total liquid error.

is the liquid mass and its error, together, with 3 significant digits or figures.

3B. Products and quotients (errors).

If

then, with you get

(19)

If , you obtain essentially the same result:

(20)

3C. Error in powers.

With , , then you derive

(21)

and if , with the error of being , you get

(22)

In the case of a several variables function, you apply a generalized Pythagorean theorem to get

(23)

or, equivalently, the errors are combined in quadrature (via standard deviations):

(24)

since

(25)

for independent random errors (no correlations). Some simple examples are provided:

1st. , with , implies .

2nd. , with , implies .

3rd. would imply

When different experiments with measurements are provided, the best estimator for the combined mean is a weighted mean with the variance, i.e.,

(26)

The best standard deviation from the different combined measurements would be:

(27)

This is also the maximal likelihood estimator of the mean assuming they are independent AND normally distributed. There, the standard error of the weighted mean would be

(28)

Least squares. Linear fits to a graph from points using least square procedure proceeds as follows. Let from be some sets of numbers from experimental data. Then, the linear function that is the best fit to the data can be calculated with , where

Moreover, .

We can also calculate the standard errors for and fitting. Let the data be

We want to minimize the variance, i.e., the squared errors , i.e., we need to minimize

Writing , the estimates are rewritten as

(29)

(30)

where are the uncorrected standard deviations of samples, are the sample variance and covariance. Moreover, the fit parameters have the standard errors

(31)

(32)

Alternatively, all the above can be also written as follows. Define

(33)

then, for a minimum square fit with , we find out that

(34)

and where the correlation coefficient is

(35)

and where are the corrected sample standard deviations of . To know what is in a more general setting, we note that the sample mean vector is a column vector whose -element is the average value of the observations of the -variable:

and thus, the sample average or mean vector contains the average of every variable as component, such as

(36)

The sample covariance matrix is a “K”-by-“K” matrix

with entries

where is an estimate of the covariance between the -th variable and the -th variable of the population underlying the data. In terms of the observation vectors, the sample covariance is

Finally, you can also provide a calculation with confidence level of the intervals where are. The t-vallue has a Student’s t-distribution with degrees of freedom. Using it, we can construct a confidence interval for :

at confidence level (C.L.) , where   is the quantile of the distribution. For example, , then the C.L. is .

Similarly, the confidence interval for the intercept coefficient is given by

at confidence level (C.L.) , where as before above

Remark: for non homogenous samples, the best estimation of the average is not the arithmetic mean, but the median.

See you in other blog post!

View ratings