Statistics is an important subject, plays a key role in the research. Good statistical knowledge may help the researchers to systematic approach towards data collection and handling. Decisions related to the adopted process and parameters can take easily by using statistics. A few important statistical terms are being used in research, given as below:
Mean: Adding all the numbers together and dividing by the number of items in the set is known as mean. For example: (10 +20 + 30 + 40 + 50) / 5 = 75.
Sum of Squares: Sum of the squared differences between the overall average and the amount of variation explained by that row source.
Degrees of Freedom (df): The number of estimated parameters used to compute the source’s sum of squares.
Mean Square (Variance): The sum of squares divided by the degrees of freedom.
Model: How much variation in the response is explained by the model along with the over-all model test for significance.
Terms: The model is separated into individual terms and tested independently. Input parameters are the terms such as A, B, C, and D.
Linear Model: Sequential sum of squares for the linear terms such as A, B, C, D.
2FI Model: Sequential sum of squares for the two-factor interaction terms such as AB, BC, CD, AD, BD, AC.
Quadratic Model: Sequential sum of squares for the quadratic terms such as A2, B2, C2, ABC, etc.
Cubic Model: Sequential sum of squares for the cubic terms
Note: In the above all models, the F-value tests the significance of adding interaction terms to the models. A small p-value less than 0.05 (Prob>F) indicates that adding interaction terms has improved the model.
Residual Error: It shows how much variation in the response is still unexplained. It is also called error variance or unexplained variance. Mathematically,
Residual = Experimental value – Predicted value
Pure Error: It reflects the variability of the observations within each treatment, in another word it is the amount of difference between replicate runs.
Lack of Fit: The LoF report gives detail about weather model fits the data well. The report appears only when it is possible to conduct the test.
- The difference between the error sum of squares (SSE) from the model and the pure error sum (SSPE) of squares is called the lack of fit sum of squares (SSLF).
SSE = SSPE + SSLF
- In case of the model is inadequate, the lack of fit variation can be significantly greater than pure error variation.
- LoF is the amount the model predictions miss the observations.
- SSE (Sum of Squares due to error) is defined as:
Where SSPE = Sum of squares due to pure error, SSLF = Sum of squares due to lack of fit
SSPE measures the inherent variability of y which cannot be explained by any model. (Without repeated measurements, SSPE = 0 and hence we cannot conduct the LoF test)
SSLF represents the variability of y that cannot be explained by the given model. This value may be reduced if a better model is used.
Sum of Squares (SS): It can be used to identify the dispersion of data as well as how well the data can fit the model. There is some type of SS as follows.
- The total sum of squares (TSS): It shows the variation of the values of a dependent variable from its mean.
Yi is the observed value, Ȳ is the mean
- Regression sum of Squares (SSR): It shows how well a regression model represents the modeled data. A higher regression sum of squares indicates that the model does not fit the data well. It is also called the sum of squares due to regression or explained sum of squares. It can be calculated as:
- Residual sum of squares (SSE): It measures the amount of variance in a data set that is not explained by a regression model itself. In another word, it measures the variation of modeling errors, also known as the sum of squared errors of prediction.
Cor Total: It is also called “Corrected Total Sum of Squares” (CTSS), It shows the amount of variation around the mean of the observations. The model explains part of it, the residual explains the rest.
F value: It is used to know whether the test is statistically significant. F-value is the test for comparing the source’s mean square to the residual mean square. It can be calculated by dividing two mean squares.
P-value: It is known as a probability value. It is a statistical tool to determine whether the hypothesis is correct or not. P-value is a number that lies between 0 and 1.
- A small p-value (less than 0.05) calls for rejection of the null hypothesis (there are no factor effects).
- If the p-value (p-value > F) less than 0.05, then the source will be tested as significant.
- There would be a probable effect on the response if model terms are significant.
- Significant Lack of fit (LoF) shows that the model does not fit the data within the observed replicate variation.
Standard Deviation, Coefficient of Variation and PRESS
Adj R-squared and Pred R-squared
- Montgomery D C, Peck E A and Vining G G (2012). Introduction to Linear Regression Analysis, 5th edition. Wiley.
- Weisberg S (2013). Applied Linear Regression, 4th edition. Wiley.