Basic Statistics

 

Population: Population is a set of similar items or events that is of interest for some statistical problem. It can be a group of existing object or a hypothetical or potentially infinite group of items.

Sample: Sample is a selection of individual, events or objects taken from a well-defined population. Thus sample is a subset of population.

Parameters: These are those quantities that summarizes or describes an aspect of the population such as mean, deviation, correlation, etc.

Sampling Error: It is the difference between sample statistics and population parameters. Since, samples does not include all elements of population, its estimate tends to differ from it.

Mode : Mode is the score that occurs most often in a frequency distribution of data.

Mean : Mean is the arithmetic average of numbers in a data set i.e, sum of numbers divided by the total.

Geometric Mean: Geometric mean (GM) is the average change of percentage, ratios, indexes or growth rate over time. If we have a set of n positive numbers, then its GM is defined as the nth root of the product of n values.

\[\text{GM} = \sqrt[n]{(x_{1})(x_{2}) \cdots (x_{n})}\]

It can also be used to get Rate of Increase Over Time. In this case, GM is calculated as:

\[\text{GM} = \sqrt[n]{\frac{\text{Value at the end of period}}{\text{Value at start of period}}} -1\]

Median : Median is the middle score found by arranging a set of numbers from the smallest to the largest (or from largest to smallest). If even number in data point, then median is the average of two middle values.

Range :Range is the difference between the smallest and the largest data value.

Quartiles, Deciles, Percentiles: Quartiles divide a set of observations into 4 equal parts. Deciles divide them into 10 equal parts, whereas Percentiles divide observations into 100 equal parts.

\[\text{Location of a Percentile}, L_{p} = (n+1)\frac{P}{100}\]

Variance : is the average squared deviation of the data values from their mean.

\[\text{variance} (σ ^{2}) = \frac{\sum\left\{x - μ \right\}^2}{N}\]

where \(x\) is an individual data value, \(μ\) is the mean of all data values, and \(N\) is the number of data values.

Covariance

Covariance between two random vairable, \(x\) and \(y\) measures how two variables are related. Positive covariance means the two variables are positively related, and they move in the same direction.

Negative covariance means that the variables are inversely related, or that they move in opposite directions.

\[\text{COV}(x,y) = \frac{\sum_{i=1}^{n} (x -\mu_{x})(y-\mu_{y})}{n-1}\]

where \(\mu_{x}\) is mean of \(x\) and \(μ_{y}\) is mean of \(y\) and \(n\) is the total number of data values.

Standard Deviation

Standard Deviation is the square root of the variance, and can be defined as:

\[\text{Std dev.}=\sqrt{\text{variance}} \text{or} \sqrt{\sigma^{2}} = \sqrt\frac{\sum\left\{x - \mu\right\}^{2}}{N}\]

Standard Deviation provides a measure in standard units of how far the data values fall from the sample mean. In a Normal Distribution, \(68%\) of the data values fall apprx one standard deviation $(1 SD)$ on either side of the mean. $95%$ fall two standard deviation $(2 SD)$ on either side of the mean, and $99%$ of the data values fall approximately three standard deviation $(3 SD)$ on either side of the mean.

Normal Distribution

Normal or Gaussian Distribution is a probability distribution of data symmetric about the mean, (and resembles bell-curve) implying data near the mean are more frequent in occurrence than data far from the mean. Coefficient of skewness and Coefficient of Kurtosis are 0 for normal distribution. It can represented as below:

\[f(x) = \frac{1}{σ \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x- \mu}{\sigma})^2}\]

Tchebysheff Inequality Theorem

It helps determine proportion of observation expected within a certain number of standard deviation from the mean, even if data is not normally distributed

Given a number \(k \ge 1\), and set of \(n\) measurements, at least, \(1 -\frac{1}{k^{2}}\) of the measurements will lie within \(k\) standard deviation of their mean, and following deduction can be made :

Lower limit = mean - \(k × \text{std.dev}\)

Upper limit = mean - \(k \times \text{std.dev}\)

(Standardized) Moments

Moments are a set of parameters to measure a distribution. Four moments are:

1st moment -> Mean

2nd Moment -> Variance

3rd Moment - Skewness

4th Moment -> Kurtosis

3rd Moment -> Skewness :

It is a measure of the asymmetry of the probability distribution of data and defined as:

\[γ = \frac{1}{N \sigma^{3}}\sum_{i=1}^{n} (x_{i}- \mu)^{3}\]

Generally, If \(\text{Mean} \gt \text{Mode}\), the skewness is positive having tail on the right side, and if \(\text{Mean} \lt \text{Mode}\), the skewness is negative, with tail on the left side of the distribution.

SKEW

Pearson Skewness Coefficient (based on Mode) is defined as: \(\displaystyle \frac{\text{Mean} - \text{Mode}}{\text{stddev}}\) and based on Median, it is \(\displaystyle \frac {3 (Mean - Median)}{\text{stddev}}\)

Kurtosis

It refers to the degree of peakedness of a frequency curve. It tell how tall and sharp the cenral peak is, relative to a standard bell curve of normal distribution. Kurtosis can be described in the following ways:

  • Playkurtic : When \(\text{Kurtosis} \lt 0\). The curve is more flat and wide.

  • Leptokurtic : When \(\text{Kurtosis} \gt 0\). The curve is more peaked.

  • Mesokurtic : When the \(\text{Kurtosis} = 0\) (normal in shape)

KURT

Correlation

It measures the interdependence between two variables and illustrate how closely two variables move together. Correlation value range between \(-1.0\) and \(1.0\). A correlation value of \(-1.0\) represents negative correlation between said variables and they move in opposite direction. A correlation value of \(0\) means no linear relationship at all. A perfect positive correlation value is 1. below given image is to illustrate it.

CorrCoeff

Pearson Correlation Coefficient

It is between two linearly related variables, and required three assumption to be true i.e, 1. interval or ratio level, 2. Bivariable normally distributed 3. Linearly related.
For sample, it is defined as :

\[\rho=\frac{\sum(x_{i} - \bar{x})(y_{i}- \bar{y})} {\sqrt{\sum(x_{i} - \bar{x})^{2} \sum(y_{i} - \bar{y})^2}}\]

where \(ρ =\) correlation coefficient,

\(x_{i} =\)values of the x-variable in sample

\(\bar{x}=\) mean of the values of the x-variable

\(y_{i} =\) valus of the y-variable in a sample

\(\bar{y}=\) mean of the values of the y-variable

For a population, correlation coefficient is defined as:

\[\rho_(X,Y)=\frac{\text{cov}(X,Y)}{\sigma_{X} \sigma_{Y}}\]

where \(\text{cov}(x,y) =\) covariance between $x$ and $y$$

\(\sigma_{x} =\) standard deviation of \(X\) \(\sigma_{Y} =\) standard deviation of \(Y\)

Spearman’s Rank Correlation Coefficient

It requires two assumption to be true i.e, 1. interval or ratio level or ordinal(categorical data) 2. monotonically related. A monotonic function is one that either never increases or never decreases as its independent variable increases. Below given image is to demonstrate that

Monotonic

Spearman’s Rank Correlation is represented by \(r_{s}\) and constrained as \(-1 \le r_{s} \le 1\). For a sample size of \(n\), the \(n\) raw scores \(X_{i}, Y_{i}\) are converted to ranks \(R(X_{i}), R(Y_{i})\) and \(r_{s}\) then computed as:

\[r_{s} = \rho_{R(X),R(Y)} = \frac{cov(R(X),R(Y))}{\sigma_(R(x)) \sigma_(R(Y))}\]

where \(\rho\) denotes Pearson Correlation Coefficient of rank variables, \(cov(R(X),R(Y))\) is covariance of rank variables

\(\sigma_{R(X)}\), and \(\sigma_{R(Y)}\) are standard deviations of rank variables.

If all \(n\) ranks are distinct integers, it can be computed as:

\[r_{s} = 1 - \frac{6 ∑ d_{i}^2}{n(n^2 - 1)}\]

where \(d_{i} = R(X_{i}) - R(Y_{i})\) is the difference between the two ranks of each obsevation, and \(n\) is the number of observations.

Z-score

It is the number of standard deviation by which the value of a raw score is above or below the mean. Raw score above mean value positive z-score (standard score) and those below mean value have negative z-score (standard score). It is defined as :

\[z = \frac{x - \mu}{\sigma}\]

where \(\mu\) is the mean of the population and \(\sigma\) is Standard Deviation of the population. And, for sample,

\[z = \frac{ x - \bar{x}}{S}\]

where \(\bar{x}\) is the mean of the sample, and \(S\) is the standard deviation of the sample.