Glossary of Terms

I will be using this page as a gradually built up glossary to be used whenever you need to quickly look up a term. Simply use Ctrl+F to find the term your looking for. If a term is not here, feel free to suggest new terms by email to barry@discoveringdata.org.

Due to the fact that I am not an outright expert in all of the terms, there will usually be a link to more information elsewhere (Wikipedia, more often than not). If you spot a discrepancy in the definitions feel free to drop a line to the email address above.

This list is not and, likely, never will be exhaustive.

Average
The average value of a one dimensional collection of numeric data is most usually thought of as the sum of the terms divided by the number of terms used in the sum. For example the average for the collection of numbers 2, 5 and 8 is:
\frac{2 + 5 + 8}{3} = \frac{15}{3} = 5
This average is also referred to as the arithmetic mean. For the purposes of this site the mean is a term with the same meaning as average as described here.
Average, Mean, Median, Mode
There is a relationship between the terms average, mean, median and mode which needs to be mentioned. These terms all reflect a way to identify a single value which represents the centre value of an entire collection. These terms will also usually yield a value that tends toward the central value of a collection, with mode being the exception in certain skewed collections. Further solidifying the relationship is the fact that when you have a data set which, when modeled, falls into a normal distribution, then the average, mean, median and mode values will all be the same value.
Gauss
German mathemetician and physicist. Full name Johann Carl Friedrich Gauss. Gauss is said to be ranked among history’s most influential mathematicians.
Mean
Mean is a different name for average as far as this site is concerned. There is another accepted way to think of mean which is as the lowest value added to the highest value and the sum is then divided by 2. For the purposes of this site we will refer to that as the midpoint. Statistically speaking, the mean is also referred to as the expected value.
Median
The median is the the very centre term of a collection of numbers when they are sorted. If the collection has an even number of terms the median is calculated as the average of the centre two terms.
Midpoint
The midpoint of a list of values is lowest value added to the highest value and that sum is then divided by 2.
Mode
The mode of a collection of numbers is the value that appears most times in the collection. If all numbers only appear once then there is no mode for the collection. Where there are 2 terms that both occur an equal number of times and also more than any other term the collection is said to be bimodal. Where that happens for more than two terms the collection is said to be multimodal.
Quartiles
Quartiles divide a sorted data set up into four parts. There are three quartiles that mark the separator between two different parts. The quartiles are the lower quartile, the median and the upper quartile. In order to identify the lower and upper quartiles you must start with identifying the median. The set is then divided in two with the lower half being all the numbers from the start of the set up to, but not including, the median. The upper half is all the numbers starting from, but not including, the median to the end of the set. The median is then identified for the lower half and the upper half of the set. The median for the lower half becomes the lower quartile. The median for the upper half becomes the upper quartile.
Note: This is not the only method of identifying quartiles. See: Computing Methods.
Range
The range of a data set is the difference between the highest and lowest value of the data set. The range of a set x = max(x) – min(x).
Standard Deviation
The standard deviation is a measure of the amount of variation in the data. A low value for standard deviation indicates that the data is gathered close around the mean, with the contrary being the case for a high valued standard deviation, indicating a wider spread of values. The standard deviation is represented in formulae by the Greek letter sigma (\sigma) (for the population standard deviation) and as ‘s’ (for the sample standard deviation, a sample being a subset of an entire population). The standard deviation is the square root of the variance in the data.
The equation for the standard deviation of an entire population is: \sqrt{\frac{1}{N}\overset{N}{\underset{i=1}{\sum_{}}}(x_i - \bar{x})^2},
where {\displaystyle \textstyle \{x_{1},\,x_{2},\,\ldots ,\,x_{N}\}} are the observed values of the data set items, {\displaystyle \textstyle {\bar {x}}} is the mean value of these observations, and N is the number of observations in the sample.
Where the standard deviation is being calculated for a sample of the population the equation is modified slightly. Instead of N as the divisor N-1 is used.
The sample standard deviation is calculated using the formula: \sqrt{\frac{1}{N-1}\overset{N}{\underset{i=1}{\sum_{}}}(x_i - \bar{x})^2}, where {\displaystyle \textstyle \{x_{1},\,x_{2},\,\ldots ,\,x_{N}\}} are the observed values of the sample items, {\displaystyle \textstyle {\bar {x}}} is the mean value of these observations, and N is the number of observations in the sample.
(Source: Wikipedia Standard Deviation article)
Variance
“Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.” (Source: wikipedia)
It is represented by standard deviation squared: {\sigma^2} for a population, or {s^2} for a sample.