Descriptive Statistics: Measure of Central Tendency, Variation, and Position

Measurement in descriptive statistics and how to calculate it using Python

7 min readJul 30, 2022

Photo by Nataliya Vaitkevich from Pexels

Introduction

Data is the core of statistics. The observational data that we collect needs to be analyzed before it can be used. When our data is very large, it is important to summarize the data. Summarizing the data really helps us in analyzing and extracting insights from the data.

Descriptive statistics are very critical, we can easily summarize data into numbers and graphs. More fully, descriptive statistics is a method of collecting data, processing data (summarizing and presenting), describing, and analyzing all data. The most critical thing about descriptive statistics is to communicate data in the form of information and to support reasoning about the data.

There are 3 critical measurements in descriptive statistics. It is a measure of central tendency, a measure of variation, and a measure of position. In this article, we’ll dig into it and calculate it using Python.

You can access the completed code we use here

The measure of Central Tendency

The measure of central tendency is a value measurement that can be used to represent the central value of a data set. In statistics, there are three ways to measure central tendency: mean (average), median (middle value), and mode (a value that occurs frequently).

Type of distribution in statistics — Source: https://www.quora.com/What-does-SKEWED-DISTRIBUTION-mean

Mean

The mean or average is the sum of the total values in the dataset divided by the number of values in the dataset. In general, when we talk about the mean, we refer to the arithmetic mean. The mean for the population and sample is calculated in the same way.

Calculating arithmetic mean using Python (Image by author).

Additionally, the weighted mean is a subset of the arithmetic mean. In the weight mean, we assume that each value has a certain weight, so to calculate the weight mean we must multiply the value by its respective weight first.

The formula for calculating the weighted mean (Image by author)

Calculating weighted mean using Python (Image by author).

Then, there is also the geometric mean. The geometric mean is calculated by multiplying all the values in the dataset and then taking the root with the power of the sum value in the dataset.

The formula for calculating the geometric mean (Image by author)

Calculating geometric mean using Python (Image by author).

In addition, there is a harmonic mean. The harmonic mean is calculated by dividing the number of values into the data set by the number of reciprocals of each value in the dataset.

The formula for calculating the harmonic mean (Image by author)

Calculating harmonic mean using Python (Image by author)

Median

The median is the middle value of a data set. To find the median, we must sort all the values in the dataset first (starting from the smallest to the largest values), then look for the midpoint or middle value of the dataset. If the number value of the dataset is even, then average the two middle values.

Calculating median using Python (Image by author)

Mode

Mode is a value that occurs frequently in a dataset. When the frequency of occurrence of a value in the data set is the same, it indicates that there is no mode. Meanwhile, if there are two values that have the highest frequency of occurrence, it is called bimodal.

Calculating mode using Python (Image by author)

The Measure of Variation

The measure of variation or dispersion is a measurement of value that can be used to represent the diversity or distribution of data. With this measure, we can determine how the data spreads from the smallest to the largest data, or how the data is far from the center of the overall data distribution. When the measure of variation is zero, then it indicates that the overall value in the data is uniform.

Range

The range is the difference between the largest value and the smallest value from the dataset. However, the range has a disadvantage because it only includes two values in the measurement process.

Calculating range using Python (Image by author)

Interquartile Range (IQR)

The interquartile range or range between quartiles is a value that is calculated by subtracting the value of the third quartile (Q3) and the first quartile (Q1). IQR will not be affected by extreme values (outliers).

The formula for calculating interquartile range (Image by author)

Calculating interquartile range using Python (Image by author)

Variance

Variance is a measure of how far a set of values is spread around the mean value. The primary disadvantage of variance is that the value no longer has the same scale as the value in the dataset. However, the standard deviation can solve this disadvantage.

Calculating variance using Python (Image by author)

Standard deviation

The standard deviation is the value used to determine the distribution of the data and see how close the data is to the mean value. What the standard deviation does is look at the difference between each number from the mean, square the difference, and then look at the average of the difference squared. And finally, it takes the square root.

Calculating standard deviation using Python (Image by author)

The data is more dispersed from the mean value when the standard deviation is high. In contrast, if the standard deviation is small, the data distribution will be close to the mean value.

Standard deviation comparison in a normal distribution (Image by author)

This graph illustrates how the point of the normal distribution curve decreases as the standard deviation increases. In contrast, the peak point of the normal distribution curve increases as the standard deviation decreases.

The Measure of Position

The measure of position is a measurement used to determine the relative position of a data point to a dataset. This measurement can indicate if a value is high, low, or average.

Quartile

A quartile is a value that divides an ordered data set into four equal parts. There are three values known as quartiles: Q1, Q1, and Q3. The value of the second quartile is equal to the mean value.

Calculating quartiles using Python (Image by author).

Decile

Decile is a value that divides an ordered data set into 10 equal parts. The values are named the first decile (D1), the second decile (D2), and so on up to the nine decile (D9).

Calculating deciles using Python (Image by author).

Percentile

Percentile or percentile is a value that divides an ordered data set into one hundred equal parts. There are 99 percentile values, starting from P1, P2, …, and P99. Percentiles can be used to detect outliers. When a value is less than the 5th percentile (P5) or greater than the 95th percentile (P95), it can be categorized as an outlier.

Conclusion

Descriptive statistics are very useful for summarizing data. When our data is very large, summarizing it will make it much easier for us to understand the data. Calculation of central tendency is very useful for determining the central or midpoint of a dataset. Then, variation calculations are useful for determining the spread and variation of a dataset. While the position calculation is useful for determining the relative position of a dataset.

References:

[1] Nield, T. (2022). Essential Math for Data Science: Take Control of Your Data with Fundamental Linear Algebra, Probability, and Statistics (1st ed.). O’Reilly Media.

[2] statistics — Mathematical statistics functions — Python 3.10.5 documentation. Docs.python.org. (2022). Retrieved 20 July 2022, from https://docs.python.org/3/library/statistics.html.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com