Hands-on Statistics with NumPy in Python

Learn how to calculate measures of central tendency like mean, median, and weighted mean, and measures of spread like range, variance, and standard deviation using the NumPy module in Python.

Understanding data distribution is key to deriving correct insights while analyzing data. To analyze data distribution, we use measures of central tendency and measures of spread. NumPy provides many functions for calculating measures of central tendency, such as mean, median, and weighted mean. It also provides functions to calculate measures of spread, such as range, variance, and standard deviation.

Let’s discuss how to compute measures of central tendency and measures of spread for both 1D and 2D NumPy arrays in Python. We will also discuss handling missing values while calculating these statistical measures.

What are Measures of Central Tendency?

Measures of central tendency are statistical measures that summarize a data set by identifying the central or typical value around which other data points cluster. The three main measures of central tendency are mean, median, and mode.

Measures of central tendency help us provide a summary of the data’s distribution. These measures are used in various statistical analyses to understand the typical or central behavior of the dataset.

What are Measures of Spread?

Measures of spread, also known as measures of variability, describe the extent to which data points in a dataset differ from the central value. They provide insight into how spread out or concentrated the data is. The main measures of spread are range, variance, and standard deviation.

Measures of spread help us assess the consistency and reliability of the data by indicating how concentrated or dispersed the values are relative to the central tendency.

Calculating Mean Using NumPy Arrays in Python

Mean or average is a measure of central tendency calculated as the sum of all values in a dataset divided by the number of values. It gives a quick snapshot of the typical value in the data.

How to Calculate Mean for 1D NumPy Arrays?

For a 1D NumPy array, we can calculate the mean using the mean() function. When we pass a NumPy array to the mean() function, it returns the mean of all the elements in the array.

For example, if we have a list of marks of a student for five different subjects, we can find the mean of the marks scored by the student in all the subjects using the mean() function, as shown below:

import numpy as np
marks = np.array([80,95,88,72, 99])
mean_value = np.mean(marks)
print("The array is:", marks)
print("The mean value is:", mean_value)

The output for the code is the following:

The array is: [80 95 88 72 99] 
The mean value is: 86.8 

Missing data points are a common issue in real-world datasets. In NumPy, we represent the missing values as nan, which stands for “not a number”.

If a NumPy array contains one or more nan values, the mean() function returns nan after execution. For instance, if the student in our example misses one of the tests, the array of their marks will have one nan value. In such a case, calculating the mean of the marks using the mean() function will also give us a nan value.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
mean_value = np.mean(marks)
print("The array is:",marks)
print("The mean value is:",mean_value)

Output:

The array is: [80. 95. nan 72. 99.] 
The mean value is: nan 

NumPy can ignore the nan values and calculate the mean for the rest of the values. For this, NumPy provides us with the nanmean() function. When we pass a NumPy array to the nanmean() function, it ignores the nan values and returns the mean of the remaining values.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
mean_value = np.nanmean(marks)
print("The array is:",marks)
print("The mean value is:",mean_value)

Output:

The array is: [80. 95. nan 72. 99.] 
The mean value is: 86.5 

In addition to 1D arrays, we can also calculate the mean for 2D NumPy arrays using the mean() function. Let’s see how we can do that.

How to Calculate Mean for 2D NumPy Arrays?

A 2D NumPy array, as the name implies, has two dimensions.

  • The first dimension corresponds to the rows of the 2D array and is represented by axis 0.
  • The second dimension corresponds to the columns of the 2D array and is represented by axis 1.

If we have a 2D array containing marks for students in a class, we can calculate the mean by three different approaches.

  1. We can calculate the mean of the marks scored by all the students in the class in all the subjects.
  2. We can calculate the mean of marks scored by all the students in a particular subject.
  3. We can calculate the mean of the marks scored by a particular student in all the subjects.

Suppose we have the following two-dimensional array containing marks data for six students and five subjects. In the array, each row represents the marks obtained by an individual student, and each column represents marks scored in a particular subject.

class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])

To calculate the mean of the marks scored by all the students in the class in all the subjects, we need to calculate the mean of all the values in the above 2D array. For this, we can directly pass the array to the mean() function like so:

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
mean_value = np.mean(class_marks)
print("The mean value is:",mean_value)

Output:

The mean value is: 86.93333333333334 

Now, we want to calculate the mean of the marks obtained by a particular student in all the subjects. For this, we need to calculate the mean of the values in each row in the 2D array, as each row represents the marks scored by an individual student.

To calculate the mean of each row in a 2D array, we set the axis parameter in the mean() function to 1.

  • By default, the axis parameter in the mean() function is set to None. In this case, the mean is calculated over all the values in the array, and we get a single value as output, as shown in the previous example.
  • When the axis parameter in the mean() function is set to 1, the calculation is done across the horizontal direction. The mean is calculated for the values in each row, and we get an array as the output of the mean() function. The length of the output array is the same as the number of rows in the input array.
  • When the axis parameter is set to 0, the calculation is done in the vertical direction. The mean is calculated for the values in each column, and we get an array as the output of the mean() function. In this case, the length of the output array is the same as the number of columns in the input array.

We can get the mean of the marks scored by each of the six students using the mean() function and the axis parameter as shown below:

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
# Calculate the mean for each row separately
mean_values = np.mean(class_marks, axis=1)
print("The mean marks for the 6 students are:",mean_values)

Output:

The mean marks for the 6 students are: [86.8 80.6 89.  95.6 80.6 89. ] 

As there are six students, i.e., six rows in the input array, you can observe that we get six values in the output array.

Instead of the students, if we want to calculate the mean of the marks by each subject, we need to calculate the mean for each column. We can set the axis parameter to 0 in the mean() function for this.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
# Calculate the mean for each column separately
mean_values = np.mean(class_marks, axis=0)
print("The mean of the marks scored in each subject are:",mean_values)

Output:

The mean of the marks scored in each subject are: [83.66666667 92.33333333 88.66666667 82.33333333 87.66666667] 

As the input array has five subjects, i.e., five columns, we get five values in the output array.

If the 2D array passed to the mean() function contains nan values, it returns a nan value.

import numpy as np
class_marks=np.array([[80,95,np.nan,72,99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
mean_value=np.mean(class_marks)
print("The mean value is:",mean_value)

Output:

The mean value is: nan 

Notice in the above example the one nan value in the input array and the mean() function returns nan.

When the axis parameter is set to 1, and the input array contains nan values, we get nan values in the output only for the rows containing nan values. For instance, if the first row of the input array contains nan, the output array of the mean() function will contain nan as its first element.

Similarly, if the axis parameter is set to 0, the output array returned by the mean() function contains nan in the respective positions for the columns that contain nan values. For example, if the input array contains a nan value in the third column, the third value in the output array will be nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
# Calculate mean for each row
mean_for_each_row = np.mean(class_marks, axis=1)
# Calculare mean for each column
mean_for_each_column = np.mean(class_marks, axis=0)
print("The mean for each row is:",mean_for_each_row)
print("The mean for each column is:",mean_for_each_column)

Output:

The mean for each row is: [ nan 80.6 89.  95.6 80.6 89. ] 
The mean for each column is: [83.66666667 92.33333333         nan 82.33333333 87.66666667] 

To ignore nan values in 2D arrays and calculate the mean for rows or columns having nan values, we can use the nanmean() function instead of the mean() function, like so:

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
mean_value_by_rows = np.nanmean(class_marks, axis=1)
mean_value_by_columns = np.nanmean(class_marks, axis=0)
print("The mean for each row is:",mean_value_by_rows)
print("The mean for each column is:",mean_value_by_columns)

Output:

The mean for each row is: [86.5 80.6 89.  95.6 80.6 89. ] 
The mean for each column is: [83.66666667 92.33333333 88.8        82.33333333 87.66666667] 

Note that we used the nanmean() function in the above example. Hence, the nan values are ignored while calculating the mean for a given row or column.

Calculating Median Using NumPy Arrays

The median value for a dataset is the number that is in the middle of the sorted group. If we have a dataset of N numbers, we calculate the median by first sorting the numbers and then selecting the number at the (N+1)/2 position if N is odd. If N is even, we calculate the mean of the numbers at N/2 and (N+1)/2 positions and consider this value as the median of the dataset.

How to Calculate Median for 1D NumPy Arrays?

To calculate the median value for the elements in a 1D NumPy array, we can use the median() function like so:

import numpy as np
marks=np.array([80,95,88,72, 99])
median_value=np.median(marks)
print("The array is:",marks)
print("The median value is:",median_value)

Output:

The array is: [80 95 88 72 99] 
The median value is: 88.0 

If the input array passed to the median() function contains nan values, the output value will also be nan.

import numpy as np
marks=np.array([80,95,np.nan,72, 99])
median_value=np.median(marks)
print("The array is:",marks)
print("The median value is:",median_value)

Output:

The array is: [80. 95. nan 72. 99.] 
The median value is: nan 

To ignore the nan values and calculate the median for the rest of the values in the array, we can use the nanmedian() function, as shown below:

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
median_value = np.nanmedian(marks)
print("The array is:",marks)
print("The median value is:",median_value)

Output:

The array is: [80. 95. nan 72. 99.] 
The median value is: 87.5 

How to Calculate Median for 2D NumPy Arrays?

Like the mean, we can also calculate the median for 2D in three ways using the median() function. To calculate the median of all the values in a given 2D array, we pass the array to the median() function.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
median_value = np.median(class_marks)
print("The median value is:",median_value)

Output:

The median value is: 88.5 

To calculate the median value for each row in the input array, we can set the axis parameter to 1 in the median() function. To calculate the median for each column, we set the axis parameter to 0. You can observe this in the following example:

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
# Set axis parameter to 1 for calculation across rows
median_value_for_each_row = np.median(class_marks, axis=1)
# Set axis parameter to 0 for calculation across columns
median_value_for_each_column = np.median(class_marks, axis=0)
print("The median value for each row is:",median_value_for_each_row)
print("The median value for each column is:",median_value_for_each_column)

Output:

The median value for each row is: [88. 77. 91. 95. 85. 89.] 
The median value for each column is: [80.5 93.  88.  83.5 91.5] 

If the 2D array contains nan values, the output of the median() function is also nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
median_value = np.median(class_marks)
print("The median value is:",median_value)

Output:

The median value is: nan 

If the axis parameter is set to 1, the median is calculated for each row. In such a case, we get nan values in the output array for each row with nan values in the input 2D array. Similarly, when the axis parameter is set to 0, the median is calculated for each column. Hence, we get nan values in the output array for each column with nan values in the input 2D array.

import numpy as np
class_marks=np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
median_value_for_rows = np.median(class_marks, axis=1)
median_value_for_columns = np.median(class_marks, axis=0)
print("The median value for each row is:",median_value_for_rows)
print("The median value for each column is:",median_value_for_columns)

Output:

The median value for each row is: [nan 77. 91. 95. 85. 89.] 
The median value for each column is: [80.5 93.   nan 83.5 91.5] 

In the above code, the first row in the input array contains a nan value. Due to this, the output array has nan in the first position. Also, the third column in the input array contains a nan value. Hence, the output array for the median of columns contains nan at the third position.

To ignore nan values and calculate the median for the rest of the values in the rows or columns having nan, we can use the nanmedian() function, as shown below:

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
median_value_by_rows = np.nanmedian(class_marks, axis=1)
median_value_by_columns = np.nanmedian(class_marks, axis=0)
print("The median value for each row is:",median_value_by_rows)
print("The median value for each column is:",median_value_by_columns)

Output:

The median value for each row is: [87.5 77.  91.  95.  85.  89. ] 
The median value for each column is: [80.5 93.  88.  83.5 91.5] 

In this example, you can observe that we don’t get nan values in the outputs. This is because the nan values have been ignored while calculating the median for rows or columns having such values.

Calculating Weighted Mean for NumPy Arrays

Weighted mean is a type of mean where some data points have more influence on the mean than other data points. We calculate the weighed mean by assigning a weight to each data point, reflecting its importancein the dataset.

For example, if we have an array with five elements, say [80, 95, 88, 72, 99], and the weights for the elements are [1, 2, 1, 3, 2]. Then, the weighted mean of the array is calculated as follows:

(80×1)+(95×2)+(88×1)+(72×3)+(99×2)
───────────────────────────────
1+2+1+3+2

How to Calculate the Weighted Mean for 1D NumPy Arrays?

We use the average() function to calculate the weighted mean for NumPy arrays. The average() function takes an array containing the elements as its first input argument and a list containing their weights as its second input argument and returns the weighted mean of the elements in the input array.

import numpy as np
marks = np.array([80,95,88,72, 99])
weighted_mean = np.average(marks, weights=[1,2,1,3,2])
print("The array is:",marks)
print("Weighted mean is:",weighted_mean)

Output:

The array is: [80 95 88 72 99]
Weighted mean is: 85.77777777777777

While calculating the weighted mean, the number of elements in the array passed to the weights parameter must be the same as the number of elements in the array containing the actual dataset. Otherwise, the program runs into a TypeError exception.

If the input array passed to the average() function contains nan values, the output becomes nan.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
weighted_mean = np.average(marks, weights=[1,2,1,3,2])
print("The array is:",marks)
print("Weighted mean is:",weighted_mean)

Output:

The array is: [80. 95. nan 72. 99.] 
Weighted mean is: nan 

How to Calculate the Weighted Mean for 2D NumPy Arrays?

We can also calculate the weighted mean for 2D NumPy arrays. To calculate the weighted mean of elements in each row in a 2D NumPy array, we must set the axis parameter in the average() function to 1 and pass a list of weights having the same number of elements as the number of columns in the array.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
weighted_mean = np.average(class_marks,axis=1,weights=[1,2,1,3,2])
print("Weighted mean for each row is:",weighted_mean)

Output:

Weighted mean for each row is: [85.77777778 80.22222222 88.55555556 95.         79.88888889 90.11111111] 

To calculate the weighted mean across columns in a 2D array, we set the axis parameter in the average() function to 0 and pass a list of weights with the same number of elements as the number of rows in the input array.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
weighted_mean = np.average(class_marks,axis=0,weights=[1,2,1,3,2,4])
print("Weighted mean for each column is:",weighted_mean)

Output:

Weighted mean for each column is: [83.38461538 91.61538462 89.53846154 84.46153846 89.84615385] 

If the 2D array passed to the average() function contains nan values, the output values corresponding to the rows or columns in which the input array contains nan will also be nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
weighted_mean_for_rows = np.average(class_marks,axis=1,weights=[1,2,1,3,2])
print("Weighted mean of elements in each row is:",weighted_mean_for_rows)
weighted_mean_for_cls = np.average(class_marks,axis=0,weights=[1,2,1,3,2,4])
print("Weighted mean of elements in each column is:",weighted_mean_for_cls)

Output:

Weighted mean of elements in each row is: [        nan 80.22222222 88.55555556 95.         79.88888889 90.11111111] 
Weighted mean of elements in each column is: [83.38461538 91.61538462         nan 84.46153846 89.84615385] 

In this example, the input 2D array contains a nan value in the first row and the third column. Hence, we get a nan value at the first position in the output array while calculating the weighted mean for elements in each row. Similarly, we get a nan value at the third position in the output array while calculating the weighted mean for elements in each column.

NumPy doesn’t have a function like nanaverage() to ignore nan values and calculate the weighted mean for the rest of the values in the input arrays.

We have discussed how to calculate measures of central tendency like mean, median, and weighted mean for 1D and 2D NumPy arrays. However, NumPy doesn’t provide a function for calculating one of the popular measures of central tendency, mode.

Now, let’s discuss the measures of spread and how to calculate them using the functions provided in the NumPy module.

Calculating Range for NumPy Arrays in Python

Measures of spread describe the variability in a dataset. They help us understand the degree to which data points differ from each other and from the measures of central tendency, i.e. (mean, median, or mode). In the following subsections, we will discuss three measures of spread: range, variance, and standard deviation.

The range for a dataset is defined as the difference between the maximum and minimum values in the dataset. By definition, range is always a non-negative value.

How to Calculate the Range for 1D NumPy Arrays?

We can calculate the range for a 1D NumPy array using the ptp() function. The ptp() function takes the array as its input and returns the range of the array, as shown below:

import numpy as np
marks = np.array([80,95,88,72, 99])
range_value = np.ptp(marks)
print("The array is:",marks)
print("The range is:",range_value)

Output:

The array is: [80 95 88 72 99] 
The range is: 27 

If the 1D NumPy array passed to the ptp() function contains nan values, it returns nan.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
range_value = np.ptp(marks)
print("The array is:",marks)
print("The range is:",range_value)

Output:

The array is: [80. 95. nan 72. 99.] 
The range is: nan 

How to Calculate the Range for 2D NumPy Arrays?

When we pass a 2D NumPy array to the ptp() function, it calculates the range of the array as if the elements of the array are present in a 1D array and returns the output.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
range_value = np.ptp(class_marks)
print("The range is:",range_value)

Output:

The range is: 31 

In the above example, the ptp() function calculates the range by considering all the values in the array as if the array were a flattened 1D array. However, we can also calculate the range across rows and columns separately.

  • To calculate the range of values in each column in a 2D NumPy array, we can set the axis parameter to 0 in the ptp() function. In this case, the number of values in the output array equals the number of columns in the input array. Each value in the output array represents the range of columns at the same position in the input 2D array.
  • To calculate the range of values in each row in a given 2D array, we can set the axis parameter to 1 in the ptp() function. After this, we get an output array with elements equal to the number of rows in the input array. Each value in the output array represents the range of the respective row in the input 2D array.

You can observe this in the following example.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
range_for_each_column = np.ptp(class_marks, axis=0)
range_for_each_row = np.ptp(class_marks, axis=1)
print("The range for each column is:",range_for_each_column)
print("The range for each row is:",range_for_each_row)

Output:

The range for each column is: [25 14 15 21 31] 
The range for each row is: [27 31 20  7 18 18] 

If we pass a 2D array containing a nan value to the ptp() function and the axis parameter is not set to 0 or 1, it returns nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
range_value = np.ptp(class_marks)
print("The range is:",range_value)

Output:

The range is: nan 

When we set the axis parameter to 0 in the ptp() function, only the values in the output array corresponding to columns having nan values in the input array are nan. Similarly, when the axis parameter is set to 1, the values in the output array corresponding to rows having nan values in the input array are nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
range_value_for_columns = np.ptp(class_marks,axis=0)
range_value_for_rows = np.ptp(class_marks,axis=1)
print("The range for each row is:",range_value_for_rows)
print("The range for each column is:",range_value_for_columns)

Output:

The range for each row is: [nan 31. 20.  7. 18. 18.] 
The range for each column is: [25. 14. nan 21. 31.] 

Calculating Variance for NumPy Arrays in Python

The variance of a dataset measures the mean squared difference of each data point from the mean. It measures how far each value deviates from the mean. We can calculate variance using the following formula-

σ² = Σ(xᵢ - μ)² / N

Here,

  • xᵢ is a data point in the dataset.
  • μ is the mean for the dataset.
  • N is the number of data points in the sample.

How to Calculate Variance for 1D NumPy Arrays?

To calculate the variance for a 1D NumPy array, we can use the var() function. When we pass the array to the var() function, it returns the variance for the input array.

import numpy as np
marks = np.array([80,95,88,72, 99])
variance = np.var(marks)
print("The array is:",marks)
print("The variance is:",variance)

Output:

The array is: [80 95 88 72 99] 
The variance is: 96.55999999999999 

If the input array passed to the var() function contains nan values, the output is also nan.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
variance = np.var(marks)
print("The array is:",marks)
print("The variance is:",variance)

Output:

The array is: [80. 95. nan 72. 99.] 
The variance is: nan 

To ignore the nan values and calculate the variance for the rest of the values in the input array, we can use the nanvar() function.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
variance = np.nanvar(marks)
print("The array is:",marks)
print("The variance is:",variance)

Output:

The array is: [80. 95. nan 72. 99.] 
The variance is: 120.25 

How to Calculate Variance for 2D NumPy Arrays?

If we pass a 2D NumPy array to the var() function, it treats the 2D array as a flattened 1D array and returns the calculated variance.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
variance = np.var(class_marks)
print("The variance is:",variance)

Output:

The variance is: 85.46222222222221 

We can also calculate variance across rows or columns using the var() function. To calculate the variance for each row in the input array, we set the axis parameter to 1 in the var() function. To do the same for each column in the input array, we set the axis parameter to 0.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
variance_for_each_row = np.var(class_marks, axis=1)
variance_for_each_column = np.var(class_marks, axis=0)
print("The variance for each row is:",variance_for_each_row)
print("The variance for each column is:",variance_for_each_column)

Output:

The variance for each row is: [ 96.56 107.44  44.4    6.64  57.04  36.8 ] 
The variance for each column is: [ 84.55555556  34.22222222  21.22222222  89.55555556 133.22222222] 

If the axis parameter in the var() function is set to 1 and the input array contains nan values, we get nan values in the output for the rows that contain nan values. For instance, if the first row of the input array contains nan, the output of the var() function will also contain nan as its first element.

Similarly, if the axis parameter is set to 0, the output array returned by the var() function contains nan in the positions corresponding to the columns that contain nan values in the input array. For instance, if the input array contains a nan value in the third column, the third value in the output array will be nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
variance_for_each_row = np.var(class_marks, axis=1)
variance_for_each_column = np.var(class_marks, axis=0)
print("The variance for each row is:",variance_for_each_row)
print("The variance for each column is:",variance_for_each_column)

Output:

The variance for each row is: [  nan 107.44  44.4    6.64  57.04  36.8 ] 
The variance for each column is: [ 84.55555556  34.22222222          nan  89.55555556 133.22222222] 

To avoid nan values and calculate variance for the rest of the values in the rows or columns having nan values, we can use the nanvar() function.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
variance_for_each_row = np.nanvar(class_marks, axis=1)
variance_for_each_column = np.nanvar(class_marks, axis=0)
print("The variance for each row is:",variance_for_each_row)
print("The variance for each column is:",variance_for_each_column)

Output:

The variance for each row is: [120.25 107.44  44.4    6.64  57.04  36.8 ] 
The variance for each column is: [ 84.55555556  34.22222222  25.36        89.55555556 133.22222222] 

Calculating Standard Deviation for NumPy Arrays

Standard deviation is the square root of variance. It measures spread in the same units as the original data. We can calculate the standard deviation for a given dataset using the following formula-

σ = √( Σ(xᵢ - μ)² / N )

Here,

  • xᵢ is a data point in the dataset.
  • μ is the mean for the dataset.
  • N is the number of data points in the sample.

Calculate Standard Deviation for 1D NumPy Arrays

To calculate the standard deviation for a 1D NumPy array, we can use the std() function, as shown below:

import numpy as np
marks = np.array([80,95,88,72, 99])
standard_deviation = np.std(marks)
print("The array is:",marks)
print("The standard deviation is:",standard_deviation)

Output:

The array is: [80 95 88 72 99] 
The standard deviation is: 9.826494797230596 

If the input array contains nan values, the std() function will also return nan.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
standard_deviation = np.std(marks)
print("The array is:",marks)
print("The standard deviation is:",standard_deviation)

Output:

The array is: [80. 95. nan 72. 99.] 
The standard deviation is: nan 

To ignore the nan values and calculate the standard deviation for the rest of the values in the input array, we can use the nanstd() function. The nanstd() function works exactly like the std() function. However, it ignores the nan values while calculating the standard deviation.

import numpy as np
marks = np.array([80,95,np.nan,72, 99])
standard_deviation = np.nanstd(marks)
print("The array is:",marks)
print("The standard deviation is:",standard_deviation)

Output:

The array is: [80. 95. nan 72. 99.] 
The standard deviation is: 10.965856099730654 

Calculate Standard Deviation for 2D NumPy Arrays

We can also calculate the standard deviation for 2D arrays using the std() function. When we pass a 2D array to the std() function, it treats the 2D array as a flattened 1D array and returns the standard deviation.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
standard_deviation = np.std(class_marks)
print("The standard deviation is:",standard_deviation)

Output:

The standard deviation is: 9.244577990488382 

We can use the axis parameter to calculate standard deviation across rows or columns using the std() function. To calculate the standard deviation for each row in the input array, we set the axis parameter to 1. To calculate the standard deviation for each column in the input array, we set the axis parameter to 0.

import numpy as np
class_marks = np.array([[80,95,88,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
std_for_each_row = np.std(class_marks, axis=1)
std_for_each_column = np.std(class_marks, axis=0)
print("The standard deviation for each row is:",std_for_each_row)
print("The standard deviation for each column is:",std_for_each_column)

Output:

The standard deviation for each row is: [ 9.8264948  10.36532682  6.6633325   2.57681975  7.55248304  6.06630036] 
The standard deviation for each column is: [ 9.19540948  5.84997626  4.60675832  9.46337971 11.54219313] 

If the input 2D array contains nan values and the axis parameter is set to 1 in the std() function, we get nan in the output array for the rows that contain nan values. For example, if the first row of the input array contains nan, the output array of the std() function will also contain nan as its first element.

Similarly, if the axis parameter is set to 0, the output array returned by the std() function contains nan in the respective positions for the columns that contain nan values in the input. For instance, if the input array contains a nan value in the third column, the third value in the output array will be nan.

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
standard_deviation_for_rows = np.std(class_marks, axis=1)
standard_deviation_for_columns = np.std(class_marks, axis=0)
print("The standard deviation for each row is:",standard_deviation_for_rows)
print("The standard deviation for each column is:",standard_deviation_for_columns)

Output:

The standard deviation for each row is: [        nan 10.36532682  6.6633325   2.57681975  7.55248304  6.06630036] 
The standard deviation for each column is: [ 9.19540948  5.84997626         nan  9.46337971 11.54219313] 

To ignore nan values and calculate the standard deviation for the rest of the values in the rows or columns having nan values, we can use the nanstd() function, as shown below:

import numpy as np
class_marks = np.array([[80,95,np.nan,72, 99],
[77,99,83,76, 68],
[97,91,88,92, 77],
[95,99,98,92, 94],
[72,85,86,71, 89],
[81,85,89,91, 99]])
std_for_each_row = np.nanstd(class_marks, axis=1)
std_for_each_column = np.nanstd(class_marks, axis=0)
print("The standard deviation for each row is:",std_for_each_row)
print("The standard deviation for each column is:",std_for_each_column)

Output:

The standard deviation for each row is: [10.9658561  10.36532682  6.6633325   2.57681975  7.55248304  6.06630036] 
The standard deviation for each column is: [ 9.19540948  5.84997626  5.03587132  9.46337971 11.54219313] 

Conclusion

Understanding and calculating statistical measures like mean, median, weighted mean, range, variance, and standard deviation is essential for data analysis. Following are the key concepts that we discussed:

  1. We use the mean() and nanmean() functions to calculate the mean of the elements in a NumPy array.
  2. To calculate median, we use the median() and nanmedian() function.
  3. To calculate the weighted mean, we use the average() function.
  4. We use the ptp() function to calculate the range of a NumPy array.
  5. We use the var() and nanvar() function to calculate the variance.
  6. To calculate the standard deviation for elements in a NumPy array, we use the std() and nanstd() functions.

To learn more about statistics in NumPy and try out a few projects, you can go through this free course on statistics with NumPy.

Author

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team