Hands-on Statistics with NumPy in Python
Understanding data distribution is key to deriving correct insights while analyzing data. To analyze data distribution, we use measures of central tendency and measures of spread. NumPy provides many functions for calculating measures of central tendency, such as mean, median, and weighted mean. It also provides functions to calculate measures of spread, such as range, variance, and standard deviation.
Let’s discuss how to compute measures of central tendency and measures of spread for both 1D and 2D NumPy arrays in Python. We will also discuss handling missing values while calculating these statistical measures.
What are Measures of Central Tendency?
Measures of central tendency are statistical measures that summarize a data set by identifying the central or typical value around which other data points cluster. The three main measures of central tendency are mean, median, and mode.
Measures of central tendency help us provide a summary of the data’s distribution. These measures are used in various statistical analyses to understand the typical or central behavior of the dataset.
What are Measures of Spread?
Measures of spread, also known as measures of variability, describe the extent to which data points in a dataset differ from the central value. They provide insight into how spread out or concentrated the data is. The main measures of spread are range, variance, and standard deviation.
Measures of spread help us assess the consistency and reliability of the data by indicating how concentrated or dispersed the values are relative to the central tendency.
Calculating Mean Using NumPy Arrays in Python
Mean or average is a measure of central tendency calculated as the sum of all values in a dataset divided by the number of values. It gives a quick snapshot of the typical value in the data.
How to Calculate Mean for 1D NumPy Arrays?
For a 1D NumPy array, we can calculate the mean using the mean()
function. When we pass a NumPy array to the mean()
function, it returns the mean of all the elements in the array.
For example, if we have a list of marks of a student for five different subjects, we can find the mean of the marks scored by the student in all the subjects using the mean()
function, as shown below:
import numpy as npmarks = np.array([80,95,88,72, 99])mean_value = np.mean(marks)print("The array is:", marks)print("The mean value is:", mean_value)
The output for the code is the following:
The array is: [80 95 88 72 99]
The mean value is: 86.8
Missing data points are a common issue in real-world datasets. In NumPy, we represent the missing values as nan
, which stands for “not a number”.
If a NumPy array contains one or more nan
values, the mean()
function returns nan
after execution. For instance, if the student in our example misses one of the tests, the array of their marks will have one nan
value. In such a case, calculating the mean of the marks using the mean()
function will also give us a nan
value.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])mean_value = np.mean(marks)print("The array is:",marks)print("The mean value is:",mean_value)
Output:
The array is: [80. 95. nan 72. 99.]
The mean value is: nan
NumPy can ignore the nan
values and calculate the mean for the rest of the values. For this, NumPy provides us with the nanmean()
function. When we pass a NumPy array to the nanmean()
function, it ignores the nan
values and returns the mean of the remaining values.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])mean_value = np.nanmean(marks)print("The array is:",marks)print("The mean value is:",mean_value)
Output:
The array is: [80. 95. nan 72. 99.]
The mean value is: 86.5
In addition to 1D arrays, we can also calculate the mean for 2D NumPy arrays using the mean()
function. Let’s see how we can do that.
How to Calculate Mean for 2D NumPy Arrays?
A 2D NumPy array, as the name implies, has two dimensions.
- The first dimension corresponds to the rows of the 2D array and is represented by axis 0.
- The second dimension corresponds to the columns of the 2D array and is represented by axis 1.
If we have a 2D array containing marks for students in a class, we can calculate the mean by three different approaches.
- We can calculate the mean of the marks scored by all the students in the class in all the subjects.
- We can calculate the mean of marks scored by all the students in a particular subject.
- We can calculate the mean of the marks scored by a particular student in all the subjects.
Suppose we have the following two-dimensional array containing marks data for six students and five subjects. In the array, each row represents the marks obtained by an individual student, and each column represents marks scored in a particular subject.
class_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])
To calculate the mean of the marks scored by all the students in the class in all the subjects, we need to calculate the mean of all the values in the above 2D array. For this, we can directly pass the array to the mean()
function like so:
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])mean_value = np.mean(class_marks)print("The mean value is:",mean_value)
Output:
The mean value is: 86.93333333333334
Now, we want to calculate the mean of the marks obtained by a particular student in all the subjects. For this, we need to calculate the mean of the values in each row in the 2D array, as each row represents the marks scored by an individual student.
To calculate the mean of each row in a 2D array, we set the axis
parameter in the mean()
function to 1.
- By default, the
axis
parameter in themean()
function is set toNone
. In this case, the mean is calculated over all the values in the array, and we get a single value as output, as shown in the previous example. - When the
axis
parameter in themean()
function is set to 1, the calculation is done across the horizontal direction. The mean is calculated for the values in each row, and we get an array as the output of themean()
function. The length of the output array is the same as the number of rows in the input array. - When the
axis
parameter is set to 0, the calculation is done in the vertical direction. The mean is calculated for the values in each column, and we get an array as the output of themean()
function. In this case, the length of the output array is the same as the number of columns in the input array.
We can get the mean of the marks scored by each of the six students using the mean()
function and the axis
parameter as shown below:
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])# Calculate the mean for each row separatelymean_values = np.mean(class_marks, axis=1)print("The mean marks for the 6 students are:",mean_values)
Output:
The mean marks for the 6 students are: [86.8 80.6 89. 95.6 80.6 89. ]
As there are six students, i.e., six rows in the input array, you can observe that we get six values in the output array.
Instead of the students, if we want to calculate the mean of the marks by each subject, we need to calculate the mean for each column. We can set the axis parameter to 0 in the mean()
function for this.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])# Calculate the mean for each column separatelymean_values = np.mean(class_marks, axis=0)print("The mean of the marks scored in each subject are:",mean_values)
Output:
The mean of the marks scored in each subject are: [83.66666667 92.33333333 88.66666667 82.33333333 87.66666667]
As the input array has five subjects, i.e., five columns, we get five values in the output array.
If the 2D array passed to the mean()
function contains nan
values, it returns a nan
value.
import numpy as npclass_marks=np.array([[80,95,np.nan,72,99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])mean_value=np.mean(class_marks)print("The mean value is:",mean_value)
Output:
The mean value is: nan
Notice in the above example the one nan
value in the input array and the mean()
function returns nan
.
When the axis
parameter is set to 1, and the input array contains nan
values, we get nan
values in the output only for the rows containing nan
values. For instance, if the first row of the input array contains nan
, the output array of the mean()
function will contain nan
as its first element.
Similarly, if the axis
parameter is set to 0, the output array returned by the mean()
function contains nan
in the respective positions for the columns that contain nan
values. For example, if the input array contains a nan
value in the third column, the third value in the output array will be nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])# Calculate mean for each rowmean_for_each_row = np.mean(class_marks, axis=1)# Calculare mean for each columnmean_for_each_column = np.mean(class_marks, axis=0)print("The mean for each row is:",mean_for_each_row)print("The mean for each column is:",mean_for_each_column)
Output:
The mean for each row is: [ nan 80.6 89. 95.6 80.6 89. ]
The mean for each column is: [83.66666667 92.33333333 nan 82.33333333 87.66666667]
To ignore nan
values in 2D arrays and calculate the mean for rows or columns having nan
values, we can use the nanmean()
function instead of the mean()
function, like so:
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])mean_value_by_rows = np.nanmean(class_marks, axis=1)mean_value_by_columns = np.nanmean(class_marks, axis=0)print("The mean for each row is:",mean_value_by_rows)print("The mean for each column is:",mean_value_by_columns)
Output:
The mean for each row is: [86.5 80.6 89. 95.6 80.6 89. ]
The mean for each column is: [83.66666667 92.33333333 88.8 82.33333333 87.66666667]
Note that we used the nanmean()
function in the above example. Hence, the nan
values are ignored while calculating the mean for a given row or column.
Calculating Median Using NumPy Arrays
The median value for a dataset is the number that is in the middle of the sorted group. If we have a dataset of N numbers, we calculate the median by first sorting the numbers and then selecting the number at the (N+1)/2
position if N is odd. If N is even, we calculate the mean of the numbers at N/2
and (N+1)/2
positions and consider this value as the median of the dataset.
How to Calculate Median for 1D NumPy Arrays?
To calculate the median value for the elements in a 1D NumPy array, we can use the median()
function like so:
import numpy as npmarks=np.array([80,95,88,72, 99])median_value=np.median(marks)print("The array is:",marks)print("The median value is:",median_value)
Output:
The array is: [80 95 88 72 99]
The median value is: 88.0
If the input array passed to the median()
function contains nan
values, the output value will also be nan
.
import numpy as npmarks=np.array([80,95,np.nan,72, 99])median_value=np.median(marks)print("The array is:",marks)print("The median value is:",median_value)
Output:
The array is: [80. 95. nan 72. 99.]
The median value is: nan
To ignore the nan
values and calculate the median for the rest of the values in the array, we can use the nanmedian()
function, as shown below:
import numpy as npmarks = np.array([80,95,np.nan,72, 99])median_value = np.nanmedian(marks)print("The array is:",marks)print("The median value is:",median_value)
Output:
The array is: [80. 95. nan 72. 99.]
The median value is: 87.5
How to Calculate Median for 2D NumPy Arrays?
Like the mean, we can also calculate the median for 2D in three ways using the median()
function. To calculate the median of all the values in a given 2D array, we pass the array to the median()
function.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])median_value = np.median(class_marks)print("The median value is:",median_value)
Output:
The median value is: 88.5
To calculate the median value for each row in the input array, we can set the axis
parameter to 1 in the median()
function. To calculate the median for each column, we set the axis
parameter to 0. You can observe this in the following example:
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])# Set axis parameter to 1 for calculation across rowsmedian_value_for_each_row = np.median(class_marks, axis=1)# Set axis parameter to 0 for calculation across columnsmedian_value_for_each_column = np.median(class_marks, axis=0)print("The median value for each row is:",median_value_for_each_row)print("The median value for each column is:",median_value_for_each_column)
Output:
The median value for each row is: [88. 77. 91. 95. 85. 89.]
The median value for each column is: [80.5 93. 88. 83.5 91.5]
If the 2D array contains nan
values, the output of the median()
function is also nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])median_value = np.median(class_marks)print("The median value is:",median_value)
Output:
The median value is: nan
If the axis
parameter is set to 1, the median is calculated for each row. In such a case, we get nan
values in the output array for each row with nan
values in the input 2D array. Similarly, when the axis
parameter is set to 0, the median is calculated for each column. Hence, we get nan
values in the output array for each column with nan
values in the input 2D array.
import numpy as npclass_marks=np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])median_value_for_rows = np.median(class_marks, axis=1)median_value_for_columns = np.median(class_marks, axis=0)print("The median value for each row is:",median_value_for_rows)print("The median value for each column is:",median_value_for_columns)
Output:
The median value for each row is: [nan 77. 91. 95. 85. 89.]
The median value for each column is: [80.5 93. nan 83.5 91.5]
In the above code, the first row in the input array contains a nan
value. Due to this, the output array has nan
in the first position. Also, the third column in the input array contains a nan
value. Hence, the output array for the median of columns contains nan
at the third position.
To ignore nan
values and calculate the median for the rest of the values in the rows or columns having nan
, we can use the nanmedian()
function, as shown below:
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])median_value_by_rows = np.nanmedian(class_marks, axis=1)median_value_by_columns = np.nanmedian(class_marks, axis=0)print("The median value for each row is:",median_value_by_rows)print("The median value for each column is:",median_value_by_columns)
Output:
The median value for each row is: [87.5 77. 91. 95. 85. 89. ]
The median value for each column is: [80.5 93. 88. 83.5 91.5]
In this example, you can observe that we don’t get nan
values in the outputs. This is because the nan
values have been ignored while calculating the median for rows or columns having such values.
Calculating Weighted Mean for NumPy Arrays
Weighted mean is a type of mean where some data points have more influence on the mean than other data points. We calculate the weighed mean by assigning a weight to each data point, reflecting its importancein the dataset.
For example, if we have an array with five elements, say [80, 95, 88, 72, 99]
, and the weights for the elements are [1, 2, 1, 3, 2]
. Then, the weighted mean of the array is calculated as follows:
(80×1)+(95×2)+(88×1)+(72×3)+(99×2)───────────────────────────────1+2+1+3+2
How to Calculate the Weighted Mean for 1D NumPy Arrays?
We use the average()
function to calculate the weighted mean for NumPy arrays. The average()
function takes an array containing the elements as its first input argument and a list containing their weights as its second input argument and returns the weighted mean of the elements in the input array.
import numpy as npmarks = np.array([80,95,88,72, 99])weighted_mean = np.average(marks, weights=[1,2,1,3,2])print("The array is:",marks)print("Weighted mean is:",weighted_mean)
Output:
The array is: [80 95 88 72 99]Weighted mean is: 85.77777777777777
While calculating the weighted mean, the number of elements in the array passed to the weights
parameter must be the same as the number of elements in the array containing the actual dataset. Otherwise, the program runs into a TypeError
exception.
If the input array passed to the average()
function contains nan
values, the output becomes nan
.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])weighted_mean = np.average(marks, weights=[1,2,1,3,2])print("The array is:",marks)print("Weighted mean is:",weighted_mean)
Output:
The array is: [80. 95. nan 72. 99.]
Weighted mean is: nan
How to Calculate the Weighted Mean for 2D NumPy Arrays?
We can also calculate the weighted mean for 2D NumPy arrays. To calculate the weighted mean of elements in each row in a 2D NumPy array, we must set the axis
parameter in the average()
function to 1 and pass a list of weights having the same number of elements as the number of columns in the array.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])weighted_mean = np.average(class_marks,axis=1,weights=[1,2,1,3,2])print("Weighted mean for each row is:",weighted_mean)
Output:
Weighted mean for each row is: [85.77777778 80.22222222 88.55555556 95. 79.88888889 90.11111111]
To calculate the weighted mean across columns in a 2D array, we set the axis
parameter in the average()
function to 0 and pass a list of weights with the same number of elements as the number of rows in the input array.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])weighted_mean = np.average(class_marks,axis=0,weights=[1,2,1,3,2,4])print("Weighted mean for each column is:",weighted_mean)
Output:
Weighted mean for each column is: [83.38461538 91.61538462 89.53846154 84.46153846 89.84615385]
If the 2D array passed to the average()
function contains nan
values, the output values corresponding to the rows or columns in which the input array contains nan
will also be nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])weighted_mean_for_rows = np.average(class_marks,axis=1,weights=[1,2,1,3,2])print("Weighted mean of elements in each row is:",weighted_mean_for_rows)weighted_mean_for_cls = np.average(class_marks,axis=0,weights=[1,2,1,3,2,4])print("Weighted mean of elements in each column is:",weighted_mean_for_cls)
Output:
Weighted mean of elements in each row is: [ nan 80.22222222 88.55555556 95. 79.88888889 90.11111111]
Weighted mean of elements in each column is: [83.38461538 91.61538462 nan 84.46153846 89.84615385]
In this example, the input 2D array contains a nan
value in the first row and the third column. Hence, we get a nan
value at the first position in the output array while calculating the weighted mean for elements in each row. Similarly, we get a nan
value at the third position in the output array while calculating the weighted mean for elements in each column.
NumPy doesn’t have a function like nanaverage()
to ignore nan
values and calculate the weighted mean for the rest of the values in the input arrays.
We have discussed how to calculate measures of central tendency like mean, median, and weighted mean for 1D and 2D NumPy arrays. However, NumPy doesn’t provide a function for calculating one of the popular measures of central tendency, mode.
Now, let’s discuss the measures of spread and how to calculate them using the functions provided in the NumPy module.
Calculating Range for NumPy Arrays in Python
Measures of spread describe the variability in a dataset. They help us understand the degree to which data points differ from each other and from the measures of central tendency, i.e. (mean, median, or mode). In the following subsections, we will discuss three measures of spread: range, variance, and standard deviation.
The range for a dataset is defined as the difference between the maximum and minimum values in the dataset. By definition, range is always a non-negative value.
How to Calculate the Range for 1D NumPy Arrays?
We can calculate the range for a 1D NumPy array using the ptp()
function. The ptp()
function takes the array as its input and returns the range of the array, as shown below:
import numpy as npmarks = np.array([80,95,88,72, 99])range_value = np.ptp(marks)print("The array is:",marks)print("The range is:",range_value)
Output:
The array is: [80 95 88 72 99]
The range is: 27
If the 1D NumPy array passed to the ptp()
function contains nan
values, it returns nan
.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])range_value = np.ptp(marks)print("The array is:",marks)print("The range is:",range_value)
Output:
The array is: [80. 95. nan 72. 99.]
The range is: nan
How to Calculate the Range for 2D NumPy Arrays?
When we pass a 2D NumPy array to the ptp()
function, it calculates the range of the array as if the elements of the array are present in a 1D array and returns the output.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])range_value = np.ptp(class_marks)print("The range is:",range_value)
Output:
The range is: 31
In the above example, the ptp()
function calculates the range by considering all the values in the array as if the array were a flattened 1D array. However, we can also calculate the range across rows and columns separately.
- To calculate the range of values in each column in a 2D NumPy array, we can set the
axis
parameter to 0 in theptp()
function. In this case, the number of values in the output array equals the number of columns in the input array. Each value in the output array represents the range of columns at the same position in the input 2D array. - To calculate the range of values in each row in a given 2D array, we can set the
axis
parameter to 1 in theptp()
function. After this, we get an output array with elements equal to the number of rows in the input array. Each value in the output array represents the range of the respective row in the input 2D array.
You can observe this in the following example.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])range_for_each_column = np.ptp(class_marks, axis=0)range_for_each_row = np.ptp(class_marks, axis=1)print("The range for each column is:",range_for_each_column)print("The range for each row is:",range_for_each_row)
Output:
The range for each column is: [25 14 15 21 31]
The range for each row is: [27 31 20 7 18 18]
If we pass a 2D array containing a nan
value to the ptp()
function and the axis parameter is not set to 0 or 1, it returns nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])range_value = np.ptp(class_marks)print("The range is:",range_value)
Output:
The range is: nan
When we set the axis
parameter to 0 in the ptp()
function, only the values in the output array corresponding to columns having nan
values in the input array are nan
. Similarly, when the axis
parameter is set to 1, the values in the output array corresponding to rows having nan
values in the input array are nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])range_value_for_columns = np.ptp(class_marks,axis=0)range_value_for_rows = np.ptp(class_marks,axis=1)print("The range for each row is:",range_value_for_rows)print("The range for each column is:",range_value_for_columns)
Output:
The range for each row is: [nan 31. 20. 7. 18. 18.]
The range for each column is: [25. 14. nan 21. 31.]
Calculating Variance for NumPy Arrays in Python
The variance of a dataset measures the mean squared difference of each data point from the mean. It measures how far each value deviates from the mean. We can calculate variance using the following formula-
σ² = Σ(xᵢ - μ)² / N
Here,
- xᵢ is a data point in the dataset.
- μ is the mean for the dataset.
- N is the number of data points in the sample.
How to Calculate Variance for 1D NumPy Arrays?
To calculate the variance for a 1D NumPy array, we can use the var()
function. When we pass the array to the var()
function, it returns the variance for the input array.
import numpy as npmarks = np.array([80,95,88,72, 99])variance = np.var(marks)print("The array is:",marks)print("The variance is:",variance)
Output:
The array is: [80 95 88 72 99]
The variance is: 96.55999999999999
If the input array passed to the var()
function contains nan
values, the output is also nan
.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])variance = np.var(marks)print("The array is:",marks)print("The variance is:",variance)
Output:
The array is: [80. 95. nan 72. 99.]
The variance is: nan
To ignore the nan
values and calculate the variance for the rest of the values in the input array, we can use the nanvar()
function.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])variance = np.nanvar(marks)print("The array is:",marks)print("The variance is:",variance)
Output:
The array is: [80. 95. nan 72. 99.]
The variance is: 120.25
How to Calculate Variance for 2D NumPy Arrays?
If we pass a 2D NumPy array to the var()
function, it treats the 2D array as a flattened 1D array and returns the calculated variance.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])variance = np.var(class_marks)print("The variance is:",variance)
Output:
The variance is: 85.46222222222221
We can also calculate variance across rows or columns using the var()
function. To calculate the variance for each row in the input array, we set the axis
parameter to 1 in the var() function. To do the same for each column in the input array, we set the axis
parameter to 0.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])variance_for_each_row = np.var(class_marks, axis=1)variance_for_each_column = np.var(class_marks, axis=0)print("The variance for each row is:",variance_for_each_row)print("The variance for each column is:",variance_for_each_column)
Output:
The variance for each row is: [ 96.56 107.44 44.4 6.64 57.04 36.8 ]
The variance for each column is: [ 84.55555556 34.22222222 21.22222222 89.55555556 133.22222222]
If the axis
parameter in the var() function is set to 1 and the input array contains nan
values, we get nan
values in the output for the rows that contain nan
values. For instance, if the first row of the input array contains nan
, the output of the var()
function will also contain nan
as its first element.
Similarly, if the axis
parameter is set to 0, the output array returned by the var()
function contains nan
in the positions corresponding to the columns that contain nan
values in the input array. For instance, if the input array contains a nan
value in the third column, the third value in the output array will be nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])variance_for_each_row = np.var(class_marks, axis=1)variance_for_each_column = np.var(class_marks, axis=0)print("The variance for each row is:",variance_for_each_row)print("The variance for each column is:",variance_for_each_column)
Output:
The variance for each row is: [ nan 107.44 44.4 6.64 57.04 36.8 ]
The variance for each column is: [ 84.55555556 34.22222222 nan 89.55555556 133.22222222]
To avoid nan
values and calculate variance for the rest of the values in the rows or columns having nan
values, we can use the nanvar()
function.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])variance_for_each_row = np.nanvar(class_marks, axis=1)variance_for_each_column = np.nanvar(class_marks, axis=0)print("The variance for each row is:",variance_for_each_row)print("The variance for each column is:",variance_for_each_column)
Output:
The variance for each row is: [120.25 107.44 44.4 6.64 57.04 36.8 ]
The variance for each column is: [ 84.55555556 34.22222222 25.36 89.55555556 133.22222222]
Calculating Standard Deviation for NumPy Arrays
Standard deviation is the square root of variance. It measures spread in the same units as the original data. We can calculate the standard deviation for a given dataset using the following formula-
σ = √( Σ(xᵢ - μ)² / N )
Here,
- xᵢ is a data point in the dataset.
- μ is the mean for the dataset.
- N is the number of data points in the sample.
Calculate Standard Deviation for 1D NumPy Arrays
To calculate the standard deviation for a 1D NumPy array, we can use the std()
function, as shown below:
import numpy as npmarks = np.array([80,95,88,72, 99])standard_deviation = np.std(marks)print("The array is:",marks)print("The standard deviation is:",standard_deviation)
Output:
The array is: [80 95 88 72 99]
The standard deviation is: 9.826494797230596
If the input array contains nan
values, the std()
function will also return nan
.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])standard_deviation = np.std(marks)print("The array is:",marks)print("The standard deviation is:",standard_deviation)
Output:
The array is: [80. 95. nan 72. 99.]
The standard deviation is: nan
To ignore the nan
values and calculate the standard deviation for the rest of the values in the input array, we can use the nanstd()
function. The nanstd()
function works exactly like the std()
function. However, it ignores the nan
values while calculating the standard deviation.
import numpy as npmarks = np.array([80,95,np.nan,72, 99])standard_deviation = np.nanstd(marks)print("The array is:",marks)print("The standard deviation is:",standard_deviation)
Output:
The array is: [80. 95. nan 72. 99.]
The standard deviation is: 10.965856099730654
Calculate Standard Deviation for 2D NumPy Arrays
We can also calculate the standard deviation for 2D arrays using the std()
function. When we pass a 2D array to the std()
function, it treats the 2D array as a flattened 1D array and returns the standard deviation.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])standard_deviation = np.std(class_marks)print("The standard deviation is:",standard_deviation)
Output:
The standard deviation is: 9.244577990488382
We can use the axis
parameter to calculate standard deviation across rows or columns using the std()
function. To calculate the standard deviation for each row in the input array, we set the axis
parameter to 1. To calculate the standard deviation for each column in the input array, we set the axis
parameter to 0.
import numpy as npclass_marks = np.array([[80,95,88,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])std_for_each_row = np.std(class_marks, axis=1)std_for_each_column = np.std(class_marks, axis=0)print("The standard deviation for each row is:",std_for_each_row)print("The standard deviation for each column is:",std_for_each_column)
Output:
The standard deviation for each row is: [ 9.8264948 10.36532682 6.6633325 2.57681975 7.55248304 6.06630036]
The standard deviation for each column is: [ 9.19540948 5.84997626 4.60675832 9.46337971 11.54219313]
If the input 2D array contains nan
values and the axis
parameter is set to 1 in the std()
function, we get nan
in the output array for the rows that contain nan
values. For example, if the first row of the input array contains nan
, the output array of the std()
function will also contain nan
as its first element.
Similarly, if the axis
parameter is set to 0, the output array returned by the std()
function contains nan
in the respective positions for the columns that contain nan
values in the input. For instance, if the input array contains a nan
value in the third column, the third value in the output array will be nan
.
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])standard_deviation_for_rows = np.std(class_marks, axis=1)standard_deviation_for_columns = np.std(class_marks, axis=0)print("The standard deviation for each row is:",standard_deviation_for_rows)print("The standard deviation for each column is:",standard_deviation_for_columns)
Output:
The standard deviation for each row is: [ nan 10.36532682 6.6633325 2.57681975 7.55248304 6.06630036]
The standard deviation for each column is: [ 9.19540948 5.84997626 nan 9.46337971 11.54219313]
To ignore nan values and calculate the standard deviation for the rest of the values in the rows or columns having nan values, we can use the nanstd()
function, as shown below:
import numpy as npclass_marks = np.array([[80,95,np.nan,72, 99],[77,99,83,76, 68],[97,91,88,92, 77],[95,99,98,92, 94],[72,85,86,71, 89],[81,85,89,91, 99]])std_for_each_row = np.nanstd(class_marks, axis=1)std_for_each_column = np.nanstd(class_marks, axis=0)print("The standard deviation for each row is:",std_for_each_row)print("The standard deviation for each column is:",std_for_each_column)
Output:
The standard deviation for each row is: [10.9658561 10.36532682 6.6633325 2.57681975 7.55248304 6.06630036]
The standard deviation for each column is: [ 9.19540948 5.84997626 5.03587132 9.46337971 11.54219313]
Conclusion
Understanding and calculating statistical measures like mean, median, weighted mean, range, variance, and standard deviation is essential for data analysis. Following are the key concepts that we discussed:
- We use the
mean()
andnanmean()
functions to calculate the mean of the elements in a NumPy array. - To calculate median, we use the
median()
andnanmedian()
function. - To calculate the weighted mean, we use the
average()
function. - We use the
ptp()
function to calculate the range of a NumPy array. - We use the
var()
andnanvar()
function to calculate the variance. - To calculate the standard deviation for elements in a NumPy array, we use the
std()
andnanstd()
functions.
To learn more about statistics in NumPy and try out a few projects, you can go through this free course on statistics with NumPy.
Author
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Sorting and Unary Operations in NumPy
Explore sorting and unary operations in NumPy arrays with examples for single and multi-dimensional data. - Article
Splitting Arrays in NumPy
Learn how to split NumPy arrays using functions like `np.split()`, `np.array_split()`, `np.hsplit()`, `np.vsplit()`, and `np.dsplit()`. A beginner-friendly guide with practical examples.
Learn more on Codecademy
- Skill path
Code Foundations
Start your programming journey with an introduction to the world of code and basic concepts.Includes 5 CoursesWith CertificateBeginner Friendly4 hours - Career path
Full-Stack Engineer
A full-stack engineer can get a project done from start to finish, back-end to front-end.Includes 51 CoursesWith Professional CertificationBeginner Friendly150 hours
- What are Measures of Central Tendency?
- What are Measures of Spread?
- Calculating Mean Using NumPy Arrays in Python
- Calculating Median Using NumPy Arrays
- Calculating Weighted Mean for NumPy Arrays
- Calculating Range for NumPy Arrays in Python
- Calculating Variance for NumPy Arrays in Python
- Calculating Standard Deviation for NumPy Arrays
- Conclusion