Probability and Statistics- Descriptive Analysis

Probability and statistics form very fundamental blocks of data science, as they provide the mathematical foundation for understanding and analyzing data. Here is a list of key probability and statistics topics that are commonly covered in data science:- Descriptive Statistics, Probability Theory, Random Variables and Probability Distributions, Sampling and Sampling Distributions, Statistical Inference, Bayesian Statistics, Regression Analysis, Statistical Modelling and Machine Learning.
We’ll cover only Descriptive Statistics for this blog.

Descriptive Statistics : Descriptive Statistics involves evaluating parameters which best describe the data. These parameters include Mean, Median, Mode, Variance and Standard Deviation, Range and Interquartile Range (IQR), Percentiles and Quartiles, Skewness and Kurtosis.

Mean: The mean, also known as the average, is calculated by adding up all the values in a dataset and then dividing by the total number of values. It is represented as μ (mu) for a population or x̄ (x-bar) for a sample. The mean is sensitive to extreme values, making it susceptible to outliers.

Median: The median is the middle value of a dataset when it is sorted in ascending order. If there is an even number of data points, the median is the average of the two middle values. The median is less affected by outliers compared to the mean and is a measure of central tendency that can be more robust.

Mode: The mode is the value that appears most frequently in a dataset. A dataset can have no mode (if all values are unique), one mode (unimodal), or multiple modes (multimodal).

import pandas as pd
import numpy as np
from scipy import stats
data = {
    'Age': [25, 30, 35, 28, 22, 31, 45, 40, 27, 29],
    'Income': [50000, 60000, 75000, 55000, 48000, 72000, 85000, 90000, 60000, 62000],
    'Height': [165, 170, 175, 168, 160, 172, 180, 178, 163, 169]
}

df = pd.DataFrame(data)
df

mean = {}
median = {}
mode = {}

for column in df.columns:
    # Mean
    mean[column] = sum(df[column]) / len(df[column])

    # Median
    sorted_values = sorted(df[column])
    n = len(sorted_values)
    if n % 2 == 0:
        median[column] = (sorted_values[n // 2 - 1] + sorted_values[n // 2]) / 2
    else:
        median[column] = sorted_values[n // 2]

    # Mode
    value_counts = {}
    for value in df[column]:
        if value in value_counts:
            value_counts[value] += 1
        else:
            value_counts[value] = 1
    mode[column] = max(value_counts, key=value_counts.get)


print("1. Mean:")
print(mean)
print("\n2. Median:")
print(median)
print("\n3. Mode:")
print(mode)

/**********Mean, Median and Mode using In-built functions******************/
df=
{'Age': 31.2, 'Income': 65700.0, 'Height': 170.0}
{'Age': 29.5, 'Income': 61000.0, 'Height': 169.5}
{'Age': 25, 'Income': 60000, 'Height': 165}
df.mean()
df.median()
df.mode()

Variance: Variance measures how spread out the values in a dataset are from the mean. It is calculated by taking the average of the squared differences between each data point and the mean. A higher variance indicates greater dispersion, while a lower variance indicates data points are closer to the mean.

Standard Deviation: The standard deviation is the square root of the variance. It provides a measure of the average distance between each data point and the mean. A smaller standard deviation implies that the data points are closer to the mean, while a larger standard deviation suggests more variability.

# Variance and Standard Deviation calculation without using inbuilt functions
variance = {}
std_deviation = {}

for column in df.columns:
    mean_val = mean[column]
    squared_deviations = [(x - mean_val) ** 2 for x in df[column]]
    variance[column] = sum(squared_deviations) / (len(df[column]) - 1)
    std_deviation[column] = variance[column] ** 0.5

print("\n4. Variance:")
print(variance)
print("\n5. Standard Deviation:")
print(std_deviation)

# Variance and Standard Deviation calculation using inbuilt functions
{'Age': 48.84444444444444, 'Income': 204677777.7777778, 'Height': 41.333333333333336}
{'Age': 6.988880056521534, 'Income': 14306.564149989954, 'Height': 6.429100507328637}
df.var()
df.std()

Range and Interquartile Range

Interquartile Range (IQR): The IQR is a measure of the spread of the middle 50% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). Quartiles divide the dataset into four equal parts, and the IQR is less affected by extreme values compared to the range.

data_range= df.max() - df.min()
data_range
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
iqr

Variance and Standard Deviation

Standard Deviation: The standard deviation is the square root of the variance. It provides a measure of the average distance between each data point and the mean. A smaller standard deviation implies that the data points are closer to the mean, while a larger standard deviation suggests more variability.

[ ]

variance = {}
std_deviation = {}

for column in df.columns:
    mean_val = mean[column]
    squared_deviations = [(x - mean_val) ** 2 for x in df[column]]
    variance[column] = sum(squared_deviations) / (len(df[column]) - 1)
    std_deviation[column] = variance[column] ** 0.5

print("\n4. Variance:")
print(variance)
print("\n5. Standard Deviation:")
print(std_deviation)

account_circle

4. Variance:
{'Age': 48.84444444444444, 'Income': 204677777.7777778, 'Height': 41.333333333333336}

5. Standard Deviation:
{'Age': 6.988880056521534, 'Income': 14306.564149989954, 'Height': 6.429100507328637}

df.var()
df.std()

[ ]

Range and Interquartile Range

Range: The range is the difference between the maximum and minimum values in a dataset. While it provides a quick measure of the spread, it is sensitive to outliers and might not be representative of the central spread of the majority of the data.
Interquartile Range (IQR): The IQR is a measure of the spread of the middle 50% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). Quartiles divide the dataset into four equal parts, and the IQR is less affected by extreme values compared to the range.

#range
data_range= df.max() - df.min()
data_range

# Interquartile Range
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
iqr

Percentiles and Quartiles

Percentiles: Percentiles are values that divide a dataset into 100 equal parts. The pth percentile represents the value below which p% of the data falls. For example, the 25th percentile (Q1) is the value below which 25% of the data falls.

Quartiles: Quartiles divide the dataset into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the median (50th percentile), and the third quartile (Q3) is the 75th percentile. Quartiles help in understanding the spread and central tendency of the data.

percentile_25 = df.quantile(0.25)
percentile_25

Skewness and Kurtosis

Skewness: Skewness measures the asymmetry in the distribution of data. A positive skew indicates that the tail of the distribution is longer on the right (right-skewed or positively skewed), while a negative skew indicates a longer left tail (left-skewed or negatively skewed). A symmetric distribution has skewness equal to zero.

Kurtosis: Kurtosis measures the “tailedness” or peakedness of a distribution compared to a normal distribution. Positive kurtosis (leptokurtic) indicates a distribution with heavier tails and a sharper peak, while negative kurtosis (platykurtic) indicates a distribution with lighter tails and a flatter peak. Mesokurtic distributions have kurtosis equal to zero.

skewness = df.skew()
skewness
kurtosis = df.kurtosis()
kurtosis

Probability and Statistics- Descriptive Analysis

Leave a Reply Cancel reply