The Basic Probability and Statistics for Machine Learning

The Basic Probability and Statistics for Machine Learning
Photo by Kevin Ku

Machine Learning is a multidisciplinary field that employs statistics, probability, and algorithms to learn from data and generate insights that may be utilized to create intelligent applications. In this essay, we will explore some of the most important machine learning topics.

Probability and statistics are related branches of mathematics that analyze the relative frequency of occurrences.

Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.


The majority of people have an intuitive sense of degrees of probability, which is why we use terms like “probably” and “unlikely” in everyday speech, but we will discuss how to make quantitative statements about those degrees.

In probability theory, an event is a probability-assigned collection of experimental events. If E is an event, P(E) is the probability that E will occur. A trial is a circumstance in which E may occur (success) or might not (failure).

This event may consist of anything, such as flipping a coin, rolling a dice, or drawing a colored ball from a bag. In these instances, the event’s outcome is random, hence the variable that reflects this outcome is known as a random variable.

Consider a simple instance of coin throwing. If the coin is fair, there is an equal chance that it will land on heads or tails. In other words, if we continually tossed the coin numerous times, we would anticipate that approximately half of the outcomes would be heads and the other half would be tails. In this instance, the likelihood of rolling a head is half, or 0.5.

The empirical probability of an event is calculated by dividing the number of occurrences of the event by the total number of observed instances. If forntrials are conducted and we see ssuccesses, the success probability is s/n. In the preceding instance. Any series of coin flips may have more or less than 50% heads.

The theoretical probability is calculated by dividing the number of ways an event may occur by the total number of possible outcomes. So a head can occur once and there are two alternative results (head, tail). 1/2 is the actual (theoretical) chance of a head.

Joint Probability

The chance that both occurrences A and B occur, represented by P(A and B) or P(A B), is denoted by P(A and B) or P(A B). P(A ∩ B) = P(A) (A). P(B) . This is only true if A and B are independent, meaning that the occurrence of A has no effect on the likelihood of B and vice versa.

Conditional Probaiblity

Consider that A and B are not independent, as the chance of B increases if A occurs. When A and B are not independent, it is frequently advantageous to compute the conditional probability, P (A|B), which is the probability of A given that B occurred: P(A|B) = P(A ∩ B)/ P(A) (B).

P(A|B) = P(A∩B)/P(B)

Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint probability of as A and B as P(A ∩ B)= p(A).P(B|A), which means : “The chance of both things happening is the chance that the first one happens, and then the second one given the first happened.”

Bayes’ Theorem

The theorem of Bayes relates the conditional probability of two occurrences. For instance, if we wish to determine the probability of selling ice cream on a hot and sunny day, Bayes’ theorem enables us to use previous information on the chance of selling ice cream on any other sort of day (rainy, windy, snowy, etc.).

P(H|E) is the conditional probability that event H occurs provided that event E has already occurred, given that Hand and E are events. The probability P(H) in the equation is essentially a frequency analysis; given our past data, what is the likelihood that the event will occur? P(E|H) in the equation is referred to as the likelihood and represents the probability that the evidence is accurate, given the results of the frequency analysis. The probability that the actual evidence is correct is denoted by P(E).

Let H represent the occurrence in which we sell ice cream, and Ebe the occurrence of weather. Particular the sort of weather, we may next inquire as to the possibility of selling ice cream on any given day. This is expressed mathematically as P(H=ice cream sales | E= kind of weather), which corresponds to the left side of the equation. The equation P(H) on the right-hand side is known as the prior since we may already know the marginal probability of ice cream sales. This is P(H = ice cream sale) in our case, or the probability of selling ice cream regardless of the weather. For instance, I may examine statistics indicating that 30 out of a possible 100 individuals purchased ice cream from a store. Prior to my knowledge of the weather, my P(H = ice cream sales) = 30/100 = 0.3. Bayes’ Theorem enables us to incorporate prior information in this manner [2].

Interpreting clinical testing is a classic use of Bayes’s theorem. Suppose your doctor notifies you during a normal medical examination that you have tested positive for a rare condition. You are also aware that the findings of these tests include some uncertainty. Assuming a Sensitivity (also known as the true positive rate) of 95% for patients with the condition and a Specificity (also known as the true negative rate) of 95% for healthy people.

If “+” and “” represent a positive and negative test result, respectively, then the test accuracies correspond to the conditional probabilities: P(+|disease) = 0.95, P(-|healthy) = 0.95, and P(-|healthy) = 0.95.

In Bayesian terminology, we wish to compute P(disease|+), the probability of disease given a positive test.P(disease|+) =  P(+|disease)* P(disease)/P(+)

How should P(+), all positive instances, be evaluated? There are two choices to consider: P(+|disease) and P(+|healthy). The probability of a false positive is the counterpart of the probability of a false negative. So P(+|healthy) equals 0.05.

Importantly, Bayes’ theorem demonstrates that in order to compute the conditional probability that you have the disease given a positive test result, you must know the “prior” chance that you have the disease. P(disease), given no other information. That is, you must be aware of the disease’s prevalence in the population to which you belong. Assuming these tests are performed to a population in which the actual prevalence of the condition is 0.5%, P(disease) = 0.005 and P(healthy) = 0.995.

So, P(disease|+) = 0.95 * 0.005 /(0.95 * 0.005 + 0.05 * 0.995) = 0.088

In other words, despite the apparent accuracy of the test, the likelihood that you have the condition is less than 9 percent. A positive result raises the likelihood that you have the condition. However, the 95% test accuracy should not be interpreted as the possibility that you have the condition.

Descriptive Statistics

Descriptive statistics refers to techniques for summarizing and arranging data set information. We will use the table below to explain some statistical concepts.

The entities for whom information is gathered are referred to as elements. The items in the table above are the ten applicants. Elements are sometimes known as topics or cases.

The characteristic of an element is referred to as a variable. It can accept various values for various components. Including, for example, marital status, mortgage, salary, rank, year, and risk. Attributes are another name for variables.

Variables might be either qualitative or quantitative.

A qualitative variable permits the classification or categorization of items based on some attribute. The qualitative factors consist of marital status, mortgage, position, and risk. Qualitative variables are sometimes termed category variables.

A quantitative variable accepts numeric values and allows meaningful arithmetic operations to be done on it. The quantitative variables are year and income. Quantitative variables are sometimes termed numerical variables.

Discrete Variable: A discrete variable is a numerical variable that may take a finite or countable number of values, and for which each value can be graphed as a single point with space between each point. The term “year” is an instance of a discrete variable.

Continuous Variable: A numerical variable that may take an unlimited number of values is a continuous variable, whose potential values form a continuous interval on the number line. An example of a continuous variable is “income.”

A population is the collection of all items of interest for a certain topic. A parameter is a population trait.

A sample is a subset of the entire population. A statistic is a property of a sample.

A random sample is one in which each component has an equal probability of being picked.

Measures of Center: Mean, Median, Mode, Mid-range

Indicate where on the number line the central part of the data is located.

A data set’s mean is its arithmetic average. To determine the average, sum the values and divide by the total number of values. The sample mean, abbreviated x (“x-bar”), is the arithmetic average of a sample. The population mean is the arithmetic average of a population and is represented by the Greek letter (“myu”, for m).

When there are an odd number of data values that have been sorted in ascending order, the median is the middle value. If the number is even, the median is the average of the two middle data values. When the income statistics are arranged in ascending order, the two middle values are $32,100 and $32,200, with a mean of $32,150 representing the median income.

The mode is the data value with the highest occurrence frequency. Modes are applicable to both quantitative and categorical variables, whereas means and medians are exclusive to quantitative variables. Since each income value only occurs once, there is no mode. 2010 is the most common year, with a frequency of 4.

A data set’s median is the average of its highest and minimum values. The median income level is:mid-range(income) = (max(income) + min(income))/2 = (48000 + 24000)/2 = $36000

Measures of Variability: Range, Variance, Standard Deviation

Quantify the amount of variation, spread or dispersion present in the data.

The range of a variable equals the difference between the maximum and minimum values. The range of income is:range(income) = max (income) − min (income) = 48,000 − 24,000 =$24000

Range only reflects the difference between the largest and smallest observation, but it fails to reflect how data is centralized.

Population variance is defined as the average of the squared differences from the Mean, denoted as 𝜎² (“sigma-squared”):

Larger Variance means the data are more spread out.

The sample variance s² is approximately the mean of the squared deviations, with N replaced by n-1. This difference occurs because the sample mean is used as an approximation of the true population mean.

Standard Deviation
The standard deviation or sd of a bunch of numbers tells you how much the individual numbers tend to differ from the mean.

The sample standard deviation is the square root of the sample variance: sd = √ s². For example, incomes deviate from their mean by $7201.

The population standard deviation is the square root of the population variance: sd= √ 𝜎².

The smaller the standard deviation, the narrower the peak, the data points are closer to the mean. The further the data points are from the mean, the greater the standard deviation.

Measures of Position: Percentile, Z-score, Quartiles

Indicate the relative position of a particular data value in the data distribution.

The pth percentile of a data set is the value at or below which p percent of the values in the data set fall. The median represents the 50th percentile. 50% of the data values, for instance, fall at or below the median income of $32,150.

Percentile rank
The percentile rank of a data value corresponds to the proportion of values in the data set that are equal to or below that value. For instance, the percentile rank of Applicant 1’s $38,000 income is 90%, as 90% of all earnings are equal to or less than $38,000.

Interquartile Range (IQR)
The first quartile (Q1) of a data set corresponds to the 25th percentile, the second quartile (Q2) corresponds to the median (50th percentile), and the third quartile (Q3) corresponds to the 75th percentile.

The IQR formula calculates the difference between the 75th and 25th observations: IQR = Q3 Q1.

x is an outlier if either x Q1 1.5 (IQR) or x Q3 + 1.5 (IQR) (IQR).

The Z-score for a specific data value indicates the number of standard deviations above or below the mean the data item sits.

Therefore, if z is positive, the value is above the mean. The Z-score for Applicant 6 is (24,000 32,540)/ 7201 1.2, which indicates that Applicant 6’s income is 1.2 standard deviations below the mean.

Uni-variate Descriptive Statistics
Patterns in univariate data can be described using central tendency: mean, mode, and median; and dispersion: range, variance, maximum, minimum, quartiles, and standard deviation.

Bar Charts, Histograms, Pie Charts, etc., are examples of the different plots used to display univariate data.

Bi-variate Descriptive Statistics
Bi-variate analysis includes the examination of two variables in order to determine their empirical connection. Typically, scatter-plots and box-plots are the plots used to depict bivariate data.

Scatter Plots
The easiest way to depict the connection between x and y. A scatter plot is a frequent graph for two continuous variables. Each (x, y) point is plotted on a Cartesian plane with the x axis horizontal and the y axis vertical. Sometimes, scatter plots are referred to as correlation plots since they illustrate the correlation between two variables.

A correlation is a statistical measure intended to assess the degree of association between two variables. The correlation coefficient r measures the magnitude and direction of the linear link between two quantitative variables. The definition of the correlation coefficient is:

where sx represents the standard deviation of the x-variable and sy represents the standard deviation of the y-variable. −1 ≤ r ≤ 1.

If r is positive and statistically significant, x and y are said to be positively connected. A rise in x corresponds to a rise in y.

If r is negative and statistically significant, x and y are said to be negatively linked. A rise in x is correlated with a decline in y.

Box Plots
A box plot, often known as a box and whisker plot, is used to depict the distribution of numerical values. Commonly used when one variable is categorical and the other is continuous. When you use a box plot you divide the data values into four portions called quartiles. You begin by locating the center or median value. The median divides the values of the data set in half. By determining the median of each half of the data, the quartiles are created.

Each box on the scatter plot displays the range of values from the median of the lower half of the values at the bottom to the median of the higher half of the values at the top. A line in the center of the box corresponds to the median of all data values. The whiskers then indicate the highest and smallest data values.

The five-number summary of a data set consists of the minimum, Q1, the median, Q3, and the maximum.

The left whisker extends to the smallest number that does not represent an anomaly. The right whisker extends to the highest value for which there is no outlier. The distribution is biased to the left when the left whisker is longer than the right whisker, and vice versa. When whiskers are roughly the same length, the distribution is symmetrical.

Thank you for reading.