Statistics is the foundation of business analytics. Some people are not interested in statistics because they think it is dry and irrelevant. But learning some statistics will be very helpful for doing analytics. Statistics is really the field of “evidence-based decision making.”
You can think of statistics as a way of converting data into meaningful information. Rather than reporting every number in the data, we can report a statistic that summarizes the data. “Descriptive statistics” are statistics that simply describe the data. It is important to understand the data you have before you try to start using it. This is the basis of descriptive analytics as the first stage among categories of analytics.
Every field in a dataset has a distribution. A distribution is the shape of the data, meaning where it is mostly grouped and how spread out it is. Two key measures for a distribution are the measures of central tendency and dispersion. Here are some of the statistics associated with central dependency and dispersion.
Measures of central tendency
- Mean (the average of the distribution)
- Median (the 50th percentile of a distribution)
- Mode (the most common value of distribution)
If a distribution is symmetric, then mean = median. If the distribution is skewed to the right, then mean > median and if it is skewed to the left then mean < median. Income distribution is an example of a variable that is skewed to the right, so mean population income > median population income.
Measures of dispersion
- Variance
- Standard deviation
- Min and Max
Informally, variance measures how far a distribution of numbers is spread out from the average value of those numbers. Low variance indicates that the numbers are not very spread out and high variance indicates that they are very spread out. Standard deviation is a standardized measure of variance.
The normal distribution is the “bell curve” that describes many distributions occurring in nature. It is typically used to represent the distribution of a random variable. It is a symmetric distribution and 95% of the values lie within two standard deviations of the mean (between -2 and +2 standard deviations). It is often helpful to create a histogram chart of a data variable to see whether it does or does not resemble a normal distribution.
Some data is related to other data. In other words, the data in the two fields tends to move together.
Measures of relationship
- Correlation
Correlation measures whether two variables “move together.” A zero correlation means that they are unrelated. A positive correlation means that they move in the same direction (they both rise and fall together). A negative correlation means that they move in the opposite direction (when one rises, the other falls). Correlation takes values between -1 and 1. Positive 1 is perfectly correlated (move together 1 for 1) and negative 1 is perfectly inversely correlated (move oppositely 1 for 1).
Categorical variables
One of the key distinctions in data types is categorical variable versus continuous variable. A continuous variable has a large number of possible values, like the dollar amount of monthly sales. In contrast, a categorical variable has only a few possible values, like a survey response from 1 to 5. Most of us are more familiar with continuous variables, but working with categorical variables is a little different. Categorical variables are also called discrete variables, because they take on a few discrete values. One important type of categorical variable is the binary 0/1 variable. Binary is actually the basis of computer operations (think 0’s and 1’s in the Matrix) and it creates a helpful “Yes” or “No” distinction. We could measure a company’s monthly profitability as a continuous variable or we could ask the question, “Was the company profitable in that month?” This is a yes or no question. Any variable that takes on two values (like “yes” and “no”) can be converted into a 0 and 1 variable where 1 reflects one of the two possible values. Converting binary variables into 0/1 variables is a helpful so that the variable can be analyzed numerically. For instance, the proportion of “yes” answers could be calculated as the mean of the variable in which “yes”=1 and “no”=0. This type of 0/1 variable is often referred to as a dummy variable. One of the common uses of dummy variables is to segment data into two groups for comparison. Categorical variables are the basis of segmentation, which is very common in marketing analytics.
Categorical variables are also closely related to probability. Probability is the likelihood of one outcome versus another. A lot of statistics is about estimating probabilities, because it is a way of gaining insights about the future. Knowing whether an outcome is likely or unlikely can be extremely helpful. A 50-50 chance is very different from a 90% probability. No one likes to leave things totally up to chance. In this way, business analytics helps move decision-making from possibilities to probabilities.
Measures of statistical significance
- Confidence intervals
- P-values
- T-test
In science, we test hypotheses using experiments. There is a control group and a treatment group. The treatment is some external change to one of the two groups. A hypothesis is a statement about a predicted outcome for the treatment group. For example, we could hypothesize that “living cells grow larger when given glucose.” Testing a hypothesis involves a comparison of the control group and the treatment group. Is the treatment group different than the control group after receiving the treatment? In the example, are the cells larger after receiving glucose?
Statistics is useful for comparing measurements of treatment and control groups when testing a hypothesis. For instance, is the mean measurement of the treatment group greater than the mean measurement of the control group? Descriptive statistics like the mean are useful for this comparison. However, we need statistics to tell us if the means of the two samples are statistically different. If the two distributions are sufficiently similar, the difference may not be “statistically significant.”
The T-test is a good introduction to hypothesis testing. It is a way to test whether the means of two groups are statistically different. It does more than just looking at the means because it incorporates the variance of the two distributions. The t-value measures the size of the difference relative to the variation in your sample data. If the variance is large, then the means of the two groups may not be statistically different. Two means can be different but not statistically different. This is typically because wide distributions are largely overlapping. We could set up a hypothesis that the means of two groups are the same (a difference of zero) and then test this hypothesis using a T-test. If you are going to recommend a business decision based on a measured difference between two groups, you want to make sure that the difference is statistically significant.
Multivariate regressions are a powerful way of analyzing the factors related to some form of outcome.