Statistical Terms

Data-Driven Analytics vs User-Driven Analytics The former is based purely on the data, with no prior knowledge of what you are after, or with a proposed business goal that the data may or may not support. The latter starts with conceiving ideas and then take refuge in the data to see whether those ideas have merits, would stand testing, and are supported by the data. The two approaches can be complementary and each has its advantages and disadvantages.

Data-driven analytics is best suited for large datasets because it is hard for human beings to wrap their minds around huge amounts of data. The nature of this analysis is broad and it does not concern itself with a specific search or validation of a preconceived idea. This approach to analytics can be viewed as sort of random and broad data mining.

User-driven analytics requires not only strategic thinking but also enough in-depth knowledge of the business domain to back up the strategizing. Vision and intuition can be very helpful here; you are looking at how the data lends specific support to ideas you deemed important and strategic. This approach to predictive analytics is defined (and limited) by the scope of the ideas you are probing. Decision-making becomes easier when the data supports your ideas.

Normalization vs Standardization

Maximum Likelihood Estimation (MLE): method of estimating the parameters of a statistical model given data. It is a technique to find the most likely function that explains observed data.

Akaike Information Criterion (AIC): a measure of the relative quality of statistical models for a given set of data.

Bayesian information criterion (BIC): a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC). When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

AIC and BIC are both penalized-likelihood criteria. The AIC or BIC for a model is usually written in the form [-2logL + kp], where L is the likelihood function, p is the number of parameters in the model, and k is 2 for AIC and log(n) for BIC.

Standard Deviation vs Variance: standard deviation is the square root of variance. Variance is defined as the average of the squared differences from the Mean.

Z-test: used for testing the mean of a population versus a standard, or comparing the means of two populations, with large (n ≥ 30) samples whether you know the population standard deviation or not. It is also used for testing the proportion of some characteristic versus a standard proportion, or comparing the proportions of two populations. Example: Comparing the average engineering salaries of men versus women.

T-test: used to test if two samples have the same mean. The assumptions are that they are samples from normal distribution. It is used to examine the differences between the means of two groups. For example, in an experiment you may want to compare the overall mean for the group on which the manipulation took place vs a control group. A t-test is used for testing the mean of one population against a standard or comparing the means of two populations if you do not know the populations’ standard deviation and when you have a limited sample (n < 30). Example: Measuring the average diameter of shafts from a certain machine when you have a small sample. See here for examples on T-test (one-sample, two-sample, paired).

However, if you have more than two groups, you shouldn’t just use multiple t-tests as the error adds up (see familywise error) and thus you increase your chances of finding an effect when there really isn’t one (i.e. a type 1 error). Therefore when you have more than two groups to compare e.g. in a drugs trial when you have a high dose, low does and a placebo group (so 3 groups), you use ANOVA to examine whether there any differences between the groups.

F-test: used to test if two sample have the same variance. Same assumptions hold as T-test. An F-test is used to compare 2 populations’ variances. The samples can be any size. It is the basis of Analysis of Variance (ANOVA). Example: Comparing the variability of bolt diameters from two machines.

Z-Distribution vs t-Distribution The standard normal (or Z-distribution), is the most common normal distribution, with a mean of 0 and standard deviation of 1. The t-distribution can be thought of as a cousin of the standard normal distribution — it looks similar in that it’s centered at zero and has a basic bell-shape, but it’s shorter and flatter around the center than the Z-distribution. Its standard deviation is proportionally larger compared to the Z, which is why you see the fatter tails on each side. See plot here.

The t-distribution is typically used to study the mean of a population, rather than to study the individuals within a population. In particular, it is used in many cases when you use data to estimate the population mean — for example, using the sample mean of 20 homes to estimate the average price of all the new homes in California.

The connection between the normal distribution and the t-distribution is that the t-distribution is often used for analyzing the mean of a population if the population has a normal distribution (or fairly close to it). Its role is especially important if your data set is small or if you don’t know the standard deviation of the population (which is often the case).

When statisticians use the term t-distribution, they aren’t talking about just one individual distribution. There is an entire family of specific t-distributions, depending on what sample size is being used to study the population mean. Each t-distribution is distinguished by what statisticians call its degrees of freedom. In situations where you have one population and your sample size is n, the degrees of freedom for the corresponding t-distribution is n – 1. For example, a sample of size 10 uses a t-distribution with 10 – 1, or 9, degrees of freedom.

Hypothesis Testing: The process of hypothesis testing can seem to be quite varied with a multitude of test statistics. But the general process is the same. Hypothesis testing involves the statement of a null hypothesis, and the selection of a level of significance. The null hypothesis is either true or false, and represents the default claim for a treatment or procedure. For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug has no effect on a disease.

After formulating the null hypothesis and choosing a level of significance, we acquire data through observation. Statistical calculations tell us whether or not we should reject the null hypothesis.

In an ideal world we would always reject the null hypothesis when it is false, and we would not reject the null hypothesis when it is indeed true. But there are two other scenarios that are possible, each of which will result in an error.

Type I Error: The first kind of error that is possible involves the rejection of a null hypothesis that is actually true. This kind of error is called a type I error, and is sometimes called an error of the first kind.

Type I errors are equivalent to false positives. Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then our claim is that the drug does in fact have some effect on a disease. But if the null hypothesis is true, then in reality the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.

Type I errors can be controlled. The value of alpha, which is related to the level of significance that we selected has a direct bearing on type I errors. Alpha is the maximum probability that we have a type I error. For a 95% confidence level, the value of alpha is 0.05. This means that there is a 5% probability that we will reject a true null hypothesis. In the long run, one out of every twenty hypothesis tests that we perform at this level will result in a type I error.

Type II Error: The other kind of error that is possible occurs when we do not reject a null hypothesis that is false. This sort of error is called a type II error, and is also referred to as an error of the second kind.

Type II errors are equivalent to false negatives. If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug had no effect on a disease, but in reality it did.

The probability of a type II error is given by the Greek letter beta. This number is related to the power or sensitivity of the hypothesis test, denoted by 1 – beta. TypeI_II Error

See also a pictorial view of Type I and Type II errors. Read also Confusion Matrix, ROC Curves and Area Under the Curve (AUC).

Confidence Interval (CI): An interval / range of values computed from sample data that is likely to cover the true parameter of interest. A 95% confidence level means that 95% of the intervals would include the parameter. See here for an example, and using R to find the CI, example#1 & example#2 .

ID3, which is a very simple decision tree algorithm (Quinlan, 1986), uses information gain as splitting criteria. The growing stops when all instances belong to a single value of target feature or when best information gain is not greater than zero. ID3 does not apply any pruning procedures nor does it handle numeric attributes or missing values.

Entropy, which in decision tree stands for homogeneity, characterizes the (im)purity of an arbitrary collection of examples. If the data is completely homogenous, the entropy is 0, else if the data is divided (50%-50%) entropy is 1.

Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is
                                    Entropy(S) = -(p+)*log2(p+) -(p-)*log2(p-)
where p+ is the proportion of positive examples in S and p- is the proportion of negative examples in S. In all calculations involving entropy we define 0log0 to be 0.

Information Gain is the decrease in entropy value when the node is split. It is the expected reduction in entropy caused by partitioning the examples according to that given attribute. More precisely, the information gain, Gain(S,A) of an attribute A, relative to a collection of examples S, is defined as                         Information_GainAn attribute should have the maximum information gain to be selected for splitting, that is, the attribute with minimum entropy is selected for split.

Limitations: This method has a bias and unfair favouritism towards selecting attributes with many outcomes. Also, since ID3 performs no backtracking in search, it converges to locally optimal solutions that are not globally optimal. Additionally, ID3 does not handle missing data values well, and is not robust in dealing with noisy data sets. ID3 does not directly deal with attributes that have continuous ranges, that is, numerical attributes are not supported.

See a numerical example which builds a decision tree based on Entropy and Information Gain. A tutorial on Decision Tree, with indices to measure degree of impurity using Entropy, Gini Index and Classification Error, is also attached here.

C4.5, an extension of ID3 algorithm, deals with those limitations faced by ID3. C4.5 accounts for unavailable values, continuous attribute value ranges, pruning of decision trees and rule derivation. Its criterion is the normalized information gain (Gain Ratio) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain (Gain Ratio) is chosen to make the decision. See here for the formula and numerical calculation of Gain Ratio.

The measure of relative variability is the coefficient of variation (CV). Unlike measures of absolute variability, the CV is unitless when it comes to comparisons between the dispersions of two distributions of different units of measurement. The coefficient of variation (CV) is the ratio of the standard deviation to the mean. The higher the coefficient of variation, the greater the level of dispersion around the mean. It is generally expressed as a percentage. In R, it can be simply written as a function as: cv <- function(vect){return(sd(vect)/mean(vect))}