HYPOTHESIS TESTING

Hypothesis testing is referred to as making an assumption (hypothesis) about data which would evaluate by using sample statistic

Null hypothesis (Ho): - It states the hypothesized value of parameter before sampling (µN = µo)

Alternative hypothesis (Ha): - It states all possible alternatives others than the null hypothesis

µN (new mean) – It defines as the samples mean, which is taken from samples from the population, that is approximately equal to the population mean

µo (old mean) – refers to the population mean

Type I error – False Positive

Type II error- False Negative

α (level of significance) = probability (P) [rejecting Ho when Ho is true] => Type I error

β (power of test) =P [ Not rejecting Ho when Ho is false] => Type II error

There are two ways to calculate the significance level:

One Tail Test

Two Tail Test

One tail test is the hypothesis test in which we can reject the null hypothesis (Ho) if the values of the sample mean are located entirely in either of the one tail of the probability distribution (critical region)

Two tail test will reject the null hypothesis if the sample mean is significantly higher or lower than the hypothesized mean

There are various statistical tests which are used to perform hypothesis testing:

Parametric Tests

It makes assumptions regarding population parameters and distribution. It is used for quantitative data and continuous variables

Z- Test

Student T-Test

P-Test

ANOVA Test

Z -Test

Conditions for Z- Test

1. The sample must be randomly selected

2. The sample size should be large

3. Data should follow a normal distribution

z= x-µ/σ

CENTRAL LIMIT THEORM

In this statistic, the problem of estimation (gives an idea of approximate values), either point estimation or interval estimation could be solved by the probability estimation formula, which has been given by Z statistics.

According to this formula, estimations can be done based upon a confidence interval which describes the amount of uncertainty associated with a sample estimate of the population parameter.

If we know mean, standard deviation, and the number of data set, it will be able to find estimation based upon confidence and is presented by:

where (α) is the level of significance (critical region)

If (a) random sample of size n is taken from aN (µ; σ) distribution, σ known, then a 90% confidence interval for µ

is given by:

estimate ± table value * SE (estimator)

[where SE--- standard error or standard deviation of sampling distribution; (SE=σ/√n) margin of error (E) = Zα *σ/√n

The confidence interval provides a range of plausible values for the population parameter(s)

The table value refers to how much certainty/ confidence in our estimate

This (estimator) refers to the error in our estimate that is, error due to sampling variability

If the population dataset is either skewed or not normally distributed, in that case, we randomly choose N number of samples of size (n>30) from the population, then the distribution of the sample mean will always be close to a population mean.

Also, all the samples will follow an approximately normal distribution pattern.

If the level of confidence (1-α ) is 90%, it means we are 90% confident that the interval contains the population mean (µ) and critical region (α) =10% (5% on both sides)

Student’s T Distribution

It is symmetrical about zero, bell-shaped, but more spread out than the normal distribution.

Using T-test, we can compare two samples.

Conditions for Student T-Test

1) The sample must be randomly selected and continuous

2) Use when the sample size is small

3) Use when population variance or standard variation do not know

4) The observation should be independent of one another

5) The data should not contain outliers

T-test can be used even for skewed distributions when the sample is large (greater than or equal to 30).

The larger the sample size, the distribution of the sample means tends to normality and the sample standard deviation (s) tends towards population standard deviation (σ)

As the degree of freedom increases, t - distribution tends towards a standard normal distribution

P-Value is the smallest value or level of significance at which we can reject a null hypothesis.

Reject Ho if P-value ≤ α (cut off)

One tail t-test: It checks whether the mean of the sample (x) differs from population mean (µ)

Two tail t-test: It checks whether the mean of the one sample (x1) differ from the mean of another sample (x2)

Some guidelines about P-values

ANOVA Test (Analysis of variance)

It is used when we want to compare the mean of more than two samples to check whether they are different from one another or not.

Conditions

a) The sample must be randomly selected and independent

b) The distribution of each group should be normal

c) Data variation across mean should be equal

d) The variance must be the same between the groups

One-way ANOVA testing:

It is used to compare TWO MEANS FROM TWO OR more independent groups based upon a single factor (variable). The null hypothesis checks whether two means are equal or not.

Two-way ANOVA testing:

It is used to compare TWO MEANS FROM TWO OR more independent groups based upon two factors (variable).

K way ANOVA testing:

It is used to compare TWO MEANS FROM TWO OR more independent groups based upon k factors (variables).

Partitioning of Variance in the ANOVA

F- Distribution

F-Test (variance ratio test): It is used to find out the means between two populations. It arises when working with a ratio of variances.

Characteristics

F-distribution always skewed to the right

Values of F- test never (-ve); F-test > 0

The shape of F-distribution id determined by:

=> DOF (degree of freedom) of the numerator (n-1)

=> DOF (degree of freedom) of the denominator (n-k)

From the above example: Number of groups (k) = 3

Total number of data = 9

F-table represented as:

Let’s assume critical value (A) = 0.05

df1= (n-1) = (3-1) =2

df2 = (n-k) = (9-3) = 6

Source:http://jukebox.esc13.net/untdeveloper/RM/Stats_Module_4/mobile_pages/Stats_Module_410.html

Non-Parametric Test

· It is the opposite of the parametric test, no assumption is made regarding the population

· It does not rely on data belonging to a particular distribution

· Works for categorical data

Chi-square test:

1. It also checks whether two categorical variables are independent of each other or not (test of independence)

2. It is based on the frequencies and independent of parameters like mean and standard deviation.

3. It tells about the how closely distribution of the categorical variable matches an expected distribution (goodness of fit).

There are some differences between Parametric tests and Non Parametric tests

Parametric Tests:

1. It requires complete information about population

2. Null hypothesis makes assumption based upon parameters (mean and standard deviation) of population distribution

3. Make specific assumptions regarding population

4. It is based upon normal probabilistic distribution

5. More powerful than non parametric tests, if exists

Non Parametric Tests:

1. It does not require information about population

2. Null hypothesis is free from parameters

3. It does not make any assumption regarding population

4. Test statistic is arbitrary

5. Have less powerful as compared to parametric tests