ANOVA, short for Analysis of Variance, and also called AOV, is a statistical method mainly used for hypothesis testing. The most common use case for ANOVA is when you do an experiment in which your outcome variable is numeric, and your explanatory variable is a categorical variable with three or more classes. ANOVA is used for statistical hypothesis testing. If you are not yet familiar with hypothesis testing, I strongly recommend reading this article first. An example is a trial for a new agricultural crop growth product in which you measure the performance of two new treatments and a control group. You measure a numerical outcome (for example kilograms of harvest) in the three groups (treatment 1, treatment 2, and control). For statistical validity, you will need to apply each of the treatments multiple times. Imagine that you cut your agricultural land in 15 subplots, and you did 5 times treatment 1, 5 times treatment 2, and 5 subplots with the nothing (control). You then compute the average kilograms of harvest for each treatment, and you observe that there are differences in the averages. However, you need to define whether the differences are large enough to state that the outcomes were significantly different and that the differences were not just due to some random variations. This is what ANOVA is made for. Of course, there are many studies in many domains that follow the exact same setup (three or more independent groups and one continuous outcome variable). You can check out this article for a more detailed read on ANOVA and the advanced options. MANOVA is a multivariate version of the ANOVA model. Multivariate here indicates the fact that there are multiple dependent variables instead of just one. The goal of a MANOVA analysis is still to detect whether there is a treatment effect vs the other groups. However, this effect is now measured across multiple continuous variables rather than just one. One MANOVA vs multiple ANOVAsYou could do a separate ANOVA for each of the dependent variables and get a result that is not extremely different from the MANOVA approach. However, it is very possible that MANOVA finds a significant effect of treatment whereas this effect would not have been found when running individual ANOVAs for each of the individual dependent variables. MANOVA: a part of Multivariate StatisticsNow, rather than seeing MANOVA as a Multivariate alternative of ANOVA, one could also describe MANOVA as a tool in the domain of Multivariate Statistics. Other methods in the family Multivariate Statistics include Structural Equation Modeling, Multidimensional Scaling, Principal Coordinates Analysis, Canonical Correlation Analysis, or Factor Analysis. A central item in those methods is that they are all used to make sense out of many variables and try to summarize this into one or a few learnings. This is very different from hypothesis testing (often used in experimental studies), which is a domain that focuses on finding an absolute answer (a truth based on significance) for a very precise hypothesis. Both are true for MANOVA, but it is important to notice that the domain of “regular” hypothesis testing with one dependent variable generally has relatively different applications than the domain of multivariate statistics. It is important to think about the goal of your study when choosing the method. An example use case for MANOVALet’s start working on a MANOVA example. In this case, let’s do a study in which the goal is to prove that different plant growth products lead to significantly different plant growth. Therefore, we will have three treatments:
We will use three measurements for defining plant growth:
Having three outcome variables is relatively few compared to the things one may encounter in multivariate statistics. However, it will be well suited to follow along with this MANOVA example. MANOVA in RLet’s start by doing a MANOVA analysis in R. Getting the MANOVA data in RI have uploaded the data in an S3 bucket. You can use the following code in R to obtain the data: The data looks as follows: Univariate description of the dataTo get a quick insight into the effect of treatment on the three dependent variables, you can use the following code to create box plots: Create boxplots of the MANOVA plant growth data.You will obtain the following plots: What you can see in this plot is that the plants receiving treatment 1 have the lowest heights, widths, and weights. Plants receiving treatment 3 have the highest of everything. There is some overlap at some places, but we could reasonably expect that treatment 3 is the best overall product for plant growth. Multivariate description of the dataAs we are doing a multivariate analysis, it is important to look at the relationships between the dependent variables as well. Let’s start by looking at the correlations between the dependent variables using the following code Compute correlations between the dependent variablesYou will obtain the following result: There are strong correlations between each of the three variables. The relation between Height and Weight is the strongest. Including treatment in the multivariate analysisBy making scatter plots, you can see the individual data points represented. If you then add the treatment in there as a shape, you can see the correlations and the treatments altogether in a plot. You can use the following code to do so: Create MANOVA scatter plotsYou will obtain the following plots: Fitting the MANOVA in RNow rather than looking at plots, we want to have an objective answer to find out whether the treatment is significantly improving plant growth. Fitting the MANOVA in RYou will obtain the following result: Understanding the output of the MANOVA in RIf you are not familiar with hypothesis testing, I recommend reading this article first. The first things to look at in hypothesis testing outputs are generally the test statistic and the p-value. The test statistic in MANOVA is the Pillai’s Trace: a value between 0 and 1. The p-value, as always, needs to be interpreted for concluding on significance. A p-value below 0.05 indicates that there is a significant effect of treatment on outcome. In the current case, we can conclude that treatment has a significant effect on plant growth. MANOVA in PythonLet’s now see how to do the same analysis in Python using the same steps. Getting the MANOVA data in PythonYou can import the same data set in Python as follows: Importing the MANOVA data in PythonIt will look as follows: Fitting the MANOVA in PythonYou can fit a MANOVA in Python using statsmodels. You can use the following code to do so: Fitting a MANOVA in PythonYou will get the following output: Understanding the output of the MANOVA in PythonNow in Python, the output shows the analysis using different test statistics. The second one, Pillai’s trace, is the one that we saw in the R output as well. Pillai’s trace is know to be relatively conservative: it gives a significant result less easily (the differences have to be bigger to obtain significant output). The Wilks’ Lambda is another often-used test statistic. Hotelling-Lawley trace and Roy’s greatest root are also alternative options. There is no absolute consensus in the statistical literature as to which test statistic should be preferred. The p-values are shown in the right column and are all inferior to 0.05, which confirms that treatment has an impact on plant growth. Assumptions of MANOVAAs in all statistical models, there are a few assumptions to take into account. In MANOVA, the assumptions are:
If you want to rely on your MANOVA conclusions, you’ll need to make sure that these assumptions are met. ConclusionIn this article, you have learned what ANOVA is, when you should use it, and how to apply it in R and Python on an example use case about crop growth. I hope that this article was useful for you! Thanks for reading and don’t hesitate to stay tuned for more stats, math, and data science content. |