Exploring Your Data with Ecological Statistics!

Learn how to defend your data with the power of statistics!

Chance Yan | July 11th, 2026

TermDefinition
CovariateThese are the X variables in your model. In other words, they are the independent variables. For instance, if you wanted to understand the effects of fish size on swimming speed, the size of the fish would be a covariate. There can be multiple covariates in a study.
Response VariableThis is the Y variable in your model. This is the outcome variable in the experiment. Using the previous example from the covariate definition, this would be swimming speed. Typically, there is only one response variable for each experiment; however, there can be multiple
ModelEcological models are a mathematical way to represent something from the environment. They are multiple different models depending on what types and amount of covariate and response variables you use.
Linear ModelA type of model that assumes a straight line relationship between the covariate and response variable. Also can be thought of as the basic algebra slope intercept equation, Y = mx + b
Generalized Linear ModelA more flexible model that can be used when certain model assumptions are violated. Very popular and common in the field of ecology because of its flexibility.

Table 1. The table defines some key terms I use in this following article. Please refer to it when it comes across in text.

After thinking of a question, forming a hypothesis, carrying out an experiment and collecting data, the next step is the dreaded analysis. Analysis is a crucial part to the experiment as it converts your data into a meaningful story. It’s also a difficult and complex process that’s often overlooked. The analysis itself can be broken down into three different steps: data exploration, model generation, and model validation. For this article, we’ll focus on data exploration which can be further broken down into eight more steps [1]. In each step, you create plots to look for trends within your data. If your data violates assumptions of the model being used, then the model can be considered invalid and your story doesn’t have evidence to support it. Having a general idea of what your data looks like helps build your model. 

Outliers for X and Y: Outliers are data points that have an extremely large or small value compared to the rest of the dataset and can skew relationships. In Fig. 1, the data points circled at temperatures 60-75F are a clear example of outliers as other values at the same temperature are nowhere near the value of ozone concentration. With the outliers, the linear relationship has a slightly higher slope (orange line) than if you were to graph without outliers (dark blue line) (Fig 1.). 

A graph showing 5 different outlier points on a scatterplot.

Figure 1. The relationship between temperature and the concentration of ozone with 5 outlier points highlighted (Air Quality in R [2], created and edited by Chance Yan). 

Relationships between X and Y: An important but easily overlooked step is to look at the relationships between each covariates and the response variable. Graphing petal length and width for iris flowers, for example, shows that there is a strong linear relationship (Fig. 2). This indicates we will need to use a linear model.

A graph showing a strong linear relationship between the X and Y variables.

Figure 2. The relationship between petal length and width of the iris flower (Iris dataset in R [2], created and edited by Chance Yan).

Response Variable Normality: A common assumption of some types of many modeling approaches is that your data is normally distributed. This means that your data when graphed should look like a bell curve (depicted as the dark red line in Fig. 3). Since the orange bars do not match up to the expected bell curve in this example, we would need to further explore our data to see if we missed a covariate or failed to measure something in our experiment. 

A histogram showing a bi-modal relationship of the response variable with a bell curve line overlapping the graph.

Figure 3. The linear relationship between petal length and width of the iris flower (Faithful dataset in R [2], created and edited by Chance Yan).

Collinearity X variables: If you’re using multiple covariates in your model, you may come across collinearity. Collinearity occurs when two or more of your covariates are correlated. Typically, collinearity comes from variables that are related such as petal length and width (as a flower’s petal length is large, so is its width) (Fig. 2). A model with collinearity may result in all covariates being important, obscuring which variables actually matter [1].

Response Variable Homogeneity: In addition to checking if your data is normal or not, you must check if the response variable distribution is equal across all your covariates. Graphing the food-intake rate for Godwits (Fig. 4) shows they have normal distributions and equal distribution across sex and migration period (Fig. 5). The upper and lower whiskers in the boxplots (thin and dotted line) are relatively the same size which suggests that they are normal and have equal distributions. Inconsistent variance would violate one of the assumptions for a generalized linear model and needs to be addressed before modeling. 

A brownish bird with a very slight upturned orange and black beak standing in water.

Figure 4. A Hudsonian Godwit standing in water (Source: Adobe Stock Photos)

A boxplot showing equal variance across all groups.

Figure 5. The food-intake amount of Hudsonian Godwits across sex and migration period [1].

Independent Observations: Most statistical techniques assume that the observations of the response variable are independent of each other [3]. For instance, if you measured only iris flowers you could reach because flowers closer together may influence each other or be more likely to share traits  (Fig. 2), that would bias the data and violate this rule. In this instance, you would need to return to the field and measure past the flowers you could reach.

Response variable zeros: Having too many zeros in the dataset can violate normality assumption. For example, many bird pairs producing no eggs at all would cause a “zero-inflated” dataset (Fig. 6A). Without the zero values, you would have a normally distributed dataset and could use a generalized linear model (Fig. 6B). If a large proportion of your dataset is zeroes you may want to use a specialized generalized linear model, such as a zero-inflated model which essentially splits your data between zeros and non-zero values. 

Two histograms with one showing a normal distribution except for the large influx of zeros. The other histogram shows a typical normally distributed dataset.

Figure 6. Simulated data of the amount of eggs per pair of birds. Some birds do not have nests and therefore will see a large amount of zeros (data generated in R [2], created and edited by Chance Yan).

Interactions: An interaction occurs when the effect of one covariate on your outcome is influenced by another covariate. For example, the effect of ozone (one covariate) on temperature might depend on the month (another covariate). (Fig. 7). Depending on the question you want to answer, you may need to account for an interaction.  

A scatterplot that has colored in the different groups by month. 

Figure 7. Similar to figure 1; however, groups are separated and shown to have different relationships between temperature and ozone concentration (Ozone dataset in R [2], created and edited by Chance Yan).

After exploring your data, you can build your model and assess how accurate it is with diagnostic tools such as DHARMa. There are many different models (linear regression, generalized linear models, classifications trees, principle component analysis, etc.) with their own ways to assess accuracy; however, data exploration is a universal process to all models! Exploring your data is only the beginning, yet can be a massive headache. Understanding the tools of data analysis can help you be more confident in the data you collect and use.

References (Chicago Style, numbers bracketed)

[1] Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1, no. 1 (November 13, 2009): 3–14. https://doi.org/10.1111/j.2041-210x.2009.00001.x.

[2] R Core Team (2026). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.

[3] Hurlbert, Stuart H. “Pseudoreplication and the Design of Ecological Field Experiments.” Ecological Monographs 54, no. 2 (June 1984): 187–211. https://doi.org/10.2307/1942661.

[4] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Subscribe to That’s Life [Science] Blog

Discover more from Science Stories

Subscribe now to keep reading and get access to the full archive.

Continue reading