Chapter 4 Data visualizations

There are a few visualizations possible from among the jamovi Descriptives options. Again, your main textbook by Navarro & Foxcroft (2019) dedicates a chapter to data visualization in jamovi. This is Chapter 5 (Drawing graphs). And also again, the best way to get through this chapter is to open up jamovi and follow along at each step. There are no practice exercises for this chapter.

NOTE: There is one important thing you should understand, however, about the visualizations in this chapter (e.g., histograms, boxplots, Q-Q plots). Namely, these visualizations of descriptive statistics are used mostly by researchers to help them make decisions about subsequent analyses. These are NOT typically visualizations that are reported in the Results sections of journal articles. In jamovi, the plots that are used in Results sections are included within the menu for the statistical analysis itself. For example, there is a box called Descriptives plots that you can check under the T-Tests tab. This will give you a dot plot with confidence intervals (to be explained later in Navarro & Foxcroft, 2019) . The exception below is the bar plot for frequency data (section 4.5). Such plots are, in fact, often used in Results sections of papers.

To illustrate these various visualizations, we will use the Chico2.omv sample data set from Navarro & Foxcroft (2019) . To get to this, just go to ( \(\equiv\) ) > Open > Data Library > Chico 2 . The data consist of grades on some test ( grade_test ), which is a continuous variable, along with a time-of-test variable, time , which is a 2-level factor, where 1 represents at time 1 and 2 represents at time 2 .

NOTE: It is important to note at this point that none of the Plots from the the Descriptives menu will work if the variable you are trying to visualize is not Continuous . Go back to the Data tab to check this if you are having trouble getting jamovi to depict your data visually.

4.1 Histograms

The first visualization is the well-known histogram, where the frequency of particular responses are either “binned” if they are non-discrete (e.g., values with decimals), or possibly “stacked” if they are discrete integers (though such a variable may be binned as well).

What does this mean? For certain variables, like reaction times, values almost never repeat. It would be very strange for any one or two people to get the same exact reaction time of 632.44 milliseconds (ms). Therefore, you can’t really “stack” these observations in a histogram. What you can do, however, is bin them. You can create various bins of, say, 50 milliseconds, and place relevant values in there. For instance, there could be a bin for reaction timess between 350-400 ms, another one adjacent, to the right of that one for reaction times between 400-450 ms, then for 450-500 ms, and so on. You can change the width of the bin.

For other variables, it’s entirely possible to stack them, sometimes. For instance, if age (in years) is recorded as a simple integer (e.g., 18, 19, 20), it would be very easy to stack the 18s, the 19s, the 20s, the 21s, and so on, as long as your age range is relatively limited (e.g., typical college students). Note that if you had a really wide range of ages, you might bin them ( 1-5 , 6-10 , 11-15 , 16-20 ,… 61-65 … etc.).

Often paired with the histogram is the density plot, which uses math to estimate what a population distribution with many, many observations would look like based on the sample data you have.

Go get these for the Chico2.omv data, we moved grade_test over to the Variables box, and time over to the Split by box. Then we checked Histogram and Density under Histograms . You can see the parameters that need to be set in Figure 4.1 below.

Parameters for histograms and density plots for **test grade** split by **time of test** [with the data set

Figure 4.1: Parameters for histograms and density plots for test grade split by time of test (with the data set ‘Chico 2’ from Navarro & Foxcroft, 2019) .

The results of this are below. 47

 DESCRIPTIVES Descriptives ──────────────────────────────────────────── time grade_test ──────────────────────────────────────────── N 1 20 2 20 Mean 1 56.98000 2 58.38500 Standard deviation 1 6.616137 2 6.405612 ──────────────────────────────────────────── 

4.2 Boxplots

One of the drawbacks of histograms and density plots is that, although they are handy with respect to depicting the informal shape of a distribution, they are not so good at showing particular statistics about any given variable. This is where the boxplot comes in. Together with violin plots and dotplots (which can all be superimposed), you can offer your reader/listener a great deal of information about your variables.

We will use the Harpo data set provided by Navarro & Foxcroft (2019) (we see this data again in Chapter 11 of Navarro & Foxcroft (2019) ). This data has to do with the grades of 33 fictional students in “Dr. Harpo’s” statistics class. Dr. Harpo also has two tutors: Anastasia and Bernadette. Go to ( \(\equiv\) ) > Open > Data Library > Harpo .

NOTE: When you open the Harpo data set, you will need to change the grade variable to continuous, which you learned how to do in Section 2.1. You can leave the tutor variable alone.

TIP: Save this file as a .omv file. Not only are there a lot of analyses here that you will be carrying out, but we also re-visit this data in Chapter 6, when we work on t-tests.

Once you have addressed the note above, go to Exploration > Descriptives and simply slide the grade variable into the Variables box, and then (since you already know how to do this) slide the tutor variable into the Split by box. Finally, click the Box plot box under the Box Plots heading. Also check the following boxes, which will help you understand the boxplot: Quartiles , Median , Minimum , and Maximum . You can see these settings in Figure 4.2 below.

Parameter settings for boxplots along with related descriptive statistics.

Figure 4.2: Parameter settings for boxplots along with related descriptive statistics.

NOTE: If we had chosen two separate variables to slide into the Variables box, and then clicked Box plot , you would have seen two separate plots, each with one boxplot. That is, to get the side-by-side plots, you need to have a Split by variable, which in turn must be nominal (e.g., time , Gender , tutor ), not continuous.

Along with the statistics, Two side-by-side plots should appear in the output window. You can see all this below. But because of the length of the explanation (spanning several paragraphs), we are going to start explaining the boxplot above the descriptives and figure below, so that it is easier for you to scroll back and forth.

NOTE: The canonical orientation of boxplots is vertical, with the outcome variable along the y-axis. It is also possible to present boxplots as horizontal, with the outcome variable on the x-axis. However, the explanations below assume the canonical, vertical orientation.

Looking at the boxplots below, you are probably asking yourself: “What do the different elements of the box mean?” You have probably also noticed that there isn’t just a box in the plot. Rather, the box is divided into two by a horizontal line, and the box has vertical lines extending from the top and the bottom. Finally, there’s a strange dot at the bottom of the boxplot for the tutor named “Bernadette.”

First: the box itself. The box itself encloses the data from the 1st to the 3rd quartiles. As explained in glossary below (Section 3.5.2.5), the first quartile is the same as the 25th percentile, or the single score that lies just above the bottom 25% of the scores (ordered from lowest to highest). The first quartile is the bottom horizontal line of the box. The 3rd quartile is equivalent to the 75th percentile, or the score that lies just above the bottom 75% of scores This is the top of the box. So if your grade was at the third quartile, then 75% of the class would have scores below yours. To verify this, you can compare the boxplot to the descriptives above them. The 25th percentile for Anastasia is 69. This corresponds to the bottom of “her” box. Likewise, the 75th percentile for her data is 79. This is the top of her box. You can verify this for Bernadette’s box as well, on your own.

All this also suggests that half of all the scores lie between the 1st and 3rd quartiles. So, in effect, the box shows you where the middle half of your data is. This also entails that half of the data is outside the confines of the box). The upper- and lower confines of the box comprise what is also known as the Interquartile Range or IQR (see Navarro & Foxcroft (2019) , Section 4.2.2, for more information).

So now, what does the horizontal line in the middle of the boxes mean? That line represents the median (also explained below in the glossary below, Section 3.5.1.2)). Working from the discussion directly above, the median is simply the 2nd quartile, or the 50th percentile. This can be verified by looking at the values for the median and the 50th percentile; they are identical values (for Anastasia and Bernadette, respectively).

 DESCRIPTIVES Descriptives ──────────────────────────────────────────────── tutor grade ──────────────────────────────────────────────── Median Anastasia 76 Bernadette 69.00000 Standard deviation Anastasia 8.998942 Bernadette 5.774918 Minimum Anastasia 55 Bernadette 56 Maximum Anastasia 90 Bernadette 79 25th percentile Anastasia 69.00000 Bernadette 66.25000 50th percentile Anastasia 76.00000 Bernadette 69.00000 75th percentile Anastasia 79.00000 Bernadette 73.00000 ──────────────────────────────────────────────── 

In other words, the median is the middle value of your data, if you ordered your data from smallest to largest. That is, half of the data points should lie below the median, and half above (and if there is an even number of data points, the median is the average of the middle two values).

The next feature is the pair of vertical lines extending from the top and bottom of the box, respectively. These are often called whiskers. 48 Traditionally, these lines extend out to 1.5 times the interquartile range (the value at the 3rd quartile minus the value at the 1st quartile), a calculation we can attribute to Tukey (1977) . The idea here is to capture all the data that is not considered potentially extreme within the ends of the whiskers.

Anything outside the whiskers, then, is considered possibly extreme, and could have an unexpectedly high influence on the calculation of the mean. This brings us to the final element of the boxplot: the dots (sometimes small circles). These appear beyond the whiskers and represent individual observations whose values are probably a bit extreme, given what we’d expect from a normal distribution.

Another name for such observations is outliers, or potential outliers, to be precise. Determining what is vs. isn’t an outlier is a little bit involved and goes well beyond boxplots. Your main textbook (Navarro & Foxcroft, 2019) covers this issue with respect to boxplots in Section 5.2.3., though the general concept is introduced in Section 4.2.1.

Note that in the boxplot figure there is one dot on the plot, which is located at the bottom of Bernadette’s group of students. This is a potential outlier. If you see any dots at all, then the top-most and/or bottom-most of these will be the maximum and/or minimum values in your data. For the person with the low grade who has Bernadette as a tutor, you can see that the minimum in her group is 56, which is exactly where the dot is at the bottom of Bernadette’s box. Boxplots provide no representations of data beyond these dots.

Overall, the boxplot is a quick way of visually determining critical characteristics of your data. Here are some guidelines: