For family home evening, I worked on this project with my lovely wife Ivana and my fantastic assistant Blakely, our daughter, and we wanted to answer one fun question together: do paper weight and airplane fold change how far a paper airplane travels? The response variable was flight distance in inches (in), measured from the launch line to first contact with the ground. The two factors were paper weight with levels light and heavy, and airplane design with levels Basic Dart, Lock Bottom, and Lift Off. Light paper was white printer paper at about 75-80 gsm, and heavy paper was Pacon construction paper at about 120 gsm; color was part of those paper categories and was not treated as a separate factor. I have also fallen in love with the Sandstone theme in R Markdown, and I plan to keep using it in my analyses from here on out. Overall, this was a cheerful way to make the three design choices about what to vary, what to measure, and how to keep every throw fair through randomization and control.
The significance level for all ANOVA tests was set to \(\alpha = 0.05\).
For paper type: \(H_0: \mu_{Light\cdot} = \mu_{Heavy\cdot}\) \(H_A: \mu_{Light\cdot} \ne \mu_{Heavy\cdot}\)
For fold design: \(H_0: \mu_{\cdot Basic} = \mu_{\cdot Lock} = \mu_{\cdot Lift}\) \(H_A\): at least one fold-design mean differs.
For interaction: \(H_0\): no paper-by-fold interaction. \(H_A\): a paper-by-fold interaction is present.
This was a completely randomized \(2 \times 3\) factorial study with six treatment combinations. The realized sample sizes were Test 1 = 10, Test 2 = 10, Test 3 = 8, Test 4 = 9, Test 5 = 12, and Test 6 = 11 throws. That gives Basic Dart = 19 throws, Lock Bottom = 22, and Lift Off = 19, with randomized throw order to keep practice fair and avoid one design being boosted due to bias.
kable(c.tb, digits = 0, col.names = c("Paper", "Fold", "n")) |>
kable_styling(full_width = TRUE)
| Paper | Fold | n |
|---|---|---|
| Light | Basic Dart | 10 |
| Light | Lock Bottom | 10 |
| Light | Lift Off | 8 |
| Heavy | Basic Dart | 9 |
| Heavy | Lock Bottom | 12 |
| Heavy | Lift Off | 11 |
The biggest extra sources of variation are throw strength, launch angle, small air currents, plane wear, and measurement error. Using one thrower, one launch line, one testing space, and randomized treatment order helps keep those issues from lining up with a specific treatment. One clear weakness is that the final cell sizes are not equal in the file, with counts ranging from 8 to 12, so the design is slightly unbalanced and the ANOVA needs to respect that.
The highest sample mean came from Light paper with the Lift Off fold at 179.25, while the lowest came from Heavy paper with the Lock Bottom fold at 156.33. To keep comparisons easy and friendly to read, this section shows one grouped numerical summary and one grouped plot together. The spreads inside groups are fairly wide, so visible mean gaps should be treated carefully.
favstats(y ~ interaction(wt, pl), data = d.df) |>
kable(digits = 2) |>
kable_styling(full_width = TRUE)
| interaction(wt, pl) | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Light.Basic Dart | 119 | 144.00 | 161.5 | 170.50 | 207 | 159.00 | 24.48 | 10 | 0 |
| Heavy.Basic Dart | 135 | 147.00 | 164.0 | 172.00 | 203 | 163.00 | 22.53 | 9 | 0 |
| Light.Lock Bottom | 145 | 159.75 | 173.0 | 198.00 | 221 | 179.20 | 25.85 | 10 | 0 |
| Heavy.Lock Bottom | 102 | 138.00 | 155.0 | 179.00 | 214 | 156.33 | 31.06 | 12 | 0 |
| Light.Lift Off | 143 | 160.00 | 183.0 | 195.25 | 216 | 179.25 | 24.11 | 8 | 0 |
| Heavy.Lift Off | 126 | 145.50 | 160.0 | 185.50 | 201 | 163.36 | 24.94 | 11 | 0 |
favstats(y ~ wt, data = d.df) |>
kable(digits = 2) |>
kable_styling(full_width = TRUE)
| wt | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Light | 119 | 154.00 | 169.5 | 195.00 | 221 | 172.00 | 25.89 | 28 | 0 |
| Heavy | 102 | 142.25 | 159.5 | 183.25 | 214 | 160.62 | 26.18 | 32 | 0 |
favstats(y ~ pl, data = d.df) |>
kable(digits = 2) |>
kable_styling(full_width = TRUE)
| pl | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| Basic Dart | 119 | 144.0 | 164.0 | 171.50 | 207 | 160.89 | 23.01 | 19 | 0 |
| Lock Bottom | 102 | 146.5 | 165.5 | 185.75 | 221 | 166.73 | 30.45 | 22 | 0 |
| Lift Off | 126 | 150.5 | 162.0 | 191.50 | 216 | 170.05 | 25.23 | 19 | 0 |
boxplot(y ~ wt, data = d.df,
main = "Paper Airplane Flight Distance by Paper Weight",
xlab = "Paper type", ylab = "Distance (inches)")
stripchart(y ~ wt, data = d.df, add = TRUE, vertical = TRUE,
method = "jitter", pch = 16, col = "gray35")
boxplot(y ~ pl, data = d.df,
main = "Paper Airplane Flight Distance by Fold Design",
xlab = "Fold design", ylab = "Distance (inches)")
stripchart(y ~ pl, data = d.df, add = TRUE, vertical = TRUE,
method = "jitter", pch = 16, col = "gray35")
interaction.plot(d.df$pl, d.df$wt, d.df$y, fun = mean,
main = "Paper Airplane Mean Flight Distance: Weight by Fold Design",
xlab = "Fold design", ylab = "Distance (inches)",
trace.label = "Paper", lwd = 2, pch = 19,
col = c("#1f77b4", "#d62728"))
points(as.numeric(d.df$pl) + ifelse(d.df$wt == "Light", -0.06, 0.06),
d.df$y,
pch = 16,
col = ifelse(d.df$wt == "Light", "#1f77b4", "#d62728"))
The lines are not perfectly parallel, and the overlaid points show substantial overlap across groups. This supports the same conclusion as the ANOVA interaction test: some pattern differences are visible, but they are not strong enough to be statistically significant in this sample, even though the plot is still fun to compare.
For the raw-distance ANOVA model, I checked normality, constant variance, and independence using visual diagnostics. Normality looked reasonable because the Q-Q residual plot stayed close to the line with only mild tail departures, and constant variance looked acceptable because the residual spread was fairly similar across fitted values without a strong funnel shape. The residuals-versus-throw-order plot did not show a clear run-order trend, so independence looked reasonable as well.
op <- par(mfrow = c(1, 2))
plot(i.lm, which = 1)
plot(i.lm, which = 2)
par(op)
plot(d.df$od, resid(i.lm),
main = "Paper Airplane ANOVA Residuals by Throw Order",
xlab = "Throw order", ylab = "Residual (inches)",
pch = 16)
abline(h = 0, lty = 2)
Because the treatment counts are uneven across cells, Type III sums of squares are the better choice here. That lets each effect be tested after accounting for the others instead of letting the order of entry steer the result. None of the three F-tests reached the 0.05 level: paper weight \(F(1, 54) = 2.94\), \(p = 0.092\); fold design \(F(2, 54) = 0.77\), \(p = 0.469\); and the interaction \(F(2, 54) = 1.43\), \(p = 0.248\).
i.lm <- lm(y ~ wt * pl, data = d.df)
i.a3 <- as.data.frame(car::Anova(i.lm, type = 3))
i.a3$term <- rownames(i.a3)
rownames(i.a3) <- NULL
i.a3 <- i.a3 |>
dplyr::select(term, `Sum Sq`, Df, `F value`, `Pr(>F)`)
kable(i.a3, digits = 3, col.names = c("Term", "Sum Sq", "Df", "F", "p")) |>
kable_styling(full_width = TRUE)
| Term | Sum Sq | Df | F | p |
|---|---|---|---|---|
| (Intercept) | 1638876.328 | 1 | 2433.686 | 0.000 |
| wt | 1978.809 | 1 | 2.938 | 0.092 |
| pl | 1033.598 | 2 | 0.767 | 0.469 |
| wt:pl | 1926.535 | 2 | 1.430 | 0.248 |
| Residuals | 36364.312 | 54 | NA | NA |
Because the overall ANOVA tests were not significant at \(\alpha = 0.05\), I did not run pairwise contrasts. Skipping post-hoc comparisons here keeps the interpretation honest and avoids over-interpreting noise when there is not enough evidence of a mean difference in the first place.
The practical question people may still ask is which setup looked best even if nothing was significant. On raw means alone, Light paper with the Lift Off fold landed farthest on average, but the within-group variation was large enough that this edge did not hold up as a reliable difference.
In this dataset, neither paper weight, fold style, nor their interaction showed a statistically clear effect on flight distance. The study still gives a useful and upbeat starting point, and the next round would be stronger with tighter control of throwing mechanics, equal replication in every cell, and possibly more trials so small real differences are easier to spot.