Introduction

For family home evening, I worked on this project with my lovely wife Ivana and my fantastic assistant Blakely, our daughter, and we wanted to answer one fun question together: do paper weight and airplane fold change how far a paper airplane travels? The response variable was flight distance in inches (in), measured from the launch line to first contact with the ground. The two factors were paper weight with levels light and heavy, and airplane design with levels Basic Dart, Lock Bottom, and Lift Off. Light paper was white printer paper at about 75-80 gsm, and heavy paper was Pacon construction paper at about 120 gsm; color was part of those paper categories and was not treated as a separate factor. I have also fallen in love with the Sandstone theme in R Markdown, and I plan to keep using it in my analyses from here on out. Overall, this was a cheerful way to make the three design choices about what to vary, what to measure, and how to keep every throw fair through randomization and control.

Blakely helping with family home evening data collection
Blakely helping with family home evening data collection

Methods

Hypotheses

The significance level for all ANOVA tests was set to \(\alpha = 0.05\).

For paper type: \(H_0: \mu_{Light\cdot} = \mu_{Heavy\cdot}\) \(H_A: \mu_{Light\cdot} \ne \mu_{Heavy\cdot}\)

For fold design: \(H_0: \mu_{\cdot Basic} = \mu_{\cdot Lock} = \mu_{\cdot Lift}\) \(H_A\): at least one fold-design mean differs.

For interaction: \(H_0\): no paper-by-fold interaction. \(H_A\): a paper-by-fold interaction is present.

Design

This was a completely randomized \(2 \times 3\) factorial study with six treatment combinations. The realized sample sizes were Test 1 = 10, Test 2 = 10, Test 3 = 8, Test 4 = 9, Test 5 = 12, and Test 6 = 11 throws. That gives Basic Dart = 19 throws, Lock Bottom = 22, and Lift Off = 19, with randomized throw order to keep practice fair and avoid one design being boosted due to bias.

kable(c.tb, digits = 0, col.names = c("Paper", "Fold", "n")) |>
  kable_styling(full_width = TRUE)
Paper Fold n
Light Basic Dart 10
Light Lock Bottom 10
Light Lift Off 8
Heavy Basic Dart 9
Heavy Lock Bottom 12
Heavy Lift Off 11

Procedure

  1. Fold planes into the three designs using the same paper size, keeping the only planned material change as light versus heavy printer paper.
  2. Label the six treatment combinations as 1 through 6 exactly as used in the file, then randomize the throw order before starting with excel.
  3. Launch every plane from the same marked line in the same indoor space so wind and floor layout stay as steady as possible.
  4. Use the same thrower and the same general throwing motion for all trials, and measure the straight-line distance from the launch line to the first landing point.
  5. Record the treatment code and distance for each throw immediately after the trial to reduce mix-ups.
  6. Refold or replace planes as needed when wear becomes obvious, because bent noses and soft creases can change flight even when the treatment label stays the same.

Plane Designs

Basic Dart design Lock Bottom design Lift Off design

Variance And Limits

The biggest extra sources of variation are throw strength, launch angle, small air currents, plane wear, and measurement error. Using one thrower, one launch line, one testing space, and randomized treatment order helps keep those issues from lining up with a specific treatment. One clear weakness is that the final cell sizes are not equal in the file, with counts ranging from 8 to 12, so the design is slightly unbalanced and the ANOVA needs to respect that.

Analysis

Data Summary

The highest sample mean came from Light paper with the Lift Off fold at 179.25, while the lowest came from Heavy paper with the Lock Bottom fold at 156.33. To keep comparisons easy and friendly to read, this section shows one grouped numerical summary and one grouped plot together. The spreads inside groups are fairly wide, so visible mean gaps should be treated carefully.

favstats(y ~ interaction(wt, pl), data = d.df) |>
  kable(digits = 2) |>
  kable_styling(full_width = TRUE)
interaction(wt, pl) min Q1 median Q3 max mean sd n missing
Light.Basic Dart 119 144.00 161.5 170.50 207 159.00 24.48 10 0
Heavy.Basic Dart 135 147.00 164.0 172.00 203 163.00 22.53 9 0
Light.Lock Bottom 145 159.75 173.0 198.00 221 179.20 25.85 10 0
Heavy.Lock Bottom 102 138.00 155.0 179.00 214 156.33 31.06 12 0
Light.Lift Off 143 160.00 183.0 195.25 216 179.25 24.11 8 0
Heavy.Lift Off 126 145.50 160.0 185.50 201 163.36 24.94 11 0
favstats(y ~ wt, data = d.df) |>
  kable(digits = 2) |>
  kable_styling(full_width = TRUE)
wt min Q1 median Q3 max mean sd n missing
Light 119 154.00 169.5 195.00 221 172.00 25.89 28 0
Heavy 102 142.25 159.5 183.25 214 160.62 26.18 32 0
favstats(y ~ pl, data = d.df) |>
  kable(digits = 2) |>
  kable_styling(full_width = TRUE)
pl min Q1 median Q3 max mean sd n missing
Basic Dart 119 144.0 164.0 171.50 207 160.89 23.01 19 0
Lock Bottom 102 146.5 165.5 185.75 221 166.73 30.45 22 0
Lift Off 126 150.5 162.0 191.50 216 170.05 25.23 19 0
boxplot(y ~ wt, data = d.df,
  main = "Paper Airplane Flight Distance by Paper Weight",
        xlab = "Paper type", ylab = "Distance (inches)")
stripchart(y ~ wt, data = d.df, add = TRUE, vertical = TRUE,
           method = "jitter", pch = 16, col = "gray35")

boxplot(y ~ pl, data = d.df,
  main = "Paper Airplane Flight Distance by Fold Design",
        xlab = "Fold design", ylab = "Distance (inches)")
stripchart(y ~ pl, data = d.df, add = TRUE, vertical = TRUE,
           method = "jitter", pch = 16, col = "gray35")

interaction.plot(d.df$pl, d.df$wt, d.df$y, fun = mean,
                 main = "Paper Airplane Mean Flight Distance: Weight by Fold Design",
                 xlab = "Fold design", ylab = "Distance (inches)",
                 trace.label = "Paper", lwd = 2, pch = 19,
                 col = c("#1f77b4", "#d62728"))
points(as.numeric(d.df$pl) + ifelse(d.df$wt == "Light", -0.06, 0.06),
       d.df$y,
       pch = 16,
       col = ifelse(d.df$wt == "Light", "#1f77b4", "#d62728"))

The lines are not perfectly parallel, and the overlaid points show substantial overlap across groups. This supports the same conclusion as the ANOVA interaction test: some pattern differences are visible, but they are not strong enough to be statistically significant in this sample, even though the plot is still fun to compare.

Assumptions

For the raw-distance ANOVA model, I checked normality, constant variance, and independence using visual diagnostics. Normality looked reasonable because the Q-Q residual plot stayed close to the line with only mild tail departures, and constant variance looked acceptable because the residual spread was fairly similar across fitted values without a strong funnel shape. The residuals-versus-throw-order plot did not show a clear run-order trend, so independence looked reasonable as well.

op <- par(mfrow = c(1, 2))
plot(i.lm, which = 1)
plot(i.lm, which = 2)

par(op)
plot(d.df$od, resid(i.lm),
  main = "Paper Airplane ANOVA Residuals by Throw Order",
  xlab = "Throw order", ylab = "Residual (inches)",
  pch = 16)
abline(h = 0, lty = 2)

ANOVA

Because the treatment counts are uneven across cells, Type III sums of squares are the better choice here. That lets each effect be tested after accounting for the others instead of letting the order of entry steer the result. None of the three F-tests reached the 0.05 level: paper weight \(F(1, 54) = 2.94\), \(p = 0.092\); fold design \(F(2, 54) = 0.77\), \(p = 0.469\); and the interaction \(F(2, 54) = 1.43\), \(p = 0.248\).

i.lm <- lm(y ~ wt * pl, data = d.df)
i.a3 <- as.data.frame(car::Anova(i.lm, type = 3))
i.a3$term <- rownames(i.a3)
rownames(i.a3) <- NULL
i.a3 <- i.a3 |>
  dplyr::select(term, `Sum Sq`, Df, `F value`, `Pr(>F)`)
kable(i.a3, digits = 3, col.names = c("Term", "Sum Sq", "Df", "F", "p")) |>
  kable_styling(full_width = TRUE)
Term Sum Sq Df F p
(Intercept) 1638876.328 1 2433.686 0.000
wt 1978.809 1 2.938 0.092
pl 1033.598 2 0.767 0.469
wt:pl 1926.535 2 1.430 0.248
Residuals 36364.312 54 NA NA

Follow-Up

Because the overall ANOVA tests were not significant at \(\alpha = 0.05\), I did not run pairwise contrasts. Skipping post-hoc comparisons here keeps the interpretation honest and avoids over-interpreting noise when there is not enough evidence of a mean difference in the first place.

The practical question people may still ask is which setup looked best even if nothing was significant. On raw means alone, Light paper with the Lift Off fold landed farthest on average, but the within-group variation was large enough that this edge did not hold up as a reliable difference.

Conclusion

In this dataset, neither paper weight, fold style, nor their interaction showed a statistically clear effect on flight distance. The study still gives a useful and upbeat starting point, and the next round would be stronger with tighter control of throwing mechanics, equal replication in every cell, and possibly more trials so small real differences are easier to spot.