Introduction

For family home evening, I worked on this project with my lovely wife Ivana and my fantastic assistant Blakely, our daughter, and we wanted to answer one fun question together: do paper weight and airplane fold change how far a paper airplane travels? The response variable was flight distance in inches (in), measured from the launch line to first contact with the ground. The two factors were paper weight with levels light and heavy, and airplane design with levels Basic Dart, Lock Bottom, and Lift Off. Light paper was white printer paper at about 75-80 gsm, and heavy paper was Pacon construction paper at about 120 gsm; color was part of those paper categories and was not treated as a separate factor. I have also fallen in love with the Sandstone theme in R Markdown, and I plan to keep using it in my analyses from here on out. Overall, this was a cheerful way to make the three design choices about what to vary, what to measure, and how to keep every throw fair through randomization and control.

Blakely helping with family home evening data collection

Methods

Hypotheses

The significance level for all ANOVA tests was set to \(\alpha = 0.05\).

For paper type: \(H_0: \mu_{Light\cdot} = \mu_{Heavy\cdot}\) \(H_A: \mu_{Light\cdot} \ne \mu_{Heavy\cdot}\)

For fold design: \(H_0: \mu_{\cdot Basic} = \mu_{\cdot Lock} = \mu_{\cdot Lift}\) \(H_A\): at least one fold-design mean differs.

For interaction: \(H_0\): no paper-by-fold interaction. \(H_A\): a paper-by-fold interaction is present.

Design

This was a completely randomized \(2 \times 3\) factorial study with six treatment combinations. The realized sample sizes were Test 1 = 10, Test 2 = 10, Test 3 = 8, Test 4 = 9, Test 5 = 12, and Test 6 = 11 throws. That gives Basic Dart = 19 throws, Lock Bottom = 22, and Lift Off = 19, with randomized throw order to keep practice fair and avoid one design being boosted due to bias.

kable(c.tb, digits = 0, col.names = c("Paper", "Fold", "n")) |>
  kable_styling(full_width = TRUE)

Paper	Fold	n
Light	Basic Dart	10
Light	Lock Bottom	10
Light	Lift Off	8
Heavy	Basic Dart	9
Heavy	Lock Bottom	12
Heavy	Lift Off	11

Procedure

Fold planes into the three designs using the same paper size, keeping the only planned material change as light versus heavy printer paper.
Label the six treatment combinations as 1 through 6 exactly as used in the file, then randomize the throw order before starting with excel.
Launch every plane from the same marked line in the same indoor space so wind and floor layout stay as steady as possible.
Use the same thrower and the same general throwing motion for all trials, and measure the straight-line distance from the launch line to the first landing point.
Record the treatment code and distance for each throw immediately after the trial to reduce mix-ups.
Refold or replace planes as needed when wear becomes obvious, because bent noses and soft creases can change flight even when the treatment label stays the same.

Plane Designs

Basic Dart design Lock Bottom design Lift Off design

Variance And Limits

The biggest extra sources of variation are throw strength, launch angle, small air currents, plane wear, and measurement error. Using one thrower, one launch line, one testing space, and randomized treatment order helps keep those issues from lining up with a specific treatment. One clear weakness is that the final cell sizes are not equal in the file, with counts ranging from 8 to 12, so the design is slightly unbalanced and the ANOVA needs to respect that.

Analysis

Data Summary

The highest sample mean came from Light paper with the Lift Off fold at 179.25, while the lowest came from Heavy paper with the Lock Bottom fold at 156.33. To keep comparisons easy and friendly to read, this section shows one grouped numerical summary and one grouped plot together. The spreads inside groups are fairly wide, so visible mean gaps should be treated carefully.

favstats(y ~ interaction(wt, pl), data = d.df) |>
  kable(digits = 2) |>
  kable_styling(full_width = TRUE)

interaction(wt, pl)	min	Q1	median	Q3	max	mean	sd	n
Light.Basic Dart	119	144.00	161.5	170.50	207	159.00	24.48	10
Heavy.Basic Dart	135	147.00	164.0	172.00	203	163.00	22.53	9
Light.Lock Bottom	145	159.75	173.0	198.00	221	179.20	25.85	10
Heavy.Lock Bottom	102	138.00	155.0	179.00	214	156.33	31.06	12
Light.Lift Off	143	160.00	183.0	195.25	216	179.25	24.11	8
Heavy.Lift Off	126	145.50	160.0	185.50	201	163.36	24.94	11

favstats(y ~ wt, data = d.df) |>
  kable(digits = 2) |>
  kable_styling(full_width = TRUE)

wt	min	Q1	median	Q3	max	mean	sd	n	missing
Light	119	154.00	169.5	195.00	221	172.00	25.89	28	0
Heavy	102	142.25	159.5	183.25	214	160.62	26.18	32	0

favstats(y ~ pl, data = d.df) |>
  kable(digits = 2) |>
  kable_styling(full_width = TRUE)

pl	min	Q1	median	Q3	max	mean	sd	n
Basic Dart	119	144.0	164.0	171.50	207	160.89	23.01	19
Lock Bottom	102	146.5	165.5	185.75	221	166.73	30.45	22
Lift Off	126	150.5	162.0	191.50	216	170.05	25.23	19

boxplot(y ~ wt, data = d.df,
  main = "Paper Airplane Flight Distance by Paper Weight",
        xlab = "Paper type", ylab = "Distance (inches)")
stripchart(y ~ wt, data = d.df, add = TRUE, vertical = TRUE,
           method = "jitter", pch = 16, col = "gray35")

boxplot(y ~ pl, data = d.df,
  main = "Paper Airplane Flight Distance by Fold Design",
        xlab = "Fold design", ylab = "Distance (inches)")
stripchart(y ~ pl, data = d.df, add = TRUE, vertical = TRUE,
           method = "jitter", pch = 16, col = "gray35")

interaction.plot(d.df$pl, d.df$wt, d.df$y, fun = mean,
                 main = "Paper Airplane Mean Flight Distance: Weight by Fold Design",
                 xlab = "Fold design", ylab = "Distance (inches)",
                 trace.label = "Paper", lwd = 2, pch = 19,
                 col = c("#1f77b4", "#d62728"))
points(as.numeric(d.df$pl) + ifelse(d.df$wt == "Light", -0.06, 0.06),
       d.df$y,
       pch = 16,
       col = ifelse(d.df$wt == "Light", "#1f77b4", "#d62728"))

The lines are not perfectly parallel, and the overlaid points show substantial overlap across groups. This supports the same conclusion as the ANOVA interaction test: some pattern differences are visible, but they are not strong enough to be statistically significant in this sample, even though the plot is still fun to compare.

Assumptions

For the raw-distance ANOVA model, I checked normality, constant variance, and independence using visual diagnostics. Normality looked reasonable because the Q-Q residual plot stayed close to the line with only mild tail departures, and constant variance looked acceptable because the residual spread was fairly similar across fitted values without a strong funnel shape. The residuals-versus-throw-order plot did not show a clear run-order trend, so independence looked reasonable as well.

op <- par(mfrow = c(1, 2))
plot(i.lm, which = 1)
plot(i.lm, which = 2)

par(op)

plot(d.df$od, resid(i.lm),
  main = "Paper Airplane ANOVA Residuals by Throw Order",
  xlab = "Throw order", ylab = "Residual (inches)",
  pch = 16)
abline(h = 0, lty = 2)

ANOVA

Because the treatment counts are uneven across cells, Type III sums of squares are the better choice here. That lets each effect be tested after accounting for the others instead of letting the order of entry steer the result. None of the three F-tests reached the 0.05 level: paper weight \(F(1, 54) = 2.94\), \(p = 0.092\); fold design \(F(2, 54) = 0.77\), \(p = 0.469\); and the interaction \(F(2, 54) = 1.43\), \(p = 0.248\).

i.lm <- lm(y ~ wt * pl, data = d.df)
i.a3 <- as.data.frame(car::Anova(i.lm, type = 3))
i.a3$term <- rownames(i.a3)
rownames(i.a3) <- NULL
i.a3 <- i.a3 |>
  dplyr::select(term, `Sum Sq`, Df, `F value`, `Pr(>F)`)

kable(i.a3, digits = 3, col.names = c("Term", "Sum Sq", "Df", "F", "p")) |>
  kable_styling(full_width = TRUE)

Term	Sum Sq	Df	F	p
(Intercept)	1638876.328	1	2433.686	0.000
wt	1978.809	1	2.938	0.092
pl	1033.598	2	0.767	0.469
wt:pl	1926.535	2	1.430	0.248
Residuals	36364.312	54	NA	NA

Follow-Up

Because the overall ANOVA tests were not significant at \(\alpha = 0.05\), I did not run pairwise contrasts. Skipping post-hoc comparisons here keeps the interpretation honest and avoids over-interpreting noise when there is not enough evidence of a mean difference in the first place.

The practical question people may still ask is which setup looked best even if nothing was significant. On raw means alone, Light paper with the Lift Off fold landed farthest on average, but the within-group variation was large enough that this edge did not hold up as a reliable difference.

Case Study 2: Fly away birdy - Paper Airplane Report

Wil Jones