Edexcel FS2 2022 June — Question 1 7 marks

Exam BoardEdexcel
ModuleFS2 (Further Statistics 2)
Year2022
SessionJune
Marks7
PaperDownload PDF ↗
Mark schemeDownload PDF ↗
TopicLinear regression
TypeHypothesis test for regression slope
DifficultyStandard +0.3 This is a straightforward FS2 question testing standard linear regression concepts: prediction from a regression line (simple substitution), definition of residual (recall), calculating PMCC from given summary statistics (formula application), and interpreting correlation/residual plots (standard textbook interpretation). All parts are routine applications of learned techniques with no novel problem-solving required, making it slightly easier than average.
Spec5.08a Pearson correlation: calculate pmcc5.09a Dependent/independent variables5.09b Least squares regression: concepts5.09c Calculate regression line5.09d Linear coding: effect on regression5.09e Use regression: for estimation in context

  1. Kwame is investigating a possible relationship between average March temperature, \(t ^ { \circ } \mathrm { C }\), and tea yield, \(y \mathrm {~kg} /\) hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
$$\begin{aligned} & \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\ & \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080 \end{aligned}$$
  1. Use the regression model to predict the tea yield for an average March temperature of \(20 ^ { \circ } \mathrm { C }\) He also produces the following residual plot for the data. \includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}
  2. Explain what you understand by the term residual.
  3. Calculate the product moment correlation coefficient between \(t\) and \(y\)
  4. Explain why the linear model may not be a good fit for the data
    1. with reference to your answer to part (c)
    2. with reference to the residual plot. \section*{Question 1 continues on page 4} Kwame also collects data on total March rainfall, \(w \mathrm {~mm}\), for each of these 30 years. For a linear regression model of \(w\) on \(t\) the following summary statistic is found. $$\text { Residual Sum of Squares (RSS) = } 86754$$ Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between \(w\) and \(t\) than between \(y\) and \(t\) (where RSS \(= 1666567\) )
  5. State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.

Question 1:
Part (a)
AnswerMarks Guidance
\([20 \times 45.5 + 2080] = 2990\) [kg/ha]B1 (1 mark) cao
Part (b)
AnswerMarks Guidance
(A residual is the) difference between the observed value (oe) and the predicted value (oe) (of the dependent variable)B1 (1 mark) Correct definition. Allow equivalent wording. Distance from regression line on its own is B0, but allow if vertical distance or \(y\) is referenced.
Part (c)
AnswerMarks Guidance
\(1666567 = 1774155(1-r^2)\)M1 Use of correct expression for \(r\) or \(r^2\). Allow use of \(S_{ty} = 45.5 \times 52.0\ [=2366]\), or RSS: \(1774155 - \frac{(S_{ty})^2}{52} = 1666567 \rightarrow [S_{ty} = 2365.2...]\)
\(r = 0.246\ldots\) awrt 0.246A1 (2 marks) awrt 0.246 \((-0.246\) or \(\pm 0.246\) scores M1A0). And then \(r = \frac{\text{awrt } 2365 \text{ or awrt } 2366}{\sqrt{52.0 \times 1774155}}\)
Part (d)(i)
AnswerMarks Guidance
Since \(r\) is close to 0/weak correlationB1 (1 mark) Correct explanation
Part (d)(ii)
AnswerMarks Guidance
e.g. (For \(t > 20\ldots\)) the residuals do not appear randomly scattered about 0.B1 (1 mark) Correct evaluation of the fit of the model's residuals (e.g. variance either side of \(t = 20\) does not appear to be the same). 'Residuals not randomly scattered' on its own is B0.
Part (e)
AnswerMarks Guidance
Kwame's conclusion cannot be supported using RSS since the two values of RSS do not have the same units.B1 (1 mark) Correct assessment of the conclusion involving the units/size of the variables used to calculate the RSS
Total: 7 marks
# Question 1:

## Part (a)
$[20 \times 45.5 + 2080] = 2990$ [kg/ha] | B1 (1 mark) | cao

## Part (b)
(A residual is the) difference between the observed value (oe) and the predicted value (oe) (of the dependent variable) | B1 (1 mark) | Correct definition. Allow equivalent wording. Distance from regression line on its own is B0, but allow if vertical distance or $y$ is referenced.

## Part (c)
$1666567 = 1774155(1-r^2)$ | M1 | Use of correct expression for $r$ or $r^2$. Allow use of $S_{ty} = 45.5 \times 52.0\ [=2366]$, or RSS: $1774155 - \frac{(S_{ty})^2}{52} = 1666567 \rightarrow [S_{ty} = 2365.2...]$

$r = 0.246\ldots$ awrt 0.246 | A1 (2 marks) | awrt 0.246 $(-0.246$ or $\pm 0.246$ scores M1A0). And then $r = \frac{\text{awrt } 2365 \text{ or awrt } 2366}{\sqrt{52.0 \times 1774155}}$

## Part (d)(i)
Since $r$ is close to 0/weak correlation | B1 (1 mark) | Correct explanation

## Part (d)(ii)
e.g. (For $t > 20\ldots$) the residuals do not appear randomly scattered about 0. | B1 (1 mark) | Correct evaluation of the fit of the model's residuals (e.g. variance either side of $t = 20$ does not appear to be the same). 'Residuals not randomly scattered' on its own is B0.

## Part (e)
Kwame's conclusion cannot be supported using RSS since the two values of RSS do not have the same units. | B1 (1 mark) | Correct assessment of the conclusion involving the units/size of the variables used to calculate the RSS

**Total: 7 marks**
\begin{enumerate}
  \item Kwame is investigating a possible relationship between average March temperature, $t ^ { \circ } \mathrm { C }$, and tea yield, $y \mathrm {~kg} /$ hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
\end{enumerate}

$$\begin{aligned}
& \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\
& \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080
\end{aligned}$$

(a) Use the regression model to predict the tea yield for an average March temperature of $20 ^ { \circ } \mathrm { C }$

He also produces the following residual plot for the data.\\
\includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}\\
(b) Explain what you understand by the term residual.\\
(c) Calculate the product moment correlation coefficient between $t$ and $y$\\
(d) Explain why the linear model may not be a good fit for the data\\
(i) with reference to your answer to part (c)\\
(ii) with reference to the residual plot.

\section*{Question 1 continues on page 4}

Kwame also collects data on total March rainfall, $w \mathrm {~mm}$, for each of these 30 years. For a linear regression model of $w$ on $t$ the following summary statistic is found.

$$\text { Residual Sum of Squares (RSS) = } 86754$$

Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between $w$ and $t$ than between $y$ and $t$ (where RSS $= 1666567$ )\\
(e) State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.

\hfill \mbox{\textit{Edexcel FS2 2022 Q1 [7]}}