5.09c Calculate regression line

235 questions

Sort by: Default | Easiest first | Hardest first
OCR MEI Further Statistics Minor 2022 June Q2
13 marks Moderate -0.8
2 A forester is investigating the relationship between the diameter and the height of young beech trees. She selects a random sample of 15 young beech trees in a forest and records their diameters, \(d \mathrm {~cm}\), and their heights, \(h \mathrm {~m}\). The data are illustrated in the scatter diagram. \includegraphics[max width=\textwidth, alt={}, center]{e8624e9b-5143-49d2-9683-cc3a1082694e-3_649_1116_386_230}
  1. State whether either or both of the variables \(d\) and \(h\) are random variables. Summary data for the diameters and heights are as follows. $$\mathrm { n } = 15 \quad \sum \mathrm {~d} = 84.9 \quad \sum \mathrm {~h} = 124.7 \quad \sum \mathrm {~d} ^ { 2 } = 624.55 \quad \sum \mathrm {~h} ^ { 2 } = 1230.57 \quad \sum \mathrm { dh } = 866.63$$
  2. Find the equation of the regression line of \(h\) on \(d\). Give your answer in the form \(h = a d + b\), giving the values of \(a\) and \(b\) correct to \(\mathbf { 2 }\) decimal places.
  3. Use the regression line to predict the heights of beech trees with the following diameters.
    Comment on this in relation to your regression line.
  4. State the coordinates of the point at which the regression line of \(d\) on \(h\) meets the line which you calculated in part (b).
OCR MEI Further Statistics Minor 2023 June Q5
8 marks Moderate -0.8
5 An ornithologist is investigating the link between the wing length and the mass of small birds, in order to try to predict the mass from the wing length without having to weigh birds. The ornithologist takes a random sample of 9 birds and measures their wing lengths \(w \mathrm {~mm}\) and their masses \(m g\). The spreadsheet below shows the data, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{72215d69-c3e6-492d-bb3e-bdc28aeb4613-5_719_1424_495_246}
  1. Find the equation of the regression line of \(m\) on \(w\), giving the coefficients correct to \(\mathbf { 3 }\) significant figures.
  2. Use the equation which you found in part (a) to estimate the mass for each of the following wing lengths.
    Comment on this suggestion.
OCR MEI Further Statistics Minor 2021 November Q2
9 marks Moderate -0.8
2 A road transport researcher is investigating the link between the age of a person, a years, and the distance, \(d\) metres, at which the person can read a large road sign. The researcher selects 13 individuals of different ages between 20 and 80 and measures the value of \(d\) for each of them. The spreadsheet below shows the data which the researcher obtained, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{691e8b55-e9a1-4fff-b9ee-a71ff1f73ead-3_725_1566_495_251}
  1. Explain which of the two variables \(a\) and \(d\) is the independent variable.
  2. Find the equation of the regression line of \(d\) on \(a\).
  3. Use the regression line to predict the average distance at which a 60-year-old person can read the road sign.
  4. Explain why it might not be sensible to use the regression line to predict the average distance at which a 5 -year-old child can read the road sign.
  5. Determine the value of the residual for \(a = 40\).
  6. Explain why it would not be useful to find the equation of the regression line of \(a\) on \(d\).
OCR MEI Further Statistics Major 2019 June Q6
18 marks Moderate -0.8
6
  1. A researcher is investigating the date of the 'start of spring' at different locations around the country.
    A suitable date (measured in days from the start of the year) can be identified by checking, for example, when buds first appear for certain species of trees and plants, but this is time-consuming and expensive. Satellite data, measuring microwave emissions, can alternatively be used to estimate the date that land-based measurements would give. The researcher chooses a random sample of 12 locations, and obtains land-based measurements for the start of spring date at each location, together with relevant satellite measurements. The scatter diagram in Fig. 6.1 shows the results; the land-based measurements are denoted by \(x\) days and the corresponding values derived from satellite measurements by \(y\) days. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-06_732_1342_781_333} \captionsetup{labelformat=empty} \caption{Fig. 6.1}
    \end{figure} Fig. 6.2 shows part of a spreadsheet used to analyse the data. Some rows of the spreadsheet have been deliberately omitted. \begin{table}[h]
    1ABCDEF
    1x\(\boldsymbol { y }\)\(\boldsymbol { x } ^ { \mathbf { 2 } }\)\(\boldsymbol { y } ^ { \mathbf { 2 } }\)xy
    2901028100104049180
    3
    10
    11
    129497883694099118
    13991019801102019999
    14Sum11311227107783126725116724
    15
    \captionsetup{labelformat=empty} \caption{Fig. 6.2}
    \end{table}
    1. Calculate the equation of a regression line suitable for estimating the land-based date of the start of spring from satellite measurements.
    2. Using this equation, estimate the land-based date of the start of spring for the following dates from satellite measurements.
      • 95 days
      • 60 days
        (iii) Comment on the reliability of each of your estimates.
      • The researcher is also investigating whether there is any correlation between the average temperature during a month in spring and the total rainfall during that month at a particular location. The average temperatures in degrees Celsius and total rainfall in mm for a random selection, over several years, of 10 spring months at this location are as follows.
      Temperature4.27.15.63.58.66.52.75.96.74.1
      Rainfall18264276154384536636
      The researcher plots the scatter diagram shown in Fig. 6.3 to check which type of test to carry out. \begin{figure}[h]
      \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-07_693_880_1174_338} \captionsetup{labelformat=empty} \caption{Fig. 6.3}
      \end{figure}
      1. Explain why the researcher might come to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid.
      2. Find the value of Pearson's product moment correlation coefficient.
      3. Carry out a test at the \(5 \%\) significance level to investigate whether there is any correlation between temperature and rainfall.
OCR MEI Further Statistics Major 2022 June Q5
11 marks Moderate -0.3
5 A motorist is investigating the relationship between tyre pressure and temperature. As the temperature increases during a hot day, she records the pressure (measured in bars) of one of her car tyres at specific temperatures of \(20 ^ { \circ } \mathrm { C } , 22 ^ { \circ } \mathrm { C } , \ldots , 36 ^ { \circ } \mathrm { C }\). The results are shown in Table 5.1. \begin{table}[h]
Temperature \(\left( t ^ { \circ } \mathrm { C } \right)\)202224262830323436
Tyre pressure \(( P\) bar \()\)2.0122.0362.0652.0742.1142.1402.1492.1762.192
\captionsetup{labelformat=empty} \caption{Table 5.1}
\end{table}
  1. Calculate the equation of the regression line of pressure on temperature. Give your answer in the form \(P = a t + b\), giving the values of \(a\) and \(b\) to \(\mathbf { 4 }\) significant figures.
  2. Table 5.2 shows the residuals for most of the data values. Complete the copy of the table in the Printed Answer Booklet. \begin{table}[h]
    Temperature202224262830323436
    Residual tyre
    pressure
    - 0.003- 0.0020.004- 0.0100.011- 0.0030.001
    \captionsetup{labelformat=empty} \caption{Table 5.2}
    \end{table}
  3. With reference to the values of the residuals, comment on the goodness of fit of the regression line.
  4. Use your answer to part (a) to calculate an estimate of the pressure in the tyre at each of the following temperatures, giving your answers to \(\mathbf { 3 }\) decimal places.
OCR MEI Further Statistics Major 2023 June Q2
5 marks Easy -1.2
2 A student is investigating the link between temperature and electricity consumption in the winter months. The student finds the average minimum temperature, \(x ^ { \circ } \mathrm { C }\), from across the country on a day. The student then finds the total electricity consumption for that day, \(y \mathrm { GWh }\). The scatter diagram below shows the values of \(x\) and \(y\) obtained from a random sample of 10 winter days. It also shows the equation of the regression line of \(y\) on \(x\) and the value of \(r ^ { 2 }\), where \(r\) is the product moment correlation coefficient. \includegraphics[max width=\textwidth, alt={}, center]{c692fb20-436f-4bc1-89bd-10fdba41ceba-03_776_1043_609_244}
  1. Use the regression line to estimate the electricity consumption at each of the following average minimum temperatures.
OCR MEI Further Statistics Major 2024 June Q8
14 marks Moderate -0.3
8 An estate agent collects data for a random selection of 13 flats in order to investigate the link between the floor areas of flats and their price. The scatter diagram shows the floor areas, \(x \mathrm {~m} ^ { 2 }\), and prices, \(\pounds y\) thousand, of the 13 flats. \includegraphics[max width=\textwidth, alt={}, center]{bab116b3-6e5f-44db-ac86-670e4040d649-07_613_1246_386_242}
  1. The estate agent notes that two of the data points are outliers. One is Flat A which has a large floor area but is in poor condition. The other is Flat B which has a balcony with a desirable view overlooking the sea. Label these two data points on the copy of the scatter diagram in the Printed Answer Booklet. The estate agent decides to remove these two data points from the analysis. Summary statistics for the remaining 11 flats are as follows. $$\sum x = 652.5 \quad \sum y = 5067 \quad \sum x ^ { 2 } = 41987.35 \quad \sum y ^ { 2 } = 2456813 \quad \sum x y = 315928.2$$
  2. In this question you must show detailed reasoning. Calculate the equation of a regression line which is suitable for estimating the price of a flat from its floor area.
  3. Use the regression line to estimate the price for the following floor areas.
    Comment briefly on the estate agent's idea.
OCR MEI Further Statistics Major 2020 November Q5
13 marks Moderate -0.3
5 A hearing expert is investigating whether web-based hearing tests can be used instead of hearing tests in a hearing laboratory. The expert selects a random sample of 16 people with normal hearing. Each of them is given two hearing tests, one in the laboratory and one web-based. The scores in the laboratory-based test, \(x\), and the web-based test, \(y\), are both measured in the same suitable units.
  1. Half of the participants do the laboratory-based test first and the other half do the web-based test first. Explain why the expert adopts this approach. The scatter diagram in Fig. 5 shows the data that the expert collected. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{8d36bc92-07ac-40c3-9e75-26f2bc9d2fcc-05_785_1360_1009_242} \captionsetup{labelformat=empty} \caption{Fig. 5}
    \end{figure} Summary statistics for these data are as follows. $$\Sigma x = 198.0 \quad \Sigma x ^ { 2 } = 2936.92 \quad \Sigma y = 188.7 \quad \Sigma y ^ { 2 } = 2605.35 \quad \Sigma x y = 2554.87$$
  2. Calculate the equation of the regression line suitable for estimating web-based scores from laboratory-based scores.
  3. Estimate the web-based scores of people whose laboratory-based scores were as follows.
    Stating the approximate coordinates of the outlier, suggest what the expert should do.
OCR MEI Further Statistics Major 2021 November Q8
16 marks Standard +0.3
8
  1. \(\mathrm { VO } _ { 2 \max }\) is a measure of athletic fitness. Since \(\mathrm { VO } _ { 2 \max }\) is fairly time-consuming and expensive to measure, an exercise scientist wants to predict \(\mathrm { VO } _ { 2 _ { \text {max } } }\) from data such as times for running different distances. The scientist uses these data for a random sample of 15 athletes to predict their \(\mathrm { V } \mathrm { O } _ { 2 \text { max } }\) values, denoted by \(y\), in suitable units. She also obtains accurate measurements of the \(\mathrm { V } \mathrm { O } _ { 2 \text { max } }\) values, denoted by \(x\), in the same units. The scatter diagram in Fig. 8.1 shows the values of \(x\) and \(y\) obtained, together with the equation of the regression line of \(y\) on \(x\) and the value of \(r ^ { 2 }\). \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{ce557137-f9eb-4c09-a7e3-e4ec626109dc-08_750_1324_660_317} \captionsetup{labelformat=empty} \caption{Fig. 8.1}
    \end{figure}
    1. Use the regression line to estimate the predicted \(\mathrm { VO } _ { 2 \text { max } }\) of an athlete whose accurately measured \(\mathrm { VO } _ { 2 \text { max } }\) is 50 .
    2. Comment on the reliability of your estimate.
    3. The equation of the regression line of \(x\) on \(y\) is \(x = 0.7565 y + 10.493\). Find the coordinates of the point at which the two regression lines meet.
    4. State what the point you found in part (iii) represents.
  2. It is known that there is negative correlation between \(\mathrm { VO } _ { 2 \text { max } }\) and marathon times in very good runners (those whose best marathon times are under 3 hours). The exercise scientist wishes to know whether the same applies to runners who take longer to run a marathon. She selects a random sample of 20 runners whose best marathon times are between \(3 \frac { 1 } { 2 }\) hours and \(4 \frac { 1 } { 2 }\) hours and accurately measures their \(\mathrm { VO } _ { 2 \text { max } }\). Fig. 8.2 is a scatter diagram of accurately measured \(\mathrm { VO } _ { \text {2max } }\), \(v\) units, against best marathon time, \(t\) hours, for these runners. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{ce557137-f9eb-4c09-a7e3-e4ec626109dc-09_671_1064_648_319} \captionsetup{labelformat=empty} \caption{Fig. 8.2}
    \end{figure}
    1. Explain why the exercise scientist comes to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid. Summary statistics for the 20 runners are as follows. $$\sum t = 80.37 \quad \sum v = 970.86 \quad \sum t ^ { 2 } = 324.71 \quad \sum v ^ { 2 } = 47829.24 \quad \sum t v = 3886.53$$
    2. Find the value of Pearson's product moment correlation coefficient.
    3. Carry out a test at the \(5 \%\) significance level to investigate whether there is negative correlation between accurately measured \(\mathrm { VO } _ { 2 _ { \text {max } } }\) and best marathon time for runners whose best marathon times are between \(3 \frac { 1 } { 2 }\) hours and \(4 \frac { 1 } { 2 }\) hours.
WJEC Further Unit 2 2019 June Q6
6 marks Moderate -0.3
6. The University of Arizona surveyed a large number of households. One purpose of the survey was to determine if annual household income could be predicted from size of family home. The graph of Annual household income, \(y\), versus Size of family home, \(x\), is shown below. \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_616_1257_566_365}
  1. State the limitations of using the regression line above with reference to the scatter diagram. The data for size of family homes between 2000 and 3000 square feet are shown in the diagram below. \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_652_1244_1516_360} Summary statistics for these data are as follows. $$\begin{array} { r c c } \sum x = 93160 & \sum y = 3907142 & n = 37 \\ S _ { x x } = 2869673.03 & S _ { y y } = 44312797167 & S _ { x y } = 348512820 \cdot 6 \end{array}$$
  2. Calculate the equation of the least squares regression line to predict Annual household income from Size of family home for these data.
WJEC Further Unit 2 2022 June Q7
7 marks Moderate -0.3
7. Data from a large dataset shows the percentage of children enrolled in secondary education and the percentage of the adult population who are literate. The following graphs show data from 30 randomly selected regions from each of the Arab World, Africa and Asia. In each case, the least squares regression line of '\% Literacy' on '\% Enrolled in Secondary Education' is shown. \includegraphics[max width=\textwidth, alt={}, center]{77fd7ad7-f5a3-4947-afc6-e5ef45bef7a8-6_682_1200_584_395} \begin{figure}[h]
\captionsetup{labelformat=empty} \caption{Africa} \includegraphics[alt={},max width=\textwidth]{77fd7ad7-f5a3-4947-afc6-e5ef45bef7a8-6_623_1191_1548_397}
\end{figure} \includegraphics[max width=\textwidth, alt={}, center]{77fd7ad7-f5a3-4947-afc6-e5ef45bef7a8-7_665_1200_331_434}
  1. Calculate the equation of the least squares regression line of '\% Literacy' ( \(y\) ) on '\% Enrolled in Secondary Education' ( \(x\) ) for Asia, given the following summary statistics. $$\begin{array} { l l l } \sum x = 2850.836 & \sum y = 2738.656 & S _ { x x } = 88.42142 \\ S _ { y y } = 204.733 & S _ { x y } = 96.60984 & n = 30 \end{array}$$
  2. The Arab World, Africa and Asia each contain a region where \(70 \%\) are enrolled in secondary education. The three regression lines are used to estimate the corresponding \% Literacy. Which of these estimates is likely to be the most reliable? Clearly explain your reasoning. \section*{END OF PAPER}
WJEC Further Unit 2 2024 June Q4
12 marks Standard +0.8
4. An author poses the following question: Does using cash for transactions affect people's financial behaviour?
She collects data on 'Cash transactions as a \% of all transactions' and 'Household debt as a \(\%\) of net disposable income' from a random sample of 25 countries. The table below shows the data she collected. There are missing values, \(p\) and \(q\), for Malta and Denmark respectively.
CountryCash transactions as a \% of all transactions \(\boldsymbol { x }\)Household debt as a \% of net disposable income \(\boldsymbol { y }\)CountryCash transactions as a \% of all transactions \(\boldsymbol { x }\)Household debt as a \% of net disposable income \(\boldsymbol { y }\)
Malta92\(p\)France68120
Mexico90-14Luxembourg64177
Greece88107Belgium63113
Spain87110Finland54137
Italy8687Estonia4882
Austria8591The Netherlands45247
Portugal81131UK42147
Slovenia8056Australia37214
Germany8095USA32109
Ireland79154Sweden20187
Slovakia7874South Korea14182
Lithuania7546Denmark\(q\)261
Latvia7143
The summary statistics and scatter diagram below are for the other 23 countries. \begin{figure}[h]
\captionsetup{labelformat=empty} \caption{Household debt versus Cash transactions} \includegraphics[alt={},max width=\textwidth]{1538fa56-5b61-40ec-bb02-cf1ed9da5eb0-13_664_1296_511_379}
\end{figure} $$\begin{gathered} \sum x = 1467 \sum y = 2695 \sum x ^ { 2 } = 105073 \quad S _ { x x } = 11503 \cdot 91304 \quad S _ { y y } = 78669 \cdot 30435 \\ \sum y ^ { 2 } = 394453 \sum x y = 152999 \quad S _ { x y } = - 18895 \cdot 13043 \end{gathered}$$
  1. Using the summary statistics for the 23 countries, calculate and interpret Pearson's product moment correlation coefficient.
  2. Calculate the equation of the least squares regression line of Household debt as a \% of net disposable income \(( y )\) on Cash transactions as a \% of all transactions ( \(x\) ). The regression line \(x\) on \(y\) is given below. $$x = - 0 \cdot 24 y + 91 \cdot 92$$
  3. By selecting the appropriate regression line in each case, estimate the values of \(p\) and \(q\) in the table.
  4. Comment on the reliability of your answers in part (c).
  5. Interpret the negative value of \(y\) for Mexico.
Edexcel FS2 AS 2018 June Q1
11 marks Moderate -0.3
  1. The scores achieved on a maths test, \(m\), and the scores achieved on a physics test, \(p\), by 16 students are summarised below.
$$\sum m = 392 \quad \sum p = 254 \quad \sum p ^ { 2 } = 4748 \quad \mathrm {~S} _ { m m } = 1846 \quad \mathrm {~S} _ { m p } = 1115$$
  1. Find the product moment correlation coefficient between \(m\) and \(p\)
  2. Find the equation of the linear regression line of \(p\) on \(m\) Figure 1 shows a plot of the residuals. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{0fcb4d83-9763-4edd-8006-93f75a44c596-02_808_1222_997_429} \captionsetup{labelformat=empty} \caption{Figure 1}
    \end{figure}
  3. Calculate the residual sum of squares (RSS). For the person who scored 30 marks on the maths test,
  4. find the score on the physics test. The data for the person who scored 20 on the maths test is removed from the data set.
  5. Suggest a reason why. The product moment correlation coefficient between \(m\) and \(p\) is now recalculated for the remaining 15 students.
  6. Without carrying out any further calculations, suggest how you would expect this recalculated value to compare with your answer to part (a).
    Give a reason for your answer.
    V349 SIHI NI IMIMM ION OCVJYV SIHIL NI LIIIM ION OOVJYV SIHIL NI JIIYM ION OC
Edexcel FS2 AS 2019 June Q3
11 marks Standard +0.3
  1. Two students, Jim and Dora, collected data on the mean annual rainfall, \(w \mathrm {~cm}\), and the annual yield of leeks, \(l\) tonnes per hectare, for 10 years.
Jim summarised the data as follows $$\mathrm { S } _ { w l } = 42.786 \quad \mathrm {~S} _ { w w } = 9936.9 \quad \sum l ^ { 2 } = 26.2326 \quad \sum l = 16.06$$
  1. Find the product moment correlation coefficient between \(l\) and \(w\) Dora decided to code the data first using \(s = w - 6\) and \(t = l - 20\)
  2. Write down the value of the product moment correlation coefficient between \(s\) and \(t\). Give a justification for your answer. Dora calculates the equation of the regression line of \(t\) on \(s\) to be \(t = 0.00431 s - 18.87\)
  3. Find the equation of the regression line of \(l\) on \(w\) in the form \(l = a + b w\), giving the values of \(a\) and \(b\) to 3 significant figures.
  4. Use your equation to estimate the yield of leeks when \(w\) is 100 cm .
  5. Calculate the residual sum of squares. The graph shows the residual for each value of \(l\) \includegraphics[max width=\textwidth, alt={}, center]{7e46e14a-0f5a-4d02-8f00-a92bc4def6d7-08_716_1594_1594_239}
    1. State whether this graph suggests that the use of a linear regression model is suitable for these data. Give a reason for your answer.
    2. Other than collecting more data, suggest how to improve the fit of the model in part (c) to the data.
Edexcel FS2 AS 2020 June Q4
14 marks Standard +0.3
  1. Some students are investigating the strength of wire by suspending a weight at the end of the wire. They measure the diameter of the wire, \(d \mathrm {~mm}\), and the weight, \(w\) grams, when the wire fails. Their results are given in the following table.
\cline { 2 - 13 } \multicolumn{1}{l|}{}These 14 points are plotted on page 13Not yet plotted
\(d\)0.50.60.70.80.91.11.31.622.42.83.33.53.9\(\mathbf { 4 . 5 }\)\(\mathbf { 4 . 6 }\)\(\mathbf { 4 . 8 }\)\(\mathbf { 5 . 4 }\)
\(w\)1.21.72.33.03.85.67.711.61825.934.947.452.763.9\(\mathbf { 8 1 }\)\(\mathbf { 8 3 . 6 }\)\(\mathbf { 8 9 . 9 }\)\(\mathbf { 1 0 9 . 4 }\)
The first 14 points are plotted on the axes on page 13.
  1. On the axes on page 13, complete the scatter diagram for these data.
  2. Use your calculator to write down the equation of the regression line of \(w\) on \(d\).
  3. With reference to the scatter diagram, comment on the appropriateness of using this linear regression model to make predictions for \(w\) for different values of \(d\) between 0.5 and 5.4 The product moment correlation coefficient for these data is \(r = 0.987\) (to 3 significant figures).
  4. Calculate the residual sum of squares (RSS) for this model. Robert, one of the students, suggests that the model could be improved and intends to find the equation of the line of regression of \(w\) on \(u\), where \(u = d ^ { 2 }\) He finds the following statistics $$\mathrm { S } _ { w u } = 5721.625 \quad \mathrm {~S} _ { u u } = 1482.619 \quad \sum u = 157.57$$
  5. By considering the physical nature of the problem, give a reason to support Robert's suggestion.
  6. Find the equation of the regression line of \(w\) on \(u\).
  7. Find the residual sum of squares (RSS) for Robert's model.
  8. State, giving a reason based on these calculations, which of these models better describes these data.
    1. Hence estimate the weight at which a piece of wire with diameter 3 mm will fail. \begin{figure}[h]
      \captionsetup{labelformat=empty} \caption{Question 4 continued} \includegraphics[alt={},max width=\textwidth]{fbd7b196-5372-4956-8d38-92f05c92a5f7-13_2315_1363_301_358}
      \end{figure}
Edexcel FS2 AS 2022 June Q3
10 marks Standard +0.3
  1. Gabriela is investigating a particular type of fish, called bream. She wants to create a model to predict the weight, \(w\) grams, of bream based on their length, \(x \mathrm {~cm}\).
For a sample of 27 bream, some summary statistics are given below. $$\begin{gathered} \bar { x } = 31.07 \quad \bar { w } = 628.59 \quad \sum w ^ { 2 } = 11386134 \\ \mathrm {~S} _ { x w } = 13082.3 \quad \mathrm {~S} _ { x x } = 260.8 \end{gathered}$$
  1. Find the value of the product moment correlation coefficient between \(x\) and \(w\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(w\) on \(x\) in the form \(w = a + b x\) A residual plot for these data is shown below. \includegraphics[max width=\textwidth, alt={}, center]{128c408d-3e08-4f74-8f19-d33ecd5c882f-06_931_1790_1107_139} One of the bream in the sample has a length of 32 cm .
  4. Find its weight.
  5. With reference to the residual plot, comment on the model for bream with lengths above 33 cm .
Edexcel FS2 AS 2023 June Q3
10 marks Standard +0.3
  1. Pat is investigating the relationship between the height of professional tennis players and the speed of their serve. Data from 9 randomly selected professional male tennis players were collected. The variables recorded were the height of each player, \(h\) metres, and the maximum speed of their serve, \(v \mathrm {~km} / \mathrm { h }\).
Pat summarised these data as follows $$\sum h = 17.63 \quad \sum v = 2174.9 \quad \sum v ^ { 2 } = 526407.8 \quad S _ { h h } = 0.0487 \quad S _ { h v } = 5.1376$$
  1. Calculate the product moment correlation coefficient between \(h\) and \(v\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(v\) on \(h\) in the form \(v = a + b h\) where \(a\) and \(b\) are to be given to one decimal place. Pat calculated the sum of the residuals for the 9 tennis players as 1.04
  4. Without doing a calculation, explain how you know Pat has made a mistake. Pat made one mistake in the calculation. For the tennis player of height 1.96 m Pat misread the residual as 2.27
  5. Find the maximum speed of serve, in km/h, for the tennis player of height 1.96 m
Edexcel FS2 AS 2024 June Q5
8 marks Standard +0.3
  1. A random sample of 24 adults is taken. The height, \(h\) metres, and the arm span, \(s\) metres, for each adult are recorded.
These data are summarised below. $$\mathrm { S } _ { h h } = 0.377 \quad \mathrm {~S} _ { s h } = 0.352 \quad \bar { s } = 1.70 \quad \bar { h } = 1.68$$ The least squares regression line of \(h\) on \(s\) is $$h = a + 0.919 s$$ where \(a\) is a constant.
  1. Calculate the product moment correlation coefficient. A doctor uses the least squares regression line of \(h\) on \(s\) as a model to predict a person's height based on their arm span.
  2. Use the model to predict the height of an adult with arm span 1.79 metres. Ewan has an arm span of 1.70 metres and a height of 1.75 metres. His information is added to the sample as the 25th adult.
  3. Explain how the gradient of the regression line for the sample of 25 adults compares with the gradient of the regression line for the original sample of 24 adults.
    Give a reason for your answer.
Edexcel FS2 AS Specimen Q3
11 marks Standard +0.3
  1. A scientist wants to develop a model to describe the relationship between the average daily temperature, \(\mathrm { x } ^ { \circ } \mathrm { C }\), and a household's daily energy consumption, ykWh , in winter.
A random sample of the average temperature and energy consumption are taken from 10 winter days and are summarised below. $$\begin{gathered} \sum x = 12 \quad \sum x ^ { 2 } = 24.76 \quad \sum y = 251 \quad \sum y ^ { 2 } = 6341 \quad \sum x y = 284.8 \\ S _ { x x } = 10.36 \quad S _ { y y } = 40.9 \end{gathered}$$
  1. Find the product moment correlation coefficient between y and x .
  2. Find the equation of the regression line of \(y\) on \(x\) in the form \(y = a + b x\)
  3. Use your equation to estimate the daily energy consumption when the average daily temperature is \(2 ^ { \circ } \mathrm { C }\)
  4. Calculate the residual sum of squares (RSS). The table shows the residual for each value of x .
    \(\mathbf { x }\)- 0.4- 0.20.30.81.11.41.82.12.52.6
    R esidual- 0.63- 0.32- 0.52- 0.730.742.221.840.32\(f\)- 1.88
  5. Find the value of f.
  6. By considering the signs of the residuals, explain whether or not the linear regression model is a suitable model for these data.
Edexcel FS2 2019 June Q2
10 marks Standard +0.3
2 A large field of wheat is split into 8 plots of equal area. Each plot is treated with a different amount of fertiliser, \(f\) grams \(/ \mathrm { m } ^ { 2 }\). The yield of wheat, \(w\) tonnes, from each plot is recorded. The results are summarised below. $$\sum f = 28 \quad \sum w = 303 \quad \sum w ^ { 2 } = 13447 \quad \mathrm {~S} _ { f f } = 42 \quad \mathrm {~S} _ { f w } = 269.5$$
  1. Calculate the product moment correlation coefficient between \(f\) and \(w\)
  2. Interpret the value of your product moment correlation coefficient.
  3. Find the equation of the regression line of \(w\) on \(f\) in the form \(w = a + b f\)
  4. Using your equation, estimate the decrease in yield when the amount of fertiliser decreases by 0.5 grams \(/ \mathrm { m } ^ { 2 }\) The residuals of the data recorded are calculated and plotted on the graph below. \includegraphics[max width=\textwidth, alt={}, center]{67df73d4-6ce4-45f7-8a69-aa94292ea814-04_1232_1294_1169_301}
  5. With reference to this graph, comment on the suitability of the model you found in part (c).
  6. Suggest how you might be able to refine your model.
Edexcel FS2 2021 June Q4
10 marks Standard +0.3
  1. A researcher is investigating the relationship between elevation, \(x\) metres, and annual mean temperature, \(t ^ { \circ } \mathrm { C }\).
From a random sample of 20 weather stations in Switzerland, the following results were obtained $$\mathrm { S } _ { x x } = 8820655 \quad \mathrm {~S} _ { t t } = 444.7 \quad \sum x = 28130 \quad \sum t = 94.62$$ The product moment correlation coefficient for these data is found to be - 0.959
  1. Interpret the value of this correlation coefficient.
  2. Show that the equation of the regression line of \(t\) on \(x\) can be written as $$t = 14.3 - 0.00681 x$$ The random variable \(W\) represents the elevations of the weather stations in kilometres.
  3. Write down the equation of the regression line of \(t\) on \(w\) for these 20 weather stations in the form \(t = a + b w\)
  4. Show that the residual sum of squares (RSS) for the model for \(t\) and \(x\) is 35.7 correct to one decimal place. One of the weather stations in the sample had a recorded elevation of 1100 metres and an annual mean temperature of \(1.4 ^ { \circ } \mathrm { C }\)
    1. Calculate this weather station's contribution to the residual sum of squares. Give your answer as a percentage
    2. Comment on the data for this weather station in light of your answer to part (e)(i).
Edexcel FS2 2022 June Q1
7 marks Standard +0.3
  1. Kwame is investigating a possible relationship between average March temperature, \(t ^ { \circ } \mathrm { C }\), and tea yield, \(y \mathrm {~kg} /\) hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
$$\begin{aligned} & \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\ & \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080 \end{aligned}$$
  1. Use the regression model to predict the tea yield for an average March temperature of \(20 ^ { \circ } \mathrm { C }\) He also produces the following residual plot for the data. \includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}
  2. Explain what you understand by the term residual.
  3. Calculate the product moment correlation coefficient between \(t\) and \(y\)
  4. Explain why the linear model may not be a good fit for the data
    1. with reference to your answer to part (c)
    2. with reference to the residual plot. \section*{Question 1 continues on page 4} Kwame also collects data on total March rainfall, \(w \mathrm {~mm}\), for each of these 30 years. For a linear regression model of \(w\) on \(t\) the following summary statistic is found. $$\text { Residual Sum of Squares (RSS) = } 86754$$ Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between \(w\) and \(t\) than between \(y\) and \(t\) (where RSS \(= 1666567\) )
  5. State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.
Edexcel FS2 2023 June Q1
7 marks Easy -1.2
  1. Baako is investigating the times taken by children to run a 100 m race, \(x\) seconds, and a 500 m race, \(y\) seconds. For a sample of 20 children, Baako obtains the time taken by each child to run each race.
Here are Baako's summary statistics. $$\begin{gathered} \mathrm { S } _ { x x } = 314.55 \quad \mathrm {~S} _ { y y } = 9026 \quad \mathrm {~S} _ { x y } = 1610 \\ \bar { x } = 19.65 \quad \bar { y } = 108 \end{gathered}$$
  1. Calculate the product moment correlation coefficient between the times taken to run the 100 m race and the times taken to run the 500 m race.
  2. Show that the equation of the regression line of \(y\) on \(x\) can be written as $$y = 5.12 x + 7.42$$ where the gradient and \(y\) intercept are given to 3 significant figures. The child who completed the 100 m race in 20 seconds took 104 seconds to complete the 500 m race.
  3. Find the residual for this child. The table below shows the signs of the residuals for the 20 children in order of finishing time for the 100 m race.
    Sign of residual++++--+--------+++++
  4. Explain what the signs of the residuals show about the model's predictions of the 500 m race times for the children who are fastest and slowest over the 100 m race.
Edexcel FS2 2024 June Q1
9 marks Standard +0.3
  1. Two students are experimenting with some water in a plastic bottle. The bottle is filled with water and a hole is put in the bottom of the bottle. The students record the time, \(t\) seconds, it takes for the water level to fall to each of 10 given values of the height, \(h \mathrm {~cm}\), above the hole.
Student \(A\) models the data with an equation of the form \(t = a + b \sqrt { h }\) The data is coded using \(v = t - 40\) and \(w = \sqrt { h }\) and the following information is obtained. $$\sum v = 626 \quad \sum v ^ { 2 } = 64678 \quad \sum w = 22.47 \quad \mathrm {~S} _ { w w } = 4.52 \quad \mathrm {~S} _ { v w } = - 338.83$$
  1. Find the equation of the regression line of \(t\) on \(\sqrt { h }\) in the form \(t = a + b \sqrt { h }\) The time it takes the water level to fall to a height of 9 cm above the hole is 47 seconds.
  2. Calculate the residual for this data point. Give your answer to 2 decimal places. Given that the residual sum of squares (RSS) for the model of \(t\) on \(\sqrt { h }\) is the same as the RSS for the model of \(v\) on \(w\),
  3. calculate the RSS for these 10 data points. Student \(B\) models the data with an equation of the form \(t = c + d h\) The regression line of \(t\) on \(h\) is calculated and the residual sum of squares (RSS) is found to be 980 to 3 significant figures.
  4. With reference to part (c) state, giving a reason, whether Student B's model or Student A's model is the more suitable for these data.
Edexcel FS2 Specimen Q6
12 marks Standard +0.3
  1. A random sample of 10 female pigs was taken. The number of piglets, \(x\), born to each female pig and their average weight at birth, \(m \mathrm {~kg}\), was recorded. The results were as follows:
Number of piglets, \(\boldsymbol { x }\)45678910111213
Average weight at
birth, \(\boldsymbol { m } \mathbf { ~ k g }\)
1.501.201.401.401.231.301.201.151.251.15
(You may use \(\mathrm { S } _ { x x } = 82.5\) and \(\mathrm { S } _ { m m } = 0.12756\) and \(\mathrm { S } _ { x m } = - 2.29\) )
  1. Find the equation of the regression line of \(m\) on \(x\) in the form \(m = a + b x\) as a model for these results.
  2. Show that the residual sum of squares (RSS) is 0.064 to 3 decimal places.
  3. Calculate the residual values.
  4. Write down the outlier.
    1. Comment on the validity of ignoring this outlier.
    2. Ignoring the outlier, produce another model.
    3. Use this model to estimate the average weight at birth if \(x = 15\)
    4. Comment, giving a reason, on the reliability of your estimate.