5.09e Use regression: for estimation in context

129 questions

Sort by: Default | Easiest first | Hardest first
OCR MEI Further Statistics A AS 2022 June Q6
10 marks Moderate -0.8
6 Tom has read in a newspaper that you can tell the air temperature by counting how often a cricket chirps in a period of 20 seconds. (A cricket is a type of insect.) He wants to know exactly how the temperature can be predicted. On 8 randomly selected days, when Tom can hear crickets chirping, he records the number of chirps, \(x\), made by a cricket in a 20-second interval, and also the temperature, \(y ^ { \circ } \mathrm { C }\), at that time. The data are summarised as follows. \(n = 8 \quad \sum x = 268 \quad \sum y = 141.9 \quad \sum x ^ { 2 } = 9618 \quad \sum y ^ { 2 } = 2630.55 \quad \sum \mathrm { xy } = 5009.1\) These data are illustrated below. \includegraphics[max width=\textwidth, alt={}, center]{8f1e0c68-a334-4657-823e-386ab0994c02-5_661_1035_699_242}
  1. Determine the equation of the regression line of \(y\) on \(x\). Give your answer in the form \(\mathrm { y } = \mathrm { ax } + \mathrm { b }\), giving the values of \(a\) and \(b\) correct to \(\mathbf { 3 }\) significant figures.
  2. Use the equation of the regression line to predict the temperature for the following values of \(x\).
OCR MEI Further Statistics A AS 2024 June Q4
10 marks Standard +0.3
4 A chemist is conducting an experiment in which the concentration of a certain chemical, A , is supposed to be recorded at the start of the experiment and then every 30 seconds after the start. The time after the start is denoted by \(t \mathrm {~s}\) and the concentration by \(\mathrm { z } \mathrm { mg } \mathrm { cm } ^ { - 3 }\). The collected data are shown in the table below. Note that the concentration at \(t = 90\) was not recorded.
Time, \(t\)03060120150
Concentration of A, \(z\)40.031.327.512.811.4
The chemist wishes to plot the data on a graph.
  1. Explain why \(t\) should be plotted on the horizontal axis. You are given that the summary statistics for the data are as follows. \(n = 5 \quad \sum t = 360 \quad \sum z = 123.0 \quad \sum t ^ { 2 } = 41400 \quad \sum z ^ { 2 } = 3629.74 \quad \sum \mathrm { t } = 5835\) The regression line of \(z\) on \(t\) is given by \(\mathbf { z = a + b t }\) and is used to model the concentration of chemical A for \(t \geqslant 0\).
    1. Use the summary statistics to determine the value of \(a\) and the value of \(b\).
    2. Find the value of the residual at each of the following values of \(t\).
      • \(t = 60\)
      • \(t = 120\)
        1. Use the equation of the regression line to estimate the value of the concentration at 90 seconds.
        2. With reference to your answers to part (b)(ii), comment on the reliability of your answer to part (c)(i).
      Further experiments indicate that the model is reasonably reliable for times greater than 150 seconds up to about 200 seconds.
  2. Show that the model cannot be valid beyond a time of about 200 seconds.
OCR MEI Further Statistics A AS 2020 November Q5
8 marks Moderate -0.3
5 A doctor is investigating the relationship between the levels in the blood of a particular hormone and of calcium in healthy adults. The levels of the hormone and of calcium, each measured in suitable units, are denoted by \(x\) and \(y\) respectively. The doctor selects a random sample of 14 adults and measures the hormone and calcium levels in each of them. The spreadsheet in Fig. 5 shows the values obtained, together with a scatter diagram which illustrates the data. The equation of the regression line of \(y\) on \(x\) is shown on the scatter diagram, together with the value of the square of the product moment correlation coefficient. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{ba3fcd3c-6834-4116-be0e-d5b27aed0a7e-5_801_1644_646_255} \captionsetup{labelformat=empty} \caption{Fig. 5}
\end{figure}
  1. Use the equation of the regression line to estimate the mean calcium level of people with the following hormone levels.
OCR MEI Further Statistics A AS 2021 November Q6
11 marks Moderate -0.3
6 A health researcher is investigating the relationship between age and maximum heart rate. A commonly quoted formula states that 'maximum heart rate \(= 220\) - age in years'. The researcher wants to check if this formula is a satisfactory model for people who work in the large hospital where she is employed. The researcher selects a random sample of 20 people who work in her hospital, and measures their maximum heart rates.
  1. Explain why the researcher selects a sample, rather than using all of the people who work in the hospital. The ages, \(x\) years, and maximum heart rates, \(y\) beats per minute, of the people in the researcher's sample are summarised as follows. \(n = 20 \quad \sum x = 922 \quad \sum y = 3638 \quad \sum x ^ { 2 } = 47250 \quad \sum y ^ { 2 } = 664610 \quad \sum x y = 164998\) These data are illustrated below. \includegraphics[max width=\textwidth, alt={}, center]{5be067ff-4668-48d6-8ed2-b8dfa3e678f7-5_758_1246_1027_244}
    1. Draw the line which represents the formula 'maximum heart rate \(= 220 -\) age in years' on the copy of the scatter diagram in the Printed Answer Booklet.
    2. Comment on how well this model fits the data.
  2. Determine the equation of the regression line of maximum heart rate on age.
  3. Use the equation of the regression line to predict the values of the maximum heart rate for each of the following ages.
OCR MEI Further Statistics Minor 2022 June Q2
13 marks Moderate -0.8
2 A forester is investigating the relationship between the diameter and the height of young beech trees. She selects a random sample of 15 young beech trees in a forest and records their diameters, \(d \mathrm {~cm}\), and their heights, \(h \mathrm {~m}\). The data are illustrated in the scatter diagram. \includegraphics[max width=\textwidth, alt={}, center]{e8624e9b-5143-49d2-9683-cc3a1082694e-3_649_1116_386_230}
  1. State whether either or both of the variables \(d\) and \(h\) are random variables. Summary data for the diameters and heights are as follows. $$\mathrm { n } = 15 \quad \sum \mathrm {~d} = 84.9 \quad \sum \mathrm {~h} = 124.7 \quad \sum \mathrm {~d} ^ { 2 } = 624.55 \quad \sum \mathrm {~h} ^ { 2 } = 1230.57 \quad \sum \mathrm { dh } = 866.63$$
  2. Find the equation of the regression line of \(h\) on \(d\). Give your answer in the form \(h = a d + b\), giving the values of \(a\) and \(b\) correct to \(\mathbf { 2 }\) decimal places.
  3. Use the regression line to predict the heights of beech trees with the following diameters.
    Comment on this in relation to your regression line.
  4. State the coordinates of the point at which the regression line of \(d\) on \(h\) meets the line which you calculated in part (b).
OCR MEI Further Statistics Minor 2023 June Q5
8 marks Moderate -0.8
5 An ornithologist is investigating the link between the wing length and the mass of small birds, in order to try to predict the mass from the wing length without having to weigh birds. The ornithologist takes a random sample of 9 birds and measures their wing lengths \(w \mathrm {~mm}\) and their masses \(m g\). The spreadsheet below shows the data, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{72215d69-c3e6-492d-bb3e-bdc28aeb4613-5_719_1424_495_246}
  1. Find the equation of the regression line of \(m\) on \(w\), giving the coefficients correct to \(\mathbf { 3 }\) significant figures.
  2. Use the equation which you found in part (a) to estimate the mass for each of the following wing lengths.
    Comment on this suggestion.
OCR MEI Further Statistics Minor 2021 November Q2
9 marks Moderate -0.8
2 A road transport researcher is investigating the link between the age of a person, a years, and the distance, \(d\) metres, at which the person can read a large road sign. The researcher selects 13 individuals of different ages between 20 and 80 and measures the value of \(d\) for each of them. The spreadsheet below shows the data which the researcher obtained, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{691e8b55-e9a1-4fff-b9ee-a71ff1f73ead-3_725_1566_495_251}
  1. Explain which of the two variables \(a\) and \(d\) is the independent variable.
  2. Find the equation of the regression line of \(d\) on \(a\).
  3. Use the regression line to predict the average distance at which a 60-year-old person can read the road sign.
  4. Explain why it might not be sensible to use the regression line to predict the average distance at which a 5 -year-old child can read the road sign.
  5. Determine the value of the residual for \(a = 40\).
  6. Explain why it would not be useful to find the equation of the regression line of \(a\) on \(d\).
OCR MEI Further Statistics Major 2019 June Q6
18 marks Moderate -0.8
6
  1. A researcher is investigating the date of the 'start of spring' at different locations around the country.
    A suitable date (measured in days from the start of the year) can be identified by checking, for example, when buds first appear for certain species of trees and plants, but this is time-consuming and expensive. Satellite data, measuring microwave emissions, can alternatively be used to estimate the date that land-based measurements would give. The researcher chooses a random sample of 12 locations, and obtains land-based measurements for the start of spring date at each location, together with relevant satellite measurements. The scatter diagram in Fig. 6.1 shows the results; the land-based measurements are denoted by \(x\) days and the corresponding values derived from satellite measurements by \(y\) days. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-06_732_1342_781_333} \captionsetup{labelformat=empty} \caption{Fig. 6.1}
    \end{figure} Fig. 6.2 shows part of a spreadsheet used to analyse the data. Some rows of the spreadsheet have been deliberately omitted. \begin{table}[h]
    1ABCDEF
    1x\(\boldsymbol { y }\)\(\boldsymbol { x } ^ { \mathbf { 2 } }\)\(\boldsymbol { y } ^ { \mathbf { 2 } }\)xy
    2901028100104049180
    3
    10
    11
    129497883694099118
    13991019801102019999
    14Sum11311227107783126725116724
    15
    \captionsetup{labelformat=empty} \caption{Fig. 6.2}
    \end{table}
    1. Calculate the equation of a regression line suitable for estimating the land-based date of the start of spring from satellite measurements.
    2. Using this equation, estimate the land-based date of the start of spring for the following dates from satellite measurements.
      • 95 days
      • 60 days
        (iii) Comment on the reliability of each of your estimates.
      • The researcher is also investigating whether there is any correlation between the average temperature during a month in spring and the total rainfall during that month at a particular location. The average temperatures in degrees Celsius and total rainfall in mm for a random selection, over several years, of 10 spring months at this location are as follows.
      Temperature4.27.15.63.58.66.52.75.96.74.1
      Rainfall18264276154384536636
      The researcher plots the scatter diagram shown in Fig. 6.3 to check which type of test to carry out. \begin{figure}[h]
      \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-07_693_880_1174_338} \captionsetup{labelformat=empty} \caption{Fig. 6.3}
      \end{figure}
      1. Explain why the researcher might come to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid.
      2. Find the value of Pearson's product moment correlation coefficient.
      3. Carry out a test at the \(5 \%\) significance level to investigate whether there is any correlation between temperature and rainfall.
OCR MEI Further Statistics Major 2022 June Q5
11 marks Moderate -0.3
5 A motorist is investigating the relationship between tyre pressure and temperature. As the temperature increases during a hot day, she records the pressure (measured in bars) of one of her car tyres at specific temperatures of \(20 ^ { \circ } \mathrm { C } , 22 ^ { \circ } \mathrm { C } , \ldots , 36 ^ { \circ } \mathrm { C }\). The results are shown in Table 5.1. \begin{table}[h]
Temperature \(\left( t ^ { \circ } \mathrm { C } \right)\)202224262830323436
Tyre pressure \(( P\) bar \()\)2.0122.0362.0652.0742.1142.1402.1492.1762.192
\captionsetup{labelformat=empty} \caption{Table 5.1}
\end{table}
  1. Calculate the equation of the regression line of pressure on temperature. Give your answer in the form \(P = a t + b\), giving the values of \(a\) and \(b\) to \(\mathbf { 4 }\) significant figures.
  2. Table 5.2 shows the residuals for most of the data values. Complete the copy of the table in the Printed Answer Booklet. \begin{table}[h]
    Temperature202224262830323436
    Residual tyre
    pressure
    - 0.003- 0.0020.004- 0.0100.011- 0.0030.001
    \captionsetup{labelformat=empty} \caption{Table 5.2}
    \end{table}
  3. With reference to the values of the residuals, comment on the goodness of fit of the regression line.
  4. Use your answer to part (a) to calculate an estimate of the pressure in the tyre at each of the following temperatures, giving your answers to \(\mathbf { 3 }\) decimal places.
OCR MEI Further Statistics Major 2023 June Q2
5 marks Easy -1.2
2 A student is investigating the link between temperature and electricity consumption in the winter months. The student finds the average minimum temperature, \(x ^ { \circ } \mathrm { C }\), from across the country on a day. The student then finds the total electricity consumption for that day, \(y \mathrm { GWh }\). The scatter diagram below shows the values of \(x\) and \(y\) obtained from a random sample of 10 winter days. It also shows the equation of the regression line of \(y\) on \(x\) and the value of \(r ^ { 2 }\), where \(r\) is the product moment correlation coefficient. \includegraphics[max width=\textwidth, alt={}, center]{c692fb20-436f-4bc1-89bd-10fdba41ceba-03_776_1043_609_244}
  1. Use the regression line to estimate the electricity consumption at each of the following average minimum temperatures.
OCR MEI Further Statistics Major 2024 June Q8
14 marks Moderate -0.3
8 An estate agent collects data for a random selection of 13 flats in order to investigate the link between the floor areas of flats and their price. The scatter diagram shows the floor areas, \(x \mathrm {~m} ^ { 2 }\), and prices, \(\pounds y\) thousand, of the 13 flats. \includegraphics[max width=\textwidth, alt={}, center]{bab116b3-6e5f-44db-ac86-670e4040d649-07_613_1246_386_242}
  1. The estate agent notes that two of the data points are outliers. One is Flat A which has a large floor area but is in poor condition. The other is Flat B which has a balcony with a desirable view overlooking the sea. Label these two data points on the copy of the scatter diagram in the Printed Answer Booklet. The estate agent decides to remove these two data points from the analysis. Summary statistics for the remaining 11 flats are as follows. $$\sum x = 652.5 \quad \sum y = 5067 \quad \sum x ^ { 2 } = 41987.35 \quad \sum y ^ { 2 } = 2456813 \quad \sum x y = 315928.2$$
  2. In this question you must show detailed reasoning. Calculate the equation of a regression line which is suitable for estimating the price of a flat from its floor area.
  3. Use the regression line to estimate the price for the following floor areas.
    Comment briefly on the estate agent's idea.
Edexcel FS2 AS 2018 June Q1
11 marks Moderate -0.3
  1. The scores achieved on a maths test, \(m\), and the scores achieved on a physics test, \(p\), by 16 students are summarised below.
$$\sum m = 392 \quad \sum p = 254 \quad \sum p ^ { 2 } = 4748 \quad \mathrm {~S} _ { m m } = 1846 \quad \mathrm {~S} _ { m p } = 1115$$
  1. Find the product moment correlation coefficient between \(m\) and \(p\)
  2. Find the equation of the linear regression line of \(p\) on \(m\) Figure 1 shows a plot of the residuals. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{0fcb4d83-9763-4edd-8006-93f75a44c596-02_808_1222_997_429} \captionsetup{labelformat=empty} \caption{Figure 1}
    \end{figure}
  3. Calculate the residual sum of squares (RSS). For the person who scored 30 marks on the maths test,
  4. find the score on the physics test. The data for the person who scored 20 on the maths test is removed from the data set.
  5. Suggest a reason why. The product moment correlation coefficient between \(m\) and \(p\) is now recalculated for the remaining 15 students.
  6. Without carrying out any further calculations, suggest how you would expect this recalculated value to compare with your answer to part (a).
    Give a reason for your answer.
    V349 SIHI NI IMIMM ION OCVJYV SIHIL NI LIIIM ION OOVJYV SIHIL NI JIIYM ION OC
Edexcel FS2 AS 2019 June Q3
11 marks Standard +0.3
  1. Two students, Jim and Dora, collected data on the mean annual rainfall, \(w \mathrm {~cm}\), and the annual yield of leeks, \(l\) tonnes per hectare, for 10 years.
Jim summarised the data as follows $$\mathrm { S } _ { w l } = 42.786 \quad \mathrm {~S} _ { w w } = 9936.9 \quad \sum l ^ { 2 } = 26.2326 \quad \sum l = 16.06$$
  1. Find the product moment correlation coefficient between \(l\) and \(w\) Dora decided to code the data first using \(s = w - 6\) and \(t = l - 20\)
  2. Write down the value of the product moment correlation coefficient between \(s\) and \(t\). Give a justification for your answer. Dora calculates the equation of the regression line of \(t\) on \(s\) to be \(t = 0.00431 s - 18.87\)
  3. Find the equation of the regression line of \(l\) on \(w\) in the form \(l = a + b w\), giving the values of \(a\) and \(b\) to 3 significant figures.
  4. Use your equation to estimate the yield of leeks when \(w\) is 100 cm .
  5. Calculate the residual sum of squares. The graph shows the residual for each value of \(l\) \includegraphics[max width=\textwidth, alt={}, center]{7e46e14a-0f5a-4d02-8f00-a92bc4def6d7-08_716_1594_1594_239}
    1. State whether this graph suggests that the use of a linear regression model is suitable for these data. Give a reason for your answer.
    2. Other than collecting more data, suggest how to improve the fit of the model in part (c) to the data.
Edexcel FS2 AS 2022 June Q3
10 marks Standard +0.3
  1. Gabriela is investigating a particular type of fish, called bream. She wants to create a model to predict the weight, \(w\) grams, of bream based on their length, \(x \mathrm {~cm}\).
For a sample of 27 bream, some summary statistics are given below. $$\begin{gathered} \bar { x } = 31.07 \quad \bar { w } = 628.59 \quad \sum w ^ { 2 } = 11386134 \\ \mathrm {~S} _ { x w } = 13082.3 \quad \mathrm {~S} _ { x x } = 260.8 \end{gathered}$$
  1. Find the value of the product moment correlation coefficient between \(x\) and \(w\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(w\) on \(x\) in the form \(w = a + b x\) A residual plot for these data is shown below. \includegraphics[max width=\textwidth, alt={}, center]{128c408d-3e08-4f74-8f19-d33ecd5c882f-06_931_1790_1107_139} One of the bream in the sample has a length of 32 cm .
  4. Find its weight.
  5. With reference to the residual plot, comment on the model for bream with lengths above 33 cm .
Edexcel FS2 AS 2023 June Q3
10 marks Standard +0.3
  1. Pat is investigating the relationship between the height of professional tennis players and the speed of their serve. Data from 9 randomly selected professional male tennis players were collected. The variables recorded were the height of each player, \(h\) metres, and the maximum speed of their serve, \(v \mathrm {~km} / \mathrm { h }\).
Pat summarised these data as follows $$\sum h = 17.63 \quad \sum v = 2174.9 \quad \sum v ^ { 2 } = 526407.8 \quad S _ { h h } = 0.0487 \quad S _ { h v } = 5.1376$$
  1. Calculate the product moment correlation coefficient between \(h\) and \(v\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(v\) on \(h\) in the form \(v = a + b h\) where \(a\) and \(b\) are to be given to one decimal place. Pat calculated the sum of the residuals for the 9 tennis players as 1.04
  4. Without doing a calculation, explain how you know Pat has made a mistake. Pat made one mistake in the calculation. For the tennis player of height 1.96 m Pat misread the residual as 2.27
  5. Find the maximum speed of serve, in km/h, for the tennis player of height 1.96 m
Edexcel FS2 AS 2024 June Q5
8 marks Standard +0.3
  1. A random sample of 24 adults is taken. The height, \(h\) metres, and the arm span, \(s\) metres, for each adult are recorded.
These data are summarised below. $$\mathrm { S } _ { h h } = 0.377 \quad \mathrm {~S} _ { s h } = 0.352 \quad \bar { s } = 1.70 \quad \bar { h } = 1.68$$ The least squares regression line of \(h\) on \(s\) is $$h = a + 0.919 s$$ where \(a\) is a constant.
  1. Calculate the product moment correlation coefficient. A doctor uses the least squares regression line of \(h\) on \(s\) as a model to predict a person's height based on their arm span.
  2. Use the model to predict the height of an adult with arm span 1.79 metres. Ewan has an arm span of 1.70 metres and a height of 1.75 metres. His information is added to the sample as the 25th adult.
  3. Explain how the gradient of the regression line for the sample of 25 adults compares with the gradient of the regression line for the original sample of 24 adults.
    Give a reason for your answer.
Edexcel FS2 AS Specimen Q3
11 marks Standard +0.3
  1. A scientist wants to develop a model to describe the relationship between the average daily temperature, \(\mathrm { x } ^ { \circ } \mathrm { C }\), and a household's daily energy consumption, ykWh , in winter.
A random sample of the average temperature and energy consumption are taken from 10 winter days and are summarised below. $$\begin{gathered} \sum x = 12 \quad \sum x ^ { 2 } = 24.76 \quad \sum y = 251 \quad \sum y ^ { 2 } = 6341 \quad \sum x y = 284.8 \\ S _ { x x } = 10.36 \quad S _ { y y } = 40.9 \end{gathered}$$
  1. Find the product moment correlation coefficient between y and x .
  2. Find the equation of the regression line of \(y\) on \(x\) in the form \(y = a + b x\)
  3. Use your equation to estimate the daily energy consumption when the average daily temperature is \(2 ^ { \circ } \mathrm { C }\)
  4. Calculate the residual sum of squares (RSS). The table shows the residual for each value of x .
    \(\mathbf { x }\)- 0.4- 0.20.30.81.11.41.82.12.52.6
    R esidual- 0.63- 0.32- 0.52- 0.730.742.221.840.32\(f\)- 1.88
  5. Find the value of f.
  6. By considering the signs of the residuals, explain whether or not the linear regression model is a suitable model for these data.
Edexcel FS2 2019 June Q2
10 marks Standard +0.3
2 A large field of wheat is split into 8 plots of equal area. Each plot is treated with a different amount of fertiliser, \(f\) grams \(/ \mathrm { m } ^ { 2 }\). The yield of wheat, \(w\) tonnes, from each plot is recorded. The results are summarised below. $$\sum f = 28 \quad \sum w = 303 \quad \sum w ^ { 2 } = 13447 \quad \mathrm {~S} _ { f f } = 42 \quad \mathrm {~S} _ { f w } = 269.5$$
  1. Calculate the product moment correlation coefficient between \(f\) and \(w\)
  2. Interpret the value of your product moment correlation coefficient.
  3. Find the equation of the regression line of \(w\) on \(f\) in the form \(w = a + b f\)
  4. Using your equation, estimate the decrease in yield when the amount of fertiliser decreases by 0.5 grams \(/ \mathrm { m } ^ { 2 }\) The residuals of the data recorded are calculated and plotted on the graph below. \includegraphics[max width=\textwidth, alt={}, center]{67df73d4-6ce4-45f7-8a69-aa94292ea814-04_1232_1294_1169_301}
  5. With reference to this graph, comment on the suitability of the model you found in part (c).
  6. Suggest how you might be able to refine your model.
Edexcel FS2 2020 June Q3
6 marks Standard +0.3
3 Below are 3 sketches from some students of the residuals from their linear regressions of \(y\) on \(x\). \includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_252_704_342_660} \includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_266_718_625_660} \includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_248_599_936_660} \section*{III} III For each sketch you should state, giving your reason,
  1. whether or not the sketch is feasible
    and if it is feasible
  2. whether or not the sketch suggests a linear or a non-linear relationship between \(y\) and \(x\).
Edexcel FS2 2022 June Q1
7 marks Standard +0.3
  1. Kwame is investigating a possible relationship between average March temperature, \(t ^ { \circ } \mathrm { C }\), and tea yield, \(y \mathrm {~kg} /\) hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
$$\begin{aligned} & \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\ & \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080 \end{aligned}$$
  1. Use the regression model to predict the tea yield for an average March temperature of \(20 ^ { \circ } \mathrm { C }\) He also produces the following residual plot for the data. \includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}
  2. Explain what you understand by the term residual.
  3. Calculate the product moment correlation coefficient between \(t\) and \(y\)
  4. Explain why the linear model may not be a good fit for the data
    1. with reference to your answer to part (c)
    2. with reference to the residual plot. \section*{Question 1 continues on page 4} Kwame also collects data on total March rainfall, \(w \mathrm {~mm}\), for each of these 30 years. For a linear regression model of \(w\) on \(t\) the following summary statistic is found. $$\text { Residual Sum of Squares (RSS) = } 86754$$ Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between \(w\) and \(t\) than between \(y\) and \(t\) (where RSS \(= 1666567\) )
  5. State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.
Edexcel FS2 2024 June Q1
9 marks Standard +0.3
  1. Two students are experimenting with some water in a plastic bottle. The bottle is filled with water and a hole is put in the bottom of the bottle. The students record the time, \(t\) seconds, it takes for the water level to fall to each of 10 given values of the height, \(h \mathrm {~cm}\), above the hole.
Student \(A\) models the data with an equation of the form \(t = a + b \sqrt { h }\) The data is coded using \(v = t - 40\) and \(w = \sqrt { h }\) and the following information is obtained. $$\sum v = 626 \quad \sum v ^ { 2 } = 64678 \quad \sum w = 22.47 \quad \mathrm {~S} _ { w w } = 4.52 \quad \mathrm {~S} _ { v w } = - 338.83$$
  1. Find the equation of the regression line of \(t\) on \(\sqrt { h }\) in the form \(t = a + b \sqrt { h }\) The time it takes the water level to fall to a height of 9 cm above the hole is 47 seconds.
  2. Calculate the residual for this data point. Give your answer to 2 decimal places. Given that the residual sum of squares (RSS) for the model of \(t\) on \(\sqrt { h }\) is the same as the RSS for the model of \(v\) on \(w\),
  3. calculate the RSS for these 10 data points. Student \(B\) models the data with an equation of the form \(t = c + d h\) The regression line of \(t\) on \(h\) is calculated and the residual sum of squares (RSS) is found to be 980 to 3 significant figures.
  4. With reference to part (c) state, giving a reason, whether Student B's model or Student A's model is the more suitable for these data.
Edexcel FS2 Specimen Q6
12 marks Standard +0.3
  1. A random sample of 10 female pigs was taken. The number of piglets, \(x\), born to each female pig and their average weight at birth, \(m \mathrm {~kg}\), was recorded. The results were as follows:
Number of piglets, \(\boldsymbol { x }\)45678910111213
Average weight at
birth, \(\boldsymbol { m } \mathbf { ~ k g }\)
1.501.201.401.401.231.301.201.151.251.15
(You may use \(\mathrm { S } _ { x x } = 82.5\) and \(\mathrm { S } _ { m m } = 0.12756\) and \(\mathrm { S } _ { x m } = - 2.29\) )
  1. Find the equation of the regression line of \(m\) on \(x\) in the form \(m = a + b x\) as a model for these results.
  2. Show that the residual sum of squares (RSS) is 0.064 to 3 decimal places.
  3. Calculate the residual values.
  4. Write down the outlier.
    1. Comment on the validity of ignoring this outlier.
    2. Ignoring the outlier, produce another model.
    3. Use this model to estimate the average weight at birth if \(x = 15\)
    4. Comment, giving a reason, on the reliability of your estimate.
OCR FS1 AS 2017 December Q5
8 marks Moderate -0.5
5 A shop manager recorded the maximum daytime temperature \(T ^ { \circ } \mathrm { C }\) and the number \(C\) of ice creams sold on 9 summer days. The results are given in the table and illustrated in the scatter diagram.
\(T\)172125262727293030
\(C\)211620383237353942
\includegraphics[max width=\textwidth, alt={}]{64d7ed6d-fadd-4c59-afb0-97d1788ba369-3_661_1189_1320_431}
$$n = 9 , \Sigma t = 232 , \Sigma c = 280 , \Sigma t ^ { 2 } = 6130 , \Sigma c ^ { 2 } = 9444 , \Sigma t c = 7489$$
  1. State, with a reason, whether one of the variables \(C\) or \(T\) is likely to be dependent upon the other.
  2. Calculate Pearson's product-moment correlation coefficient \(r\) for the data.
  3. State with a reason what the value of \(r\) would have been if the temperature had been measured in \({ } ^ { \circ } \mathrm { F }\) rather than \({ } ^ { \circ } \mathrm { C }\).
  4. Calculate the equation of the least squares regression line of \(c\) on \(t\).
  5. The regression line is drawn on the copy of the scatter diagram in the Printed Answer Booklet. Use this diagram to explain what is meant by "least squares".
Edexcel S1 2022 January Q6
13 marks Moderate -0.8
  1. Students on a psychology course were given a pre-test at the start of the course and a final exam at the end of the course. The teacher recorded the number of marks achieved on the pre-test, \(p\), and the number of marks achieved on the final exam, \(f\), for 34 students and displayed them on the scatter diagram. \includegraphics[max width=\textwidth, alt={}, center]{fa1cb8a2-dab9-4133-b7a1-9108888c37d7-22_1121_1136_447_438}
The equation of the least squares regression line for these data is found to be $$f = 10.8 + 0.748 p$$ For these students, the mean number of marks on the pre-test is 62.4
  1. Use the regression model to find the mean number of marks on the final exam.
  2. Give an interpretation of the gradient of the regression line. Considering the equation of the regression line, Priya says that she would expect someone who scored 0 marks on the pre-test to score 10.8 marks on the final exam.
  3. Comment on the reliability of Priya's statement.
  4. Write down the number of marks achieved on the final exam for the student who exceeded the expectation of the regression model by the largest number of marks.
  5. Find the range of values of \(p\) for which this regression model, \(f = 10.8 + 0.748 p\), predicts a greater number of marks on the final exam than on the pre-test. Later the teacher discovers an error in the recorded data. The student who achieved a score of 98 on the pre-test, scored 92 not 29 on the final exam. The summary statistics used for the model \(f = 10.8 + 0.748 p\) are corrected to include this information and a new least squares regression line is found. Given the original summary statistics were, $$n = 34 \quad \sum p = 2120 \quad \sum p f = 133486 \quad \mathrm {~S} _ { p p } = 15573.76 \quad \mathrm {~S} _ { p f } = 11648.35$$
  6. calculate the gradient of the new regression line. Show your working clearly.
Edexcel S1 2017 June Q5
15 marks Moderate -0.3
  1. Tomas is studying the relationship between temperature and hours of sunshine in Seapron. He records the midday temperature, \(t ^ { \circ } \mathrm { C }\), and the hours of sunshine, \(s\) hours, for a random sample of 9 days in October. He calculated the following statistics
$$\sum s = 15 \quad \sum s ^ { 2 } = 44.22 \quad \sum t = 127 \quad \mathrm {~S} _ { t t } = 10.89$$
  1. Calculate \(\mathrm { S } _ { s s }\) Tomas calculated the product moment correlation coefficient between \(s\) and \(t\) to be 0.832 correct to 3 decimal places.
  2. State, giving a reason, whether or not this correlation coefficient supports the use of a linear regression model to describe the relationship between midday temperature and hours of sunshine.
  3. State, giving a reason, why the hours of sunshine would be the explanatory variable in a linear regression model between midday temperature and hours of sunshine.
  4. Find \(\mathrm { S } _ { s t }\)
  5. Calculate a suitable linear regression equation to model the relationship between midday temperature and hours of sunshine.
  6. Calculate the standard deviation of \(s\) Tomas uses this model to estimate the midday temperature in Seapron for a day in October with 5 hours of sunshine.
  7. State the value of Tomas' estimate. Given that the values of \(s\) are all within 2 standard deviations of the mean,
  8. comment, giving your reason, on the reliability of this estimate.