Assess validity of predictions

A question is this type if and only if it asks whether a regression line provides reliable estimates, whether extrapolation is appropriate, or to comment on the validity of using the model for prediction.

11 questions

OCR MEI S2 2007 January Q1
1 In a science investigation into energy conservation in the home, a student is collecting data on the time taken for an electric kettle to boil as the volume of water in the kettle is varied. The student's data are shown in the table below, where \(v\) litres is the volume of water in the kettle and \(t\) seconds is the time taken for the kettle to boil (starting with the water at room temperature in each case). Also shown are summary statistics and a scatter diagram on which the regression line of \(t\) on \(v\) is drawn.
\(v\)0.20.40.60.81.0
\(t\)4478114156172
$$n = 5 , \Sigma v = 3.0 , \Sigma t = 564 , \Sigma v ^ { 2 } = 2.20 , \Sigma v t = 405.2 .$$ \includegraphics[max width=\textwidth, alt={}, center]{7ba30ff3-af90-4741-aab1-576efcbcb0b2-2_563_1376_742_386}
  1. Calculate the equation of the regression line of \(t\) on \(v\), giving your answer in the form \(t = a + b v\).
  2. Use this equation to predict the time taken for the kettle to boil when the amount of water which it contains is
    (A) 0.5 litres,
    (B) 1.5 litres. Comment on the reliability of each of these predictions.
  3. In the equation of the regression line found in part (i), explain the role of the coefficient of \(v\) in the relationship between time taken and volume of water.
  4. Calculate the values of the residuals for \(v = 0.8\) and \(v = 1.0\).
  5. Explain how, on a scatter diagram with the regression line drawn accurately on it, a residual could be measured and its sign determined.
    (a) A farmer grows Brussels sprouts. The diameter of sprouts in a particular batch, measured in mm , is Normally distributed with mean 28 and variance 16. Sprouts that are between 24 mm and 33 mm in diameter are sold to a supermarket.
  6. Find the probability that the diameter of a randomly selected sprout will be within this range.
  7. The farmer sells the sprouts in this range to the supermarket for 10 pence per kilogram. The farmer sells sprouts under 24 mm in diameter to a frozen food factory for 5 pence per kilogram. Sprouts over 33 mm in diameter are thrown away. Estimate the total income received by the farmer for the batch, which weighs 25000 kg .
  8. By harvesting sprouts earlier, the mean diameter for another batch can be reduced to \(k \mathrm {~mm}\). Find the value of \(k\) for which only \(5 \%\) of the sprouts will be above 33 mm in diameter. You may assume that the variance is still 16 .
    (b) The farmer also grows onions. The weight in kilograms of the onions is Normally distributed with mean 0.155 and variance 0.005 . He is trying out a new variety, which he hopes will yield a higher mean weight. In order to test this, he takes a random sample of 25 onions of the new variety and finds that their total weight is 4.77 kg . You should assume that the weight in kilograms of the new variety is Normally distributed with variance 0.005 .
  9. Write down suitable null and alternative hypotheses for the test in terms of \(\mu\). State the meaning of \(\mu\) in this case.
  10. Carry out the test at the \(1 \%\) level.
CAIE FP2 2017 Specimen Q9
9 A random sample of 8 students is chosen from those sitting examinations in both Mathematics and French. Their marks in Mathematics, \(x\), and in French, \(y\), are summarised as follows. $$\Sigma x = 472 \quad \Sigma x ^ { 2 } = 29950 \quad \Sigma y = 400 \quad \Sigma y ^ { 2 } = 21226 \quad \Sigma x y = 24879$$ Another student scored 72 marks in the Mathematics examination but was unable to sit the French examination.
  1. Estimate the mark that this student would have obtained in the French examination.
  2. Test, at the \(5 \%\) significance level, whether there is non-zero correlation between marks in Mathematics and marks in French.
OCR MEI AS Paper 2 2022 June Q6
6 The pre-release material contains information about employment rates in London boroughs. The graph shows employment rates for Westminster between 2006 and 2019. \begin{figure}[h]
\captionsetup{labelformat=empty} \caption{Employment rate in Westminster} \includegraphics[alt={},max width=\textwidth]{e0b502a8-c742-4d78-993c-8c0c7329ec9c-05_641_1465_406_242}
\end{figure} A local politician stated that the diagram shows that more than \(60 \%\) of seventy-year-olds were in employment throughout the period from 2006 to 2019.
  1. Use your knowledge of the pre-release material to explain whether there is any evidence to support this statement. In order to estimate the employment rate in 2020, two different models were proposed using the LINEST function in a spreadsheet. Model 1 (using all the data from 2006 onwards)
    \(\mathrm { Y } = 0.549 \mathrm { x } - 1040\), Model 2 (using data from 2017 onwards)
    \(\mathrm { Y } = 2.65 \mathrm { x } - 5280\),
    where \(Y =\) employment rate and \(x =\) calendar year. It was subsequently found that the employment rate in Westminster in 2020 was 68.4\%.
  2. Determine which of the two models provided the better estimate for the employment rate in Westminster in 2020.
  3. Use your knowledge of the pre-release material to explain whether it would be appropriate to use either model to estimate the employment rate in 2020 in other London boroughs.
  4. What does model 2 predict for employment rates in Westminster in the long term?
OCR Further Statistics 2019 June Q1
1 A set of bivariate data ( \(X , Y\) ) is summarised as follows.
\(n = 25 , \sum x = 9.975 , \sum y = 11.175 , \sum x ^ { 2 } = 5.725 , \sum y ^ { 2 } = 46.200 , \sum x y = 11.575\)
  1. Calculate the value of Pearson's product-moment correlation coefficient.
  2. Calculate the equation of the regression line of \(y\) on \(x\). It is desired to know whether the regression line of \(y\) on \(x\) will provide a reliable estimate of \(y\) when \(x = 0.75\).
  3. State one reason for believing that the estimate will be reliable.
  4. State what further information is needed in order to determine whether the estimate is reliable.
OCR Further Statistics 2023 June Q2
2 The director of a concert hall wishes to investigate if the price of the most expensive concert tickets affects attendance. The director collects data about the price, \(\pounds P\), of the most expensive tickets and the number of people in the audience, \(H\) hundred (rounded to the nearest hundred), for 20 concerts. For each price there are several different concerts. The results are shown in the table.
\(P\) (£)7565554535
\multirow[t]{5}{*}{\(H\) (hundred)}2727272615
2727202112
2218169
191813
12169
\(\mathrm { n } = 20 \quad \sum \mathrm { p } = 1080 \quad \sum \mathrm {~h} = 381 \quad \sum \mathrm { p } ^ { 2 } = 61300 \quad \sum \mathrm {~h} ^ { 2 } = 8011 \quad \sum \mathrm { ph } = 21535\)
  1. Calculate the equation of the regression line of \(h\) on \(p\).
  2. State what change, if any, there would be to your answer to part (a) if \(H\) had been measured in thousands (to 1 decimal place) rather than in hundreds. For a special charity concert, the most expensive tickets cost \(\pounds 50\).
  3. Use your answer to part (b) to estimate the expected size of the audience for this concert. Give your answer correct to \(\mathbf { 1 }\) decimal place.
  4. Comment on the reliability of your answer to part (c). You should refer to
    • the value of the product-moment correlation coefficient for the data, which is 0.642
    • the value of \(\pounds 50\)
    • any one other relevant factor that should be taken into account.
OCR MEI Further Statistics A AS 2021 November Q6
6 A health researcher is investigating the relationship between age and maximum heart rate. A commonly quoted formula states that 'maximum heart rate \(= 220\) - age in years'. The researcher wants to check if this formula is a satisfactory model for people who work in the large hospital where she is employed. The researcher selects a random sample of 20 people who work in her hospital, and measures their maximum heart rates.
  1. Explain why the researcher selects a sample, rather than using all of the people who work in the hospital. The ages, \(x\) years, and maximum heart rates, \(y\) beats per minute, of the people in the researcher's sample are summarised as follows.
    \(n = 20 \quad \sum x = 922 \quad \sum y = 3638 \quad \sum x ^ { 2 } = 47250 \quad \sum y ^ { 2 } = 664610 \quad \sum x y = 164998\) These data are illustrated below.
    \includegraphics[max width=\textwidth, alt={}, center]{5be067ff-4668-48d6-8ed2-b8dfa3e678f7-5_758_1246_1027_244}
    1. Draw the line which represents the formula 'maximum heart rate \(= 220 -\) age in years' on the copy of the scatter diagram in the Printed Answer Booklet.
    2. Comment on how well this model fits the data.
  2. Determine the equation of the regression line of maximum heart rate on age.
  3. Use the equation of the regression line to predict the values of the maximum heart rate for each of the following ages.
    • 40 years
    • 5 years
    • Comment on the reliability of your predictions in part (d).
OCR MEI Further Statistics A AS Specimen Q6
6 A motorist decides to check the fuel consumption, \(y\) miles per gallon, of her car at particular speeds, \(x \mathrm { mph }\), on flat roads. She carries out the check on a suitable stretch of motorway. Fig. 6 shows her results. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{880026ad-1cd3-40bb-bc87-8dcc94bd9bbd-4_707_1091_1320_477} \captionsetup{labelformat=empty} \caption{Fig. 6}
\end{figure}
  1. Explain why it would not be appropriate to carry out a hypothesis test for correlation based on the product moment correlation coefficient.
  2. (A) One of the results is an outlier. Circle the outlier on the copy of Fig. 6 in the Printed Answer Booklet.
    (B) Suggest one possible reason for the outlier in part (ii) (A) not being used in any analysis. The motorist decides to remove this item of data from any analysis. The table below shows part of a spreadsheet that was used to analyse the 14 remaining data items (with the outlier removed). Some rows of the spreadsheet have been deliberately omitted.
    Data item\(x\)\(y\)\(x ^ { 2 }\)\(y ^ { 2 }\)\(x y\)
    15053.625002872.962680
    25053.325002840.892665
    137044.849002007.043136
    147044.249001953.643094
    Sum8406865115033779.740812
  3. Calculate the equation of the regression line of \(y\) on \(x\).
  4. Use the equation of the regression line to predict the fuel consumption of the car at
    (A) 58 mph ,
    (B) 30 mph .
  5. Comment on the reliability of your predictions in part (iv). OCR is committed to seeking permission to reproduce all third-party content that it uses in the assessment materials. OCR has attempted to identify and contact all copyright holders whose work is used in this paper. To avoid the issue of disclosure of answer-related information to candidates, all copyright acknowledgements are reproduced in the OCR Copyright Acknowledgements booklet. This is produced for each series of examinations and is freely available to download from our public website (\href{http://www.ocr.org.uk}{www.ocr.org.uk}) after the live examination series. If OCR has unwittingly failed to correctly acknowledge or clear any third-party content in this assessment material, OCR will be happy to correct its mistake at the earliest possible opportunity.
    For queries or further information please contact the Copyright Team, First Floor, 9 Hills Road, Cambridge CB2 1GE.
    OCR is part of the Cambridge Assessment Group; Cambridge Assessment is the brand name of University of Cambridge Local Examinations Syndicate (UCLES), which is itself a department of the University of Cambridge. }\section*{}
OCR MEI Further Statistics Major 2019 June Q6
6
  1. A researcher is investigating the date of the 'start of spring' at different locations around the country.
    A suitable date (measured in days from the start of the year) can be identified by checking, for example, when buds first appear for certain species of trees and plants, but this is time-consuming and expensive. Satellite data, measuring microwave emissions, can alternatively be used to estimate the date that land-based measurements would give. The researcher chooses a random sample of 12 locations, and obtains land-based measurements for the start of spring date at each location, together with relevant satellite measurements. The scatter diagram in Fig. 6.1 shows the results; the land-based measurements are denoted by \(x\) days and the corresponding values derived from satellite measurements by \(y\) days. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-06_732_1342_781_333} \captionsetup{labelformat=empty} \caption{Fig. 6.1}
    \end{figure} Fig. 6.2 shows part of a spreadsheet used to analyse the data. Some rows of the spreadsheet have been deliberately omitted. \begin{table}[h]
    1ABCDEF
    1x\(\boldsymbol { y }\)\(\boldsymbol { x } ^ { \mathbf { 2 } }\)\(\boldsymbol { y } ^ { \mathbf { 2 } }\)xy
    2901028100104049180
    3
    10
    11
    129497883694099118
    13991019801102019999
    14Sum11311227107783126725116724
    15
    \captionsetup{labelformat=empty} \caption{Fig. 6.2}
    \end{table}
    1. Calculate the equation of a regression line suitable for estimating the land-based date of the start of spring from satellite measurements.
    2. Using this equation, estimate the land-based date of the start of spring for the following dates from satellite measurements.
      • 95 days
  2. 60 days
    (iii) Comment on the reliability of each of your estimates.
  3. The researcher is also investigating whether there is any correlation between the average temperature during a month in spring and the total rainfall during that month at a particular location. The average temperatures in degrees Celsius and total rainfall in mm for a random selection, over several years, of 10 spring months at this location are as follows.
  4. Temperature4.27.15.63.58.66.52.75.96.74.1
    Rainfall18264276154384536636
    The researcher plots the scatter diagram shown in Fig. 6.3 to check which type of test to carry out. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-07_693_880_1174_338} \captionsetup{labelformat=empty} \caption{Fig. 6.3}
    \end{figure} (i) Explain why the researcher might come to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid.
    (ii) Find the value of Pearson's product moment correlation coefficient.
    (iii) Carry out a test at the \(5 \%\) significance level to investigate whether there is any correlation between temperature and rainfall.
WJEC Further Unit 2 2019 June Q6
6. The University of Arizona surveyed a large number of households. One purpose of the survey was to determine if annual household income could be predicted from size of family home. The graph of Annual household income, \(y\), versus Size of family home, \(x\), is shown below.
\includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_616_1257_566_365}
  1. State the limitations of using the regression line above with reference to the scatter diagram. The data for size of family homes between 2000 and 3000 square feet are shown in the diagram below.
    \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_652_1244_1516_360} Summary statistics for these data are as follows. $$\begin{array} { r c c } \sum x = 93160 & \sum y = 3907142 & n = 37
    S _ { x x } = 2869673.03 & S _ { y y } = 44312797167 & S _ { x y } = 348512820 \cdot 6 \end{array}$$
  2. Calculate the equation of the least squares regression line to predict Annual household income from Size of family home for these data.
Edexcel FS2 2020 June Q3
3 Below are 3 sketches from some students of the residuals from their linear regressions of \(y\) on \(x\).
\includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_252_704_342_660}
\includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_266_718_625_660}
\includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_248_599_936_660} \section*{III} III For each sketch you should state, giving your reason,
  1. whether or not the sketch is feasible
    and if it is feasible
  2. whether or not the sketch suggests a linear or a non-linear relationship between \(y\) and \(x\).
OCR MEI Further Statistics Major Specimen Q3
3 A researcher is investigating factors that might affect how many hours per day different species of mammals spend asleep. First she investigates human beings. She collects data on body mass index, \(x\), and hours of sleep, \(y\), for a random sample of people. A scatter diagram of the data is shown in Fig. 3.1 together with the regression line of \(y\) on \(x\). \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{e6ee3a4a-3e76-4422-9a78-17b64b458f83-04_885_1584_598_274} \captionsetup{labelformat=empty} \caption{Fig. 3.1}
\end{figure}
  1. Calculate the residual for the data point which has the residual with the greatest magnitude.
  2. Use the equation of the regression line to estimate the mean number of hours spent asleep by a person with body mass index
    (A) 26,
    (B) 16,
    commenting briefly on each of your predictions. The researcher then collects additional data for a large number of species of mammals and analyses different factors for effect size. Definitions of the variables measured for a typical animal of the species, the correlations between these variables, and guidelines often used when considering effect size are given in Fig. 3.2.
    VariableDefinition
    Body massMass of animal in kg
    Brain massMass of brain in g
    Hours of sleep/dayNumber of hours per day spent asleep
    Life spanHow many years the animal lives
    DangerA measure of how dangerous the animal's situation is when asleep, taking into account predators and how protected the animal's den is: higher value indicates greater danger.
    Correlations (pmcc)Body MassBrain MassHours of sleep/dayLife spanDanger
    Body Mass1.00
    Brain Mass0.931.00
    Hours of sleep/day-0.31-0.361.00
    Life span0.300.51-0.411.00
    Danger0.130.15-0.590.061.00
    \begin{table}[h]
    Product moment
    correlation coefficient
    Effect size
    0.1Small
    0.3Medium
    0.5Large
    \captionsetup{labelformat=empty} \caption{Fig. 3.2}
    \end{table}
  3. State two conclusions the researcher might draw from these tables, relevant to her investigation into how many hours mammals spend asleep. One of the researcher's students notices the high correlation between body mass and brain mass and produces a scatter diagram for these two variables, shown in Fig. 3.3 below. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{e6ee3a4a-3e76-4422-9a78-17b64b458f83-05_675_698_1802_735} \captionsetup{labelformat=empty} \caption{Fig. 3.3}
    \end{figure}
  4. Comment on the suitability of a linear model for these two variables.