Calculate regression line then predict

A question is this sub-type if and only if the student must first calculate the regression line equation from summary statistics (using formulas for gradient and intercept) before making a prediction.

9 questions

OCR S1 2005 January Q9
9 Five observations of bivariate data produce the following results, denoted as ( \(x _ { i } , y _ { i }\) ) for \(i = 1,2,3,4,5\). $$\begin{aligned} & ( 13,2.7 )
& { \left[ \Sigma x = 90 , \Sigma y = 15.0 , \Sigma x ^ { 2 } = 1720 , \Sigma y ^ { 2 } = 46.86 , \Sigma x y = 264.0 . \right] } \end{aligned}$$
  1. Show that the regression line of \(y\) on \(x\) has gradient - 0.06 , and find its equation in the form \(y = a + b x\).
  2. The regression line is used to estimate the value of \(y\) corresponding to \(x = 20\), but the value \(x = 20\) is accurate only to the nearest whole number. Calculate the difference between the largest and the smallest values that the estimated value of \(y\) could take. The numbers \(e _ { 1 } , e _ { 2 } , e _ { 3 } , e _ { 4 } , e _ { 5 }\) are defined by $$e _ { i } = a + b x _ { i } - y _ { i } \quad \text { for } i = 1,2,3,4,5$$
  3. The values of \(e _ { 1 } , e _ { 2 }\) and \(e _ { 3 }\) are \(0.6 , - 0.7\) and 0.2 respectively. Calculate the values of \(e _ { 4 }\) and \(e _ { 5 }\).
  4. Calculate the value of \(e _ { 1 } ^ { 2 } + e _ { 2 } ^ { 2 } + e _ { 3 } ^ { 2 } + e _ { 4 } ^ { 2 } + e _ { 5 } ^ { 2 }\) and explain the relevance of this quantity to the regression line found in part (i).
  5. Find the mean and the variance of \(e _ { 1 } , e _ { 2 } , e _ { 3 } , e _ { 4 } , e _ { 5 }\).
CAIE FP2 2011 June Q9
9 The marks achieved by a random sample of 15 college students in a Physics examination ( \(x\) ) and in a General Studies examination (y) are summarised as follows. $$\Sigma x = 752 \quad \Sigma x ^ { 2 } = 38814 \quad \Sigma y = 773 \quad \Sigma y ^ { 2 } = 45351 \quad \Sigma x y = 40236$$
  1. Find the mean values, \(\bar { x }\) and \(\bar { y }\).
  2. Another college student achieved a mark of 56 in the General Studies examination, but was unable to take the Physics examination. Use the equation of a suitable regression line to estimate the mark that the student would have obtained in the Physics examination.
  3. Find the product moment correlation coefficient for the given data.
  4. Stating your hypotheses, test at the \(5 \%\) level of significance whether there is a non-zero product moment correlation coefficient between examination marks in Physics and in General Studies achieved by college students.
CAIE FP2 2015 November Q9
9 A random sample of 8 students is chosen from those sitting examinations in both Mathematics and French. Their marks in Mathematics, \(x\), and in French, \(y\), are summarised as follows. $$\Sigma x = 472 \quad \Sigma x ^ { 2 } = 29950 \quad \Sigma y = 400 \quad \Sigma y ^ { 2 } = 21226 \quad \Sigma x y = 24879$$ Another student scored 72 marks in the Mathematics examination but was unable to sit the French examination. Estimate the mark that this student would have obtained in the French examination. Test, at the \(5 \%\) significance level, whether there is non-zero correlation between marks in Mathematics and marks in French.
OCR Further Statistics AS 2023 June Q3
3 An insurance company collected data concerning the age, \(x\) years, of policy holders and the average size of claim, \(\pounds y\) thousand. The data is summarised as follows.
\(n = 32 \quad \sum x = 1340 \quad \sum y = 612 \quad \sum x ^ { 2 } = 64282 \quad \sum y ^ { 2 } = 13418 \quad \sum x y = 27794\)
  1. Find the variance of \(x\).
  2. Find the equation of the regression line of \(y\) on \(x\).
  3. Hence estimate the expected size of claim from a policy holder of age 48. Tom is aged 48. He claims that the range of the data probably does not include people of his age because the mean age for the data is 41.875 , and 48 is not close to this.
  4. Use your answer to part (a) to determine how likely it is that Tom's claim is correct.
  5. Comment on the reliability of your estimate in part (c). You should refer to the value of the product-moment correlation coefficient for the data, which is 0.579 correct to 3 significant figures.
Edexcel S1 2018 October Q1
  1. The heights above sea level ( \(h\) hundred metres) and the temperatures ( \(t ^ { \circ } \mathrm { C }\) ) at 12 randomly selected places in France, at 7 am on July 31st, were recorded.
    The data are summarised as follows
    1. Find the value of \(S _ { t t }\)
    2. Calculate the product moment correlation coefficient for these data.
    3. Interpret the relationship between \(t\) and \(h\).
    4. Find an equation of the regression line of \(t\) on \(h\).
    At 7 am on July 31st Yinka is on holiday in South Africa. He uses the regression equation to estimate the temperature when the height above sea level is 500 m .
  2. Find the estimated temperature Yinka calculates.
  3. Comment on the validity of your answer in part (e). $$\sum h = 112 \quad \sum t = 136 \quad \sum t ^ { 2 } = 1828 \quad S _ { h t } = - 236 \quad S _ { h h } = 297$$
  4. Find the value of \(S\) (2)
Edexcel S1 2004 November Q2
2. An experiment carried out by a student yielded pairs of \(( x , y )\) observations such that $$\bar { x } = 36 , \quad \bar { y } = 28.6 , \quad S _ { x x } = 4402 , \quad S _ { x y } = 3477.6$$
  1. Calculate the equation of the regression line of \(y\) on \(x\) in the form \(y = a + b x\). Give your values of \(a\) and \(b\) to 2 decimal places.
  2. Find the value of \(y\) when \(x = 45\).
AQA S1 2008 June Q1
1 The table shows the times taken, \(y\) minutes, for a wood glue to dry at different air temperatures, \(x ^ { \circ } \mathrm { C }\).
\(\boldsymbol { x }\)101215182022252830
\(\boldsymbol { y }\)42.940.638.535.433.030.728.025.322.6
  1. Calculate the equation of the least squares regression line \(y = a + b x\).
  2. Estimate the time taken for the glue to dry when the air temperature is \(21 ^ { \circ } \mathrm { C }\).
Edexcel FS2 AS 2022 June Q3
  1. Gabriela is investigating a particular type of fish, called bream. She wants to create a model to predict the weight, \(w\) grams, of bream based on their length, \(x \mathrm {~cm}\).
For a sample of 27 bream, some summary statistics are given below. $$\begin{gathered} \bar { x } = 31.07 \quad \bar { w } = 628.59 \quad \sum w ^ { 2 } = 11386134
\mathrm {~S} _ { x w } = 13082.3 \quad \mathrm {~S} _ { x x } = 260.8 \end{gathered}$$
  1. Find the value of the product moment correlation coefficient between \(x\) and \(w\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(w\) on \(x\) in the form \(w = a + b x\) A residual plot for these data is shown below.
    \includegraphics[max width=\textwidth, alt={}, center]{128c408d-3e08-4f74-8f19-d33ecd5c882f-06_931_1790_1107_139} One of the bream in the sample has a length of 32 cm .
  4. Find its weight.
  5. With reference to the residual plot, comment on the model for bream with lengths above 33 cm .
Edexcel S1 Q6
6. A local authority is investigating the cost of reconditioning its incinerators. Data from 10 randomly chosen incinerators were collected. The variables monitored were the operating time \(x\) (in thousands of hours) since last reconditioning and the reconditioning cost \(y\) (in \(\pounds 1000\) ). None of the incinerators had been used for more than 3000 hours since last reconditioning. The data are summarised below, $$\Sigma x = 25.0 , \Sigma x ^ { 2 } = 65.68 , \Sigma y = 50.0 , \Sigma y ^ { 2 } = 260.48 , \Sigma x y = 130.64$$
  1. Find \(\mathrm { S } _ { x x } , \mathrm {~S} _ { x y } , \mathrm {~S} _ { y y }\).
  2. Calculate the product moment correlation coefficient between \(x\) and \(y\).
  3. Explain why this value might support the fitting of a linear regression model of the form \(y = a + b x\).
  4. Find the values of \(a\) and \(b\).
  5. Give an interpretation of \(a\).
  6. Estimate
    1. the reconditioning cost for an operating time of 2400 hours,
    2. the financial effect of an increase of 1500 hours in operating time.
  7. Suggest why the authority might be cautious about making a prediction of the reconditioning cost of an incinerator which had been operating for 4500 hours since its last reconditioning. Materials required for examination
    Answer Book (AB16)
    Graph Paper (ASG2)
    Mathematical Formulae (Lilac) Items included with question papers
    Nil Paper Reference(s)
    6683 \section*{Edexcel GCE
    Statistics S1
    (New Syllabus)
    Advanced/Advanced Subsidiary
    Tuesday 12 June 2001 - Afternoon
    Time: 1 hour 30 minutes} Candidates may use any calculator EXCEPT those with the facility for symbolic algebra, differentiation and/or integration. Thus candidates may NOT use calculators such as the Texas Instruments TI 89, TI 92, Casio CFX 9970G, Hewlett Packard HP 48G. In the boxes on the answer book, write the name of the examining body (Edexcel), your centre number, candidate number, the unit title (Statistics S1), the paper reference (6683), your surname, other name and signature.
    Values from the statistical tables should be quoted in full. When a calculator is used, the answer should be given to an appropriate degree of accuracy. A booklet 'Mathematical Formulae and Statistical Tables' is provided.
    Full marks may be obtained for answers to ALL questions.
    This paper has seven questions. Pages 6, 7 and 8 are blank. You must ensure that your answers to parts of questions are clearly labelled.
    You must show sufficient working to make your methods clear to the Examiner. Answers without working may gain no credit.
    1. Each of the 25 students on a computer course recorded the number of minutes \(x\), to the nearest minute, spent surfing the internet during a given day. The results are summarised below.
    $$\Sigma x = 1075 , \Sigma x ^ { 2 } = 46625$$
  8. Find \(\mu\) and \(\sigma\) for these data. Two other students surfed the internet on the same day for 35 and 51 minutes respectively.
  9. Without further calculation, explain the effect on the mean of including these two students.
    2. On a particular day in summer 1993 at 0800 hours the height above sea level, \(x\) metres, and the temperature, \(y ^ { \circ } \mathrm { C }\), were recorded in 10 Mediterranean towns. The following summary statistics were calculated from the results. $$\Sigma x = 7300 , \Sigma x ^ { 2 } = 6599600 , S _ { x y } = - 13060 , S _ { y y } = 140.9 .$$
  10. Find \(S _ { x x }\).
  11. Calculate, to 3 significant figures, the product moment correlation coefficient between \(x\) and \(y\).
  12. Give an interpretation of your coefficient.
    3. The continuous random variable \(Y\) is normally distributed with mean 100 and variance 256 .
  13. Find \(\mathrm { P } ( Y < 80 )\).
  14. Find \(k\) such that \(\mathrm { P } ( 100 - k \leq Y \leq 100 + k ) = 0.516\).
    4. The discrete random variable \(X\) has the probability function shown in the table below.
    \(x\)- 2- 10123
    \(\mathrm { P } ( X = x )\)0.1\(\alpha\)0.30.20.10.1
    Find
  15. \(\alpha\),
  16. \(\mathrm { P } ( - 1 < X \leq 2 )\),
  17. \(\mathrm { F } ( - 0.4 )\),
  18. \(\mathrm { E } ( 3 X + 4 )\),
  19. \(\operatorname { Var } ( 2 X + 3 )\).
    5. A market researcher asked 100 adults which of the three newspapers \(A , B , C\) they read. The results showed that \(30 \operatorname { read } A , 26\) read \(B , 21\) read \(C , 5\) read both \(A\) and \(B\), 7 read both \(B\) and \(C\), 6 read both \(C\) and \(A\) and 2 read all three.
  20. Draw a Venn diagram to represent these data. One of the adults is then selected at random.
    Find the probability that she reads
  21. at least one of the newspapers,
  22. only \(A\),
  23. only one of the newspapers,
  24. \(A\) given that she reads only one newspaper.
    6. Three swimmers Alan, Diane and Gopal record the number of lengths of the swimming pool they swim during each practice session over several weeks. The stem and leaf diagram below shows the results for Alan.
    Lengths\(2 \mid 0\) means 20
    20122\(( 4 )\)
    25567789\(( 7 )\)
    301224\(( 5 )\)
    356679\(( 5 )\)
    40133333444\(( 10 )\)
    4556667788999\(( 12 )\)
    5000\(( 3 )\)
  25. Find the three quartiles for Alan's results. The table below summarises the results for Diane and Gopal.
    DianeGopal
    Smallest value3525
    Lower quartile3734
    Median4242
    Upper quartile5350
    Largest value6557
  26. Using the same scale and on the same sheet of graph paper draw box plots to represent the data for Alan, Diane and Gopal.
  27. Compare and contrast the three box plots.