5.09b Least squares regression: concepts

144 questions

Sort by: Default | Easiest first | Hardest first
OCR S1 2005 January Q9
15 marks Standard +0.3
9 Five observations of bivariate data produce the following results, denoted as ( \(x _ { i } , y _ { i }\) ) for \(i = 1,2,3,4,5\). $$\begin{aligned} & ( 13,2.7 ) \\ & { \left[ \Sigma x = 90 , \Sigma y = 15.0 , \Sigma x ^ { 2 } = 1720 , \Sigma y ^ { 2 } = 46.86 , \Sigma x y = 264.0 . \right] } \end{aligned}$$
  1. Show that the regression line of \(y\) on \(x\) has gradient - 0.06 , and find its equation in the form \(y = a + b x\).
  2. The regression line is used to estimate the value of \(y\) corresponding to \(x = 20\), but the value \(x = 20\) is accurate only to the nearest whole number. Calculate the difference between the largest and the smallest values that the estimated value of \(y\) could take. The numbers \(e _ { 1 } , e _ { 2 } , e _ { 3 } , e _ { 4 } , e _ { 5 }\) are defined by $$e _ { i } = a + b x _ { i } - y _ { i } \quad \text { for } i = 1,2,3,4,5$$
  3. The values of \(e _ { 1 } , e _ { 2 }\) and \(e _ { 3 }\) are \(0.6 , - 0.7\) and 0.2 respectively. Calculate the values of \(e _ { 4 }\) and \(e _ { 5 }\).
  4. Calculate the value of \(e _ { 1 } ^ { 2 } + e _ { 2 } ^ { 2 } + e _ { 3 } ^ { 2 } + e _ { 4 } ^ { 2 } + e _ { 5 } ^ { 2 }\) and explain the relevance of this quantity to the regression line found in part (i).
  5. Find the mean and the variance of \(e _ { 1 } , e _ { 2 } , e _ { 3 } , e _ { 4 } , e _ { 5 }\).
OCR S1 2007 January Q5
8 marks Moderate -0.8
5 A chemical solution was gradually heated. At five-minute intervals the time, \(x\) minutes, and the temperature, \(y ^ { \circ } \mathrm { C }\), were noted.
\(x\)05101520253035
\(y\)0.83.06.810.915.619.623.426.7
$$\left[ n = 8 , \Sigma x = 140 , \Sigma y = 106.8 , \Sigma x ^ { 2 } = 3500 , \Sigma y ^ { 2 } = 2062.66 , \Sigma x y = 2685.0 . \right]$$
  1. Calculate the equation of the regression line of \(y\) on \(x\).
  2. Use your equation to estimate the temperature after 12 minutes.
  3. It is given that the value of the product moment correlation coefficient is close to + 1 . Comment on the reliability of using your equation to estimate \(y\) when
    1. \(x = 17\),
    2. \(x = 57\).
OCR S1 2008 January Q9
11 marks Moderate -0.8
9 It is thought that the pH value of sand (a measure of the sand's acidity) may affect the extent to which a particular species of plant will grow in that sand. A botanist wished to determine whether there was any correlation between the pH value of the sand on certain sand dunes, and the amount of each of two plant species growing there. She chose random sections of equal area on each of eight sand dunes and measured the pH values. She then measured the area within each section that was covered by each of the two species. The results were as follows.
\cline { 2 - 10 } \multicolumn{1}{c|}{}Dune\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)
\cline { 2 - 10 } \multicolumn{1}{c|}{}pH value, \(x\)8.58.59.58.56.57.58.59.0
\multirow{2}{*}{
Area, \(y \mathrm {~cm} ^ { 2 }\)
covered
}
Species \(P\)1501505753304515340330
\cline { 2 - 10 }Species \(Q\)1701580230752500
The results for species \(P\) can be summarised by $$n = 8 , \quad \Sigma x = 66.5 , \quad \Sigma x ^ { 2 } = 558.75 , \quad \Sigma y = 1935 , \quad \Sigma y ^ { 2 } = 711275 , \quad \Sigma x y = 17082.5 .$$
  1. Give a reason why it might be appropriate to calculate the equation of the regression line of \(y\) on \(x\) rather than \(x\) on \(y\) in this situation.
  2. Calculate the equation of the regression line of \(y\) on \(x\) for species \(P\), in the form \(y = a + b x\), giving the values of \(a\) and \(b\) correct to 3 significant figures.
  3. Estimate the value of \(y\) for species \(P\) on sand where the pH value is 7.0 . The values of the product moment correlation coefficient between \(x\) and \(y\) for species \(P\) and \(Q\) are \(r _ { P } = 0.828\) and \(r _ { Q } = 0.0302\).
  4. Describe the relationship between the area covered by species \(Q\) and the pH value.
  5. State, with a reason, whether the regression line of \(y\) on \(x\) for species \(P\) will provide a reliable estimate of the value of \(y\) when the pH value is
    1. 8,
    2. 4 .
    3. Assume that the equation of the regression line of \(y\) on \(x\) for species \(Q\) is also known. State, with a reason, whether this line will provide a reliable estimate of the value of \(y\) when the pH value is 8 .
OCR S1 2016 June Q2
10 marks Moderate -0.3
2
  1. The table shows the amount, \(x\), in hundreds of pounds, spent on heating and the number of absences, \(y\), at a factory during each month in 2014.
    Amount, \(x\), spent on
    heating (£ hundreds)
    212319151452109201823
    Number of absences, \(y\)2325181812104911152026
    \(n = 12 \quad \Sigma x = 179 \quad \Sigma x ^ { 2 } = 3215 \quad \Sigma y = 191 \quad \Sigma y ^ { 2 } = 3565 \quad \Sigma x y = 3343\)
    1. Calculate \(r\), the product moment correlation coefficient, showing that \(r > 0.92\).
    2. A manager says, 'The value of \(r\) shows that spending more money on heating causes more absences, so we should spend less on heating.' Comment on this claim.
    3. The months in 2014 were numbered \(1,2,3 , \ldots , 12\). The output, \(z\), in suitable units was recorded along with the month number, \(n\), for each month in 2014. The equation of the regression line of \(z\) on \(n\) was found to be \(z = 0.6 n + 17\).
      (a) Use this equation to explain whether output generally increased or decreased over these months.
      (b) Find the mean of \(n\) and use the equation of the regression line to calculate the mean of \(z\).
    4. Hence calculate the total output in 2014.
OCR S1 Specimen Q8
13 marks Moderate -0.8
8 An experiment was conducted to see whether there was any relationship between the maximum tidal current, \(y \mathrm {~cm} \mathrm {~s} ^ { - 1 }\), and the tidal range, \(x\) metres, at a particular marine location. [The tidal range is the difference between the height of high tide and the height of low tide.] Readings were taken over a period of 12 days, and the results are shown in the following table.
\(x\)2.02.43.03.13.43.73.83.94.04.54.64.9
\(y\)15.222.025.233.033.134.251.042.345.050.761.059.2
$$\left[ \Sigma x = 43.3 , \Sigma y = 471.9 , \Sigma x ^ { 2 } = 164.69 , \Sigma y ^ { 2 } = 20915.75 , \Sigma x y = 1837.78 . \right]$$ The scatter diagram below illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{2fb25fc5-0445-44fa-a23e-647d14b1a376-4_462_793_1464_644}
  1. Calculate the product moment correlation coefficient for the data, and comment briefly on your answer with reference to the appearance of the scatter diagram.
  2. Calculate the equation of the regression line of maximum tidal current on tidal range.
  3. Estimate the maximum tidal current on a day when the tidal range is 4.2 m , and comment briefly on how reliable you consider your estimate is likely to be.
  4. It is suggested that the equation found in part (ii) could be used to predict the maximum tidal current on a day when the tidal range is 15 m . Comment briefly on the validity of this suggestion.
OCR MEI S2 2008 January Q1
18 marks Moderate -0.3
1 A biology student is carrying out an experiment to study the effect of a hormone on the growth of plant shoots. The student applies the hormone at various concentrations to a random sample of twelve shoots and measures the growth of each shoot. The data are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\), measured in suitable units, represent concentration and growth respectively. \includegraphics[max width=\textwidth, alt={}, center]{20fc4222-95c6-4b59-8e89-913dd988eb44-2_693_897_534_625} $$n = 12 , \Sigma x = 30 , \Sigma y = 967.6 , \Sigma x ^ { 2 } = 90 , \Sigma y ^ { 2 } = 78926 , \Sigma x y = 2530.3 .$$
  1. State which of the two variables \(x\) and \(y\) is the independent variable and which is the dependent variable. Briefly explain your answers.
  2. Calculate the equation of the regression line of \(y\) on \(x\).
  3. Use the equation of the regression line to calculate estimates of shoot growth for concentrations of
    (A) 1.2,
    (B) 4.3. Comment on the reliability of each of these estimates.
  4. Calculate the value of the residual for the data point where \(x = 3\) and \(y = 80\).
  5. In further experiments, the student finds that using concentration \(x = 6\) results in shoot growths of around \(y = 20\). In the light of all the available information, what can be said about the relationship between \(x\) and \(y\) ?
OCR MEI S2 2005 June Q3
17 marks Standard +0.3
3 In a triathlon, competitors have to swim 600 metres, cycle 40 kilometres and run 10 kilometres. To improve her strength, a triathlete undertakes a training programme in which she carries weights in a rucksack whilst running. She runs a specific course and notes the total time taken for each run. Her coach is investigating the relationship between time taken and weight carried. The times taken with eight different weights are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\) represent weight carried in kilograms and time taken in minutes respectively. \includegraphics[max width=\textwidth, alt={}, center]{be463718-caf7-4bc8-b838-143ab4681d6e-4_627_1536_630_281} Summary statistics: \(n = 8 , \Sigma x = 36 , \Sigma y = 214.8 , \Sigma x ^ { 2 } = 204 , \Sigma y ^ { 2 } = 5775.28 , \Sigma x y = 983.6\).
  1. Calculate the equation of the regression line of \(y\) on \(x\). On one of the eight runs, the triathlete was carrying 4 kilograms and took 27.5 minutes. On this run she was delayed when she tripped and fell over.
  2. Calculate the value of the residual for this weight.
  3. The coach decides to recalculate the equation of the regression line without the data for this run. Would it be preferable to use this recalculated equation or the equation found in part (i) to estimate the delay when the triathlete tripped and fell over? Explain your answer. The triathlete's coach claims that there is positive correlation between cycling and swimming times in triathlons. The product moment correlation coefficient of the times of twenty randomly selected competitors in these two sections is 0.209 .
  4. Carry out a hypothesis test at the \(5 \%\) level to examine the coach's claim, explaining your conclusions clearly.
  5. What distributional assumption is necessary for this test to be valid? How can you use a scatter diagram to decide whether this assumption is likely to be true?
Edexcel S1 2016 January Q3
15 marks Moderate -0.3
3. A publisher collects information about the amount spent on advertising, \(\pounds x\), and the sales, \(y\) books, for some of her publications. She collects information for a random sample of 8 textbooks and codes the data using \(v = \frac { x + 50 } { 200 }\) and \(s = \frac { y } { 1000 }\) to give
\(v\)0.608.104.300.401.606.402.505.10
\(s\)1.846.735.951.302.457.464.826.25
[You may use: \(\sum v = 29 \sum s = 36.8 \sum s ^ { 2 } = 209.72 \sum v s = 177.311 \quad \mathrm {~S} _ { v v } = 55.275\) ]
  1. Find \(\mathrm { S } _ { v s }\) and \(\mathrm { S } _ { s s }\)
  2. Calculate the product moment correlation coefficient for these data. The publisher believes that a linear regression model may be appropriate to describe these data.
  3. State, giving a reason, whether or not your answer to part (b) supports the publisher's belief.
  4. Find the equation of the regression line of \(s\) on \(v\), giving your answer in the form \(s = a + b v\)
  5. Hence find the equation of the regression line of \(y\) on \(x\) for the sample of textbooks, giving your answer in the form \(y = c + d x\) The publisher calculated the regression line for a sample of novels and obtained the equation $$y = 3100 + 1.2 x$$ She wants to increase the sales of books by spending more money on advertising.
  6. State, giving your reasons, whether the publisher should spend more money on advertising textbooks or novels.
Edexcel S1 2017 January Q3
17 marks Moderate -0.3
  1. A scientist measured the salinity of water, \(x \mathrm {~g} / \mathrm { kg }\), and recorded the temperature at which the water froze, \(y ^ { \circ } \mathrm { C }\), for 12 different water samples. The summary statistics are listed below.
$$\begin{gathered} \sum x = 504 \quad \sum y = - 27 \quad \sum x ^ { 2 } = 22842 \quad \sum y ^ { 2 } = 62.98 \\ \sum x y = - 1190.7 \quad \mathrm {~S} _ { x x } = 1674 \quad \mathrm {~S} _ { y y } = 2.23 \end{gathered}$$
  1. Find the mean and variance of the recorded temperatures.
    (3) Priya believes that the higher the salinity of water, the higher the temperature at which the water freezes.
    1. Calculate the product moment correlation coefficient between \(x\) and \(y\)
    2. State, with a reason, whether or not this value supports Priya's belief.
  2. Find the least squares regression line of \(y\) on \(x\) in the form \(y = a + b x\) Give the value of \(a\) and the value of \(b\) to 3 significant figures.
  3. Estimate the temperature at which water freezes when the salinity is \(32 \mathrm {~g} / \mathrm { kg }\) The coding \(w = 1.8 y + 32\) is used to convert the recorded temperatures from \({ } ^ { \circ } \mathrm { C }\) to \({ } ^ { \circ } \mathrm { F }\)
  4. Find an equation of the least squares regression line of \(w\) on \(x\) in the form \(w = c + d x\)
  5. Find
    1. the variance of the recorded temperatures when converted to \({ } ^ { \circ } \mathrm { F }\)
    2. the product moment correlation coefficient between \(w\) and \(x\) \href{http://PhysicsAndMathsTutor.com}{PhysicsAndMathsTutor.com}
Edexcel S1 2018 January Q3
8 marks Moderate -0.8
3. Martin is investigating the relationship between a person's daily caffeine consumption, \(c\) milligrams, and the amount of sleep they get, \(h\) hours, per night. He collected this information from 20 people and the results are summarised below. $$\begin{array} { c c } \sum c = 3660 \quad \sum h = 126 \quad \sum c ^ { 2 } = 973228 \\ \sum c h = 20023.4 \quad S _ { c c } = 303448 \quad S _ { c h } = - 3034.6 \end{array}$$ Martin calculates the product moment correlation coefficient for these data and obtains - 0.833
  1. Give a reason why this value supports a linear relationship between \(c\) and \(h\) The amount of sleep per night is the response variable.
  2. Explain what you understand by the term 'response variable'. Martin says that for each additional 100 mg of caffeine consumed, the expected number of hours of sleep decreases by 1
  3. Determine, by calculation, whether or not the data support this statement.
  4. Use the data to calculate an estimate for the expected number of hours of sleep per night when no caffeine is consumed.
Edexcel S1 2018 January Q5
12 marks Moderate -0.3
5. Franca is the manager of an accountancy firm. She is investigating the relationship between the salary, \(\pounds x\), and the length of commute, \(y\) minutes, for employees at the firm. She collected this information from 9 randomly selected employees. The salary of each employee was then coded using \(w = \frac { x - 20000 } { 1000 }\) The table shows the values of \(w\) and \(y\) for the 9 employees.
\(w\)688- 125153- 219
\(y\)455035652540507520
(You may use \(\sum w = 81 \quad \sum y = 405 \quad \sum w y = 2490 \quad S _ { w w } = 660 \quad S _ { y y } = 2500\) )
  1. Calculate the salary of the employee with \(w = - 2\)
  2. Show that, to 3 significant figures, the value of the product moment correlation coefficient between \(w\) and \(y\) is - 0.899
  3. State, giving a reason, the value of the product moment correlation coefficient between \(x\) and \(y\) The least squares regression line of \(y\) on \(w\) is \(y = 60.75 - 1.75 w\)
  4. Find the equation of the least squares regression line of \(y\) on \(x\) giving your answer in the form \(y = a + b x\)
  5. Estimate the length of commute for an employee with a salary of \(\pounds 21000\) Franca uses the regression line to estimate the length of commute for employees with salaries between \(\pounds 25000\) and \(\pounds 40000\)
  6. State, giving a reason, whether or not these estimates are reliable.
Edexcel S1 2015 June Q2
13 marks Moderate -0.3
2. Paul believes there is a relationship between the value and the floor size of a house. He takes a random sample of 20 houses and records the value, \(\pounds v\), and the floor size, \(s \mathrm {~m} ^ { 2 }\) The data were coded using \(x = \frac { s - 50 } { 10 }\) and \(y = \frac { v } { 100000 }\) and the following statistics obtained. $$\sum x = 441.5 , \quad \sum y = 59.8 , \quad \sum x ^ { 2 } = 11261.25 , \quad \sum y ^ { 2 } = 196.66 , \quad \sum x y = 1474.1$$
  1. Find the value of \(S _ { x y }\) and the value of \(S _ { x x }\)
  2. Find the equation of the least squares regression line of \(y\) on \(x\) in the form \(y = a + b x\) The least squares regression line of \(v\) on \(s\) is \(v = c + d s\)
  3. Show that \(d = 1020\) to 3 significant figures and find the value of \(c\)
  4. Estimate the value of a house of floor size \(130 \mathrm {~m} ^ { 2 }\)
  5. Interpret the value \(d\) Paul wants to increase the value of his house. He decides to add an extension to increase the floor size by \(31 \mathrm {~m} ^ { 2 }\)
  6. Estimate the increase in the value of Paul's house after adding the extension.
Edexcel S1 2004 January Q1
13 marks Moderate -0.8
  1. An office has the heating switched on at 7.00 a.m. each morning. On a particular day, the temperature of the office, \(t { } ^ { \circ } \mathrm { C }\), was recorded \(m\) minutes after 7.00 a.m. The results are shown in the table below.
\(m\)01020304050
\(t\)6.08.911.813.515.316.1
  1. Calculate the exact values of \(S _ { m t }\) and \(S _ { m m }\).
  2. Calculate the equation of the regression line of \(t\) on \(m\) in the form \(t = a + b m\).
  3. Use your equation to estimate the value of \(t\) at 7.35 a.m.
  4. State, giving a reason, whether or not you would use the regression equation in (b) to estimate the temperature
    1. at 9.00 a.m. that day,
    2. at 7.15 a.m. one month later.
OCR S1 2011 June Q7
6 marks Moderate -0.8
7 The diagram shows the results of an experiment involving some bivariate data. The least squares regression line of \(y\) on \(x\) for these results is also shown. \includegraphics[max width=\textwidth, alt={}, center]{48ffcd44-d933-40e0-818a-20d6db607298-5_748_919_390_612}
  1. Given that the least squares regression line of \(y\) on \(x\) is used for an estimation, state which of \(x\) or \(y\) is treated as the independent variable.
  2. Use the diagram to explain what is meant by 'least squares'.
  3. State, with a reason, the value of Spearman's rank correlation coefficient for these data.
  4. What can be said about the value of the product moment correlation coefficient for these data?
OCR MEI S2 2009 January Q1
20 marks Moderate -0.3
1 A researcher is investigating whether there is a relationship between the population size of cities and the average walking speed of pedestrians in the city centres. Data for the population size, \(x\) thousands, and the average walking speed of pedestrians, \(y \mathrm {~m} \mathrm {~s} ^ { - 1 }\), of eight randomly selected cities are given in the table below.
\(x\)18435294982067841530
\(y\)1.150.971.261.351.281.421.321.64
  1. Calculate the value of Spearman's rank correlation coefficient.
  2. Carry out a hypothesis test at the \(5 \%\) significance level to determine whether there is any association between population size and average walking speed. In another investigation, the researcher selects a random sample of six adult males of particular ages and measures their maximum walking speeds. The data are shown in the table below, where \(t\) years is the age of the adult and \(w \mathrm {~m} \mathrm {~s} ^ { - 1 }\) is the maximum walking speed. Also shown are summary statistics and a scatter diagram on which the regression line of \(w\) on \(t\) is drawn.
    \(t\)203040506070
    \(w\)2.492.412.382.141.972.03
    $$n = 6 \quad \Sigma t = 270 \quad \Sigma w = 13.42 \quad \Sigma t ^ { 2 } = 13900 \quad \Sigma w ^ { 2 } = 30.254 \quad \Sigma t w = 584.6$$ \includegraphics[max width=\textwidth, alt={}, center]{77b97142-afb6-41d6-8fec-e982b7a7501b-2_728_1091_1379_529}
  3. Calculate the equation of the regression line of \(w\) on \(t\).
  4. (A) Use this equation to calculate an estimate of maximum walking speed of an 80 -year-old male.
    (B) Explain why it might not be appropriate to use the equation to calculate an estimate of maximum walking speed of a 10 -year-old male.
OCR MEI S2 2010 January Q1
19 marks Moderate -0.3
1 A pilot records the take-off distance for his light aircraft on runways at various altitudes. The data are shown in the table below, where \(a\) metres is the altitude and \(t\) metres is the take-off distance. Also shown are summary statistics for these data.
\(a\)0300600900120015001800
\(t\)63570477683692310081105
$$n = 7 \quad \Sigma a = 6300 \quad \Sigma t = 5987 \quad \Sigma a ^ { 2 } = 8190000 \quad \Sigma t ^ { 2 } = 5288931 \quad \Sigma a t = 6037800$$
  1. Draw a scatter diagram to illustrate these data.
  2. State which of the two variables \(a\) and \(t\) is the independent variable and which is the dependent variable. Briefly explain your answer.
  3. Calculate the equation of the regression line of \(t\) on \(a\).
  4. Use the equation of the regression line to calculate estimates of the take-off distance for altitudes
    (A) 800 metres,
    (B) 2500 metres. Comment on the reliability of each of these estimates.
  5. Calculate the value of the residual for the data point where \(a = 1200\) and \(t = 923\), and comment on its sign.
OCR MEI S2 2013 January Q1
19 marks Standard +0.3
1 A manufacturer of playground safety tiles is testing a new type of tile. Tiles of various thicknesses are tested to estimate the maximum height at which people would be unlikely to sustain injury if they fell onto a tile. The results of the test are as follows.
Thickness \(( t \mathrm {~mm} )\)20406080100
Maximum height \(( h \mathrm {~m} )\)0.721.091.621.972.34
  1. Draw a scatter diagram to illustrate these data.
  2. State which of the two variables is the independent variable, giving a reason for your answer.
  3. Calculate the equation of the regression line of maximum height on thickness.
  4. Use the equation of the regression line to calculate estimates of the maximum height for thicknesses of
    (A) 70 mm ,
    (B) 120 mm . Comment on the reliability of each of these estimates.
  5. Calculate the value of the residual for the data point at which \(t = 40\).
  6. In a further experiment, the manufacturer tests a tile with a thickness of 200 mm and finds that the corresponding maximum height is 2.96 m . What can be said about the relationship between tile thickness and maximum height?
OCR MEI S2 2011 June Q1
18 marks Easy -1.2
1 An experiment is performed to determine the response of maize to nitrogen fertilizer. Data for the amount of nitrogen fertilizer applied, \(x \mathrm {~kg} / \mathrm { hectare }\), and the average yield of maize, \(y\) tonnes/hectare, in 5 experimental plots are given in the table below.
\(x\)0306090120
\(y\)0.52.54.76.27.4
  1. Draw a scatter diagram to illustrate these data.
  2. Calculate the equation of the regression line of \(y\) on \(x\).
  3. Draw your regression line on your scatter diagram and comment briefly on its fit.
  4. Calculate the value of the residual for the data point where \(x = 30\) and \(y = 2.5\).
  5. Use the equation of the regression line to calculate estimates of average yield with nitrogen fertilizer applications of
    (A) \(45 \mathrm {~kg} / \mathrm { hectare }\),
    (B) \(150 \mathrm {~kg} /\) hectare.
  6. In a plot where \(150 \mathrm {~kg} /\) hectare of nitrogen fertilizer is applied, the average yield of maize is 8.7 tonnes/hectare. Comment on this result.
OCR MEI S2 2015 June Q1
17 marks Moderate -0.5
1 A random sample of wheat seedlings is planted and their growth is measured. The table shows their average growth, \(y \mathrm {~mm}\), at half-day intervals.
Time \(t\) days00.511.522.53
Average growth \(y \mathrm {~mm}\)072133455662
  1. Draw a scatter diagram to illustrate these data.
  2. Calculate the equation of the regression line of \(y\) on \(t\).
  3. Calculate the value of the residual for the data point at which \(t = 2\).
  4. Use the equation of the regression line to calculate an estimate of the average growth after 5 days for wheat seedlings. Comment on the reliability of this estimate. It is suggested that it would be better to replace the regression line by a line which passes through the origin. You are given that the equation of such a line is \(y = a t\), where \(a = \frac { \sum y t } { \sum t ^ { 2 } }\).
  5. Find the equation of this line and plot the line on your scatter diagram.
CAIE FP2 2015 June Q7
11 marks Standard +0.8
7 For a random sample of 10 observations of pairs of values \(( x , y )\), the equation of the regression line of \(y\) on \(x\) is \(y = 3.25 x - 4.27\). The sum of the ten \(x\) values is 15.6 and the product moment correlation coefficient for the sample is 0.56 . Find the equation of the regression line of \(x\) on \(y\). Test, at the \(5 \%\) significance level, whether there is evidence of non-zero correlation between the variables.
CAIE FP2 2019 June Q10
11 marks Standard +0.3
10 The values from a random sample of five pairs \(( x , y )\) taken from a bivariate distribution are shown below.
\(x\)34468
\(y\)57\(q\)67
The equation of the regression line of \(x\) on \(y\) is given by \(x = \frac { 5 } { 4 } y + c\).
  1. Given that \(q\) is an integer, find its value.
  2. Find the value of \(c\).
  3. Find the value of the product moment correlation coefficient.
CAIE FP2 2011 November Q10 OR
Standard +0.8
The regression line of \(y\) on \(x\) obtained from a random sample of five pairs of values of \(x\) and \(y\) is $$y = 2.5 x - 1.5$$ The data is given in the following table.
\(x\)12426
\(y\)236\(p\)\(q\)
  1. Show that \(p + q = 19\).
  2. Find the values of \(p\) and \(q\).
  3. Determine the value of the product moment correlation coefficient for this sample.
  4. It is later discovered that the values of \(x\) given in the table have each been divided by 10 (that is, the actual values are \(10,20,40,20,60\) ). Without any further calculation, state
    1. the equation of the actual regression line of \(y\) on \(x\),
    2. the value of the actual product moment correlation coefficient.
CAIE FP2 2012 November Q8
11 marks Moderate -0.8
8 The yield of a particular crop on a farm is thought to depend principally on the amount of sunshine during the growing season. For a random sample of 8 years, the average yield, \(y\) kilograms per square metre, and the average amount of sunshine per day, \(x\) hours, are recorded. The results are given in the following table.
\(x\)12.210.45.26.311.810.014.22.3
\(y\)159107811126
$$\left[ \Sigma x = 72.4 , \Sigma x ^ { 2 } = 769.9 , \Sigma y = 78 , \Sigma y ^ { 2 } = 820 , \Sigma x y = 761.3 . \right]$$
  1. Find the equation of the regression line of \(y\) on \(x\).
  2. Find the product moment correlation coefficient.
  3. Test, at the \(5 \%\) significance level, whether there is positive correlation between the average yield and the average amount of sunshine per day.
CAIE FP2 2012 November Q10
10 marks Moderate -0.3
10 Delegates who travelled to a conference were asked to report the distance, \(y \mathrm {~km}\), that they had travelled and the time taken, \(x\) minutes. The values reported by a random sample of 8 delegates are given in the following table.
Delegate\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)
\(x\)90467298526510582
\(y\)90556985455011074
$$\left[ \Sigma x = 610 , \Sigma x ^ { 2 } = 49682 , \Sigma y = 578 , \Sigma y ^ { 2 } = 45212 , \Sigma x y = 47136 . \right]$$ Find the equations of the regression lines of \(y\) on \(x\) and of \(x\) on \(y\). Estimate the time taken by a delegate who travelled 100 km to the conference. Calculate the product moment correlation coefficient for this sample.
CAIE FP2 2017 November Q11 OR
Moderate -0.3
A large number of people attended a course to improve the speed of their logical thinking. The times taken to complete a particular type of logic puzzle at the beginning of the course and at the end of the course are recorded for each person. The time taken, in minutes, at the beginning of the course is denoted by \(x\) and the time taken, in minutes, at the end of the course is denoted by \(y\). For a random sample of 9 people, the results are summarised as follows. $$\Sigma x = 45.3 \quad \Sigma x ^ { 2 } = 245.59 \quad \Sigma y = 40.5 \quad \Sigma y ^ { 2 } = 195.11 \quad \Sigma x y = 218.72$$ Ken attended the course, but his time to complete the puzzle at the beginning of the course was not recorded. His time to complete the puzzle at the end of the course was 4.2 minutes.
  1. By finding, showing all necessary working, the equation of a suitable regression line, find an estimate for the time that Ken would have taken to complete the puzzle at the beginning of the course.
    The values of \(x - y\) for the sample of 9 people are as follows. $$\begin{array} { l l l l l l l l l } 0.2 & 0.8 & 0.5 & 1.0 & 0.2 & 0.6 & 0.2 & 0.5 & 0.8 \end{array}$$ The organiser of the course believes that, on average, the time taken to complete the puzzle decreases between the beginning and the end of the course by more than 0.3 minutes.
  2. Stating suitable hypotheses and assuming a normal distribution, test the organiser's belief at the \(2 \frac { 1 } { 2 } \%\) significance level.