5.09b Least squares regression: concepts

144 questions

Sort by: Default | Easiest first | Hardest first
Edexcel S1 2023 June Q2
13 marks Moderate -0.3
Two students, Olive and Shan, collect data on the weight, \(w\) grams, and the tail length, \(t\) cm, of 15 mice. Olive summarised the data as follows \(S_tt = 5.3173\) \quad \(\sum w^2 = 6089.12\) \quad \(\sum tw = 2304.53\) \quad \(\sum w = 297.8\) \quad \(\sum t = 114.8\)
  1. Calculate the value of \(S_{ww}\) and the value of \(S_{tw}\) [3]
  2. Calculate the value of the product moment correlation coefficient between \(w\) and \(t\) [2]
  3. Show that the equation of the regression line of \(w\) on \(t\) can be written as $$w = -16.7 + 4.77t$$ [3]
  4. Give an interpretation of the gradient of the regression line. [1]
  5. Explain why it would not be appropriate to use the regression line in part (c) to estimate the weight of a mouse with a tail length of 2cm. [2]
Shan decided to code the data using \(x = t - 6\) and \(y = \frac{w}{2} - 5\)
  1. Write down the value of the product moment correlation coefficient between \(x\) and \(y\) [1]
  2. Write down an equation of the regression line of \(y\) on \(x\) You do not need to simplify your equation. [1]
Edexcel S1 2010 January Q6
18 marks Moderate -0.8
The blood pressures, \(p\) mmHg, and the ages, \(t\) years, of 7 hospital patients are shown in the table below.
PatientABCDEFG
\(t\)42744835562660
\(p\)981301208818280135
[\(\sum t = 341\), \(\sum p = 833\), \(\sum t^2 = 18181\), \(\sum p^2 = 106397\), \(\sum tp = 42948\)]
  1. Find \(S_{tt}\), \(S_{pp}\) and \(S_t\) for these data. [4]
  2. Calculate the product moment correlation coefficient for these data. [3]
  3. Interpret the correlation coefficient. [1]
  4. On the graph paper on page 17, draw the scatter diagram of blood pressure against age for these 7 patients. [2]
  5. Find the equation of the regression line of \(p\) on \(t\). [4]
  6. Plot your regression line on your scatter diagram. [2]
  7. Use your regression line to estimate the blood pressure of a 40 year old patient. [2]
Edexcel S1 2011 June Q7
12 marks Moderate -0.8
A teacher took a random sample of 8 children from a class. For each child the teacher recorded the length of their left foot, \(f\) cm, and their height, \(h\) cm. The results are given in the table below.
\(f\)2326232227242021
\(h\)135144134136140134130132
(You may use \(\sum f = 186 \quad \sum h = 1085 \quad S_{ff} = 39.5 \quad S_{hh} = 139.875 \quad \sum fh = 25291\))
  1. Calculate \(S_{fh}\) [2]
  2. Find the equation of the regression line of \(h\) on \(f\) in the form \(h = a + bf\). Give the value of \(a\) and the value of \(b\) correct to 3 significant figures. [5]
  3. Use your equation to estimate the height of a child with a left foot length of 25 cm. [2]
  4. Comment on the reliability of your estimate in (c), giving a reason for your answer. [2]
The left foot length of the teacher is 25 cm.
  1. Give a reason why the equation in (b) should not be used to estimate the teacher's height. [1]
Edexcel S1 2002 November Q5
12 marks Standard +0.3
An agricultural researcher collected data, in appropriate units, on the annual rainfall \(x\) and the annual yield of wheat \(y\) at 8 randomly selected places. The data were coded using \(s = x - 6\) and \(t = y - 20\) and the following summations were obtained. \(\Sigma s = 48.5\), \(\Sigma t = 65.0\), \(\Sigma s^2 = 402.11\), \(\Sigma t^2 = 701.80\), \(\Sigma st = 523.23\)
  1. Find the equation of the regression line of \(t\) on \(s\) in the form \(t = p + qs\). [7]
  2. Find the equation of the regression line of \(y\) on \(x\) in the form \(y = a + bx\), giving \(a\) and \(b\) to 3 decimal places. [3]
The value of the product moment correlation coefficient between \(s\) and \(t\) is 0.943, to 3 decimal places.
  1. Write down the value of the product moment correlation coefficient between \(x\) and \(y\). Give a justification for your answer. [2]
Edexcel S1 Specimen Q4
14 marks Moderate -0.3
A drilling machine can run at various speeds, but in general the higher the speed the sooner the drill needs to be replaced. Over several months, 15 pairs of observations relating to speed, \(s\) revolutions per minute, and life of drill, \(h\) hours, are collected. For convenience the data are coded so that \(x = s - 20\) and \(y = h - 100\) and the following summations obtained. \(\Sigma x = 143; \Sigma y = 391; \Sigma x^2 = 2413; \Sigma y^2 = 22441; \Sigma xy = 484\).
  1. Find the equation of the regression line of \(h\) on \(s\). [10]
  2. Interpret the slope of your regression line. [2]
Estimate the life of a drill revolving at 30 revolutions per minute. [2]
Edexcel S1 Q7
15 marks Moderate -0.3
The following data was collected for seven cars, showing their engine size, \(x\) litres, and their fuel consumption, \(y\) km per litre, on a long journey.
Car\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)
\(x\)0.951.201.371.762.252.502.875
\(y\)21.317.215.519.114.711.49.0
\(\sum x = 12.905\), \(\sum x^2 = 26.8951\), \(\sum y = 108.2\), \(\sum y^2 = 1781.64\), \(\sum xy = 183.176\).
  1. Calculate the equation of the regression line of \(x\) on \(y\), expressing your answer in the form \(x = ay + b\). [6 marks]
  2. Calculate the product moment correlation coefficient between \(y\) and \(x\) and give a brief interpretation of its value. [4 marks]
  3. Use the equation of the regression line to estimate the value of \(x\) when \(y = 12\). State, with a reason, how accurate you would expect this estimate to be. [3 marks]
  4. Comment on the use of the line to find values of \(x\) as \(y\) gets very small. [2 marks]
Edexcel S1 Q5
13 marks Standard +0.3
The following marks out of 50 were given by two judges to the contestants in a talent contest:
Contestant\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)
Judge 1 (\(x\))4332402147112938
Judge 2 (\(y\))3925402236132732
Given that \(\sum x = 261\), \(\sum x^2 = 9529\) and \(\sum xy = 8373\),
  1. calculate the product-moment correlation coefficient between the two judges' marks [5 marks]
  2. Find an equation of the regression line of \(x\) on \(y\). [4 marks]
Contestant \(I\) was awarded 45 marks by Judge 2.
  1. Estimate the mark that this contestant would have received from Judge 1. [2 marks]
  2. Comment, with explanation, on the probable accuracy of your answer. [2 marks]
Edexcel S1 Q6
21 marks Standard +0.3
A missile was fired vertically upwards and its height above ground level, \(h\) metres, was found at various times \(t\) seconds after it was released. The results are given in the following table:
\(t\)1234567
\(h\)68126174216240252266
It is thought that this data can be fitted to the formula \(h = pt - qt^2\).
  1. Show that this equation can be written as \(\frac{h}{t} = p - qt\). [1 mark]
  2. Plot a scatter diagram of \(\frac{h}{t}\) against \(t\). [5 marks]
Given that \(\sum h = 1342\), \(\sum \frac{h}{t} = 371\) and \(\sum \frac{h^2}{t^2} = 20385\),
  1. find the equation of the regression line of \(\frac{h}{t}\) on \(t\) and hence write down the values of \(p\) and \(q\). [8 marks]
  2. Use your equation to find the value of \(h\) when \(t = 10\). Comment on the implication of your answer. [3 marks]
  3. Find the product-moment correlation coefficient between \(\frac{h}{t}\) and \(t\) and state the significance of its value. [4 marks]
Edexcel S1 Q6
15 marks Standard +0.3
The marks out of 75 obtained by a group of ten students in their first and second Statistics modules were as follows:
Student\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)\(I\)\(J\)
Module 1 \((x)\)\(54\)\(33\)\(42\)\(71\)\(60\)\(27\)\(39\)\(46\)\(59\)\(64\)
Module 2 \((y)\)\(50\)\(22\)\(44\)\(58\)\(42\)\(19\)\(35\)\(46\)\(55\)\(60\)
  1. Find \(\sum x\) and \(\sum y\). [2 marks]
Given that \(\sum x^2 = 26353\) and \(\sum xy = 22991\),
  1. obtain the equation of the regression line of \(y\) on \(x\). [5 marks]
  2. Estimate the Module 2 result of a student whose mark in Module 1 was (i) 65, (ii) 5. Explain why one of these estimates is less reliable than the other. [4 marks]
The equation of the regression line of \(x\) on \(y\) is \(x = 0.921y + 9.81\).
  1. Deduce the product moment correlation coefficient between \(x\) and \(y\), and briefly interpret its value. [4 marks]
Edexcel S1 Q6
13 marks Standard +0.3
Two variables \(x\) and \(y\) are such that, for a sample of ten pairs of values, $$\sum x = 104.5, \quad \sum y = 113.6, \quad \sum x^2 = 1954.1, \quad \sum y^2 = 2100.6.$$ The regression line of \(x\) on \(y\) has gradient 0.8. Find
  1. \(\sum xy\), [4 marks]
  2. the equation of the regression line of \(y\) on \(x\), [5 marks]
  3. the product moment correlation coefficient between \(y\) and \(x\). [3 marks]
  4. Describe the kind of correlation indicated by your answer to (c). [1 mark]
OCR S1 2010 January Q6
7 marks Standard +0.3
  1. A student calculated the values of the product moment correlation coefficient, \(r\), and Spearman's rank correlation coefficient, \(r_s\), for two sets of bivariate data, \(A\) and \(B\). His results are given below. $$A: \quad r = 0.9 \text{ and } r_s = 1$$ $$B: \quad r = 1 \quad \text{and } r_s = 0.9$$ With the aid of a diagram where appropriate, explain why the student's results for \(A\) could both be correct but his results for \(B\) cannot both be correct. [3]
  2. An old research paper has been partially destroyed. The surviving part of the paper contains the following incomplete information about some bivariate data from an experiment. \includegraphics{figure_6} The mean of \(x\) is 4.5. The equation of the regression line of \(y\) on \(x\) is \(y = 2.4x + 3.7\). The equation of the regression line of \(x\) on \(y\) is \(x = 0.40y\) + [missing constant] Calculate the missing constant at the end of the equation of the second regression line. [4]
OCR S1 2013 January Q3
12 marks Moderate -0.3
The Gross Domestic Product per Capita (GDP), \(x\) dollars, and the Infant Mortality Rate per thousand (IMR), \(y\), of 6 African countries were recorded and summarised as follows. \(n = 6\) \quad \(\sum x = 7000\) \quad \(\sum x^2 = 8700000\) \quad \(\sum y = 456\) \quad \(\sum y^2 = 36262\) \quad \(\sum xy = 509900\)
  1. Calculate the equation of the regression line of \(y\) on \(x\) for these 6 countries. [4]
The original data were plotted on a scatter diagram and the regression line of \(y\) on \(x\) was drawn, as shown below. \includegraphics{figure_3}
  1. The GDP for another country, Tanzania, is 1300 dollars. Use the regression line in the diagram to estimate the IMR of Tanzania. [1]
  2. The GDP for Nigeria is 2400 dollars. Give two reasons why the regression line is unlikely to give a reliable estimate for the IMR for Nigeria. [2]
  3. The actual value of the IMR for Tanzania is 96. The data for Tanzania (\(x = 1300, y = 96\)) is now included with the original 6 countries. Calculate the value of the product moment correlation coefficient, \(r\), for all 7 countries. [4]
  4. The IMR is now redefined as the infant mortality rate per hundred instead of per thousand, and the value of \(r\) is recalculated for all 7 countries. Without calculation state what effect, if any, this would have on the value of \(r\) found in part (iv). [1]
OCR S1 2009 June Q3
8 marks Moderate -0.3
In an agricultural experiment, the relationship between the amount of water supplied, \(x\) units, and the yield, \(y\) units, was investigated. Six values of \(x\) were chosen and for each value of \(x\) the corresponding value of \(y\) was measured. The results are shown in the table.
\(x\)123456
\(y\)36881110
These results, together with the regression line of \(y\) on \(x\), are plotted on the graph. \includegraphics{figure_1}
  1. Give a reason why the regression line of \(x\) on \(y\) is not suitable in this context. [1]
  2. Explain the significance, for the regression line of \(y\) on \(x\), of the distances shown by the vertical dotted lines in the diagram. [2]
  3. Calculate the value of the product moment correlation coefficient, \(r\). [3]
  4. Comment on your value of \(r\) in relation to the diagram. [2]
OCR S1 2010 June Q3
10 marks Moderate -0.8
  1. Some values, \((x, y)\), of a bivariate distribution are plotted on a scatter diagram and a regression line is to be drawn. Explain how to decide whether the regression line of \(y\) on \(x\) or the regression line of \(x\) on \(y\) is appropriate. [2]
  2. In an experiment the temperature, \(x\) °C, of a rod was gradually increased from 0 °C, and the extension, \(y\), was measured nine times at 50 °C intervals. The results are summarised below. \(n = 9\) \quad \(\Sigma x = 1800\) \quad \(\Sigma y = 14.4\) \quad \(\Sigma x^2 = 510000\) \quad \(\Sigma y^2 = 32.6416\) \quad \(\Sigma xy = 4080\)
    1. Show that the gradient of the regression line of \(y\) on \(x\) is 0.008 and find the equation of this line. [4]
    2. Use your equation to estimate the temperature when the extension is 2.5 mm. [1]
    3. Use your equation to estimate the extension for a temperature of \(-50\) °C. [1]
    4. Comment on the meaning and the reliability of your estimate in part (c). [2]
Edexcel S1 Q6
17 marks Moderate -0.3
Penshop have stores selling stationary in each of 6 towns. The population, \(P\), in tens of thousands and the monthly turnover, \(T\), in thousands of pounds for each of the shops are as recorded below.
TownAbbertonBemberClasterDellerEdgetonFigland
\(P\) (0.000's)3.27.65.29.08.14.8
\(T\) (£ 000's)11.112.413.319.317.911.8
  1. Represent these data on a scatter diagram with \(T\) on the vertical axis. [4]
    1. Which town's shop might appear to be underachieving given the populations of the towns?
    2. Suggest two other factors that might affect each shop's turnover. [3]
You may assume that $$\Sigma P = 37.9, \quad \Sigma T = 85.8, \quad \Sigma P^2 = 264.69, \quad \Sigma T^2 = 1286, \quad \Sigma PT = 574.25.$$
  1. Find the equation of the regression line of \(T\) on \(P\). [7]
  2. Estimate the monthly turnover that might be expected if a shop were opened in Gratton, a town with a population of 68 000. [2]
  3. Why might the management of Penshop be reluctant to use the regression line to estimate the monthly turnover they could expect if a shop were opened in Haggin, a town with a population of 172 000? [1]
Edexcel S1 Q4
12 marks Standard +0.3
The owner of a mobile burger-bar believes that hot weather reduces his sales. To investigate the effect on his business he collected data on his daily sales, \(£P\), and the maximum temperature, \(T\)°C, on each of 20 days. He then coded the data, using \(x = T - 20\) and \(y = P - 300\), and calculated the summary statistics given below. $$\Sigma x = 57, \quad \Sigma y = 2222, \quad \Sigma x^2 = 401, \quad \Sigma y^2 = 305576, \quad \Sigma xy = 3871.$$
  1. Find an equation of the regression line of \(P\) on \(T\). [9 marks]
The owner of the bar doesn't believe it is profitable for him to run the bar if he takes less than £460 in a day.
  1. According to your regression line at what maximum daily temperature, to the nearest degree Celsius, does it become unprofitable for him to run the bar? [3 marks]
OCR MEI S2 2007 January Q1
18 marks Moderate -0.8
In a science investigation into energy conservation in the home, a student is collecting data on the time taken for an electric kettle to boil as the volume of water in the kettle is varied. The student's data are shown in the table below, where \(v\) litres is the volume of water in the kettle and \(t\) seconds is the time taken for the kettle to boil (starting with the water at room temperature in each case). Also shown are summary statistics and a scatter diagram on which the regression line of \(t\) on \(v\) is drawn.
\(v\)0.20.40.60.81.0
\(t\)4478114156172
\(n = 5\), \(\Sigma v = 3.0\), \(\Sigma t = 564\), \(\Sigma v^2 = 2.20\), \(\Sigma vt = 405.2\). \includegraphics{figure_1}
  1. Calculate the equation of the regression line of \(t\) on \(v\), giving your answer in the form \(t = a + bv\). [5]
  2. Use this equation to predict the time taken for the kettle to boil when the amount of water which it contains is
    1. 0.5 litres,
    2. 1.5 litres.
    Comment on the reliability of each of these predictions. [4]
  3. In the equation of the regression line found in part (i), explain the role of the coefficient of \(v\) in the relationship between time taken and volume of water. [2]
  4. Calculate the values of the residuals for \(v = 0.8\) and \(v = 1.0\). [4]
  5. Explain how, on a scatter diagram with the regression line drawn accurately on it, a residual could be measured and its sign determined. [3]
WJEC Unit 2 2018 June Q05
6 marks Easy -1.2
A baker is aware that the pH of his sourdough, \(y\), and the hydration, \(x\), affect the taste and texture of the final product. The hydration is measured in ml of water per 100 g of flour (ml/100 g). The baker researches how the pH of his sourdough changes as the hydration changes. The results of his research are shown in the diagram below. \includegraphics{figure_5}
  1. Describe the relationship between pH and hydration. [2]
  2. The equation of the regression line for \(y\) on \(x\) is $$y = 5.4 - 0.02x.$$
    1. Interpret the gradient and intercept of the regression line in this context.
    2. Estimate the pH of the sourdough when the hydration is 20 ml/100 g. Comment on the reliability of this estimate. [4]
WJEC Further Unit 2 2018 June Q7
7 marks Moderate -0.3
A university professor conducted some research into factors that affect job satisfaction. The four factors considered were Interesting work, Good wages, Job security and Appreciation of work done. The professor interviewed workers at 14 different companies and asked them to rate their companies on each of the factors. The workers' ratings were averaged to give each company a score out of 5 on each factor. Each company was also given a score out of 100 for Job satisfaction. The following graph shows the part of the research concerning Job Satisfaction versus Interesting work. \includegraphics{figure_2}
  1. Calculate the equation of the least squares regression line of Job satisfaction (\(y\)) on Interesting work (\(x\)), given the following summary statistics. [5] \(\sum x = 46 \cdot 2\), \quad \(\sum y = 898\), \quad \(S_{xx} = 3 \cdot 48\) \(S_{xy} = 49 \cdot 45\), \quad \(S_{yy} = 1437 \cdot 714\), \quad \(n = 14\)
  2. Give two reasons why it would be inappropriate for the professor to use this equation to calculate the score for Interesting work from a Job satisfaction score of 90. [2]