5.09b Least squares regression: concepts

144 questions

Sort by: Default | Easiest first | Hardest first
WJEC Further Unit 2 2019 June Q6
6 marks Moderate -0.3
6. The University of Arizona surveyed a large number of households. One purpose of the survey was to determine if annual household income could be predicted from size of family home. The graph of Annual household income, \(y\), versus Size of family home, \(x\), is shown below. \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_616_1257_566_365}
  1. State the limitations of using the regression line above with reference to the scatter diagram. The data for size of family homes between 2000 and 3000 square feet are shown in the diagram below. \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_652_1244_1516_360} Summary statistics for these data are as follows. $$\begin{array} { r c c } \sum x = 93160 & \sum y = 3907142 & n = 37 \\ S _ { x x } = 2869673.03 & S _ { y y } = 44312797167 & S _ { x y } = 348512820 \cdot 6 \end{array}$$
  2. Calculate the equation of the least squares regression line to predict Annual household income from Size of family home for these data.
WJEC Further Unit 2 2022 June Q7
7 marks Moderate -0.3
7. Data from a large dataset shows the percentage of children enrolled in secondary education and the percentage of the adult population who are literate. The following graphs show data from 30 randomly selected regions from each of the Arab World, Africa and Asia. In each case, the least squares regression line of '\% Literacy' on '\% Enrolled in Secondary Education' is shown. \includegraphics[max width=\textwidth, alt={}, center]{77fd7ad7-f5a3-4947-afc6-e5ef45bef7a8-6_682_1200_584_395} \begin{figure}[h]
\captionsetup{labelformat=empty} \caption{Africa} \includegraphics[alt={},max width=\textwidth]{77fd7ad7-f5a3-4947-afc6-e5ef45bef7a8-6_623_1191_1548_397}
\end{figure} \includegraphics[max width=\textwidth, alt={}, center]{77fd7ad7-f5a3-4947-afc6-e5ef45bef7a8-7_665_1200_331_434}
  1. Calculate the equation of the least squares regression line of '\% Literacy' ( \(y\) ) on '\% Enrolled in Secondary Education' ( \(x\) ) for Asia, given the following summary statistics. $$\begin{array} { l l l } \sum x = 2850.836 & \sum y = 2738.656 & S _ { x x } = 88.42142 \\ S _ { y y } = 204.733 & S _ { x y } = 96.60984 & n = 30 \end{array}$$
  2. The Arab World, Africa and Asia each contain a region where \(70 \%\) are enrolled in secondary education. The three regression lines are used to estimate the corresponding \% Literacy. Which of these estimates is likely to be the most reliable? Clearly explain your reasoning. \section*{END OF PAPER}
WJEC Further Unit 2 2024 June Q4
12 marks Standard +0.8
4. An author poses the following question: Does using cash for transactions affect people's financial behaviour?
She collects data on 'Cash transactions as a \% of all transactions' and 'Household debt as a \(\%\) of net disposable income' from a random sample of 25 countries. The table below shows the data she collected. There are missing values, \(p\) and \(q\), for Malta and Denmark respectively.
CountryCash transactions as a \% of all transactions \(\boldsymbol { x }\)Household debt as a \% of net disposable income \(\boldsymbol { y }\)CountryCash transactions as a \% of all transactions \(\boldsymbol { x }\)Household debt as a \% of net disposable income \(\boldsymbol { y }\)
Malta92\(p\)France68120
Mexico90-14Luxembourg64177
Greece88107Belgium63113
Spain87110Finland54137
Italy8687Estonia4882
Austria8591The Netherlands45247
Portugal81131UK42147
Slovenia8056Australia37214
Germany8095USA32109
Ireland79154Sweden20187
Slovakia7874South Korea14182
Lithuania7546Denmark\(q\)261
Latvia7143
The summary statistics and scatter diagram below are for the other 23 countries. \begin{figure}[h]
\captionsetup{labelformat=empty} \caption{Household debt versus Cash transactions} \includegraphics[alt={},max width=\textwidth]{1538fa56-5b61-40ec-bb02-cf1ed9da5eb0-13_664_1296_511_379}
\end{figure} $$\begin{gathered} \sum x = 1467 \sum y = 2695 \sum x ^ { 2 } = 105073 \quad S _ { x x } = 11503 \cdot 91304 \quad S _ { y y } = 78669 \cdot 30435 \\ \sum y ^ { 2 } = 394453 \sum x y = 152999 \quad S _ { x y } = - 18895 \cdot 13043 \end{gathered}$$
  1. Using the summary statistics for the 23 countries, calculate and interpret Pearson's product moment correlation coefficient.
  2. Calculate the equation of the least squares regression line of Household debt as a \% of net disposable income \(( y )\) on Cash transactions as a \% of all transactions ( \(x\) ). The regression line \(x\) on \(y\) is given below. $$x = - 0 \cdot 24 y + 91 \cdot 92$$
  3. By selecting the appropriate regression line in each case, estimate the values of \(p\) and \(q\) in the table.
  4. Comment on the reliability of your answers in part (c).
  5. Interpret the negative value of \(y\) for Mexico.
Edexcel FS2 AS 2018 June Q1
11 marks Moderate -0.3
  1. The scores achieved on a maths test, \(m\), and the scores achieved on a physics test, \(p\), by 16 students are summarised below.
$$\sum m = 392 \quad \sum p = 254 \quad \sum p ^ { 2 } = 4748 \quad \mathrm {~S} _ { m m } = 1846 \quad \mathrm {~S} _ { m p } = 1115$$
  1. Find the product moment correlation coefficient between \(m\) and \(p\)
  2. Find the equation of the linear regression line of \(p\) on \(m\) Figure 1 shows a plot of the residuals. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{0fcb4d83-9763-4edd-8006-93f75a44c596-02_808_1222_997_429} \captionsetup{labelformat=empty} \caption{Figure 1}
    \end{figure}
  3. Calculate the residual sum of squares (RSS). For the person who scored 30 marks on the maths test,
  4. find the score on the physics test. The data for the person who scored 20 on the maths test is removed from the data set.
  5. Suggest a reason why. The product moment correlation coefficient between \(m\) and \(p\) is now recalculated for the remaining 15 students.
  6. Without carrying out any further calculations, suggest how you would expect this recalculated value to compare with your answer to part (a).
    Give a reason for your answer.
    V349 SIHI NI IMIMM ION OCVJYV SIHIL NI LIIIM ION OOVJYV SIHIL NI JIIYM ION OC
Edexcel FS2 2019 June Q2
10 marks Standard +0.3
2 A large field of wheat is split into 8 plots of equal area. Each plot is treated with a different amount of fertiliser, \(f\) grams \(/ \mathrm { m } ^ { 2 }\). The yield of wheat, \(w\) tonnes, from each plot is recorded. The results are summarised below. $$\sum f = 28 \quad \sum w = 303 \quad \sum w ^ { 2 } = 13447 \quad \mathrm {~S} _ { f f } = 42 \quad \mathrm {~S} _ { f w } = 269.5$$
  1. Calculate the product moment correlation coefficient between \(f\) and \(w\)
  2. Interpret the value of your product moment correlation coefficient.
  3. Find the equation of the regression line of \(w\) on \(f\) in the form \(w = a + b f\)
  4. Using your equation, estimate the decrease in yield when the amount of fertiliser decreases by 0.5 grams \(/ \mathrm { m } ^ { 2 }\) The residuals of the data recorded are calculated and plotted on the graph below. \includegraphics[max width=\textwidth, alt={}, center]{67df73d4-6ce4-45f7-8a69-aa94292ea814-04_1232_1294_1169_301}
  5. With reference to this graph, comment on the suitability of the model you found in part (c).
  6. Suggest how you might be able to refine your model.
Edexcel FS2 2021 June Q4
10 marks Standard +0.3
  1. A researcher is investigating the relationship between elevation, \(x\) metres, and annual mean temperature, \(t ^ { \circ } \mathrm { C }\).
From a random sample of 20 weather stations in Switzerland, the following results were obtained $$\mathrm { S } _ { x x } = 8820655 \quad \mathrm {~S} _ { t t } = 444.7 \quad \sum x = 28130 \quad \sum t = 94.62$$ The product moment correlation coefficient for these data is found to be - 0.959
  1. Interpret the value of this correlation coefficient.
  2. Show that the equation of the regression line of \(t\) on \(x\) can be written as $$t = 14.3 - 0.00681 x$$ The random variable \(W\) represents the elevations of the weather stations in kilometres.
  3. Write down the equation of the regression line of \(t\) on \(w\) for these 20 weather stations in the form \(t = a + b w\)
  4. Show that the residual sum of squares (RSS) for the model for \(t\) and \(x\) is 35.7 correct to one decimal place. One of the weather stations in the sample had a recorded elevation of 1100 metres and an annual mean temperature of \(1.4 ^ { \circ } \mathrm { C }\)
    1. Calculate this weather station's contribution to the residual sum of squares. Give your answer as a percentage
    2. Comment on the data for this weather station in light of your answer to part (e)(i).
Edexcel FS2 2022 June Q1
7 marks Standard +0.3
  1. Kwame is investigating a possible relationship between average March temperature, \(t ^ { \circ } \mathrm { C }\), and tea yield, \(y \mathrm {~kg} /\) hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
$$\begin{aligned} & \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\ & \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080 \end{aligned}$$
  1. Use the regression model to predict the tea yield for an average March temperature of \(20 ^ { \circ } \mathrm { C }\) He also produces the following residual plot for the data. \includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}
  2. Explain what you understand by the term residual.
  3. Calculate the product moment correlation coefficient between \(t\) and \(y\)
  4. Explain why the linear model may not be a good fit for the data
    1. with reference to your answer to part (c)
    2. with reference to the residual plot. \section*{Question 1 continues on page 4} Kwame also collects data on total March rainfall, \(w \mathrm {~mm}\), for each of these 30 years. For a linear regression model of \(w\) on \(t\) the following summary statistic is found. $$\text { Residual Sum of Squares (RSS) = } 86754$$ Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between \(w\) and \(t\) than between \(y\) and \(t\) (where RSS \(= 1666567\) )
  5. State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.
Edexcel FS2 Specimen Q6
12 marks Standard +0.3
  1. A random sample of 10 female pigs was taken. The number of piglets, \(x\), born to each female pig and their average weight at birth, \(m \mathrm {~kg}\), was recorded. The results were as follows:
Number of piglets, \(\boldsymbol { x }\)45678910111213
Average weight at
birth, \(\boldsymbol { m } \mathbf { ~ k g }\)
1.501.201.401.401.231.301.201.151.251.15
(You may use \(\mathrm { S } _ { x x } = 82.5\) and \(\mathrm { S } _ { m m } = 0.12756\) and \(\mathrm { S } _ { x m } = - 2.29\) )
  1. Find the equation of the regression line of \(m\) on \(x\) in the form \(m = a + b x\) as a model for these results.
  2. Show that the residual sum of squares (RSS) is 0.064 to 3 decimal places.
  3. Calculate the residual values.
  4. Write down the outlier.
    1. Comment on the validity of ignoring this outlier.
    2. Ignoring the outlier, produce another model.
    3. Use this model to estimate the average weight at birth if \(x = 15\)
    4. Comment, giving a reason, on the reliability of your estimate.
OCR FS1 AS 2017 December Q5
8 marks Moderate -0.5
5 A shop manager recorded the maximum daytime temperature \(T ^ { \circ } \mathrm { C }\) and the number \(C\) of ice creams sold on 9 summer days. The results are given in the table and illustrated in the scatter diagram.
\(T\)172125262727293030
\(C\)211620383237353942
\includegraphics[max width=\textwidth, alt={}]{64d7ed6d-fadd-4c59-afb0-97d1788ba369-3_661_1189_1320_431}
$$n = 9 , \Sigma t = 232 , \Sigma c = 280 , \Sigma t ^ { 2 } = 6130 , \Sigma c ^ { 2 } = 9444 , \Sigma t c = 7489$$
  1. State, with a reason, whether one of the variables \(C\) or \(T\) is likely to be dependent upon the other.
  2. Calculate Pearson's product-moment correlation coefficient \(r\) for the data.
  3. State with a reason what the value of \(r\) would have been if the temperature had been measured in \({ } ^ { \circ } \mathrm { F }\) rather than \({ } ^ { \circ } \mathrm { C }\).
  4. Calculate the equation of the least squares regression line of \(c\) on \(t\).
  5. The regression line is drawn on the copy of the scatter diagram in the Printed Answer Booklet. Use this diagram to explain what is meant by "least squares".
OCR Further Statistics 2018 March Q9
8 marks Challenging +1.2
9 The values of a set of bivariate data \(\left( x _ { i } , y _ { i } \right)\) can be summarised by $$n = 50 , \sum x = 1270 , \sum y = 5173 , \sum x ^ { 2 } = 42767 , \sum y ^ { 2 } = 701301 , \sum x y = 173161 .$$ Ten independent observations of \(Y\) are obtained, all corresponding to \(x = 20\). It may be assumed that the variance of \(Y\) is 1.9 , independently of the value of \(x\). Find a \(95 \%\) confidence interval for the mean \(\bar { Y }\) of the 10 observations of \(Y\). \section*{END OF QUESTION PAPER}
OCR Further Statistics 2018 September Q1
4 marks Moderate -0.8
1 An experiment involves releasing a coin on a sloping plane so that it slides down the slope and then slides along a horizontal plane at the bottom of the slope before coming to rest. The angle \(\theta ^ { \circ }\) of the sloping plane is varied, and for each value of \(\theta\), the distance \(d \mathrm {~cm}\) the coin slides on the horizontal plane is recorded. A scatter diagram to illustrate the results of the experiment is shown below, together with the least squares regression line of \(d\) on \(\theta\). \includegraphics[max width=\textwidth, alt={}, center]{28c6a0d9-09a6-4743-af0e-fe2e43e256c9-2_639_972_561_548}
  1. State which two of the following correctly describe the variable \(\theta\).
    Controlled variableCorrelation coefficient
    Dependent variableIndependent variable
    Response variableRegression coefficient
    The least squares regression line of \(d\) on \(\theta\) has equation \(d = 1.96 + 0.11 \theta\).
  2. Use the diagram in the Printed Answer Booklet to explain the term "least squares".
  3. State what difference, if any, it would make to the equation of the regression line if \(d\) were measured in inches rather than centimetres. ( 1 inch \(\approx 2.54 \mathrm {~cm}\) ).
Edexcel S1 2022 January Q6
13 marks Moderate -0.8
  1. Students on a psychology course were given a pre-test at the start of the course and a final exam at the end of the course. The teacher recorded the number of marks achieved on the pre-test, \(p\), and the number of marks achieved on the final exam, \(f\), for 34 students and displayed them on the scatter diagram. \includegraphics[max width=\textwidth, alt={}, center]{fa1cb8a2-dab9-4133-b7a1-9108888c37d7-22_1121_1136_447_438}
The equation of the least squares regression line for these data is found to be $$f = 10.8 + 0.748 p$$ For these students, the mean number of marks on the pre-test is 62.4
  1. Use the regression model to find the mean number of marks on the final exam.
  2. Give an interpretation of the gradient of the regression line. Considering the equation of the regression line, Priya says that she would expect someone who scored 0 marks on the pre-test to score 10.8 marks on the final exam.
  3. Comment on the reliability of Priya's statement.
  4. Write down the number of marks achieved on the final exam for the student who exceeded the expectation of the regression model by the largest number of marks.
  5. Find the range of values of \(p\) for which this regression model, \(f = 10.8 + 0.748 p\), predicts a greater number of marks on the final exam than on the pre-test. Later the teacher discovers an error in the recorded data. The student who achieved a score of 98 on the pre-test, scored 92 not 29 on the final exam. The summary statistics used for the model \(f = 10.8 + 0.748 p\) are corrected to include this information and a new least squares regression line is found. Given the original summary statistics were, $$n = 34 \quad \sum p = 2120 \quad \sum p f = 133486 \quad \mathrm {~S} _ { p p } = 15573.76 \quad \mathrm {~S} _ { p f } = 11648.35$$
  6. calculate the gradient of the new regression line. Show your working clearly.
Edexcel S1 2017 June Q5
15 marks Moderate -0.3
  1. Tomas is studying the relationship between temperature and hours of sunshine in Seapron. He records the midday temperature, \(t ^ { \circ } \mathrm { C }\), and the hours of sunshine, \(s\) hours, for a random sample of 9 days in October. He calculated the following statistics
$$\sum s = 15 \quad \sum s ^ { 2 } = 44.22 \quad \sum t = 127 \quad \mathrm {~S} _ { t t } = 10.89$$
  1. Calculate \(\mathrm { S } _ { s s }\) Tomas calculated the product moment correlation coefficient between \(s\) and \(t\) to be 0.832 correct to 3 decimal places.
  2. State, giving a reason, whether or not this correlation coefficient supports the use of a linear regression model to describe the relationship between midday temperature and hours of sunshine.
  3. State, giving a reason, why the hours of sunshine would be the explanatory variable in a linear regression model between midday temperature and hours of sunshine.
  4. Find \(\mathrm { S } _ { s t }\)
  5. Calculate a suitable linear regression equation to model the relationship between midday temperature and hours of sunshine.
  6. Calculate the standard deviation of \(s\) Tomas uses this model to estimate the midday temperature in Seapron for a day in October with 5 hours of sunshine.
  7. State the value of Tomas' estimate. Given that the values of \(s\) are all within 2 standard deviations of the mean,
  8. comment, giving your reason, on the reliability of this estimate.
Edexcel S1 2017 October Q5
13 marks Moderate -0.8
  1. A company wants to pay its employees according to their performance at work. Last year's performance score \(x\) and annual salary \(y\), in thousands of dollars, were recorded for a random sample of 10 employees of the company.
The performance scores were $$\begin{array} { l l l l l l l l l l } 15 & 24 & 32 & 39 & 41 & 18 & 16 & 22 & 34 & 42 \end{array}$$ (You may use \(\sum x ^ { 2 } = 9011\) )
  1. Find the mean and the variance of these performance scores. The corresponding \(y\) values for these 10 employees are summarised by $$\sum y = 306.1 \quad \text { and } \quad \mathrm { S } _ { y y } = 546.3$$
  2. Find the mean and the variance of these \(y\) values. The regression line of \(y\) on \(x\) based on this sample is $$y = 12.0 + 0.659 x$$
  3. Find the product moment correlation coefficient for these data.
  4. State, giving a reason, whether or not the value of the product moment correlation coefficient supports the use of a regression line to model the relationship between performance score and annual salary. The company decides to use this regression model to determine future salaries.
  5. Find the proposed annual salary, in dollars, for an employee who has a performance score of 35
Edexcel S1 2021 October Q2
12 marks Moderate -0.5
2. A large company is analysing how much money it spends on paper in its offices each year. The number of employees in the office, \(x\), and the amount spent on paper in a year, \(p\) (\$ hundreds), in each of 12 randomly selected offices were recorded. The results are summarised in the following statistics. $$\sum x = 93 \quad \mathrm {~S} _ { x x } = 148.25 \quad \sum p = 273 \quad \sum p ^ { 2 } = 6602.72 \quad \sum x p = 2347$$
  1. Show that \(\mathrm { S } _ { x p } = 231.25\)
  2. Find the product moment correlation coefficient for these data.
  3. Find the equation of the regression line of \(p\) on \(x\) in the form \(p = a + b x\)
  4. Give an interpretation of the gradient of your regression line. The director of the company wants to reduce the amount spent on paper each year. He wants each office to aim for a model of the form \(p = \frac { 4 } { 5 } a + \frac { 1 } { 2 } b x\), where \(a\) and \(b\) are the values found in part (c). Using the data for the 93 employees from the 12 offices,
  5. estimate the percentage saving in the amount spent on paper each year by the company using the director's model.
Edexcel S1 2003 June Q7
16 marks Moderate -0.8
  1. Eight students took tests in mathematics and physics. The marks for each student are given in the table below where \(m\) represents the mathematics mark and \(p\) the physics mark.
\multirow{2}{*}{}Student
\(A\)B\(C\)D\(E\)\(F\)G\(H\)
\multirow{2}{*}{Mark}\(m\)9141310782017
\(p\)1123211519103126
A science teacher believes that students' marks in physics depend upon their mathematical ability. The teacher decides to investigate this relationship using the test marks.
  1. Write down which is the explanatory variable in this investigation.
  2. Draw a scatter diagram to illustrate these data.
  3. Showing your working, find the equation of the regression line of \(p\) on \(m\).
  4. Draw the regression line on your scatter diagram. A ninth student was absent for the physics test, but she sat the mathematics test and scored 15 .
  5. Using this model, estimate the mark she would have scored in the physics test.
AQA S1 2005 January Q3
12 marks Moderate -0.8
3 [Figure 1, printed on the insert, is provided for use in this question.]
A parcel delivery company has a depot on the outskirts of a town. Each weekday, a van leaves the depot to deliver parcels across a nearby area. The table below shows, for a random sample of 10 weekdays, the number, \(x\), of parcels to be delivered and the total time, \(y\) minutes, that the van is out of the depot.
\(\boldsymbol { x }\)9162211192614101117
\(\boldsymbol { y }\)791271721091522141318094148
  1. On Figure 1, plot a scatter diagram of these data.
  2. Calculate the equation of the least squares regression line of \(y\) on \(x\) and draw your line on Figure 1.
  3. Use your regression equation to estimate the total time that the van is out of the depot when delivering:
    1. 15 parcels;
    2. 35 parcels. Comment on the likely reliability of each of your estimates.
  4. The time that the van is out of the depot delivering parcels may be thought of as the time needed to travel to and from the area plus an amount of time proportional to the number of parcels to be delivered. Given that the regression line of \(y\) on \(x\) is of the form \(y = a + b x\), give an interpretation, in context, for each of your values of \(a\) and \(b\).
    (2 marks)
AQA S1 2007 January Q7
15 marks Moderate -0.8
7 [Figure 1, printed on the insert, is provided for use in this question.]
Stan is a retired academic who supplements his pension by mowing lawns for customers who live nearby. As part of a review of his charges for this work, he measures the areas, \(x \mathrm {~m} ^ { 2 }\), of a random sample of eight of his customers' lawns and notes the times, \(y\) minutes, that it takes him to mow these lawns. His results are shown in the table.
Customer\(\mathbf { A }\)\(\mathbf { B }\)\(\mathbf { C }\)\(\mathbf { D }\)\(\mathbf { E }\)\(\mathbf { F }\)\(\mathbf { G }\)\(\mathbf { H }\)
\(\boldsymbol { x }\)3601408606001180540260480
\(\boldsymbol { y }\)502513570140905570
  1. On Figure 1, plot a scatter diagram of these data.
  2. Calculate the equation of the least squares regression line of \(y\) on \(x\). Draw your line on Figure 1.
  3. Calculate the value of the residual for Customer H and indicate how your value is confirmed by your scatter diagram.
  4. Given that Stan charges \(\pounds 12\) per hour, estimate the charge for mowing a customer's lawn that has an area of \(560 \mathrm {~m} ^ { 2 }\).
AQA S1 2010 January Q3
8 marks Moderate -0.3
3 The table shows, for each of a random sample of 7 weeks, the number of customers, \(x\), who purchased fuel from a filling station, together with the total volume, \(y\) litres, of fuel purchased by these customers.
\(\boldsymbol { x }\)230184165147241174210
\(\boldsymbol { y }\)4551341032523756378740244254
  1. Calculate the equation of the least squares regression line of \(y\) on \(x\).
  2. Estimate the volume of fuel sold during a week in which 200 customers purchase fuel.
  3. Comment on the likely reliability of your estimate in part (b), given that, for the regression line calculated in part (a), the values of the 7 residuals lie between approximately - 415 litres and + 430 litres.
AQA S1 2015 June Q4
15 marks Moderate -0.3
4 Stephan is a roofing contractor who is often required to replace loose ridge tiles on house roofs. In order to help him to quote more accurately the prices for such jobs in the future, he records, for each of 11 recently repaired roofs, the number of ridge tiles replaced, \(x _ { i }\), and the time taken, \(y _ { i }\) hours. His results are shown in the table.
Roof \(( \boldsymbol { i } )\)\(\mathbf { 1 }\)\(\mathbf { 2 }\)\(\mathbf { 3 }\)\(\mathbf { 4 }\)\(\mathbf { 5 }\)\(\mathbf { 6 }\)\(\mathbf { 7 }\)\(\mathbf { 8 }\)\(\mathbf { 9 }\)\(\mathbf { 1 0 }\)\(\mathbf { 1 1 }\)
\(\boldsymbol { x } _ { \boldsymbol { i } }\)811141416202222252730
\(\boldsymbol { y } _ { \boldsymbol { i } }\)5.05.26.37.28.08.810.611.011.812.113.0
  1. The pairs of data values for roofs 1 to 7 are plotted on the scatter diagram shown on the opposite page. Plot the 4 pairs of data values for roofs 8 to 11 on the scatter diagram.
    1. Calculate the equation of the least squares regression line of \(y _ { i }\) on \(x _ { i }\), and draw your line on the scatter diagram.
    2. Interpret your values for the gradient and for the intercept of this regression line.
  2. Estimate the time that it would take Stephan to replace 15 loose ridge tiles on a house roof.
  3. Given that \(r _ { i }\) denotes the residual for the point representing roof \(i\) :
    1. calculate the value of \(r _ { 6 }\);
    2. state why the value of \(\sum _ { i = 1 } ^ { 11 } r _ { i }\) gives no useful information about the connection between the number of ridge tiles replaced and the time taken.
      [0pt] [1 mark]
      \section*{Answer space for question 4}
      \includegraphics[max width=\textwidth, alt={}]{6fbb8891-e6de-42fe-a195-ea643552fdcf-11_2385_1714_322_155}
OCR S1 Q4
8 marks Moderate -0.3
4 The table shows the latitude, \(x\) (in degrees correct to 3 significant figures), and the average rainfall \(y\) (in cm correct to 3 significant figures) of five European cities.
City\(x\)\(y\)
Berlin52.558.2
Bucharest44.458.7
Moscow55.853.3
St Petersburg60.047.8
Warsaw52.356.6
$$\left[ n = 5 , \Sigma x = 265.0 , \Sigma y = 274.6 , \Sigma x ^ { 2 } = 14176.54 , \Sigma y ^ { 2 } = 15162.22 , \Sigma x y = 14464.10 . \right]$$
  1. Calculate the product moment correlation coefficient.
  2. The values of \(y\) in the table were in fact obtained from measurements in inches and converted into centimetres by multiplying by 2.54. State what effect it would have had on the value of the product moment correlation coefficient if it had been calculated using inches instead of centimetres.
  3. It is required to estimate the annual rainfall at Bergen, where \(x = 60.4\). Calculate the equation of an appropriate line of regression, giving your answer in simplified form, and use it to find the required estimate. \section*{June 2005}
Edexcel AS Paper 2 2018 June Q1
3 marks Moderate -0.8
  1. A company is introducing a job evaluation scheme. Points ( \(x\) ) will be awarded to each job based on the qualifications and skills needed and the level of responsibility. Pay ( \(\pounds y\) ) will then be allocated to each job according to the number of points awarded.
Before the scheme is introduced, a random sample of 8 employees was taken and the linear regression equation of pay on points was \(y = 4.5 x - 47\)
  1. Describe the correlation between points and pay.
  2. Give an interpretation of the gradient of this regression line.
  3. Explain why this model might not be appropriate for all jobs in the company.
Edexcel S1 2024 October Q2
Moderate -0.8
  1. A biologist records the length, \(y \mathrm {~cm}\), and the weight, \(w \mathrm {~kg}\), of 50 rabbits. The following summary statistics are calculated from these data.
$$\sum y = 2015 \quad \sum y ^ { 2 } = 81938.5 \quad \sum w = 125 \quad \mathrm {~S} _ { w w } = 72.25 \quad \mathrm {~S} _ { y w } = 219.55$$
    1. Show that \(\mathrm { S } _ { y y } = 734\)
    2. Calculate the product moment correlation coefficient for these data. Give your answer to 3 decimal places.
  1. Interpret your value of the product moment correlation coefficient. The biologist believes that a linear regression model may be appropriate to describe these data.
  2. State, with a reason, whether or not your value of the product moment correlation coefficient is consistent with the biologist’s belief.
  3. Find the equation of the regression line of \(w\) on \(y\), giving your answer in the form \(w = a + b y\) Jeff has a pet rabbit of length 45 cm .
  4. Use your regression equation to estimate the weight of Jeff's rabbit.
Pre-U Pre-U 9794/3 2013 June Q3
12 marks Moderate -0.8
3 At a local athletics club, data on the ages of the members and their times to run a 10 km course are recorded. For a random sample of 25 club members aged between 20 and 60, their ages ( \(x\) years) and times ( \(y\) minutes) are summarised as follows. $$n = 25 \quad \Sigma x = 1002 \quad \Sigma x ^ { 2 } = 43508 \quad \Sigma y = 1865 \quad \Sigma y ^ { 2 } = 142749 \quad \Sigma x y = 77532$$
  1. Calculate the product moment correlation coefficient for these data.
  2. Show that the equation of the least squares regression line of \(y\) on \(x\) is \(y = 0.83 x + 41.28\), where the coefficients are given correct to 2 decimal places.
  3. Use the equation given in part (ii) to estimate the time taken by someone who is
    1. 50 years old,
    2. 65 years old. Comment on the validity of each of these estimates.
Pre-U Pre-U 9794/1 Specimen Q13
9 marks Moderate -0.3
13 A seed company investigated how well African Marigold seeds germinated when the seeds were past their sell-by date. The table shows the average number of seeds which germinated per packet, \(y\), and the number of months past their sell-by date, \(t\).
\(t\)1020304050
\(y\)24.524.021.718.612.4
The summary data for the investigation were as follows. $$\Sigma t = 150 \quad \Sigma t ^ { 2 } = 5500 \quad \Sigma y = 101.2 \quad \Sigma y ^ { 2 } = 2146.86 \quad \Sigma t y = 2740$$
  1. Calculate the equation of the regression line of \(y\) on \(t\).
  2. Use your regression line to calculate \(y\) when \(t = 10\). Compare your answer with the value of \(y\) when \(t = 10\) in the table and comment on the result.
  3. Use your regression line to calculate \(y\) when \(t = 100\). Comment on the validity of this result.
  4. Suggest with reasons whether the regression line provides a good model for predicting the germination of seeds past their sell-by date.