Linear regression

219 questions · 30 question types identified

Sort by: Question count | Difficulty
Calculate y on x from raw data table

Questions that provide raw bivariate data in a table and ask to find the regression line of y on x.

65 Moderate -0.6
29.7% of questions
Show example »
A random sample of 5 pairs of values \((x, y)\) is given in the following table.
\(x\)12458
\(y\)75864
  1. Find, showing all necessary working, the equation of the regression line of \(y\) on \(x\). [4]
  2. Find, showing all necessary working, the value of the product moment correlation coefficient for this sample. [3]
  3. Test, at the 10% significance level, whether there is evidence of non-zero correlation between the variables. [4]
View full question →
Easiest question Easy -1.2 »
3. The table shows data on the number of visitors to the UK in a month, \(v\) (1000s), and the amount of money they spent, \(m\) ( \(\pounds\) millions), for each of 8 months.
Number of visitors
\(v ( 1000 \mathrm {~s} )\)
24502480254024202350229024002460
Amount of money spent
\(m ( \pounds\) millions \()\)
13701350140013301270121013301350
You may use \(S _ { v v } = 42587.5 \quad S _ { v m } = 31512.5 \quad S _ { m m } = 25187.5 \quad \sum v = 19390 \quad \sum m = 10610\)
  1. Find the product moment correlation coefficient between \(m\) and \(v\).
  2. Give a reason to support fitting a regression model of the form \(m = a + b v\) to these data.
  3. Find the value of \(b\) correct to 3 decimal places.
  4. Find the equation of the regression line of \(m\) on \(v\).
  5. Interpret your value of \(b\).
  6. Use your answer to part (d) to estimate the amount of money spent when the number of visitors to the UK in a month is 2500000
  7. Comment on the reliability of your estimate in part (f). Give a reason for your answer.
View full question →
Hardest question Standard +0.8 »
9 A random sample of 10 pairs of values of \(x\) and \(y\) is given in the following table.
\(x\)466827121495
\(y\)24686109865
  1. Find the equation of the regression line of \(y\) on \(x\).
  2. Find the product moment correlation coefficient for the sample.
  3. Find the estimated value of \(y\) when \(x = 10\), and comment on the reliability of this estimate.
  4. Another sample of \(N\) pairs of data from the same population has the same product moment correlation coefficient as the first sample given. A test, at the \(1 \%\) significance level, on this second sample indicates that there is sufficient evidence to conclude that there is positive correlation. Find the set of possible values of \(N\).
View full question →
Calculate y on x from summary statistics

Questions that provide summary statistics (sums, means, variances, Sxx, Sxy, etc.) and ask to find the regression line of y on x.

12 Moderate -0.3
5.5% of questions
Show example »
10 The means and variances for a random sample of 8 pairs of values of \(x\) and \(y\) taken from a bivariate distribution are given in the following table.
MeanVariance
\(x\)3.31253.3086
\(y\)6.73757.9473
The product moment correlation coefficient for the sample is 0.5815 , correct to 4 decimal places.
  1. Find the equation of the regression line of \(y\) on \(x\).
  2. Test at the \(5 \%\) significance level whether there is evidence of positive correlation between \(x\) and \(y\). [4]
  3. Calculate an estimate of \(y\) when \(x = 6.0\) and comment on the reliability of your estimate.
View full question →
Easiest question Moderate -0.8 »
  1. Stuart is investigating the relationship between Gross Domestic Product (GDP) and the size of the population for a particular country.
    He takes a random sample of 9 years and records the size of the population, \(t\) millions, and the GDP, \(g\) billion dollars for each of these years.
The data are summarised as $$n = 9 \quad \sum t = 7.87 \quad \sum g = 144.84 \quad \sum g ^ { 2 } = 3624.41 \quad S _ { t t } = 1.29 \quad S _ { t g } = 40.25$$
  1. Calculate the product moment correlation coefficient between \(t\) and \(g\)
  2. Give an interpretation of your product moment correlation coefficient.
  3. Find the equation of the least squares regression line of \(g\) on \(t\) in the form \(g = a + b t\)
  4. Give an interpretation of the value of \(b\) in your regression line.
    1. Use the regression line from part (c) to estimate the GDP, in billions of dollars, for a population of 7000000
    2. Comment on the reliability of your answer in part (i). Give a reason, in context, for your answer. Using the regression line from part (c), Stuart estimates that for a population increase of \(x\) million there will be an increase of 0.1 billion dollars in GDP.
  5. Find the value of \(x\)
View full question →
Hardest question Standard +0.3 »
2 The director of a concert hall wishes to investigate if the price of the most expensive concert tickets affects attendance. The director collects data about the price, \(\pounds P\), of the most expensive tickets and the number of people in the audience, \(H\) hundred (rounded to the nearest hundred), for 20 concerts. For each price there are several different concerts. The results are shown in the table.
\(P\) (£)7565554535
\multirow[t]{5}{*}{\(H\) (hundred)}2727272615
2727202112
2218169
191813
12169
\(\mathrm { n } = 20 \quad \sum \mathrm { p } = 1080 \quad \sum \mathrm {~h} = 381 \quad \sum \mathrm { p } ^ { 2 } = 61300 \quad \sum \mathrm {~h} ^ { 2 } = 8011 \quad \sum \mathrm { ph } = 21535\)
  1. Calculate the equation of the regression line of \(h\) on \(p\).
  2. State what change, if any, there would be to your answer to part (a) if \(H\) had been measured in thousands (to 1 decimal place) rather than in hundreds. For a special charity concert, the most expensive tickets cost \(\pounds 50\).
  3. Use your answer to part (b) to estimate the expected size of the audience for this concert. Give your answer correct to \(\mathbf { 1 }\) decimal place.
  4. Comment on the reliability of your answer to part (c). You should refer to
View full question →
Convert regression equation between coded and original

Questions that require finding the regression equation in coded variables and then converting it to original variables, or vice versa, using the coding transformations.

12 Moderate -0.1
5.5% of questions
Show example »
  1. A farmer collected data on the annual rainfall, \(x \mathrm {~cm}\), and the annual yield of peas, \(p\) tonnes per acre.
The data for annual rainfall was coded using \(v = \frac { x - 5 } { 10 }\) and the following statistics were found. $$S _ { v v } = 5.753 \quad S _ { p v } = 1.688 \quad S _ { p p } = 1.168 \quad \bar { p } = 3.22 \quad \bar { v } = 4.42$$
  1. Find the equation of the regression line of \(p\) on \(v\) in the form \(p = a + b v\).
  2. Using your regression line estimate the annual yield of peas per acre when the annual rainfall is 85 cm .
View full question →
Easiest question Moderate -0.8 »
  1. A farmer collected data on the annual rainfall, \(x \mathrm {~cm}\), and the annual yield of peas, \(p\) tonnes per acre.
The data for annual rainfall was coded using \(v = \frac { x - 5 } { 10 }\) and the following statistics were found. $$S _ { v v } = 5.753 \quad S _ { p v } = 1.688 \quad S _ { p p } = 1.168 \quad \bar { p } = 3.22 \quad \bar { v } = 4.42$$
  1. Find the equation of the regression line of \(p\) on \(v\) in the form \(p = a + b v\).
  2. Using your regression line estimate the annual yield of peas per acre when the annual rainfall is 85 cm .
View full question →
Hardest question Standard +0.3 »
6. In a survey for a computer magazine, the times \(t\) seconds taken by eight laser printers to print a page of text were compared with the prices \(\pounds p\) of the printers. The data were coded using the equations \(x = t - 10\) and \(y = p - 150\), and it was found that $$\sum x = 42 \cdot 4 , \quad \sum x ^ { 2 } = 314 \cdot 5 , \quad \sum y = 560 , \quad \sum y ^ { 2 } = 60600 , \quad \sum x y = 1592 .$$
  1. Find the mean time and the mean price for the eight printers.
  2. Find the variance of the times.
  3. Find the equation of the regression line of \(p\) on \(t\).
  4. Estimate the price of a printer which takes 11.3 seconds to print the page.
View full question →
Interpret regression line parameters

A question is this type if and only if it asks to interpret the meaning of the gradient, intercept, or other feature of a regression line in context.

11 Moderate -0.6
5.0% of questions
Show example »
  1. The relationship between two variables \(p\) and \(t\) is modelled by the regression line with equation
$$p = 22 - 1.1 t$$ The model is based on observations of the independent variable, \(t\), between 1 and 10
  1. Describe the correlation between \(p\) and \(t\) implied by this model. Given that \(p\) is measured in centimetres and \(t\) is measured in days,
  2. state the units of the gradient of the regression line. Using the model,
  3. calculate the change in \(p\) over a 3-day period. Tisam uses this model to estimate the value of \(p\) when \(t = 19\)
  4. Comment, giving a reason, on the reliability of this estimate.
View full question →
Easiest question Easy -1.8 »
The pre-release material contains data concerning the death rate per thousand people and the birth rate per thousand people in all the countries of the world. The diagram in Fig. 11.1 was generated using a spreadsheet and summarises the birth rates for all the countries in Africa. \includegraphics{figure_11_1} Fig. 11.1
  1. Identify two respects in which the presentation of the data is incorrect. [2]
Fig. 11.2 shows a scatter diagram of death rate, \(y\), against birth rate, \(x\), for a sample of 55 countries, all of which are in Africa. A line of best fit has also been drawn. \includegraphics{figure_11_2} Fig. 11.2 The equation of the line of best fit is \(y = 0.15x + 4.72\).
    1. What does the diagram suggest about the relationship between death rate and birth rate? [1]
    2. The birth rate in Togo is recorded as 34.13 per thousand, but the data on death rate has been lost. Use the equation of the line of best fit to estimate the death rate in Togo. [1]
    3. Explain why it would not be sensible to use the equation of the line of best fit to estimate the death rate in a country where the birth rate is 5.5 per thousand. [1]
    4. Explain why it would not be sensible to use the equation of the line of best fit to estimate the death rate in a Caribbean country where the birth rate is known. [1]
    5. Explain why it is unlikely that the sample is random. [1]
Including Togo there were 56 items available for selection.
  1. Describe how a sample of size 14 from this data could be generated for further analysis using systematic sampling. [2]
View full question →
Hardest question Standard +0.3 »
3 In a triathlon, competitors have to swim 600 metres, cycle 40 kilometres and run 10 kilometres. To improve her strength, a triathlete undertakes a training programme in which she carries weights in a rucksack whilst running. She runs a specific course and notes the total time taken for each run. Her coach is investigating the relationship between time taken and weight carried. The times taken with eight different weights are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\) represent weight carried in kilograms and time taken in minutes respectively. \includegraphics[max width=\textwidth, alt={}, center]{be463718-caf7-4bc8-b838-143ab4681d6e-4_627_1536_630_281} Summary statistics: \(n = 8 , \Sigma x = 36 , \Sigma y = 214.8 , \Sigma x ^ { 2 } = 204 , \Sigma y ^ { 2 } = 5775.28 , \Sigma x y = 983.6\).
  1. Calculate the equation of the regression line of \(y\) on \(x\). On one of the eight runs, the triathlete was carrying 4 kilograms and took 27.5 minutes. On this run she was delayed when she tripped and fell over.
  2. Calculate the value of the residual for this weight.
  3. The coach decides to recalculate the equation of the regression line without the data for this run. Would it be preferable to use this recalculated equation or the equation found in part (i) to estimate the delay when the triathlete tripped and fell over? Explain your answer. The triathlete's coach claims that there is positive correlation between cycling and swimming times in triathlons. The product moment correlation coefficient of the times of twenty randomly selected competitors in these two sections is 0.209 .
  4. Carry out a hypothesis test at the \(5 \%\) level to examine the coach's claim, explaining your conclusions clearly.
  5. What distributional assumption is necessary for this test to be valid? How can you use a scatter diagram to decide whether this assumption is likely to be true?
View full question →
Identify response/explanatory variables

A question is this type if and only if it asks to identify which variable is the independent/explanatory/controlled variable and which is the dependent/response variable.

11 Moderate -0.9
5.0% of questions
Show example »
As part of a study into the effects of alcohol, volunteers have their reaction times measured after they have consumed various fixed amounts of alcohol. For a random sample of 12 volunteers the following information was collected.
Units of alcohol consumed23344.55.5667889
Reaction time (seconds)12553.85.54.88.57.26.898
  1. Which is the independent variable in this experiment? [1]
  2. Find the least squares regression line of \(y\) (Reaction time) on \(x\) (Units of alcohol), and use it to estimate the reaction time of someone who has consumed 5 units of alcohol. [5]
View full question →
Easiest question Easy -1.3 »
  1. The resting heart rate, \(h\) beats per minute (bpm), and average length of daily exercise, \(t\) minutes, of a random sample of 8 teachers are shown in the table below.
\(t\)2035402545707590
\(h\)8885777571666054
  1. State, with a reason, which variable is the response variable. The equation of the least squares regression line of \(h\) on \(t\) is $$h = 93.5 - 0.43 t$$
  2. Give an interpretation of the gradient of this regression line.
  3. Find the value of \(\bar { t }\) and the value of \(\bar { h }\)
  4. Show that the point \(( \bar { t } , \bar { h } )\) lies on the regression line.
  5. Estimate the resting heart rate of a teacher with an average length of daily exercise of 1 hour.
  6. Comment, giving a reason, on the reliability of the estimate in part (e). The resting heart rate of teachers is assumed to be normally distributed with mean 73 bpm and standard deviation 8 bpm . The middle \(95 \%\) of resting heart rates of teachers lies between \(a\) and \(b\)
  7. Find the value of \(a\) and the value of \(b\).
View full question →
Hardest question Moderate -0.3 »
5 A hearing expert is investigating whether web-based hearing tests can be used instead of hearing tests in a hearing laboratory. The expert selects a random sample of 16 people with normal hearing. Each of them is given two hearing tests, one in the laboratory and one web-based. The scores in the laboratory-based test, \(x\), and the web-based test, \(y\), are both measured in the same suitable units.
  1. Half of the participants do the laboratory-based test first and the other half do the web-based test first. Explain why the expert adopts this approach. The scatter diagram in Fig. 5 shows the data that the expert collected. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{8d36bc92-07ac-40c3-9e75-26f2bc9d2fcc-05_785_1360_1009_242} \captionsetup{labelformat=empty} \caption{Fig. 5}
    \end{figure} Summary statistics for these data are as follows. $$\Sigma x = 198.0 \quad \Sigma x ^ { 2 } = 2936.92 \quad \Sigma y = 188.7 \quad \Sigma y ^ { 2 } = 2605.35 \quad \Sigma x y = 2554.87$$
  2. Calculate the equation of the regression line suitable for estimating web-based scores from laboratory-based scores.
  3. Estimate the web-based scores of people whose laboratory-based scores were as follows.
    Stating the approximate coordinates of the outlier, suggest what the expert should do.
View full question →
Calculate regression line then predict

A question is this sub-type if and only if the student must first calculate the regression line equation from summary statistics (using formulas for gradient and intercept) before making a prediction.

11 Moderate -0.2
5.0% of questions
Show example »
2. An experiment carried out by a student yielded pairs of \(( x , y )\) observations such that $$\bar { x } = 36 , \quad \bar { y } = 28.6 , \quad S _ { x x } = 4402 , \quad S _ { x y } = 3477.6$$
  1. Calculate the equation of the regression line of \(y\) on \(x\) in the form \(y = a + b x\). Give your values of \(a\) and \(b\) to 2 decimal places.
  2. Find the value of \(y\) when \(x = 45\).
View full question →
Easiest question Moderate -0.8 »
  1. The heights above sea level ( \(h\) hundred metres) and the temperatures ( \(t ^ { \circ } \mathrm { C }\) ) at 12 randomly selected places in France, at 7 am on July 31st, were recorded.
    The data are summarised as follows
    1. Find the value of \(S _ { t t }\)
    2. Calculate the product moment correlation coefficient for these data.
    3. Interpret the relationship between \(t\) and \(h\).
    4. Find an equation of the regression line of \(t\) on \(h\).
    At 7 am on July 31st Yinka is on holiday in South Africa. He uses the regression equation to estimate the temperature when the height above sea level is 500 m .
  2. Find the estimated temperature Yinka calculates.
  3. Comment on the validity of your answer in part (e). $$\sum h = 112 \quad \sum t = 136 \quad \sum t ^ { 2 } = 1828 \quad S _ { h t } = - 236 \quad S _ { h h } = 297$$
  4. Find the value of \(S\) (2)
View full question →
Hardest question Standard +0.3 »
9 Five observations of bivariate data produce the following results, denoted as ( \(x _ { i } , y _ { i }\) ) for \(i = 1,2,3,4,5\). $$\begin{aligned} & ( 13,2.7 ) \\ & { \left[ \Sigma x = 90 , \Sigma y = 15.0 , \Sigma x ^ { 2 } = 1720 , \Sigma y ^ { 2 } = 46.86 , \Sigma x y = 264.0 . \right] } \end{aligned}$$
  1. Show that the regression line of \(y\) on \(x\) has gradient - 0.06 , and find its equation in the form \(y = a + b x\).
  2. The regression line is used to estimate the value of \(y\) corresponding to \(x = 20\), but the value \(x = 20\) is accurate only to the nearest whole number. Calculate the difference between the largest and the smallest values that the estimated value of \(y\) could take. The numbers \(e _ { 1 } , e _ { 2 } , e _ { 3 } , e _ { 4 } , e _ { 5 }\) are defined by $$e _ { i } = a + b x _ { i } - y _ { i } \quad \text { for } i = 1,2,3,4,5$$
  3. The values of \(e _ { 1 } , e _ { 2 }\) and \(e _ { 3 }\) are \(0.6 , - 0.7\) and 0.2 respectively. Calculate the values of \(e _ { 4 }\) and \(e _ { 5 }\).
  4. Calculate the value of \(e _ { 1 } ^ { 2 } + e _ { 2 } ^ { 2 } + e _ { 3 } ^ { 2 } + e _ { 4 } ^ { 2 } + e _ { 5 } ^ { 2 }\) and explain the relevance of this quantity to the regression line found in part (i).
  5. Find the mean and the variance of \(e _ { 1 } , e _ { 2 } , e _ { 3 } , e _ { 4 } , e _ { 5 }\).
View full question →
Calculate PMCC from summary statistics

Questions that provide summary statistics (such as Sxx, Syy, Sxy, sums of x, y, x², y², xy) and require calculating the product moment correlation coefficient using these given values.

10 Moderate -0.2
4.6% of questions
Show example »
A set of \(20\) pairs of bivariate data \((x, y)\) is summarised by $$\Sigma x = 200, \quad \Sigma x^2 = 2125, \quad \Sigma y = 240, \quad \Sigma y^2 = 8245.$$ The product moment correlation coefficient is \(-0.992\).
  1. What does the value of the product moment correlation coefficient indicate about a scatter diagram of the data points? [1]
  2. Find the equation of the regression line of \(y\) on \(x\). [6]
  3. The equation of the regression line of \(x\) on \(y\) is \(x = a' + b'y\). Find the value of \(b'\). [2]
View full question →
Easiest question Easy -1.2 »
  1. Baako is investigating the times taken by children to run a 100 m race, \(x\) seconds, and a 500 m race, \(y\) seconds. For a sample of 20 children, Baako obtains the time taken by each child to run each race.
Here are Baako's summary statistics. $$\begin{gathered} \mathrm { S } _ { x x } = 314.55 \quad \mathrm {~S} _ { y y } = 9026 \quad \mathrm {~S} _ { x y } = 1610 \\ \bar { x } = 19.65 \quad \bar { y } = 108 \end{gathered}$$
  1. Calculate the product moment correlation coefficient between the times taken to run the 100 m race and the times taken to run the 500 m race.
  2. Show that the equation of the regression line of \(y\) on \(x\) can be written as $$y = 5.12 x + 7.42$$ where the gradient and \(y\) intercept are given to 3 significant figures. The child who completed the 100 m race in 20 seconds took 104 seconds to complete the 500 m race.
  3. Find the residual for this child. The table below shows the signs of the residuals for the 20 children in order of finishing time for the 100 m race.
    Sign of residual++++--+--------+++++
  4. Explain what the signs of the residuals show about the model's predictions of the 500 m race times for the children who are fastest and slowest over the 100 m race.
View full question →
Hardest question Standard +0.3 »
  1. Two students, Jim and Dora, collected data on the mean annual rainfall, \(w \mathrm {~cm}\), and the annual yield of leeks, \(l\) tonnes per hectare, for 10 years.
Jim summarised the data as follows $$\mathrm { S } _ { w l } = 42.786 \quad \mathrm {~S} _ { w w } = 9936.9 \quad \sum l ^ { 2 } = 26.2326 \quad \sum l = 16.06$$
  1. Find the product moment correlation coefficient between \(l\) and \(w\) Dora decided to code the data first using \(s = w - 6\) and \(t = l - 20\)
  2. Write down the value of the product moment correlation coefficient between \(s\) and \(t\). Give a justification for your answer. Dora calculates the equation of the regression line of \(t\) on \(s\) to be \(t = 0.00431 s - 18.87\)
  3. Find the equation of the regression line of \(l\) on \(w\) in the form \(l = a + b w\), giving the values of \(a\) and \(b\) to 3 significant figures.
  4. Use your equation to estimate the yield of leeks when \(w\) is 100 cm .
  5. Calculate the residual sum of squares. The graph shows the residual for each value of \(l\) \includegraphics[max width=\textwidth, alt={}, center]{7e46e14a-0f5a-4d02-8f00-a92bc4def6d7-08_716_1594_1594_239}
    1. State whether this graph suggests that the use of a linear regression model is suitable for these data. Give a reason for your answer.
    2. Other than collecting more data, suggest how to improve the fit of the model in part (c) to the data.
View full question →
Find unknown values from regression

A question is this type if and only if it requires finding unknown data values given the regression line equation and some of the data points.

9 Standard +0.7
4.1% of questions
Show example »
The values from a random sample of five pairs \((x, y)\) taken from a bivariate distribution are shown below.
\(x\)34468
\(y\)57\(q\)67
The equation of the regression line of \(x\) on \(y\) is given by \(x = \frac{5}{4}y + c\).
  1. Given that \(q\) is an integer, find its value. [5]
  2. Find the value of \(c\). [3]
  3. Find the value of the product moment correlation coefficient. [3]
View full question →
Easiest question Standard +0.3 »
10 The values from a random sample of five pairs \(( x , y )\) taken from a bivariate distribution are shown below.
\(x\)34468
\(y\)57\(q\)67
The equation of the regression line of \(x\) on \(y\) is given by \(x = \frac { 5 } { 4 } y + c\).
  1. Given that \(q\) is an integer, find its value.
  2. Find the value of \(c\).
  3. Find the value of the product moment correlation coefficient.
View full question →
Hardest question Challenging +1.2 »
8 For a random sample of 6 observations of pairs of values \(( x , y )\), the equation of the regression line of \(y\) on \(x\) is \(y = b x + 1.306\), where \(b\) is a constant. The corresponding equation of the regression line of \(x\) on \(y\) is \(x = 0.6331 y + d\), where \(d\) is a constant. The values of \(x\) from the sample are $$\begin{array} { l l l l l l } 2.3 & 2.8 & 3.7 & p & 6.1 & 6.4 \end{array}$$ and the sum of the values of \(y\) is 46.5 . The product moment correlation coefficient is 0.9797 .
  1. Find the value of \(b\) correct to 3 decimal places.
  2. Find the value of \(p\).
  3. Use the equation of the regression line of \(x\) on \(y\) to estimate the value of \(x\) when \(y = 8.5\).
View full question →
Interpret features of scatter diagram

A question is this sub-type if and only if it provides a scatter diagram and requires interpretation of its features such as correlation strength, outliers, or relationship patterns without requiring drawing.

9 Moderate -0.6
4.1% of questions
Show example »
2 A road transport researcher is investigating the link between the age of a person, a years, and the distance, \(d\) metres, at which the person can read a large road sign. The researcher selects 13 individuals of different ages between 20 and 80 and measures the value of \(d\) for each of them. The spreadsheet below shows the data which the researcher obtained, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{691e8b55-e9a1-4fff-b9ee-a71ff1f73ead-3_725_1566_495_251}
  1. Explain which of the two variables \(a\) and \(d\) is the independent variable.
  2. Find the equation of the regression line of \(d\) on \(a\).
  3. Use the regression line to predict the average distance at which a 60-year-old person can read the road sign.
  4. Explain why it might not be sensible to use the regression line to predict the average distance at which a 5 -year-old child can read the road sign.
  5. Determine the value of the residual for \(a = 40\).
  6. Explain why it would not be useful to find the equation of the regression line of \(a\) on \(d\).
View full question →
Easiest question Easy -1.8 »
Fig. 16.1, Fig. 16.2 and Fig. 16.3 show some data about life expectancy, including some from the pre-release data set. \includegraphics{figure_16_1} \includegraphics{figure_16_2} \includegraphics{figure_16_3}
  1. Comment on the shapes of the distributions of life expectancy at birth in 2014 and 1974. [2]
    1. The minimum value shown in the box plot is negative. What does a negative value indicate? [1]
    2. What feature of Fig 16.3 suggests that a Normal distribution would not be an appropriate model for increase in life expectancy from one year to another year? [1]
    3. Software has been used to obtain the values in the table in Fig. 16.3. Decide whether the level of accuracy is appropriate. Justify your answer. [1]
    4. John claims that for half the people in the world their life expectancy has improved by 10 years or more. Explain why Fig. 16.3 does not provide conclusive evidence for John's claim. [1]
  2. Decide whether the maximum increase in life expectancy from 1974 to 2014 is an outlier. Justify your answer. [3]
Here is some further information from the pre-release data set.
CountryLife expectancy at birth in 2014
Ethiopia60.8
Sweden81.9
    1. Estimate the change in life expectancy at birth for Ethiopia between 1974 and 2014.
    2. Estimate the change in life expectancy at birth for Sweden between 1974 and 2014.
    3. Give one possible reason why the answers to parts (i) and (ii) are so different. [4]
Fig. 16.4 shows the relationship between life expectancy at birth in 2014 and 1974. \includegraphics{figure_16_4} A spreadsheet gives the following linear model for all the data in Fig 16.4. (Life expectancy at birth 2014) = 30.98 + 0.67 × (Life expectancy at birth 1974) The life expectancy at birth in 1974 for the region that now constitutes the country of South Sudan was 37.4 years. The value for this country in 2014 is not available.
    1. Use the linear model to estimate the life expectancy at birth in 2014 for South Sudan. [2]
    2. Give two reasons why your answer to part (i) is not likely to be an accurate estimate for the life expectancy at birth in 2014 for South Sudan. You should refer to both information from Fig 16.4 and your knowledge of the large data set. [2]
  1. In how many of the countries represented in Fig. 16.4 did life expectancy drop between 1974 and 2014? Justify your answer. [3]
View full question →
Hardest question Standard +0.3 »
3 In a triathlon, competitors have to swim 600 metres, cycle 40 kilometres and run 10 kilometres. To improve her strength, a triathlete undertakes a training programme in which she carries weights in a rucksack whilst running. She runs a specific course and notes the total time taken for each run. Her coach is investigating the relationship between time taken and weight carried. The times taken with eight different weights are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\) represent weight carried in kilograms and time taken in minutes respectively. \includegraphics[max width=\textwidth, alt={}, center]{d138173d-c70c-46db-b9b9-d5f19334c5f1-04_627_1536_630_281} Summary statistics: \(n = 8 , \Sigma x = 36 , \Sigma y = 214.8 , \Sigma x ^ { 2 } = 204 , \Sigma y ^ { 2 } = 5775.28 , \Sigma x y = 983.6\).
  1. Calculate the equation of the regression line of \(y\) on \(x\). On one of the eight runs, the triathlete was carrying 4 kilograms and took 27.5 minutes. On this run she was delayed when she tripped and fell over.
  2. Calculate the value of the residual for this weight.
  3. The coach decides to recalculate the equation of the regression line without the data for this run. Would it be preferable to use this recalculated equation or the equation found in part (i) to estimate the delay when the triathlete tripped and fell over? Explain your answer. The triathlete's coach claims that there is positive correlation between cycling and swimming times in triathlons. The product moment correlation coefficient of the times of twenty randomly selected competitors in these two sections is 0.209 .
  4. Carry out a hypothesis test at the \(5 \%\) level to examine the coach's claim, explaining your conclusions clearly.
  5. What distributional assumption is necessary for this test to be valid? How can you use a scatter diagram to decide whether this assumption is likely to be true?
View full question →
Relate two regression lines

A question is this type if and only if it asks to find the correlation coefficient or other relationship given both regression line equations (y on x and x on y).

7 Standard +0.6
3.2% of questions
Show example »
7 For a random sample of 10 observations of pairs of values \(( x , y )\), the equation of the regression line of \(y\) on \(x\) is \(y = 3.25 x - 4.27\). The sum of the ten \(x\) values is 15.6 and the product moment correlation coefficient for the sample is 0.56 . Find the equation of the regression line of \(x\) on \(y\). Test, at the \(5 \%\) significance level, whether there is evidence of non-zero correlation between the variables.
View full question →
Calculate from summary statistics

A question is this sub-type if and only if it provides summary statistics (such as Σx, Σy, Σx², Σy², Σxy, n) and asks to calculate Sxx, Syy, or Sxy using the standard formulas.

7 Moderate -0.3
3.2% of questions
Show example »
  1. The production cost, \(\pounds c\) million, of a film and the total ticket sales, \(\pounds t\) million, earned by the film are recorded for a sample of 40 films.
Some summary statistics are given below. $$\sum c = 1634 \quad \sum t = 1361 \quad \sum t ^ { 2 } = 82873 \quad \sum c t = 83634 \quad \mathrm {~S} _ { c c } = 28732.1$$
  1. Find the exact value of \(\mathrm { S } _ { t t }\) and the exact value of \(\mathrm { S } _ { c t }\)
  2. Calculate the value of the product moment correlation coefficient for these data.
  3. Give an interpretation of your answer to part (b)
  4. Show that the equation of the linear regression line of \(t\) on \(c\) can be written as $$t = - 5.84 + 0.976 c$$ where the values of the intercept and gradient are given to 3 significant figures.
  5. Find the expected total ticket sales for a film with a production cost of \(\pounds 90\) million. Using the regression line in part (d)
  6. find the range of values of the production cost of a film for which the total ticket sales are less than \(80 \%\) of its production cost.
View full question →
Linearize non-linear relationships

A question is this type if and only if it involves transforming a non-linear relationship (e.g., y = Ca^x) into linear form by taking logarithms or other transformations to enable linear regression.

6 Standard +0.4
2.7% of questions
Show example »
  1. Show that there is a linear relationship between \(Y\) and \(X\).
  2. The graph of \(Y\) against \(X\) is shown in the diagram. \includegraphics[max width=\textwidth, alt={}, center]{cf9337b9-b766-4ce5-967c-5d7522e2aa42-4_748_858_849_593} Find the value of \(n\) and the value of \(a\).
View full question →
Find means from regression lines

A question is this type if and only if it asks to find the mean values of x and y given the equations of both regression lines (using the fact that both pass through the mean point).

5 Moderate -0.1
2.3% of questions
Show example »
8 The equations of the regression lines for a random sample of 25 pairs of data \(( x , y )\) from a bivariate population are $$\begin{array} { c c } y \text { on } x : & y = 1.28 - 0.425 x , \\ x \text { on } y : & x = 1.05 - 0.516 y . \end{array}$$
  1. Find the sample means, \(\bar { x }\) and \(\bar { y }\).
  2. Find the product moment correlation coefficient for the sample.
  3. Test at the \(5 \%\) significance level whether the population correlation coefficient differs from zero.
View full question →
Calculate PMCC from raw data

Questions that provide raw bivariate data in a table and require calculating the product moment correlation coefficient directly from the individual data values.

5 Moderate -0.5
2.3% of questions
Show example »
3 An investor obtains data about the profits of 8 randomly chosen investment accounts over two one-year periods. The profit in the first year for each account is \(p \%\) and the profit in the second year for each account is \(q \%\). The results are shown in the table and in the scatter diagram.
AccountABCDEFGH
\(p\)1.62.12.42.72.83.35.28.4
\(q\)1.62.32.22.23.12.97.64.8
\(n = 8 \quad \sum \mathrm { p } = 28.5 \quad \sum \mathrm { q } = 26.7 \quad \sum \mathrm { p } ^ { 2 } = 136.35 \quad \sum \mathrm { q } ^ { 2 } = 116.35 \quad \sum \mathrm { pq } = 116.70\) \includegraphics[max width=\textwidth, alt={}, center]{bf1468d1-e02e-47d2-bf41-5bc8f5b4d7c4-3_782_1280_998_242}
  1. State which, if either, of the variables \(p\) and \(q\) is independent.
  2. Calculate the equation of the regression line of \(q\) on \(p\).
    1. Use the regression line to estimate the value of \(q\) for an investment account for which \(p = 2.5\).
    2. Give two reasons why this estimate could be considered reliable.
  3. Comment on the reliability of using the regression line to predict the value of \(q\) when \(p = 7.0\).
View full question →
Hypothesis test for zero correlation

Questions that require testing whether the population correlation coefficient is zero (or equivalently, whether there is significant correlation) using the product moment correlation coefficient and t-distribution or critical value tables.

5 Standard +0.2
2.3% of questions
Show example »
9 A random sample of 8 students is chosen from those sitting examinations in both Mathematics and French. Their marks in Mathematics, \(x\), and in French, \(y\), are summarised as follows. $$\Sigma x = 472 \quad \Sigma x ^ { 2 } = 29950 \quad \Sigma y = 400 \quad \Sigma y ^ { 2 } = 21226 \quad \Sigma x y = 24879$$ Another student scored 72 marks in the Mathematics examination but was unable to sit the French examination.
  1. Estimate the mark that this student would have obtained in the French examination.
  2. Test, at the \(5 \%\) significance level, whether there is non-zero correlation between marks in Mathematics and marks in French.
View full question →
Comment on reliability/validity of prediction

Questions that ask whether a regression line provides reliable estimates for a given value, whether extrapolation is appropriate, or to comment on the validity/reliability of using the model for a specific prediction.

5 Moderate -0.4
2.3% of questions
Show example »
1 A set of bivariate data ( \(X , Y\) ) is summarised as follows. \(n = 25 , \sum x = 9.975 , \sum y = 11.175 , \sum x ^ { 2 } = 5.725 , \sum y ^ { 2 } = 46.200 , \sum x y = 11.575\)
  1. Calculate the value of Pearson's product-moment correlation coefficient.
  2. Calculate the equation of the regression line of \(y\) on \(x\). It is desired to know whether the regression line of \(y\) on \(x\) will provide a reliable estimate of \(y\) when \(x = 0.75\).
  3. State one reason for believing that the estimate will be reliable.
  4. State what further information is needed in order to determine whether the estimate is reliable.
View full question →
Explain least squares concept

A question is this type if and only if it asks to explain what is meant by 'least squares' in the context of regression, typically requiring reference to minimizing sum of squared residuals.

4 Moderate -0.3
1.8% of questions
Show example »
7 The diagram shows the results of an experiment involving some bivariate data. The least squares regression line of \(y\) on \(x\) for these results is also shown. \includegraphics[max width=\textwidth, alt={}, center]{48ffcd44-d933-40e0-818a-20d6db607298-5_748_919_390_612}
  1. Given that the least squares regression line of \(y\) on \(x\) is used for an estimation, state which of \(x\) or \(y\) is treated as the independent variable.
  2. Use the diagram to explain what is meant by 'least squares'.
  3. State, with a reason, the value of Spearman's rank correlation coefficient for these data.
  4. What can be said about the value of the product moment correlation coefficient for these data?
View full question →
Calculate x on y regression line

Questions that ask to find the regression line of x on y (the reverse regression), either from summary statistics or raw data.

4 Standard +0.1
1.8% of questions
Show example »
For a random sample of 10 observations of pairs of values \((x, y)\), the equation of the regression line of \(y\) on \(x\) is \(y = 1.1664 + 0.4604x\). It is given that $$\Sigma x^2 = 1419.98 \quad \text{and} \quad \Sigma y^2 = 439.68.$$ The mean value of \(y\) is 6.24.
  1. Find the equation of the regression line of \(x\) on \(y\). [6]
  2. Find the product moment correlation coefficient. [2]
  3. Test at the 5\% significance level whether there is evidence of positive correlation between the two variables. [4]
View full question →
Interpret correlation strength/direction

A question is this type if and only if it asks to describe, interpret, or comment on the type, strength, or direction of correlation from a given correlation coefficient or scatter diagram.

3 Moderate -0.8
1.4% of questions
Show example »
1 For each of the last five years the number of tourists, \(x\) thousands, visiting Sackton, and the average weekly sales, \(\pounds y\) thousands, in Sackton Stores were noted. The table shows the results.
Year20072008200920102011
\(x\)250270264290292
\(y\)4.23.73.23.53.0
  1. Calculate the product moment correlation coefficient \(r\) between \(x\) and \(y\).
  2. It is required to estimate the average weekly sales at Sackton Stores in a year when the number of tourists is 280000 . Calculate the equation of an appropriate regression line, and use it to find this estimate.
  3. Over a longer period the value of \(r\) is - 0.8 . The mayor says, "This shows that having more tourists causes sales at Sackton Stores to decrease." Give a reason why this statement is not correct.
View full question →
Draw scatter diagram from data

A question is this sub-type if and only if it explicitly requires the student to draw or plot a scatter diagram from given data values.

3 Moderate -0.9
1.4% of questions
Show example »
3. A manufacturer stores drums of chemicals. During storage, evaporation takes place. A random sample of 10 drums was taken and the time in storage, \(x\) weeks, and the evaporation loss, \(y \mathrm { ml }\), are shown in the table below.
\(x\)3568101213151618
\(y\)36505361697982908896
  1. On graph paper, draw a scatter diagram to represent these data.
  2. Give a reason to support fitting a regression model of the form \(y = a + b x\) to these data.
  3. Find, to 2 decimal places, the value of \(a\) and the value of \(b\). $$\text { (You may use } \Sigma x ^ { 2 } = 1352 , \Sigma y ^ { 2 } = 53112 \text { and } \Sigma x y = 8354 \text {.) }$$
  4. Give an interpretation of the value of \(b\).
  5. Using your model, predict the amount of evaporation that would take place after
    1. 19 weeks,
    2. 35 weeks.
  6. Comment, with a reason, on the reliability of each of your predictions.
View full question →
Calculate from raw data

A question is this sub-type if and only if it provides raw data values and asks to calculate Sxx, Syy, or Sxy directly from those values.

3 Moderate -0.6
1.4% of questions
Show example »
  1. The percentage oil content, \(p\), and the weight, \(w\) milligrams, of each of 10 randomly selected sunflower seeds were recorded. These data are summarised below.
$$\sum w ^ { 2 } = 41252 \quad \sum w p = 27557.8 \quad \sum w = 640 \quad \sum p = 431 \quad \mathrm {~S} _ { p p } = 2.72$$
  1. Find the value of \(\mathrm { S } _ { w w }\) and the value of \(\mathrm { S } _ { w p }\)
  2. Calculate the product moment correlation coefficient between \(p\) and \(w\)
  3. Give an interpretation of your product moment correlation coefficient. The equation of the regression line of \(p\) on \(w\) is given in the form \(p = a + b w\)
  4. Find the equation of the regression line of \(p\) on \(w\)
  5. Hence estimate the percentage oil content of a sunflower seed which weighs 60 milligrams.
View full question →
Assess model appropriateness from context

Questions that ask students to assess whether a linear regression model is appropriate based on contextual factors, scatter diagrams, or theoretical considerations (not residual plots).

3 Moderate -0.1
1.4% of questions
Show example »
6. The University of Arizona surveyed a large number of households. One purpose of the survey was to determine if annual household income could be predicted from size of family home. The graph of Annual household income, \(y\), versus Size of family home, \(x\), is shown below. \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_616_1257_566_365}
  1. State the limitations of using the regression line above with reference to the scatter diagram. The data for size of family homes between 2000 and 3000 square feet are shown in the diagram below. \includegraphics[max width=\textwidth, alt={}, center]{4ecf99c5-c4b3-41b7-a8df-a7c2ca7fcd6a-5_652_1244_1516_360} Summary statistics for these data are as follows. $$\begin{array} { r c c } \sum x = 93160 & \sum y = 3907142 & n = 37 \\ S _ { x x } = 2869673.03 & S _ { y y } = 44312797167 & S _ { x y } = 348512820 \cdot 6 \end{array}$$
  2. Calculate the equation of the least squares regression line to predict Annual household income from Size of family home for these data.
View full question →
Minimize sum of squared residuals

A question is this type if and only if it involves algebraically minimizing an expression for the sum of squared residuals to derive regression line parameters.

2 Moderate -0.1
0.9% of questions
Show example »
7 The coordinates of a set of 10 points are denoted by ( \(\mathrm { x } _ { \mathrm { i } } , \mathrm { y } _ { \mathrm { i } }\) ) for \(i = 1,2 , \ldots , 10\). For a particular set of values of ( \(\mathrm { x } _ { \mathrm { i } } , \mathrm { y } _ { \mathrm { i } }\) ) and any constants \(a\) and \(b\) it can be shown that \(\Sigma \left( y _ { i } - a - b x _ { i } \right) ^ { 2 } = 10 ( 11 - a - 6 b ) ^ { 2 } + 126 \left( b - \frac { 83 } { 42 } \right) ^ { 2 } + \frac { 139 } { 14 }\).
    1. Explain why \(\sum \left( \mathrm { y } _ { \mathrm { i } } - \mathrm { a } - \mathrm { bx } _ { \mathrm { i } } \right) ^ { 2 }\) is minimised by taking \(b = \frac { 83 } { 42 }\) and \(\mathrm { a } = 11 - 6 \mathrm {~b}\).
    2. Hence explain why the equation of the regression line of \(y\) on \(x\) for these points is given by the corresponding values of \(a\) and \(b\) (so that the equation is \(\mathrm { y } = \frac { 83 } { 42 } \mathrm { x } - \frac { 6 } { 7 }\) ).
  1. State which of the following terms cannot apply to the variable \(X\) if the regression line of \(y\) on \(x\) can be used for estimating values of \(Y\). Dependent Independent Controlled Response
  2. Use the regression line to estimate the value of \(y\) corresponding to \(x = 8\).
  3. State what must be true of the value \(x = 8\) if the estimate in part (c) is to be reliable.
  4. Variables \(u\) and \(v\) are related to \(x\) and \(y\) by the following relationships. \(u = 2 + 4 x \quad v = 8 - 2 y\) Show that the gradient of the regression line of \(v\) on \(u\) is very close to - 1 .
View full question →
Direct prediction from given regression line

A question is this sub-type if and only if the regression line equation is already provided in the question and the task is simply to substitute a value to make a prediction.

2 Moderate -1.0
0.9% of questions
Show example »
2 A student is investigating the link between temperature and electricity consumption in the winter months. The student finds the average minimum temperature, \(x ^ { \circ } \mathrm { C }\), from across the country on a day. The student then finds the total electricity consumption for that day, \(y \mathrm { GWh }\). The scatter diagram below shows the values of \(x\) and \(y\) obtained from a random sample of 10 winter days. It also shows the equation of the regression line of \(y\) on \(x\) and the value of \(r ^ { 2 }\), where \(r\) is the product moment correlation coefficient. \includegraphics[max width=\textwidth, alt={}, center]{c692fb20-436f-4bc1-89bd-10fdba41ceba-03_776_1043_609_244}
  1. Use the regression line to estimate the electricity consumption at each of the following average minimum temperatures.
View full question →
Calculate variance from summations

A question is this type if and only if it asks to calculate the variance of x or y from given summary statistics like Σx, Σx², and n.

1 Moderate -0.8
0.5% of questions
Show example »
  1. A company wants to pay its employees according to their performance at work. Last year's performance score \(x\) and annual salary \(y\), in thousands of dollars, were recorded for a random sample of 10 employees of the company.
The performance scores were $$\begin{array} { l l l l l l l l l l } 15 & 24 & 32 & 39 & 41 & 18 & 16 & 22 & 34 & 42 \end{array}$$ (You may use \(\sum x ^ { 2 } = 9011\) )
  1. Find the mean and the variance of these performance scores. The corresponding \(y\) values for these 10 employees are summarised by $$\sum y = 306.1 \quad \text { and } \quad \mathrm { S } _ { y y } = 546.3$$
  2. Find the mean and the variance of these \(y\) values. The regression line of \(y\) on \(x\) based on this sample is $$y = 12.0 + 0.659 x$$
  3. Find the product moment correlation coefficient for these data.
  4. State, giving a reason, whether or not the value of the product moment correlation coefficient supports the use of a regression line to model the relationship between performance score and annual salary. The company decides to use this regression model to determine future salaries.
  5. Find the proposed annual salary, in dollars, for an employee who has a performance score of 35
View full question →
Prediction with confidence or prediction intervals

A question is this sub-type if and only if it requires constructing a confidence interval or prediction interval around the predicted value, involving variance and distributional assumptions.

1 Challenging +1.2
0.5% of questions
Show example »
9 The values of a set of bivariate data \(\left( x _ { i } , y _ { i } \right)\) can be summarised by $$n = 50 , \sum x = 1270 , \sum y = 5173 , \sum x ^ { 2 } = 42767 , \sum y ^ { 2 } = 701301 , \sum x y = 173161 .$$ Ten independent observations of \(Y\) are obtained, all corresponding to \(x = 20\). It may be assumed that the variance of \(Y\) is 1.9 , independently of the value of \(x\). Find a \(95 \%\) confidence interval for the mean \(\bar { Y }\) of the 10 observations of \(Y\). \section*{END OF QUESTION PAPER}
View full question →
Hypothesis test for regression slope

Questions that require testing whether the regression slope coefficient is significantly different from zero using regression output, standard errors, and t-tests to assess the significance of the relationship.

1 Standard +0.3
0.5% of questions
Show example »
  1. Kwame is investigating a possible relationship between average March temperature, \(t ^ { \circ } \mathrm { C }\), and tea yield, \(y \mathrm {~kg} /\) hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
$$\begin{aligned} & \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\ & \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080 \end{aligned}$$
  1. Use the regression model to predict the tea yield for an average March temperature of \(20 ^ { \circ } \mathrm { C }\) He also produces the following residual plot for the data. \includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}
  2. Explain what you understand by the term residual.
  3. Calculate the product moment correlation coefficient between \(t\) and \(y\)
  4. Explain why the linear model may not be a good fit for the data
    1. with reference to your answer to part (c)
    2. with reference to the residual plot. \section*{Question 1 continues on page 4} Kwame also collects data on total March rainfall, \(w \mathrm {~mm}\), for each of these 30 years. For a linear regression model of \(w\) on \(t\) the following summary statistic is found. $$\text { Residual Sum of Squares (RSS) = } 86754$$ Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between \(w\) and \(t\) than between \(y\) and \(t\) (where RSS \(= 1666567\) )
  5. State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.
View full question →
Interpret residual plots

Questions that provide residual plots and ask students to interpret them or assess model appropriateness based on residual patterns.

1 Standard +0.3
0.5% of questions
Show example »
3 Below are 3 sketches from some students of the residuals from their linear regressions of \(y\) on \(x\). \includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_252_704_342_660} \includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_266_718_625_660} \includegraphics[max width=\textwidth, alt={}, center]{54bf68ab-7934-432a-890f-20093082ab07-06_248_599_936_660} \section*{III} III For each sketch you should state, giving your reason,
  1. whether or not the sketch is feasible
    and if it is feasible
  2. whether or not the sketch suggests a linear or a non-linear relationship between \(y\) and \(x\).
View full question →
Convert correlation or regression statistics between scales

Questions that require converting summary statistics (like Sxx, Sxy, Syy, or correlation coefficient) between coded and original variables using properties of linear transformations.

0
0.0% of questions
Calculate with one S-value given

A question is this sub-type if and only if it provides summary statistics where one of Sxx, Syy, or Sxy is already given and asks to calculate one or both of the remaining S-values.

0
0.0% of questions