5.08a Pearson correlation: calculate pmcc

246 questions

Sort by: Default | Easiest first | Hardest first
OCR MEI Further Statistics A AS 2024 June Q5
10 marks Moderate -0.3
5 A student is investigating possible association between the amount of coffee that an adult drinks each day and the number of hours that they remain awake each day. In an initial investigation, a random sample of 8 adults is selected. The student obtains the following information from each of these adults: the amount of coffee that they drink each day and the number of hours that they remain awake each day. The student analyses the data and finds that the associated product moment correlation coefficient is 0.6030 .
  1. State one assumption that must be made for a hypothesis test based on the product moment correlation coefficient to be carried out. For the remainder of this question you may assume that this assumption is true.
  2. Carry out a test at the \(5 \%\) significance level to investigate whether there is any correlation between amount of coffee drunk and number of hours awake. The student conducts a second investigation which is similar to the first but this time based on a random sample of 30 adults. The product moment correlation coefficient for the new data is 0.5487 . The student carries out an equivalent hypothesis test to the one carried out in part (b), again using a 5\% significance level.
  3. Identify any differences between the two tests and their results. You do not need to restate the hypotheses or explain the conclusion in context.
  4. You may assume the following guidelines for considering effect size.
    Product moment
    correlation coefficient
    Effect size
    0.1Small
    0.3Medium
    0.5Large
    Explain briefly why the results of the student's second investigation are likely to be more reliable than the results of the initial investigation.
OCR MEI Further Statistics A AS 2020 November Q2
12 marks Standard +0.3
2 A researcher is investigating the concentration of bacteria and fungi in the air in buildings. The researcher selects a random sample of 12 buildings and measures the concentrations of bacteria, \(x\), and fungi, \(y\), in the air in each building. Both concentrations are measured in the same standard units. Fig. 2 illustrates the data collected. The researcher wishes to test for a relationship between \(x\) and \(y\). \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{ba3fcd3c-6834-4116-be0e-d5b27aed0a7e-3_595_844_513_255} \captionsetup{labelformat=empty} \caption{Fig. 2}
\end{figure}
  1. Explain why a test based on the product moment correlation coefficient is likely to be appropriate for these data. Summary statistics for the data are as follows. \(n = 12 \quad \sum x = 18030 \quad \sum y = 15550 \quad \sum x ^ { 2 } = 31458700 \quad \sum y ^ { 2 } = 21980500 \quad \sum x y = 25626800\)
  2. In this question you must show detailed reasoning. Calculate the product moment correlation coefficient between \(x\) and \(y\).
  3. Carry out a test at the \(5 \%\) significance level based on the product moment correlation coefficient to investigate whether there is any correlation between concentrations of bacteria and fungi.
  4. Explain why, in order for proper inference to be undertaken, the sample should be chosen randomly.
OCR MEI Further Statistics A AS 2021 November Q3
9 marks Standard +0.3
3 A student is investigating the link between temperature (in degrees Celsius) and electricity consumption (in Gigawatt-hours) in the country in which he lives. The student has read that there is strong negative correlation between daily mean temperature over the whole country and daily electricity consumption during a year. He wonders if this applies to an individual season. He therefore obtains data on the mean temperature and electricity consumption on ten randomly selected days in the summer. The spreadsheet output below shows the data, together with a scatter diagram to illustrate the data. \includegraphics[max width=\textwidth, alt={}, center]{5be067ff-4668-48d6-8ed2-b8dfa3e678f7-3_798_1593_639_251}
  1. Calculate Pearson's product moment correlation coefficient between daily mean temperature and daily electricity consumption. The student decides to carry out a hypothesis test to investigate whether there is negative correlation between daily mean temperature and daily electricity consumption during the summer.
  2. Explain why the student decides to carry out a test based on Pearson's product moment correlation coefficient.
  3. Show that the test at the \(5 \%\) significance level does not result in the null hypothesis being rejected.
  4. The student concludes that there is no correlation between the variables in the summer months. Comment on the student's conclusion.
OCR MEI Further Statistics Minor 2019 June Q5
16 marks Standard +0.3
5 A student wants to know if there is a positive correlation between the amounts of two pollutants, sulphur dioxide and PM10 particulates, on different days in the area of London in which he lives; these amounts, measured in suitable units, are denoted by \(s\) and \(p\) respectively.
He uses a government website to obtain data for a random sample of 15 days on which the amounts of these pollutants were measured simultaneously. Fig. 5.1 is a scatter diagram showing the data. Summary statistics for these 15 values of \(s\) and \(p\) are as follows. \(\sum s _ { 1 } = 155.4 \quad \sum p = 518.9 \quad \sum s ^ { 2 } = 2322.7 \quad \sum p ^ { 2 } = 21270.5 \quad \sum s p = 6009.1\) \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{4a4d5816-5b53-49a1-b72f-f8bcf3b4e8bc-4_935_1134_683_260} \captionsetup{labelformat=empty} \caption{Fig. 5.1}
\end{figure}
  1. Explain why the student might come to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid.
  2. Find the value of Pearson's product moment correlation coefficient.
  3. Carry out a test at the \(5 \%\) significance level to investigate whether there is positive correlation between the amounts of sulphur dioxide and PM10 particulates.
  4. Explain why the student made sure that the sample chosen was a random sample. The student also wishes to model the relationship between the amounts of nitrogen dioxide \(n\) and PM10 particulates \(p\).
    He takes a random sample of 54 values of the two variables, both measured at the same times. Fig. 5.2 is a scatter diagram which shows the data, together with the regression line of \(n\) on \(p\), the equation of the regression line and the value of \(r ^ { 2 }\). \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{4a4d5816-5b53-49a1-b72f-f8bcf3b4e8bc-5_824_1230_495_258} \captionsetup{labelformat=empty} \caption{Fig. 5.2}
    \end{figure}
  5. Predict the value of \(n\) for \(p = 150\).
  6. Discuss the reliability of your prediction in part (e).
OCR MEI Further Statistics Minor 2024 June Q3
13 marks Standard +0.3
3 The scatter diagram below illustrates data concerning average annual income per person, \(\\) x\(, and average life expectancy, \)y$ years, for 45 randomly selected cities. \includegraphics[max width=\textwidth, alt={}, center]{464c80be-007b-4d5a-9fe5-2f35100bdea6-3_860_1465_354_244}
  1. State whether neither variable, one variable or both variables can be considered to be random in this situation. A student is researching possible positive association between average annual income and average life expectancy. The student decides that the data point labelled A on the scatter diagram is an outlier.
  2. Describe the apparent relationship between average annual income and average life expectancy for this data point relative to the rest of the data. The data for point A is removed. The student now wishes to carry out a hypothesis test using the product moment correlation coefficient for the remaining 44 data points to investigate whether there is positive correlation between average annual income and average life expectancy.
  3. Explain why this type of hypothesis test is appropriate in this situation. Justify your answer. The summary statistics for these 44 data points are as follows. \(\sum x = 751120 \sum y = 2397.1 \sum x ^ { 2 } = 14363849200 \sum y ^ { 2 } = 133014.63 \sum x y = 42465962\)
  4. Determine the value of the product moment correlation coefficient.
  5. Carry out the test at the 1\% significance level.
OCR MEI Further Statistics Minor 2021 November Q4
14 marks Standard +0.3
4 A scientist is investigating sea salinity (the level of salt in the sea) in a particular area. She wishes to check whether satellite measurements, \(y\), of salinity are similar to those directly measured, \(x\). Both variables are measured in parts per thousand in suitable units. The scientist obtains a random sample of 10 values of \(x\) and the related values of \(y\). Below is a screenshot of a scatter diagram to illustrate the data. She decides to carry out a hypothesis test to check if there is any correlation between direct measurement, \(x\), and satellite measurement, \(y\). \includegraphics[max width=\textwidth, alt={}, center]{691e8b55-e9a1-4fff-b9ee-a71ff1f73ead-5_830_837_589_246}
  1. Explain why the scientist might decide to carry out a test based on the product moment correlation coefficient. Summary statistics for \(x\) and \(y\) are as follows. \(n = 10 \quad \sum x = 351.9 \quad \sum y = 350.0 \quad \sum x ^ { 2 } = 12384.5 \quad \sum y ^ { 2 } = 12251.2 \quad \sum \mathrm { xy } = 12317.2\)
  2. In this question you must show detailed reasoning. Calculate the product moment correlation coefficient.
  3. Carry out a hypothesis test at the \(5 \%\) significance level to investigate whether there is positive correlation between directly measured and satellite measured salinity levels.
  4. Explain why it would be preferable to use a larger sample. The scientist is also interested in whether there is any correlation between salinity and numbers of a particular species of shrimp in the water. She takes a large sample and finds that the product moment correlation coefficient for this sample is 0.165 . The result of a test based on this sample is to reject the null hypothesis and conclude that there is correlation between salinity and numbers of shrimp.
  5. Comment on the outcome of the hypothesis test with reference to the effect size of 0.165 .
OCR MEI Further Statistics Major 2019 June Q6
18 marks Moderate -0.8
6
  1. A researcher is investigating the date of the 'start of spring' at different locations around the country.
    A suitable date (measured in days from the start of the year) can be identified by checking, for example, when buds first appear for certain species of trees and plants, but this is time-consuming and expensive. Satellite data, measuring microwave emissions, can alternatively be used to estimate the date that land-based measurements would give. The researcher chooses a random sample of 12 locations, and obtains land-based measurements for the start of spring date at each location, together with relevant satellite measurements. The scatter diagram in Fig. 6.1 shows the results; the land-based measurements are denoted by \(x\) days and the corresponding values derived from satellite measurements by \(y\) days. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-06_732_1342_781_333} \captionsetup{labelformat=empty} \caption{Fig. 6.1}
    \end{figure} Fig. 6.2 shows part of a spreadsheet used to analyse the data. Some rows of the spreadsheet have been deliberately omitted. \begin{table}[h]
    1ABCDEF
    1x\(\boldsymbol { y }\)\(\boldsymbol { x } ^ { \mathbf { 2 } }\)\(\boldsymbol { y } ^ { \mathbf { 2 } }\)xy
    2901028100104049180
    3
    10
    11
    129497883694099118
    13991019801102019999
    14Sum11311227107783126725116724
    15
    \captionsetup{labelformat=empty} \caption{Fig. 6.2}
    \end{table}
    1. Calculate the equation of a regression line suitable for estimating the land-based date of the start of spring from satellite measurements.
    2. Using this equation, estimate the land-based date of the start of spring for the following dates from satellite measurements.
      • 95 days
      • 60 days
        (iii) Comment on the reliability of each of your estimates.
      • The researcher is also investigating whether there is any correlation between the average temperature during a month in spring and the total rainfall during that month at a particular location. The average temperatures in degrees Celsius and total rainfall in mm for a random selection, over several years, of 10 spring months at this location are as follows.
      Temperature4.27.15.63.58.66.52.75.96.74.1
      Rainfall18264276154384536636
      The researcher plots the scatter diagram shown in Fig. 6.3 to check which type of test to carry out. \begin{figure}[h]
      \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-07_693_880_1174_338} \captionsetup{labelformat=empty} \caption{Fig. 6.3}
      \end{figure}
      1. Explain why the researcher might come to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid.
      2. Find the value of Pearson's product moment correlation coefficient.
      3. Carry out a test at the \(5 \%\) significance level to investigate whether there is any correlation between temperature and rainfall.
OCR MEI Further Statistics Major 2023 June Q6
12 marks Standard +0.3
6 A student wonders if there is any correlation between download and upload speeds of data to and from the internet. The student decides to carry out a hypothesis test to investigate this and so measures the download speed \(x\) and upload speed \(y\) in suitable units on 20 randomly chosen occasions. The scatter diagram below illustrates the data which the student collected. \includegraphics[max width=\textwidth, alt={}, center]{c692fb20-436f-4bc1-89bd-10fdba41ceba-07_824_1411_440_246}
  1. Explain why the student decides to carry out a test based on the product moment correlation coefficient. Summary statistics for the 20 occasions are as follows. $$\sum x = 342.10 \quad \sum y = 273.65 \quad \sum x ^ { 2 } = 5989.53 \quad \sum y ^ { 2 } = 3919.53 \quad \sum x y = 4713.62$$
  2. In this question you must show detailed reasoning. Calculate the product moment correlation coefficient.
  3. Carry out a hypothesis test at the \(5 \%\) significance level to investigate whether there is any correlation between download speed and upload speed.
  4. Both of the variables, download speed and upload speed, are random. Explain why, if download speed had been a non-random variable, the student could not have carried out the hypothesis test to investigate whether there was any correlation between download speed and upload speed.
OCR MEI Further Statistics Major 2020 November Q6
10 marks Standard +0.3
6 A pollution control officer is investigating a possible link between the levels of various pollutants in the air and the speed of the wind at various sites. A random sample of 60 values of the windspeed together with the levels of a variety of pollutants is taken at a particular site. The product moment correlation coefficient between wind-speed and nitrogen dioxide level is 0.3231 .
  1. Carry out a hypothesis test at the \(10 \%\) significance level to investigate whether there is any correlation between wind-speed and nitrogen dioxide level.
  2. State the condition required for the test carried out in part (a) to be valid. Table 6.1 shows the values of the product moment correlation coefficient between 5 different measures of pollution and also wind-speed for a very large random sample of values at another site. Those correlations that are significant at the \(10 \%\) level are denoted by a * after the value of the correlation. \begin{table}[h]
    CorrelationsPM10SPEED\(\mathrm { NO } _ { 2 }\)\(\mathrm { O } _ { 3 }\)PM25\(\mathrm { SO } _ { 2 }\)
    PM101.00
    SPEED0.08*1.00
    \(\mathrm { NO } _ { 2 }\)0.59*0.25*1.00
    \(\mathbf { O } _ { \mathbf { 3 } }\)-0.05*-0.04*-0.30*1.00
    PM250.85*-0.010.56*-0.021.00
    \(\mathrm { SO } _ { 2 }\)0.42*0.15*0.73*-0.63*0.40*1.00
    \captionsetup{labelformat=empty} \caption{Table 6.1}
    \end{table} \begin{table}[h]
    \captionsetup{labelformat=empty} \caption{Table 6.2 shows standard guidelines for effect sizes.}
    Product moment
    correlation coefficient
    Effect size
    0.1Small
    0.3Medium
    0.5Large
    \end{table} Table 6.2 The officer analyses these data for effect size.
  3. Explain how the very large sample size relates to the interpretation of the correlation coefficients shown in Table 6.1.
  4. Comment briefly on what the pollution control officer might conclude from these tables, relevant to her investigation into wind-speed and pollutant levels.
OCR MEI Further Statistics Major 2021 November Q8
16 marks Standard +0.3
8
  1. \(\mathrm { VO } _ { 2 \max }\) is a measure of athletic fitness. Since \(\mathrm { VO } _ { 2 \max }\) is fairly time-consuming and expensive to measure, an exercise scientist wants to predict \(\mathrm { VO } _ { 2 _ { \text {max } } }\) from data such as times for running different distances. The scientist uses these data for a random sample of 15 athletes to predict their \(\mathrm { V } \mathrm { O } _ { 2 \text { max } }\) values, denoted by \(y\), in suitable units. She also obtains accurate measurements of the \(\mathrm { V } \mathrm { O } _ { 2 \text { max } }\) values, denoted by \(x\), in the same units. The scatter diagram in Fig. 8.1 shows the values of \(x\) and \(y\) obtained, together with the equation of the regression line of \(y\) on \(x\) and the value of \(r ^ { 2 }\). \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{ce557137-f9eb-4c09-a7e3-e4ec626109dc-08_750_1324_660_317} \captionsetup{labelformat=empty} \caption{Fig. 8.1}
    \end{figure}
    1. Use the regression line to estimate the predicted \(\mathrm { VO } _ { 2 \text { max } }\) of an athlete whose accurately measured \(\mathrm { VO } _ { 2 \text { max } }\) is 50 .
    2. Comment on the reliability of your estimate.
    3. The equation of the regression line of \(x\) on \(y\) is \(x = 0.7565 y + 10.493\). Find the coordinates of the point at which the two regression lines meet.
    4. State what the point you found in part (iii) represents.
  2. It is known that there is negative correlation between \(\mathrm { VO } _ { 2 \text { max } }\) and marathon times in very good runners (those whose best marathon times are under 3 hours). The exercise scientist wishes to know whether the same applies to runners who take longer to run a marathon. She selects a random sample of 20 runners whose best marathon times are between \(3 \frac { 1 } { 2 }\) hours and \(4 \frac { 1 } { 2 }\) hours and accurately measures their \(\mathrm { VO } _ { 2 \text { max } }\). Fig. 8.2 is a scatter diagram of accurately measured \(\mathrm { VO } _ { \text {2max } }\), \(v\) units, against best marathon time, \(t\) hours, for these runners. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{ce557137-f9eb-4c09-a7e3-e4ec626109dc-09_671_1064_648_319} \captionsetup{labelformat=empty} \caption{Fig. 8.2}
    \end{figure}
    1. Explain why the exercise scientist comes to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid. Summary statistics for the 20 runners are as follows. $$\sum t = 80.37 \quad \sum v = 970.86 \quad \sum t ^ { 2 } = 324.71 \quad \sum v ^ { 2 } = 47829.24 \quad \sum t v = 3886.53$$
    2. Find the value of Pearson's product moment correlation coefficient.
    3. Carry out a test at the \(5 \%\) significance level to investigate whether there is negative correlation between accurately measured \(\mathrm { VO } _ { 2 _ { \text {max } } }\) and best marathon time for runners whose best marathon times are between \(3 \frac { 1 } { 2 }\) hours and \(4 \frac { 1 } { 2 }\) hours.
WJEC Further Unit 2 2022 June Q2
11 marks Standard +0.3
2. An economist suggested the rate of unemployment and the rate of wage inflation are independent. Amy sets about investigating this suggestion. She collects unemployment data and wage inflation data from a random sample of regions in the UK and decides that it is appropriate to carry out a significance test on Pearson's product moment correlation coefficient. Amy's summary statistics for percentage unemployment, \(x\), and percentage wage inflation, \(y\), are shown below. $$\begin{array} { l l l } \sum x = 62 \cdot 8 & \sum y = 19 \cdot 4 & n = 10 \\ \sum x ^ { 2 } = 413 \cdot 44 & \sum y ^ { 2 } = 46 \cdot 16 & \sum x y = 113 \cdot 16 \end{array}$$
  1. Calculate Pearson's product moment correlation coefficient for these data.
  2. Carry out Amy's test at the \(5 \%\) level of significance and state whether the economist's suggestion is reasonable. Amy also collects unemployment data and wage inflation data from a random sample of 10 regions in Spain and calculates Pearson's product moment correlation coefficient to be - 0.2525 .
  3. Should this change Amy's opinion on the economist's suggestion above? What could she do to improve her investigation?
  4. What assumption has Amy made in deciding that it is appropriate to carry out a significance test on Pearson's product moment correlation coefficient?
WJEC Further Unit 2 2024 June Q4
12 marks Standard +0.8
4. An author poses the following question: Does using cash for transactions affect people's financial behaviour?
She collects data on 'Cash transactions as a \% of all transactions' and 'Household debt as a \(\%\) of net disposable income' from a random sample of 25 countries. The table below shows the data she collected. There are missing values, \(p\) and \(q\), for Malta and Denmark respectively.
CountryCash transactions as a \% of all transactions \(\boldsymbol { x }\)Household debt as a \% of net disposable income \(\boldsymbol { y }\)CountryCash transactions as a \% of all transactions \(\boldsymbol { x }\)Household debt as a \% of net disposable income \(\boldsymbol { y }\)
Malta92\(p\)France68120
Mexico90-14Luxembourg64177
Greece88107Belgium63113
Spain87110Finland54137
Italy8687Estonia4882
Austria8591The Netherlands45247
Portugal81131UK42147
Slovenia8056Australia37214
Germany8095USA32109
Ireland79154Sweden20187
Slovakia7874South Korea14182
Lithuania7546Denmark\(q\)261
Latvia7143
The summary statistics and scatter diagram below are for the other 23 countries. \begin{figure}[h]
\captionsetup{labelformat=empty} \caption{Household debt versus Cash transactions} \includegraphics[alt={},max width=\textwidth]{1538fa56-5b61-40ec-bb02-cf1ed9da5eb0-13_664_1296_511_379}
\end{figure} $$\begin{gathered} \sum x = 1467 \sum y = 2695 \sum x ^ { 2 } = 105073 \quad S _ { x x } = 11503 \cdot 91304 \quad S _ { y y } = 78669 \cdot 30435 \\ \sum y ^ { 2 } = 394453 \sum x y = 152999 \quad S _ { x y } = - 18895 \cdot 13043 \end{gathered}$$
  1. Using the summary statistics for the 23 countries, calculate and interpret Pearson's product moment correlation coefficient.
  2. Calculate the equation of the least squares regression line of Household debt as a \% of net disposable income \(( y )\) on Cash transactions as a \% of all transactions ( \(x\) ). The regression line \(x\) on \(y\) is given below. $$x = - 0 \cdot 24 y + 91 \cdot 92$$
  3. By selecting the appropriate regression line in each case, estimate the values of \(p\) and \(q\) in the table.
  4. Comment on the reliability of your answers in part (c).
  5. Interpret the negative value of \(y\) for Mexico.
Edexcel FS2 AS 2018 June Q1
11 marks Moderate -0.3
  1. The scores achieved on a maths test, \(m\), and the scores achieved on a physics test, \(p\), by 16 students are summarised below.
$$\sum m = 392 \quad \sum p = 254 \quad \sum p ^ { 2 } = 4748 \quad \mathrm {~S} _ { m m } = 1846 \quad \mathrm {~S} _ { m p } = 1115$$
  1. Find the product moment correlation coefficient between \(m\) and \(p\)
  2. Find the equation of the linear regression line of \(p\) on \(m\) Figure 1 shows a plot of the residuals. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{0fcb4d83-9763-4edd-8006-93f75a44c596-02_808_1222_997_429} \captionsetup{labelformat=empty} \caption{Figure 1}
    \end{figure}
  3. Calculate the residual sum of squares (RSS). For the person who scored 30 marks on the maths test,
  4. find the score on the physics test. The data for the person who scored 20 on the maths test is removed from the data set.
  5. Suggest a reason why. The product moment correlation coefficient between \(m\) and \(p\) is now recalculated for the remaining 15 students.
  6. Without carrying out any further calculations, suggest how you would expect this recalculated value to compare with your answer to part (a).
    Give a reason for your answer.
    V349 SIHI NI IMIMM ION OCVJYV SIHIL NI LIIIM ION OOVJYV SIHIL NI JIIYM ION OC
Edexcel FS2 AS 2019 June Q3
11 marks Standard +0.3
  1. Two students, Jim and Dora, collected data on the mean annual rainfall, \(w \mathrm {~cm}\), and the annual yield of leeks, \(l\) tonnes per hectare, for 10 years.
Jim summarised the data as follows $$\mathrm { S } _ { w l } = 42.786 \quad \mathrm {~S} _ { w w } = 9936.9 \quad \sum l ^ { 2 } = 26.2326 \quad \sum l = 16.06$$
  1. Find the product moment correlation coefficient between \(l\) and \(w\) Dora decided to code the data first using \(s = w - 6\) and \(t = l - 20\)
  2. Write down the value of the product moment correlation coefficient between \(s\) and \(t\). Give a justification for your answer. Dora calculates the equation of the regression line of \(t\) on \(s\) to be \(t = 0.00431 s - 18.87\)
  3. Find the equation of the regression line of \(l\) on \(w\) in the form \(l = a + b w\), giving the values of \(a\) and \(b\) to 3 significant figures.
  4. Use your equation to estimate the yield of leeks when \(w\) is 100 cm .
  5. Calculate the residual sum of squares. The graph shows the residual for each value of \(l\) \includegraphics[max width=\textwidth, alt={}, center]{7e46e14a-0f5a-4d02-8f00-a92bc4def6d7-08_716_1594_1594_239}
    1. State whether this graph suggests that the use of a linear regression model is suitable for these data. Give a reason for your answer.
    2. Other than collecting more data, suggest how to improve the fit of the model in part (c) to the data.
Edexcel FS2 AS 2020 June Q4
14 marks Standard +0.3
  1. Some students are investigating the strength of wire by suspending a weight at the end of the wire. They measure the diameter of the wire, \(d \mathrm {~mm}\), and the weight, \(w\) grams, when the wire fails. Their results are given in the following table.
\cline { 2 - 13 } \multicolumn{1}{l|}{}These 14 points are plotted on page 13Not yet plotted
\(d\)0.50.60.70.80.91.11.31.622.42.83.33.53.9\(\mathbf { 4 . 5 }\)\(\mathbf { 4 . 6 }\)\(\mathbf { 4 . 8 }\)\(\mathbf { 5 . 4 }\)
\(w\)1.21.72.33.03.85.67.711.61825.934.947.452.763.9\(\mathbf { 8 1 }\)\(\mathbf { 8 3 . 6 }\)\(\mathbf { 8 9 . 9 }\)\(\mathbf { 1 0 9 . 4 }\)
The first 14 points are plotted on the axes on page 13.
  1. On the axes on page 13, complete the scatter diagram for these data.
  2. Use your calculator to write down the equation of the regression line of \(w\) on \(d\).
  3. With reference to the scatter diagram, comment on the appropriateness of using this linear regression model to make predictions for \(w\) for different values of \(d\) between 0.5 and 5.4 The product moment correlation coefficient for these data is \(r = 0.987\) (to 3 significant figures).
  4. Calculate the residual sum of squares (RSS) for this model. Robert, one of the students, suggests that the model could be improved and intends to find the equation of the line of regression of \(w\) on \(u\), where \(u = d ^ { 2 }\) He finds the following statistics $$\mathrm { S } _ { w u } = 5721.625 \quad \mathrm {~S} _ { u u } = 1482.619 \quad \sum u = 157.57$$
  5. By considering the physical nature of the problem, give a reason to support Robert's suggestion.
  6. Find the equation of the regression line of \(w\) on \(u\).
  7. Find the residual sum of squares (RSS) for Robert's model.
  8. State, giving a reason based on these calculations, which of these models better describes these data.
    1. Hence estimate the weight at which a piece of wire with diameter 3 mm will fail. \begin{figure}[h]
      \captionsetup{labelformat=empty} \caption{Question 4 continued} \includegraphics[alt={},max width=\textwidth]{fbd7b196-5372-4956-8d38-92f05c92a5f7-13_2315_1363_301_358}
      \end{figure}
Edexcel FS2 AS 2022 June Q3
10 marks Standard +0.3
  1. Gabriela is investigating a particular type of fish, called bream. She wants to create a model to predict the weight, \(w\) grams, of bream based on their length, \(x \mathrm {~cm}\).
For a sample of 27 bream, some summary statistics are given below. $$\begin{gathered} \bar { x } = 31.07 \quad \bar { w } = 628.59 \quad \sum w ^ { 2 } = 11386134 \\ \mathrm {~S} _ { x w } = 13082.3 \quad \mathrm {~S} _ { x x } = 260.8 \end{gathered}$$
  1. Find the value of the product moment correlation coefficient between \(x\) and \(w\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(w\) on \(x\) in the form \(w = a + b x\) A residual plot for these data is shown below. \includegraphics[max width=\textwidth, alt={}, center]{128c408d-3e08-4f74-8f19-d33ecd5c882f-06_931_1790_1107_139} One of the bream in the sample has a length of 32 cm .
  4. Find its weight.
  5. With reference to the residual plot, comment on the model for bream with lengths above 33 cm .
Edexcel FS2 AS 2023 June Q3
10 marks Standard +0.3
  1. Pat is investigating the relationship between the height of professional tennis players and the speed of their serve. Data from 9 randomly selected professional male tennis players were collected. The variables recorded were the height of each player, \(h\) metres, and the maximum speed of their serve, \(v \mathrm {~km} / \mathrm { h }\).
Pat summarised these data as follows $$\sum h = 17.63 \quad \sum v = 2174.9 \quad \sum v ^ { 2 } = 526407.8 \quad S _ { h h } = 0.0487 \quad S _ { h v } = 5.1376$$
  1. Calculate the product moment correlation coefficient between \(h\) and \(v\)
  2. Explain whether the answer to part (a) is consistent with a linear model for these data.
  3. Find the equation of the regression line of \(v\) on \(h\) in the form \(v = a + b h\) where \(a\) and \(b\) are to be given to one decimal place. Pat calculated the sum of the residuals for the 9 tennis players as 1.04
  4. Without doing a calculation, explain how you know Pat has made a mistake. Pat made one mistake in the calculation. For the tennis player of height 1.96 m Pat misread the residual as 2.27
  5. Find the maximum speed of serve, in km/h, for the tennis player of height 1.96 m
Edexcel FS2 AS 2024 June Q5
8 marks Standard +0.3
  1. A random sample of 24 adults is taken. The height, \(h\) metres, and the arm span, \(s\) metres, for each adult are recorded.
These data are summarised below. $$\mathrm { S } _ { h h } = 0.377 \quad \mathrm {~S} _ { s h } = 0.352 \quad \bar { s } = 1.70 \quad \bar { h } = 1.68$$ The least squares regression line of \(h\) on \(s\) is $$h = a + 0.919 s$$ where \(a\) is a constant.
  1. Calculate the product moment correlation coefficient. A doctor uses the least squares regression line of \(h\) on \(s\) as a model to predict a person's height based on their arm span.
  2. Use the model to predict the height of an adult with arm span 1.79 metres. Ewan has an arm span of 1.70 metres and a height of 1.75 metres. His information is added to the sample as the 25th adult.
  3. Explain how the gradient of the regression line for the sample of 25 adults compares with the gradient of the regression line for the original sample of 24 adults.
    Give a reason for your answer.
Edexcel FS2 AS Specimen Q3
11 marks Standard +0.3
  1. A scientist wants to develop a model to describe the relationship between the average daily temperature, \(\mathrm { x } ^ { \circ } \mathrm { C }\), and a household's daily energy consumption, ykWh , in winter.
A random sample of the average temperature and energy consumption are taken from 10 winter days and are summarised below. $$\begin{gathered} \sum x = 12 \quad \sum x ^ { 2 } = 24.76 \quad \sum y = 251 \quad \sum y ^ { 2 } = 6341 \quad \sum x y = 284.8 \\ S _ { x x } = 10.36 \quad S _ { y y } = 40.9 \end{gathered}$$
  1. Find the product moment correlation coefficient between y and x .
  2. Find the equation of the regression line of \(y\) on \(x\) in the form \(y = a + b x\)
  3. Use your equation to estimate the daily energy consumption when the average daily temperature is \(2 ^ { \circ } \mathrm { C }\)
  4. Calculate the residual sum of squares (RSS). The table shows the residual for each value of x .
    \(\mathbf { x }\)- 0.4- 0.20.30.81.11.41.82.12.52.6
    R esidual- 0.63- 0.32- 0.52- 0.730.742.221.840.32\(f\)- 1.88
  5. Find the value of f.
  6. By considering the signs of the residuals, explain whether or not the linear regression model is a suitable model for these data.
Edexcel FS2 2019 June Q2
10 marks Standard +0.3
2 A large field of wheat is split into 8 plots of equal area. Each plot is treated with a different amount of fertiliser, \(f\) grams \(/ \mathrm { m } ^ { 2 }\). The yield of wheat, \(w\) tonnes, from each plot is recorded. The results are summarised below. $$\sum f = 28 \quad \sum w = 303 \quad \sum w ^ { 2 } = 13447 \quad \mathrm {~S} _ { f f } = 42 \quad \mathrm {~S} _ { f w } = 269.5$$
  1. Calculate the product moment correlation coefficient between \(f\) and \(w\)
  2. Interpret the value of your product moment correlation coefficient.
  3. Find the equation of the regression line of \(w\) on \(f\) in the form \(w = a + b f\)
  4. Using your equation, estimate the decrease in yield when the amount of fertiliser decreases by 0.5 grams \(/ \mathrm { m } ^ { 2 }\) The residuals of the data recorded are calculated and plotted on the graph below. \includegraphics[max width=\textwidth, alt={}, center]{67df73d4-6ce4-45f7-8a69-aa94292ea814-04_1232_1294_1169_301}
  5. With reference to this graph, comment on the suitability of the model you found in part (c).
  6. Suggest how you might be able to refine your model.
Edexcel FS2 2021 June Q4
10 marks Standard +0.3
  1. A researcher is investigating the relationship between elevation, \(x\) metres, and annual mean temperature, \(t ^ { \circ } \mathrm { C }\).
From a random sample of 20 weather stations in Switzerland, the following results were obtained $$\mathrm { S } _ { x x } = 8820655 \quad \mathrm {~S} _ { t t } = 444.7 \quad \sum x = 28130 \quad \sum t = 94.62$$ The product moment correlation coefficient for these data is found to be - 0.959
  1. Interpret the value of this correlation coefficient.
  2. Show that the equation of the regression line of \(t\) on \(x\) can be written as $$t = 14.3 - 0.00681 x$$ The random variable \(W\) represents the elevations of the weather stations in kilometres.
  3. Write down the equation of the regression line of \(t\) on \(w\) for these 20 weather stations in the form \(t = a + b w\)
  4. Show that the residual sum of squares (RSS) for the model for \(t\) and \(x\) is 35.7 correct to one decimal place. One of the weather stations in the sample had a recorded elevation of 1100 metres and an annual mean temperature of \(1.4 ^ { \circ } \mathrm { C }\)
    1. Calculate this weather station's contribution to the residual sum of squares. Give your answer as a percentage
    2. Comment on the data for this weather station in light of your answer to part (e)(i).
Edexcel FS2 2022 June Q1
7 marks Standard +0.3
  1. Kwame is investigating a possible relationship between average March temperature, \(t ^ { \circ } \mathrm { C }\), and tea yield, \(y \mathrm {~kg} /\) hectare, for tea grown in a particular location. He uses 30 years of past data to produce the following summary statistics for a linear regression model, with tea yield as the dependent variable.
$$\begin{aligned} & \text { Residual Sum of Squares } ( \mathrm { RSS } ) = 1666567 \quad \mathrm {~S} _ { t t } = 52.0 \quad \mathrm {~S} _ { y y } = 1774155 \\ & \text { least squares regression line: } \quad \text { gradient } = 45.5 \quad y \text {-intercept } = 2080 \end{aligned}$$
  1. Use the regression model to predict the tea yield for an average March temperature of \(20 ^ { \circ } \mathrm { C }\) He also produces the following residual plot for the data. \includegraphics[max width=\textwidth, alt={}, center]{d139840b-16ec-42ce-8501-f79c263c8017-02_663_880_868_589}
  2. Explain what you understand by the term residual.
  3. Calculate the product moment correlation coefficient between \(t\) and \(y\)
  4. Explain why the linear model may not be a good fit for the data
    1. with reference to your answer to part (c)
    2. with reference to the residual plot. \section*{Question 1 continues on page 4} Kwame also collects data on total March rainfall, \(w \mathrm {~mm}\), for each of these 30 years. For a linear regression model of \(w\) on \(t\) the following summary statistic is found. $$\text { Residual Sum of Squares (RSS) = } 86754$$ Kwame concludes that since this model has a smaller RSS, there must be a stronger linear relationship between \(w\) and \(t\) than between \(y\) and \(t\) (where RSS \(= 1666567\) )
  5. State, giving a reason, whether or not you agree with the reasoning that led to Kwame's conclusion.
Edexcel FS2 2023 June Q1
7 marks Easy -1.2
  1. Baako is investigating the times taken by children to run a 100 m race, \(x\) seconds, and a 500 m race, \(y\) seconds. For a sample of 20 children, Baako obtains the time taken by each child to run each race.
Here are Baako's summary statistics. $$\begin{gathered} \mathrm { S } _ { x x } = 314.55 \quad \mathrm {~S} _ { y y } = 9026 \quad \mathrm {~S} _ { x y } = 1610 \\ \bar { x } = 19.65 \quad \bar { y } = 108 \end{gathered}$$
  1. Calculate the product moment correlation coefficient between the times taken to run the 100 m race and the times taken to run the 500 m race.
  2. Show that the equation of the regression line of \(y\) on \(x\) can be written as $$y = 5.12 x + 7.42$$ where the gradient and \(y\) intercept are given to 3 significant figures. The child who completed the 100 m race in 20 seconds took 104 seconds to complete the 500 m race.
  3. Find the residual for this child. The table below shows the signs of the residuals for the 20 children in order of finishing time for the 100 m race.
    Sign of residual++++--+--------+++++
  4. Explain what the signs of the residuals show about the model's predictions of the 500 m race times for the children who are fastest and slowest over the 100 m race.
Edexcel FS2 Specimen Q7
8 marks Standard +0.8
  1. Over a period of time, researchers took 10 blood samples from one patient with a blood disease. For each sample, they measured the levels of serum magnesium, \(s \mathrm { mg } / \mathrm { dl }\), in the blood and the corresponding level of the disease protein, \(d \mathrm { mg } / \mathrm { dl }\). One of the researchers coded the data for each sample using \(x = 10 s\) and \(y = 10 ( d - 9 )\) but spilt ink over his work.
The following summary statistics and unfinished scatter diagram are the only remaining information. $$\sum d ^ { 2 } = 1081.74 \quad \mathrm {~S} _ { d s } = 59.524$$ and $$\sum y = 64 \quad \mathrm {~S} _ { x x } = 2658.9$$ \(d \mathrm { mg } / \mathrm { dl }\) \includegraphics[max width=\textwidth, alt={}, center]{e777c787-0d39-4d84-a0f9-fc4a6712184f-22_983_1534_840_303}
  1. Use the formula for \(\mathrm { S } _ { x x }\) to show that \(\mathrm { S } _ { s s } = 26.589\)
  2. Find the value of the product moment correlation coefficient between \(s\) and \(d\).
  3. With reference to the unfinished scatter diagram, comment on your result in part (b).
OCR FS1 AS 2017 December Q5
8 marks Moderate -0.5
5 A shop manager recorded the maximum daytime temperature \(T ^ { \circ } \mathrm { C }\) and the number \(C\) of ice creams sold on 9 summer days. The results are given in the table and illustrated in the scatter diagram.
\(T\)172125262727293030
\(C\)211620383237353942
\includegraphics[max width=\textwidth, alt={}]{64d7ed6d-fadd-4c59-afb0-97d1788ba369-3_661_1189_1320_431}
$$n = 9 , \Sigma t = 232 , \Sigma c = 280 , \Sigma t ^ { 2 } = 6130 , \Sigma c ^ { 2 } = 9444 , \Sigma t c = 7489$$
  1. State, with a reason, whether one of the variables \(C\) or \(T\) is likely to be dependent upon the other.
  2. Calculate Pearson's product-moment correlation coefficient \(r\) for the data.
  3. State with a reason what the value of \(r\) would have been if the temperature had been measured in \({ } ^ { \circ } \mathrm { F }\) rather than \({ } ^ { \circ } \mathrm { C }\).
  4. Calculate the equation of the least squares regression line of \(c\) on \(t\).
  5. The regression line is drawn on the copy of the scatter diagram in the Printed Answer Booklet. Use this diagram to explain what is meant by "least squares".