Interpret features of scatter diagram

A question is this sub-type if and only if it provides a scatter diagram and requires interpretation of its features such as correlation strength, outliers, or relationship patterns without requiring drawing.

9 questions · Moderate -0.6

5.09a Dependent/independent variables5.09b Least squares regression: concepts5.09c Calculate regression line5.09e Use regression: for estimation in context
Sort by: Default | Easiest first | Hardest first
OCR MEI S2 2008 January Q1
18 marks Moderate -0.3
1 A biology student is carrying out an experiment to study the effect of a hormone on the growth of plant shoots. The student applies the hormone at various concentrations to a random sample of twelve shoots and measures the growth of each shoot. The data are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\), measured in suitable units, represent concentration and growth respectively. \includegraphics[max width=\textwidth, alt={}, center]{20fc4222-95c6-4b59-8e89-913dd988eb44-2_693_897_534_625} $$n = 12 , \Sigma x = 30 , \Sigma y = 967.6 , \Sigma x ^ { 2 } = 90 , \Sigma y ^ { 2 } = 78926 , \Sigma x y = 2530.3 .$$
  1. State which of the two variables \(x\) and \(y\) is the independent variable and which is the dependent variable. Briefly explain your answers.
  2. Calculate the equation of the regression line of \(y\) on \(x\).
  3. Use the equation of the regression line to calculate estimates of shoot growth for concentrations of
    (A) 1.2,
    (B) 4.3. Comment on the reliability of each of these estimates.
  4. Calculate the value of the residual for the data point where \(x = 3\) and \(y = 80\).
  5. In further experiments, the student finds that using concentration \(x = 6\) results in shoot growths of around \(y = 20\). In the light of all the available information, what can be said about the relationship between \(x\) and \(y\) ?
OCR MEI S2 Q3
18 marks Standard +0.3
3 In a triathlon, competitors have to swim 600 metres, cycle 40 kilometres and run 10 kilometres. To improve her strength, a triathlete undertakes a training programme in which she carries weights in a rucksack whilst running. She runs a specific course and notes the total time taken for each run. Her coach is investigating the relationship between time taken and weight carried. The times taken with eight different weights are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\) represent weight carried in kilograms and time taken in minutes respectively. \includegraphics[max width=\textwidth, alt={}, center]{d138173d-c70c-46db-b9b9-d5f19334c5f1-04_627_1536_630_281} Summary statistics: \(n = 8 , \Sigma x = 36 , \Sigma y = 214.8 , \Sigma x ^ { 2 } = 204 , \Sigma y ^ { 2 } = 5775.28 , \Sigma x y = 983.6\).
  1. Calculate the equation of the regression line of \(y\) on \(x\). On one of the eight runs, the triathlete was carrying 4 kilograms and took 27.5 minutes. On this run she was delayed when she tripped and fell over.
  2. Calculate the value of the residual for this weight.
  3. The coach decides to recalculate the equation of the regression line without the data for this run. Would it be preferable to use this recalculated equation or the equation found in part (i) to estimate the delay when the triathlete tripped and fell over? Explain your answer. The triathlete's coach claims that there is positive correlation between cycling and swimming times in triathlons. The product moment correlation coefficient of the times of twenty randomly selected competitors in these two sections is 0.209 .
  4. Carry out a hypothesis test at the \(5 \%\) level to examine the coach's claim, explaining your conclusions clearly.
  5. What distributional assumption is necessary for this test to be valid? How can you use a scatter diagram to decide whether this assumption is likely to be true?
OCR MEI Paper 2 2020 November Q11
10 marks Moderate -0.8
11 The pre-release material contains information concerning median house prices over the period 2004-2015. A spreadsheet has been used to generate a time series graph for two areas: the London borough of "Barking and Dagenham" and "North West". This is shown together with the raw data in Fig. 11.1. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{cea67565-8074-4703-8e1a-09b98e380baf-12_572_1751_447_159} \captionsetup{labelformat=empty} \caption{Fig. 11.1}
\end{figure} Dr Procter suggests that it is unusual for median house prices in a London borough to be consistently higher than those in other parts of the country.
  1. Use your knowledge of the large data set to comment on Dr Procter's suggestion. Dr Procter wishes to predict the median house price in Barking and Dagenham in 2016. She uses the spreadsheet function LINEST to find the equation of the line of best fit for the given data. She obtains the equation \(P = 4897 Y - 9657847\), where \(P\) is the median house price in pounds and \(Y\) is the calendar year, for example 2015.
  2. Use Dr Procter's equation to predict the median house price in Barking and Dagenham in
    Professor Jackson uses a simpler model by using the data from 2014 and 2015 only to form a straight-line model.
  3. Find the equation Professor Jackson uses in her model.
  4. Use Professor Jackson's equation to predict the median house price in Barking and Dagenham in
    Professor Jackson carries out some research online. She finds some information about median house prices in Barking and Dagenham, which is shown in Fig. 11.2. \begin{table}[h]
    20162017
    \(\pounds 290000\)\(\pounds 300000\)
    \captionsetup{labelformat=empty} \caption{Fig. 11.2}
    \end{table}
  5. Comment on how well
OCR MEI Further Statistics A AS 2020 November Q5
8 marks Moderate -0.3
5 A doctor is investigating the relationship between the levels in the blood of a particular hormone and of calcium in healthy adults. The levels of the hormone and of calcium, each measured in suitable units, are denoted by \(x\) and \(y\) respectively. The doctor selects a random sample of 14 adults and measures the hormone and calcium levels in each of them. The spreadsheet in Fig. 5 shows the values obtained, together with a scatter diagram which illustrates the data. The equation of the regression line of \(y\) on \(x\) is shown on the scatter diagram, together with the value of the square of the product moment correlation coefficient. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{ba3fcd3c-6834-4116-be0e-d5b27aed0a7e-5_801_1644_646_255} \captionsetup{labelformat=empty} \caption{Fig. 5}
\end{figure}
  1. Use the equation of the regression line to estimate the mean calcium level of people with the following hormone levels.
OCR MEI Further Statistics Minor 2023 June Q5
8 marks Moderate -0.8
5 An ornithologist is investigating the link between the wing length and the mass of small birds, in order to try to predict the mass from the wing length without having to weigh birds. The ornithologist takes a random sample of 9 birds and measures their wing lengths \(w \mathrm {~mm}\) and their masses \(m g\). The spreadsheet below shows the data, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{72215d69-c3e6-492d-bb3e-bdc28aeb4613-5_719_1424_495_246}
  1. Find the equation of the regression line of \(m\) on \(w\), giving the coefficients correct to \(\mathbf { 3 }\) significant figures.
  2. Use the equation which you found in part (a) to estimate the mass for each of the following wing lengths.
    Comment on this suggestion.
OCR MEI Further Statistics Minor 2021 November Q2
9 marks Moderate -0.8
2 A road transport researcher is investigating the link between the age of a person, a years, and the distance, \(d\) metres, at which the person can read a large road sign. The researcher selects 13 individuals of different ages between 20 and 80 and measures the value of \(d\) for each of them. The spreadsheet below shows the data which the researcher obtained, together with a scatter diagram which illustrates the data. \includegraphics[max width=\textwidth, alt={}, center]{691e8b55-e9a1-4fff-b9ee-a71ff1f73ead-3_725_1566_495_251}
  1. Explain which of the two variables \(a\) and \(d\) is the independent variable.
  2. Find the equation of the regression line of \(d\) on \(a\).
  3. Use the regression line to predict the average distance at which a 60-year-old person can read the road sign.
  4. Explain why it might not be sensible to use the regression line to predict the average distance at which a 5 -year-old child can read the road sign.
  5. Determine the value of the residual for \(a = 40\).
  6. Explain why it would not be useful to find the equation of the regression line of \(a\) on \(d\).
OCR MEI Further Statistics Major 2019 June Q6
18 marks Moderate -0.8
6
  1. A researcher is investigating the date of the 'start of spring' at different locations around the country.
    A suitable date (measured in days from the start of the year) can be identified by checking, for example, when buds first appear for certain species of trees and plants, but this is time-consuming and expensive. Satellite data, measuring microwave emissions, can alternatively be used to estimate the date that land-based measurements would give. The researcher chooses a random sample of 12 locations, and obtains land-based measurements for the start of spring date at each location, together with relevant satellite measurements. The scatter diagram in Fig. 6.1 shows the results; the land-based measurements are denoted by \(x\) days and the corresponding values derived from satellite measurements by \(y\) days. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-06_732_1342_781_333} \captionsetup{labelformat=empty} \caption{Fig. 6.1}
    \end{figure} Fig. 6.2 shows part of a spreadsheet used to analyse the data. Some rows of the spreadsheet have been deliberately omitted. \begin{table}[h]
    1ABCDEF
    1x\(\boldsymbol { y }\)\(\boldsymbol { x } ^ { \mathbf { 2 } }\)\(\boldsymbol { y } ^ { \mathbf { 2 } }\)xy
    2901028100104049180
    3
    10
    11
    129497883694099118
    13991019801102019999
    14Sum11311227107783126725116724
    15
    \captionsetup{labelformat=empty} \caption{Fig. 6.2}
    \end{table}
    1. Calculate the equation of a regression line suitable for estimating the land-based date of the start of spring from satellite measurements.
    2. Using this equation, estimate the land-based date of the start of spring for the following dates from satellite measurements.
      • 95 days
      • 60 days
        (iii) Comment on the reliability of each of your estimates.
      • The researcher is also investigating whether there is any correlation between the average temperature during a month in spring and the total rainfall during that month at a particular location. The average temperatures in degrees Celsius and total rainfall in mm for a random selection, over several years, of 10 spring months at this location are as follows.
      Temperature4.27.15.63.58.66.52.75.96.74.1
      Rainfall18264276154384536636
      The researcher plots the scatter diagram shown in Fig. 6.3 to check which type of test to carry out. \begin{figure}[h]
      \includegraphics[alt={},max width=\textwidth]{3a89edc4-ac93-4691-ade8-4d4665b55202-07_693_880_1174_338} \captionsetup{labelformat=empty} \caption{Fig. 6.3}
      \end{figure}
      1. Explain why the researcher might come to the conclusion that a test based on Pearson's product moment correlation coefficient may be valid.
      2. Find the value of Pearson's product moment correlation coefficient.
      3. Carry out a test at the \(5 \%\) significance level to investigate whether there is any correlation between temperature and rainfall.
OCR MEI Further Statistics Major 2024 June Q8
14 marks Moderate -0.3
8 An estate agent collects data for a random selection of 13 flats in order to investigate the link between the floor areas of flats and their price. The scatter diagram shows the floor areas, \(x \mathrm {~m} ^ { 2 }\), and prices, \(\pounds y\) thousand, of the 13 flats. \includegraphics[max width=\textwidth, alt={}, center]{bab116b3-6e5f-44db-ac86-670e4040d649-07_613_1246_386_242}
  1. The estate agent notes that two of the data points are outliers. One is Flat A which has a large floor area but is in poor condition. The other is Flat B which has a balcony with a desirable view overlooking the sea. Label these two data points on the copy of the scatter diagram in the Printed Answer Booklet. The estate agent decides to remove these two data points from the analysis. Summary statistics for the remaining 11 flats are as follows. $$\sum x = 652.5 \quad \sum y = 5067 \quad \sum x ^ { 2 } = 41987.35 \quad \sum y ^ { 2 } = 2456813 \quad \sum x y = 315928.2$$
  2. In this question you must show detailed reasoning. Calculate the equation of a regression line which is suitable for estimating the price of a flat from its floor area.
  3. Use the regression line to estimate the price for the following floor areas.
    Comment briefly on the estate agent's idea.
OCR MEI Paper 2 Specimen Q16
20 marks Easy -1.8
Fig. 16.1, Fig. 16.2 and Fig. 16.3 show some data about life expectancy, including some from the pre-release data set. \includegraphics{figure_16_1} \includegraphics{figure_16_2} \includegraphics{figure_16_3}
  1. Comment on the shapes of the distributions of life expectancy at birth in 2014 and 1974. [2]
    1. The minimum value shown in the box plot is negative. What does a negative value indicate? [1]
    2. What feature of Fig 16.3 suggests that a Normal distribution would not be an appropriate model for increase in life expectancy from one year to another year? [1]
    3. Software has been used to obtain the values in the table in Fig. 16.3. Decide whether the level of accuracy is appropriate. Justify your answer. [1]
    4. John claims that for half the people in the world their life expectancy has improved by 10 years or more. Explain why Fig. 16.3 does not provide conclusive evidence for John's claim. [1]
  2. Decide whether the maximum increase in life expectancy from 1974 to 2014 is an outlier. Justify your answer. [3]
Here is some further information from the pre-release data set.
CountryLife expectancy at birth in 2014
Ethiopia60.8
Sweden81.9
    1. Estimate the change in life expectancy at birth for Ethiopia between 1974 and 2014.
    2. Estimate the change in life expectancy at birth for Sweden between 1974 and 2014.
    3. Give one possible reason why the answers to parts (i) and (ii) are so different. [4]
Fig. 16.4 shows the relationship between life expectancy at birth in 2014 and 1974. \includegraphics{figure_16_4} A spreadsheet gives the following linear model for all the data in Fig 16.4. (Life expectancy at birth 2014) = 30.98 + 0.67 × (Life expectancy at birth 1974) The life expectancy at birth in 1974 for the region that now constitutes the country of South Sudan was 37.4 years. The value for this country in 2014 is not available.
    1. Use the linear model to estimate the life expectancy at birth in 2014 for South Sudan. [2]
    2. Give two reasons why your answer to part (i) is not likely to be an accurate estimate for the life expectancy at birth in 2014 for South Sudan. You should refer to both information from Fig 16.4 and your knowledge of the large data set. [2]
  1. In how many of the countries represented in Fig. 16.4 did life expectancy drop between 1974 and 2014? Justify your answer. [3]