Interpret features of scatter diagram

A question is this sub-type if and only if it provides a scatter diagram and requires interpretation of its features such as correlation strength, outliers, or relationship patterns without requiring drawing.

8 questions

OCR MEI S2 2008 January Q1
1 A biology student is carrying out an experiment to study the effect of a hormone on the growth of plant shoots. The student applies the hormone at various concentrations to a random sample of twelve shoots and measures the growth of each shoot. The data are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\), measured in suitable units, represent concentration and growth respectively.
\includegraphics[max width=\textwidth, alt={}, center]{20fc4222-95c6-4b59-8e89-913dd988eb44-2_693_897_534_625} $$n = 12 , \Sigma x = 30 , \Sigma y = 967.6 , \Sigma x ^ { 2 } = 90 , \Sigma y ^ { 2 } = 78926 , \Sigma x y = 2530.3 .$$
  1. State which of the two variables \(x\) and \(y\) is the independent variable and which is the dependent variable. Briefly explain your answers.
  2. Calculate the equation of the regression line of \(y\) on \(x\).
  3. Use the equation of the regression line to calculate estimates of shoot growth for concentrations of
    (A) 1.2,
    (B) 4.3. Comment on the reliability of each of these estimates.
  4. Calculate the value of the residual for the data point where \(x = 3\) and \(y = 80\).
  5. In further experiments, the student finds that using concentration \(x = 6\) results in shoot growths of around \(y = 20\). In the light of all the available information, what can be said about the relationship between \(x\) and \(y\) ?
OCR S1 2013 January Q3
3 The Gross Domestic Product per Capita (GDP), \(x\) dollars, and the Infant Mortality Rate per thousand (IMR), \(y\), of 6 African countries were recorded and summarised as follows. $$n = 6 \quad \sum x = 7000 \quad \sum x ^ { 2 } = 8700000 \quad \sum y = 456 \quad \sum y ^ { 2 } = 36262 \quad \sum x y = 509900$$
  1. Calculate the equation of the regression line of \(y\) on \(x\) for these 6 countries. The original data were plotted on a scatter diagram and the regression line of \(y\) on \(x\) was drawn, as shown below.
    \includegraphics[max width=\textwidth, alt={}, center]{13d8d940-fd63-4b62-bd7a-aa7174f6af4b-3_721_1246_680_408}
  2. The GDP for another country, Tanzania, is 1300 dollars. Use the regression line in the diagram to estimate the IMR of Tanzania.
  3. The GDP for Nigeria is 2400 dollars. Give two reasons why the regression line is unlikely to give a reliable estimate for the IMR for Nigeria.
  4. The actual value of the IMR for Tanzania is 96. The data for Tanzania \(( x = 1300 , y = 96 )\) is now included with the original 6 countries. Calculate the value of the product moment correlation coefficient, \(r\), for all 7 countries.
  5. The IMR is now redefined as the infant mortality rate per hundred instead of per thousand, and the value of \(r\) is recalculated for all 7 countries. Without calculation state what effect, if any, this would have on the value of \(r\) found in part (iv).
OCR MEI S2 Q3
3 In a triathlon, competitors have to swim 600 metres, cycle 40 kilometres and run 10 kilometres. To improve her strength, a triathlete undertakes a training programme in which she carries weights in a rucksack whilst running. She runs a specific course and notes the total time taken for each run. Her coach is investigating the relationship between time taken and weight carried. The times taken with eight different weights are illustrated on the scatter diagram below, together with the summary statistics for these data. The variables \(x\) and \(y\) represent weight carried in kilograms and time taken in minutes respectively.
\includegraphics[max width=\textwidth, alt={}, center]{d138173d-c70c-46db-b9b9-d5f19334c5f1-04_627_1536_630_281} Summary statistics: \(n = 8 , \Sigma x = 36 , \Sigma y = 214.8 , \Sigma x ^ { 2 } = 204 , \Sigma y ^ { 2 } = 5775.28 , \Sigma x y = 983.6\).
  1. Calculate the equation of the regression line of \(y\) on \(x\). On one of the eight runs, the triathlete was carrying 4 kilograms and took 27.5 minutes. On this run she was delayed when she tripped and fell over.
  2. Calculate the value of the residual for this weight.
  3. The coach decides to recalculate the equation of the regression line without the data for this run. Would it be preferable to use this recalculated equation or the equation found in part (i) to estimate the delay when the triathlete tripped and fell over? Explain your answer. The triathlete's coach claims that there is positive correlation between cycling and swimming times in triathlons. The product moment correlation coefficient of the times of twenty randomly selected competitors in these two sections is 0.209 .
  4. Carry out a hypothesis test at the \(5 \%\) level to examine the coach's claim, explaining your conclusions clearly.
  5. What distributional assumption is necessary for this test to be valid? How can you use a scatter diagram to decide whether this assumption is likely to be true?
OCR MEI Paper 2 2020 November Q11
11 The pre-release material contains information concerning median house prices over the period 2004-2015. A spreadsheet has been used to generate a time series graph for two areas: the London borough of "Barking and Dagenham" and "North West". This is shown together with the raw data in Fig. 11.1. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{cea67565-8074-4703-8e1a-09b98e380baf-12_572_1751_447_159} \captionsetup{labelformat=empty} \caption{Fig. 11.1}
\end{figure} Dr Procter suggests that it is unusual for median house prices in a London borough to be consistently higher than those in other parts of the country.
  1. Use your knowledge of the large data set to comment on Dr Procter's suggestion. Dr Procter wishes to predict the median house price in Barking and Dagenham in 2016. She uses the spreadsheet function LINEST to find the equation of the line of best fit for the given data. She obtains the equation
    \(P = 4897 Y - 9657847\), where \(P\) is the median house price in pounds and \(Y\) is the calendar year, for example 2015.
  2. Use Dr Procter's equation to predict the median house price in Barking and Dagenham in
    • 2016
    • 2017.
    Professor Jackson uses a simpler model by using the data from 2014 and 2015 only to form a straight-line model.
  3. Find the equation Professor Jackson uses in her model.
  4. Use Professor Jackson’s equation to predict the median house price in Barking and Dagenham in
    • 2016
    • 2017.
    Professor Jackson carries out some research online. She finds some information about median house prices in Barking and Dagenham, which is shown in Fig. 11.2. \begin{table}[h]
    20162017
    \(\pounds 290000\)\(\pounds 300000\)
    \captionsetup{labelformat=empty} \caption{Fig. 11.2}
    \end{table}
  5. Comment on how well
    • Dr Procter’s model fits the data,
    • Professor Jackson’s model fits the data.
    • Explain which, if any, of the models is likely to be more reliable for predicting median house prices in Barking and Dagenham in 2020.
OCR MEI Further Statistics A AS 2020 November Q5
5 A doctor is investigating the relationship between the levels in the blood of a particular hormone and of calcium in healthy adults. The levels of the hormone and of calcium, each measured in suitable units, are denoted by \(x\) and \(y\) respectively. The doctor selects a random sample of 14 adults and measures the hormone and calcium levels in each of them. The spreadsheet in Fig. 5 shows the values obtained, together with a scatter diagram which illustrates the data. The equation of the regression line of \(y\) on \(x\) is shown on the scatter diagram, together with the value of the square of the product moment correlation coefficient. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{ba3fcd3c-6834-4116-be0e-d5b27aed0a7e-5_801_1644_646_255} \captionsetup{labelformat=empty} \caption{Fig. 5}
\end{figure}
  1. Use the equation of the regression line to estimate the mean calcium level of people with the following hormone levels.
    • 150
    • 250
    • Explain which of your two estimates is likely to be more reliable.
    • Comment on the goodness of fit of the regression line.
    • Explain whether it would be appropriate to plot the scatter diagram the other way around with calcium level on the horizontal axis and hormone level on the vertical axis.
    • Calculate the equation of a regression line which would be suitable for estimating the mean hormone level of people with a known calcium level.
OCR MEI Further Statistics Minor 2023 June Q5
5 An ornithologist is investigating the link between the wing length and the mass of small birds, in order to try to predict the mass from the wing length without having to weigh birds. The ornithologist takes a random sample of 9 birds and measures their wing lengths \(w \mathrm {~mm}\) and their masses \(m g\). The spreadsheet below shows the data, together with a scatter diagram which illustrates the data.
\includegraphics[max width=\textwidth, alt={}, center]{72215d69-c3e6-492d-bb3e-bdc28aeb4613-5_719_1424_495_246}
  1. Find the equation of the regression line of \(m\) on \(w\), giving the coefficients correct to \(\mathbf { 3 }\) significant figures.
  2. Use the equation which you found in part (a) to estimate the mass for each of the following wing lengths.
    • 99 mm
    • 110 mm
    • Comment on the reliability of your estimates.
    • The equation of the regression line of \(w\) on \(m\) is \(w = 0.473 m + 87.5\). A friend of the ornithologist suggests that this equation could also be used to estimate the masses of birds from their wing lengths.
    Comment on this suggestion.
OCR MEI Further Statistics Minor 2021 November Q2
2 A road transport researcher is investigating the link between the age of a person, a years, and the distance, \(d\) metres, at which the person can read a large road sign. The researcher selects 13 individuals of different ages between 20 and 80 and measures the value of \(d\) for each of them. The spreadsheet below shows the data which the researcher obtained, together with a scatter diagram which illustrates the data.
\includegraphics[max width=\textwidth, alt={}, center]{691e8b55-e9a1-4fff-b9ee-a71ff1f73ead-3_725_1566_495_251}
  1. Explain which of the two variables \(a\) and \(d\) is the independent variable.
  2. Find the equation of the regression line of \(d\) on \(a\).
  3. Use the regression line to predict the average distance at which a 60-year-old person can read the road sign.
  4. Explain why it might not be sensible to use the regression line to predict the average distance at which a 5 -year-old child can read the road sign.
  5. Determine the value of the residual for \(a = 40\).
  6. Explain why it would not be useful to find the equation of the regression line of \(a\) on \(d\).
OCR MEI Further Statistics Major 2024 June Q8
8 An estate agent collects data for a random selection of 13 flats in order to investigate the link between the floor areas of flats and their price. The scatter diagram shows the floor areas, \(x \mathrm {~m} ^ { 2 }\), and prices, \(\pounds y\) thousand, of the 13 flats.
\includegraphics[max width=\textwidth, alt={}, center]{bab116b3-6e5f-44db-ac86-670e4040d649-07_613_1246_386_242}
  1. The estate agent notes that two of the data points are outliers. One is Flat A which has a large floor area but is in poor condition. The other is Flat B which has a balcony with a desirable view overlooking the sea. Label these two data points on the copy of the scatter diagram in the Printed Answer Booklet. The estate agent decides to remove these two data points from the analysis. Summary statistics for the remaining 11 flats are as follows. $$\sum x = 652.5 \quad \sum y = 5067 \quad \sum x ^ { 2 } = 41987.35 \quad \sum y ^ { 2 } = 2456813 \quad \sum x y = 315928.2$$
  2. In this question you must show detailed reasoning. Calculate the equation of a regression line which is suitable for estimating the price of a flat from its floor area.
  3. Use the regression line to estimate the price for the following floor areas.
    • \(40 \mathrm {~m} ^ { 2 }\)
    • \(110 \mathrm {~m} ^ { 2 }\)
    • Given that the value of the product moment correlation coefficient for these 11 data items is 0.765 , comment on the reliability of your estimates.
    • The estate agent thinks that he can predict the floor area of a flat from its price, using the equation of the regression line found in part (b).
    Comment briefly on the estate agent's idea.