OCR S1 — Question 8

Exam BoardOCR
ModuleS1 (Statistics 1)
TopicBivariate data
TypeCalculate r from raw bivariate data

8 The table shows the population, \(x\) million, of each of nine countries in Western Europe together with the population, \(y\) million, of its capital city.
GermanyUnited KingdomFranceItalySpainThe NetherlandsPortugalAustriaSwitzerland
\(x\)82.159.259.156.739.215.99.98.17.3
\(y\)3.57.09.02.72.90.80.71.60.1
$$\left[ n = 9 , \Sigma x = 337.5 , \Sigma x ^ { 2 } = 18959.11 , \Sigma y = 28.3 , \Sigma y ^ { 2 } = 161.65 , \Sigma x y = 1533.76 . \right]$$
  1. (a) Calculate Spearman's rank correlation coefficient, \(r _ { s }\).
    (b) Explain what your answer indicates about the populations of these countries and their capital cities.
  2. Calculate the product moment correlation coefficient, \(r\). The data are illustrated in the scatter diagram.
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-09_936_881_1162_632}
  3. By considering the diagram, state the effect on the value of the product moment correlation coefficient, \(r\), if the data for France and the United Kingdom were removed from the calculation.
  4. In a certain country in Africa, most people live in remote areas and hence the population of the country is unknown. However, the population of the capital city is known to be approximately 1 million. An official suggests that the population of this country could be estimated by using a regression line drawn on the above scatter diagram.
    (a) State, with a reason, whether the regression line of \(y\) on \(x\) or the regression line of \(x\) on \(y\) would need to be used.
    (b) Comment on the reliability of such an estimate in this situation. 1 Some observations of bivariate data were made and the equations of the two regression lines were found to be as follows. $$\begin{array} { c c } y \text { on } x : & y = - 0.6 x + 13.0
    x \text { on } y : & x = - 1.6 y + 21.0 \end{array}$$
  5. State, with a reason, whether the correlation between \(x\) and \(y\) is negative or positive.
  6. Neither variable is controlled. Calculate an estimate of the value of \(x\) when \(y = 7.0\).
  7. Find the values of \(\bar { x }\) and \(\bar { y }\). 2 A bag contains 5 black discs and 3 red discs. A disc is selected at random from the bag. If it is red it is replaced in the bag. If it is black, it is not replaced. A second disc is now selected at random from the bag. Find the probability that
  8. the second disc is black, given that the first disc was black,
  9. the second disc is black,
  10. the two discs are of different colours. 3 Each of the 7 letters in the word DIVIDED is printed on a separate card. The cards are arranged in a row.
  11. How many different arrangements of the letters are possible?
  12. In how many of these arrangements are all three Ds together? The 7 cards are now shuffled and 2 cards are selected at random, without replacement.
  13. Find the probability that at least one of these 2 cards has D printed on it. 4
  14. The random variable \(X\) has the distribution \(\mathrm { B } ( 25,0.2 )\). Using the tables of cumulative binomial probabilities, or otherwise, find \(\mathrm { P } ( X \geqslant 5 )\).
  15. The random variable \(Y\) has the distribution \(\mathrm { B } ( 10,0.27 )\). Find \(\mathrm { P } ( Y = 3 )\).
  16. The random variable \(Z\) has the distribution \(B ( n , 0.27 )\). Find the smallest value of \(n\) such that \(\mathrm { P } ( Z \geqslant 1 ) > 0.95\). 5 The probability distribution of a discrete random variable, \(X\), is given in the table.
    \(x\)0123
    \(\mathrm { P } ( X = x )\)\(\frac { 1 } { 3 }\)\(\frac { 1 } { 4 }\)\(p\)\(q\)
    It is given that the expectation, \(\mathrm { E } ( X )\), is \(1 \frac { 1 } { 4 }\).
  17. Calculate the values of \(p\) and \(q\).
  18. Calculate the standard deviation of \(X\). \section*{June 2006} 6 The table shows the total distance travelled, in thousands of miles, and the amount of commission earned, in thousands of pounds, by each of seven sales agents in 2005.
    Agent\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)
    Distance travelled18151214162413
    Commission earned18451924272223
  19. (a) Calculate Spearman's rank correlation coefficient, \(r _ { s }\), for these data.
    (b) Comment briefly on your value of \(r _ { s }\) with reference to this context.
    (c) After these data were collected, agent \(A\) found that he had made a mistake. He had actually travelled 19000 miles in 2005. State, with a reason, but without further calculation, whether the value of Spearman's rank correlation coefficient will increase, decrease or stay the same. The agents were asked to indicate their level of job satisfaction during 2005. A score of 0 represented no job satisfaction, and a score of 10 represented high job satisfaction. Their scores, \(y\), together with the data for distance travelled, \(x\), are illustrated in the scatter diagram below.
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-11_680_972_1235_589}
  20. For this scatter diagram, what can you say about the value of
    (a) Spearman's rank correlation coefficient,
    (b) the product moment correlation coefficient? 7 In a UK government survey in 2000, smokers were asked to estimate the time between their waking and their having the first cigarette of the day. For heavy smokers, the results were as follows.
    Time between waking
    and first cigarette
    1 to 4
    minutes
    5 to 14
    minutes
    15 to 29
    minutes
    30 to 59
    minutes
    At least 60
    minutes
    Percentage of smokers312719149
    Times are given correct to the nearest minute.
  21. Assuming that 'At least 60 minutes' means 'At least 60 minutes but less than 240 minutes', calculate estimates for the mean and standard deviation of the time between waking and first cigarette for these smokers.
  22. Find an estimate for the interquartile range of the time between waking and first cigarette for these smokers. Give your answer correct to the nearest minute.
  23. The meaning of 'At least 60 minutes' is now changed to 'At least 60 minutes but less than 480 minutes'. Without further calculation, state whether this would cause an increase, a decrease or no change in the estimated value of
    (a) the mean,
    (b) the standard deviation,
    (c) the interquartile range. 8 Henry makes repeated attempts to light his gas fire. He makes the modelling assumption that the probability that the fire will light on any attempt is \(\frac { 1 } { 3 }\). Let \(X\) be the number of attempts at lighting the fire, up to and including the successful attempt.
  24. Name the distribution of \(X\), stating a further modelling assumption needed. In the rest of this question, you should use the distribution named in part (i).
  25. Calculate
    (a) \(\mathrm { P } ( X = 4 )\),
    (b) \(\mathrm { P } ( X < 4 )\).
  26. State the value of \(\mathrm { E } ( X )\).
  27. Henry has to light the fire once a day, starting on March 1st. Calculate the probability that the first day on which fewer than 4 attempts are needed to light the fire is March 3rd. 1 Part of the probability distribution of a variable, \(X\), is given in the table.
    \(x\)0123
    \(\mathrm { P } ( X = x )\)\(\frac { 3 } { 10 }\)\(\frac { 1 } { 5 }\)\(\frac { 2 } { 5 }\)
  28. Find \(\mathrm { P } ( X = 0 )\).
  29. Find \(\mathrm { E } ( X )\). 2 The table contains data concerning five households selected at random from a certain town.
    Number of people in the household23357
    Number of cars belonging to people in the household11324
  30. Calculate the product moment correlation coefficient, \(r\), for the data in the table.
  31. Give a reason why it would not be sensible to use your answer to draw a conclusion about all the households in the town. 3 The digits 1, 2, 3, 4 and 5 are arranged in random order, to form a five-digit number.
  32. How many different five-digit numbers can be formed?
  33. Find the probability that the five-digit number is
    (a) odd,
    (b) less than 23000 . 4 Each of the variables \(W , X , Y\) and \(Z\) takes eight integer values only. The probability distributions are illustrated in the following diagrams.
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-14_423_385_404_287}
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-14_419_376_406_687}
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-14_419_378_406_1082}
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-14_419_376_406_1482}
  34. For which one or more of these variables is
    (a) the mean equal to the median,
    (b) the mean greater than the median?
  35. Give a reason why none of these diagrams could represent a geometric distribution.
  36. Which one of these diagrams could not represent a binomial distribution? Explain your answer briefly. 5 A chemical solution was gradually heated. At five-minute intervals the time, \(x\) minutes, and the temperature, \(y ^ { \circ } \mathrm { C }\), were noted.
    \(x\)05101520253035
    \(y\)0.83.06.810.915.619.623.426.7
    $$\left[ n = 8 , \Sigma x = 140 , \Sigma y = 106.8 , \Sigma x ^ { 2 } = 3500 , \Sigma y ^ { 2 } = 2062.66 , \Sigma x y = 2685.0 . \right]$$
  37. Calculate the equation of the regression line of \(y\) on \(x\).
  38. Use your equation to estimate the temperature after 12 minutes.
  39. It is given that the value of the product moment correlation coefficient is close to + 1 . Comment on the reliability of using your equation to estimate \(y\) when
    (a) \(x = 17\),
    (b) \(x = 57\). 6 A coin is biased so that the probability that it will show heads on any throw is \(\frac { 2 } { 3 }\). The coin is thrown repeatedly. The number of throws up to and including the first head is denoted by \(X\). Find
  40. \(\mathrm { P } ( X = 4 )\),
  41. \(\mathrm { P } ( X < 4 )\),
  42. \(\mathrm { E } ( X )\). 7 A bag contains three 1 p coins and seven 2 p coins. Coins are removed at random one at a time, without replacement, until the total value of the coins removed is at least 3p. Then no more coins are removed.
  43. Copy and complete the probability tree diagram. First coin
    \includegraphics[max width=\textwidth, alt={}, center]{11316ea6-3999-4003-b77d-bee8b547c1da-15_350_317_1279_568} Find the probability that
  44. exactly two coins are removed,
  45. the total value of the coins removed is 4p. \section*{Jan 2007} 8 In the 2001 census, the household size (the number of people living in each household) was recorded. The percentages of households of different sizes were then calculated. The table shows the percentages for two wards, Withington and Old Moat, in Manchester.
    \cline { 2 - 8 } \multicolumn{1}{c|}{}Household size
    \cline { 2 - 8 } \multicolumn{1}{c|}{}1234567 or more
    Withington34.126.112.712.88.24.02.1
    Old Moat35.127.114.711.47.62.81.3
  46. Calculate the median and interquartile range of the household size for Withington.
  47. Making an appropriate assumption for the last class, which should be stated, calculate the mean and standard deviation of the household size for Withington. Give your answers to an appropriate degree of accuracy. The corresponding results for Old Moat are as follows.
    Median
    Interquartile
    range
    Mean
    Standard
    deviation
    222.41.5
  48. State one advantage of using the median rather than the mean as a measure of the average household size.
  49. By comparing the values for Withington with those for Old Moat, explain briefly why the interquartile range may be less suitable than the standard deviation as a measure of the variation in household size.
  50. For one of the above wards, the value of Spearman's rank correlation coefficient between household size and percentage is - 1 . Without any calculation, state which ward this is. Explain your answer.