OCR MEI Further Statistics Major (Further Statistics Major) 2020 November

Question 1
View details
1 In a game at a fair, players choose 4 countries from a list of 10 countries. The names of all 10 countries are then put in a box and the player selects 4 of them at random. The random variable \(X\) represents the number of countries that match those which the player originally chose.
  1. Show that the probability that a randomly selected player matches all 4 countries is \(\frac { 1 } { 210 }\). Table 1 shows the probability distribution of \(X\). \begin{table}[h]
    \(r\)01234
    \(\mathrm { P } ( X = r )\)\(\frac { 1 } { 14 }\)\(\frac { 8 } { 21 }\)\(\frac { 3 } { 7 }\)\(\frac { 4 } { 35 }\)\(\frac { 1 } { 210 }\)
    \captionsetup{labelformat=empty} \caption{Table 1}
    \end{table}
  2. Find each of the following.
    • \(\mathrm { E } ( X )\)
    • \(\operatorname { Var } ( X )\)
    • A player has to pay \(\pounds 1\) to play the game. The player gets 40 pence back for every country which is matched.
    Find the mean and standard deviation of the player's loss per game.
  3. In order to try to attract more customers, the rules will be changed as follows. The game will still cost \(\pounds 1\) to play. The player will get 25 pence back for every country which is matched, plus an additional bonus of \(\pounds 100\) if all four countries are matched. Find the player's mean gain or loss per game with these new rules.
Question 2
View details
2 On average 1 in 4000 people have a particular antigen in their blood (an antigen is a molecule which may cause an adverse reaction).
    1. A random sample of 1200 people is selected. The random variable \(X\) represents the number of people in the sample who have this antigen in their blood. Explain why you could use either a binomial distribution or a Poisson distribution to model the distribution of \(X\).
    2. Use either a binomial or a Poisson distribution to calculate each of the following probabilities.
      • \(\mathrm { P } ( X = 3 )\)
  1. \(\mathrm { P } ( X > 3 )\)
  2. A researcher needs to find 2 people with the antigen. Find the probability that at most 5000 people have to be tested in order to achieve this.
Question 3
View details
3 A supermarket sells cashew nuts in three different sizes of bag: small, medium and large. The weights in grams of the nuts in each type of bag are modelled by independent Normal distributions as shown in Table 3. \begin{table}[h]
Bag sizeMeanStandard deviation
Small51.51.1
Medium100.71.6
Large201.31.7
\captionsetup{labelformat=empty} \caption{Table 3}
\end{table}
  1. Find the probability that the mean weight of two randomly selected large bags is at least 200 g .
  2. Find the probability that the total weight of eight randomly selected small bags is greater than the total weight of two randomly selected medium bags and one randomly selected large bag.
Question 4
View details
4 An amateur meteorologist records the total rainfall at her home each day using a traditional rain gauge. This means that she has to go out each day at 9 am to read the rain gauge and then to empty it. She wants to save time by using a digital rain gauge, but she also wants to ensure that the readings from the digital gauge are similar to those of her traditional gauge. Over a period of 100 days, she uses both gauges to measure the rainfall. The meteorologist uses software to produce a 95\% confidence interval for the difference between the two readings (the traditional gauge reading minus the digital gauge reading). The output from the software is shown in Fig. 4. Although rainfall was measured over a period of 100 days, there was no rain on 40 of those days and so the sample size in the software output is 60 rather than 100. \begin{table}[h]
Z Estimate of a Mean
Confidence Level
0.95
Sample
Mean 0.1173
Result
Z Estimate of a Mean
Mean0.1173
\(\sigma\)0.5766
SE0.07444
N60
Lower Limit-0.0286
Upper Limit0.2632
Interval\(0.1173 \pm 0.1459\)
\captionsetup{labelformat=empty} \caption{Fig. 4}
\end{table}
  1. Explain why this confidence interval can be calculated even though nothing is known about the distribution of the population of differences.
  2. State the confidence interval which the software gives in the form \(a < \mu < b\).
  3. Show how the value 0.07444 (labelled SE) was calculated.
  4. Comment on whether you think that the confidence interval suggests that the two different methods of measurement are broadly in agreement.
Question 5
View details
5 A hearing expert is investigating whether web-based hearing tests can be used instead of hearing tests in a hearing laboratory. The expert selects a random sample of 16 people with normal hearing. Each of them is given two hearing tests, one in the laboratory and one web-based. The scores in the laboratory-based test, \(x\), and the web-based test, \(y\), are both measured in the same suitable units.
  1. Half of the participants do the laboratory-based test first and the other half do the web-based test first. Explain why the expert adopts this approach. The scatter diagram in Fig. 5 shows the data that the expert collected. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{8d36bc92-07ac-40c3-9e75-26f2bc9d2fcc-05_785_1360_1009_242} \captionsetup{labelformat=empty} \caption{Fig. 5}
    \end{figure} Summary statistics for these data are as follows. $$\Sigma x = 198.0 \quad \Sigma x ^ { 2 } = 2936.92 \quad \Sigma y = 188.7 \quad \Sigma y ^ { 2 } = 2605.35 \quad \Sigma x y = 2554.87$$
  2. Calculate the equation of the regression line suitable for estimating web-based scores from laboratory-based scores.
  3. Estimate the web-based scores of people whose laboratory-based scores were as follows.
    • 12
    • 25
    • Comment on the reliability of each of your estimates.
    • A colleague of the expert suggests that the regression line is not valid because one of the data values is an outlier.
    Stating the approximate coordinates of the outlier, suggest what the expert should do.
Question 6
View details
6 A pollution control officer is investigating a possible link between the levels of various pollutants in the air and the speed of the wind at various sites. A random sample of 60 values of the windspeed together with the levels of a variety of pollutants is taken at a particular site. The product moment correlation coefficient between wind-speed and nitrogen dioxide level is 0.3231 .
  1. Carry out a hypothesis test at the \(10 \%\) significance level to investigate whether there is any correlation between wind-speed and nitrogen dioxide level.
  2. State the condition required for the test carried out in part (a) to be valid. Table 6.1 shows the values of the product moment correlation coefficient between 5 different measures of pollution and also wind-speed for a very large random sample of values at another site. Those correlations that are significant at the \(10 \%\) level are denoted by a * after the value of the correlation. \begin{table}[h]
    CorrelationsPM10SPEED\(\mathrm { NO } _ { 2 }\)\(\mathrm { O } _ { 3 }\)PM25\(\mathrm { SO } _ { 2 }\)
    PM101.00
    SPEED0.08*1.00
    \(\mathrm { NO } _ { 2 }\)0.59*0.25*1.00
    \(\mathbf { O } _ { \mathbf { 3 } }\)-0.05*-0.04*-0.30*1.00
    PM250.85*-0.010.56*-0.021.00
    \(\mathrm { SO } _ { 2 }\)0.42*0.15*0.73*-0.63*0.40*1.00
    \captionsetup{labelformat=empty} \caption{Table 6.1}
    \end{table} \begin{table}[h]
    \captionsetup{labelformat=empty} \caption{Table 6.2 shows standard guidelines for effect sizes.}
    Product moment
    correlation coefficient
    Effect size
    0.1Small
    0.3Medium
    0.5Large
    \end{table} Table 6.2 The officer analyses these data for effect size.
  3. Explain how the very large sample size relates to the interpretation of the correlation coefficients shown in Table 6.1.
  4. Comment briefly on what the pollution control officer might conclude from these tables, relevant to her investigation into wind-speed and pollutant levels.
Question 7 10 marks
View details
7 The lengths in mm of a random sample of 6 one-year-old fish of a particular species are as follows.
\(\begin{array} { l l l l l l } 271 & 293 & 306 & 287 & 264 & 290 \end{array}\)
  1. State an assumption required in order to find a confidence interval for the mean length of one-year-old fish of this species. Fig. 7 shows a Normal probability plot for these data. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{8d36bc92-07ac-40c3-9e75-26f2bc9d2fcc-07_599_753_646_246} \captionsetup{labelformat=empty} \caption{Fig. 7}
    \end{figure}
  2. Explain why the Normal probability plot suggests that the assumption in part (a) may be valid.
  3. In this question you must show detailed reasoning. Assuming that this assumption is true, find a 95\% confidence interval for the mean length of one-year-old fish of this species.
Question 8 10 marks
View details
8 In this question you must show detailed reasoning. On the manufacturer's website, it is claimed that the average daily electricity consumption of a particular model of fridge is 1.25 kWh (kilowatt hours). A researcher at a consumer organisation decides to check this figure. A random sample of 40 fridges is selected. Summary statistics for the electricity consumption \(x \mathrm { kWh }\) of these fridges, measured over a period of 24 hours, are as follows.
\(\Sigma x = 51.92 \quad \Sigma x ^ { 2 } = 70.57\) Carry out a test at the \(5 \%\) significance level to investigate the validity of the claim on the website.
[0pt] [10]
Question 9
View details
9 A supermarket sells trays of peaches. Each tray contains 10 peaches. Often some of the peaches in a tray are rotten. The numbers of rotten peaches in a random sample of 150 trays are shown in Table 9.1. \begin{table}[h]
Number of rotten peaches0123456\(\geqslant 7\)
Frequency393933198840
\captionsetup{labelformat=empty} \caption{Table 9.1}
\end{table} A manager at the supermarket thinks that the number of rotten peaches in a tray may be modelled by a binomial distribution.
  1. Use these data to estimate the value of the parameter \(p\) for the binomial model \(\mathrm { B } ( 10 , p )\). The manager decides to carry out a goodness of fit test to investigate further. The screenshot in Fig. 9.2 shows part of a spreadsheet to assess the goodness of fit of the distribution \(\mathrm { B } ( 10 , p )\), using the value of \(p\) estimated from the data. \begin{table}[h]
    -ABCDE
    1Number of rotten peachesObserved frequencyBinomial probabilityExpected frequencyChi-squared contribution
    2039
    31391.4229
    42330.294144.11672.8012
    53190.162924.43831.2102
    6\(\geqslant 4\)200.076911.53116.2199
    7
    \captionsetup{labelformat=empty} \caption{Fig. 9.2}
    \end{table}
  2. Calculate the missing values in each of the following cells.
    • C2
    • D2
    • E2
    • Explain why the numbers for 4, 5, 6 and at least 7 rotten peaches have been combined into the single category of at least 4 rotten peaches, as shown in the spreadsheet.
    • Carry out the test at the \(1 \%\) significance level.
    • Using the values of the contributions, comment on the results of the test.
Question 10
View details
10 The discrete random variables \(X\) and \(Y\) have distributions as follows: \(X \sim \mathrm {~B} ( 20,0.3 )\) and \(Y \sim \operatorname { Po } ( 3 )\). The spreadsheet in Fig. 10 shows a simulation of the distributions of \(X\) and \(Y\). Each of the 20 rows below the heading row consists of a value of \(X\), a value of \(Y\), and the value of \(X - 2 Y\). \begin{table}[h]
1ABC
1XY\(X - 2 Y\)
266-6
354-3
4816
565-4
6630
7816
864-2
954-3
1074-1
11832
12622
13513
14614
1554-3
16723
17521
1844-4
19505
20513
21420
nn
\captionsetup{labelformat=empty} \caption{Fig. 10}
\end{table}
  1. Use the spreadsheet to estimate each of the following.
    • \(\mathrm { P } ( X - 2 Y > 0 )\)
    • \(\mathrm { P } ( X - 2 Y > 1 )\)
    • How could the estimates in part (a) be improved?
    The mean of 50 values of \(X - 2 Y\) is denoted by the random variable \(W\).
  2. Calculate an estimate of \(\mathrm { P } ( W > 1 )\).
Question 11
View details
11 The length of time in minutes for which a particular geyser erupts is modelled by the continuous random variable \(T\) with cumulative distribution function given by
\(\mathrm { F } ( t ) = \begin{cases} 0 & t \leqslant 2 ,
k \left( 8 t ^ { 2 } - t ^ { 3 } - 24 \right) & 2 < t < 4 ,
1 & t \geqslant 4 , \end{cases}\)
where \(k\) is a positive constant.
  1. Show that \(k = \frac { 1 } { 40 }\).
  2. Find the probability that a randomly selected eruption time lies between 2.5 and 3.5 minutes.
  3. Show that the median \(m\) of the distribution satisfies the equation \(m ^ { 3 } - 8 m ^ { 2 } + 44 = 0\).
  4. Verify that the median eruption time is 2.95 minutes, correct to 2 decimal places. The mean and standard deviation of \(T\) are denoted by \(\mu\) and \(\sigma\) respectively.
  5. Find \(\mathrm { P } ( \mu - \sigma < T < \mu + \sigma )\).
  6. Sketch the graph of the probability density function of \(T\).
  7. A Normally distributed random variable \(X\) has the same mean and standard deviation as \(T\). By considering the shape of the Normal distribution, and without doing any calculations, explain whether \(\mathrm { P } ( \mu - \sigma < X < \mu + \sigma )\) will be greater than, equal to or less than the probability that you calculated in part (e).