Clean or interpret large data set structure

Questions that ask students to explain data cleaning needs, identify variable types, state units, or describe structural features of the large data set without performing calculations.

9 questions

Edexcel AS Paper 2 2019 June Q4
  1. Joshua is investigating the daily total rainfall in Hurn for May to October 2015
Using the information from the large data set, Joshua wishes to calculate the mean of the daily total rainfall in Hurn for May to October 2015
  1. Using your knowledge of the large data set, explain why Joshua needs to clean the data before calculating the mean. Using the information from the large data set, he produces the grouped frequency table below.
    Daily total rainfall ( \(r \mathrm {~mm}\) )FrequencyMidpoint ( \(\boldsymbol { x } \mathbf { m m }\) )
    \(0 \leqslant r < 0.5\)1210.25
    \(0.5 \leqslant r < 1.0\)100.75
    \(1.0 \leqslant r < 5.0\)243.0
    \(5.0 \leqslant r < 10.0\)127.5
    \(10.0 \leqslant r < 30.0\)1720.0
    $$\text { You may use } \sum \mathrm { f } x = 539.75 \text { and } \sum \mathrm { f } x ^ { 2 } = 7704.1875$$
  2. Use linear interpolation to calculate an estimate for the upper quartile of the daily total rainfall.
  3. Calculate an estimate for the standard deviation of the daily total rainfall in Hurn for May to October 2015
    1. State the assumption involved with using class midpoints to calculate an estimate of a mean from a grouped frequency table.
    2. Using your knowledge of the large data set, explain why this assumption does not hold in this case.
    3. State, giving a reason, whether you would expect the actual mean daily total rainfall in Hurn for May to October 2015 to be larger than, smaller than or the same as an estimate based on the grouped frequency table.
OCR Stats 1 2018 December Q10
10 Using the 2001 UK census results and some software, Javid intended to calculate the mean number of people who travelled to work by underground, metro, light rail or tram (UMLT) for all 348 Local Authorities. However, Javid noticed that for one LA the entry in the UMLT column is a dash, rather than a 0 . See the extract below.
Data extract for one LA in 2001
Work
mainly at or
from home
UMLTTrain
Bus,
minibus or
coach
295-44
Javid felt that it was not clear how this LA was to be treated so he decided to omit it from his calculation.
  1. Explain how the omission of this LA affects Javid's calculation of the mean. The value of the mean that Javid obtained was 2046.3.
  2. Calculate the value of the mean when this LA is not removed. Javid finds that the corresponding mean for all Local Authorities for 2011 is 2860.8. In order to compare the means for the two years, Javid also finds the total number of employees in each of these years. His results are given below.
    Year20012011
    Total number of
    employees
    2362775326526336
  3. Show that a higher proportion of employees used the metro to travel to work in 2011 than in 2001.
  4. Suggest a reason for this increase.
OCR H240/02 2022 June Q10
10 The table shows the age structure of usual residents of 18 Local Authorities (LAs) in the North West region of the UK in 2011.
Local AuthorityAge 0 to 17Age 18 to 24Age 25 to 64Age 65 and over
A26.20\%9.06\%51.81\%12.92\%
B23.32\%8.99\%52.32\%15.37\%
C22.24\%8.96\%52.56\%16.23\%
D22.67\%8.10\%53.27\%15.96\%
E20.70\%7.77\%54.77\%16.76\%
F18.14\%6.51\%51.13\%24.21\%
G18.96\%14.20\%48.51\%18.33\%
H19.06\%14.79\%52.12\%14.04\%
I25.15\%9.04\%51.16\%14.65\%
J22.93\%8.81\%52.22\%16.04\%
K21.48\%13.98\%50.82\%13.73\%
L23.98\%9.20\%52.26\%14.56\%
M21.67\%11.19\%52.94\%14.19\%
N17.82\%6.01\%51.93\%24.23\%
O22.83\%7.30\%53.86\%16.01\%
P21.76\%8.28\%54.03\%15.93\%
Q21.42\%8.43\%53.90\%16.25\%
R18.61\%7.33\%49.35\%24.71\%
\section*{Percentage of residents}
  1. Without reference to any other columns, explain how you would use only the columns for the age ranges 0 to 17 and 18 to 24 to decide whether an LA might be one of the following.
    1. An LA that includes a university
    2. An LA that attracts young couples to live
    3. An LA that attracts retired people to live
  2. Using your answers to part (a), identify the following.
    1. Four LAs that might include a university
    2. Three LAs that might be attractive to retired people
  3. Explain why your answer to part (b)(ii), based only on the columns for the age ranges 0 to 17 and 18 to 24, may not be reliable.
  4. The lower quartile, median and upper quartile of the percentages in the column "Age 65 and over" are \(14.56 \% , 15.99 \%\) and \(16.76 \%\) respectively. Use this information to comment on your answers to part (b)(ii) and part (c). In a magazine article, a councillor plans to describe a typical LA in the North West region. He wants to quote the average percentage of residents aged 65 or over.
  5. The mean of the percentages in the column "Age 65 and over" is \(16.90 \%\). Use this information, and the information given in part (d), to explain whether the median or the mean better represents the data in the column "Age 65 and over".
AQA AS Paper 2 2023 June Q12
12 The mass of a bag of nuts produced by a company is known to have a mean of 40 grams and a standard deviation of 3 grams. The company produces five different flavours of nuts.
The bags of nuts are packed in large boxes.
Given the information above, identify the continuous variable from the options below.
Tick ( \(\checkmark\) ) one box. The flavours of the bags of nuts The known standard deviation of the mass of a bag of nuts
□ The mass of an individual bag of nuts
□ The number of bags of nuts in a large box

The number of bags of nuts in a large box □
AQA AS Paper 2 2024 June Q11
11 The table below shows the daily salt intake, \(x\) grams, and the daily Vitamin C intake, \(y\) milligrams, for a group of 10 adults.
AdultABCDEFGHIJ
\(\boldsymbol { x }\)5.36.23.610.42.49.4657.111.2
\(y\)9014588481144480955541
A scatter diagram of the data is shown below.
\includegraphics[max width=\textwidth, alt={}, center]{f5e0d980-4c50-4735-aea7-1bdf448a58f7-17_675_1150_1110_431} One of the adults is an outlier. Identify the letter of the adult that is the outlier.
Circle your answer below.
A
B
E
J Which one of the following is not a measure of spread?
Circle your answer.
median
range
standard deviation
variance
AQA AS Paper 2 2024 June Q16
2 marks
16
  1. (ii) &
    16
    where \(n\) is the total number of cars which had a measured hydrocarbon emission in the Large Data Set.
    16
  2. Find the mean of \(X\)
    [1 mark]
    16

  3. \hline &
    \hline \end{tabular} \end{center} 16
  • (ii) State one type of emission where more than 80\% of the data is known for cars in the entire UK Department for Transport Stock Vehicle Database.
    [0pt] [1 mark]
  • AQA AS Paper 2 Specimen Q14
    3 marks
    14 In the Large Data Set, the emissions of carbon dioxide are measured in what units? Circle your answer.
    [0pt] [1 mark]
    mg/litre
    g/litre
    g/km
    mg/km A school took 225 children on a trip to a theme park.
    After the trip the children had to write about their favourite ride at the park from a choice of three. The table shows the number of children who wrote about each ride.
    \multirow{2}{*}{}Ride written about
    The DropThe BeanstalkThe GiantTotal
    \multirow{3}{*}{Year group}Year 724452392
    Year 836172275
    Year 920132558
    Total807570225
    Three children were randomly selected from those who went on the trip.
    Calculate the probability that one wrote about 'The Drop', one wrote about ‘The Beanstalk’ and one wrote about The Giant’.
    [0pt] [2 marks]
    AQA AS Paper 2 Specimen Q17
    6 marks
    17 The table below is an extract from the Large Data Set.
    MakeRegionEngine sizeMassCO2CO
    VAUXHALLSouth West139811631180.463
    VOLKSWAGENLondon99910551060.407
    VAUXHALLSouth West12481225850.141
    BMWSouth West297916351940.139
    TOYOTASouth West199516501230.274
    BMWSouth West297902440.447
    FORDSouth West159601650.518
    TOYOTASouth West12991050144
    VAUXHALLLondon139813611400.695
    FORDNorth West495117992990.621
    17
      1. Calculate the standard deviation of the engine sizes in the table.
        [0pt] [1 mark] 17
    1. (ii) The mean of the engine sizes is 2084
      Any value more than 2 standard deviations from the mean can be identified as an outlier. Using this definition of an outlier, show that the sample of engine sizes has exactly one outlier. Fully justify your answer.
      [0pt] [3 marks] 17
    2. Rajan calculates the mean of the masses of the cars in this extract and states that it is 1094 kg. Use your knowledge of the Large Data Set to suggest what error Rajan is likely to have made in his calculation.
      [0pt] [1 mark] 17
    3. Rajan claims there is an error in the data recorded in the table for one of the Toyotas from the South West, because there is no value for its carbon monoxide emissions. Use your knowledge of the Large Data Set to comment on Rajan's claim.
      [0pt] [1 mark]
    AQA Paper 3 2021 June Q13
    2 marks
    13 The table below is an extract from the Large Data Set.
    Propulsion TypeRegionEngine SizeMass\(\mathrm { CO } _ { 2 }\)Particulate Emissions
    2London189615331540.04
    2North West189614231460.029
    2North West189613531380.025
    2South West199815471590.026
    2London189613881380.025
    2South West189612141300.011
    2South West189614801460.029
    2South West189614131460.024
    2South West249616951920.034
    2South West142212511220.025
    2South West199520751750.034
    2London189612851400.036
    2North West18960146
    13
      1. Calculate the mean and standard deviation of \(\mathrm { CO } _ { 2 }\) emissions in the table.
        [0pt] [2 marks]
        13
    1. (ii) Any value more than 2 standard deviations from the mean can be identified as an outlier. Determine, using this definition of an outlier, if there are any outliers in this sample of \(\mathrm { CO } _ { 2 }\) emissions. Fully justify your answer.
      13
    2. Maria claims that the last line in the table must contain two errors. Use your knowledge of the Large Data Set to comment on Maria's claim.
      \(14 \quad A\) and \(B\) are two events such that $$\begin{aligned} & \mathrm { P } ( A \cap B ) = 0.1
      & \mathrm { P } \left( A ^ { \prime } \cap B ^ { \prime } \right) = 0.2
      & \mathrm { P } ( B ) = 2 \mathrm { P } ( A ) \end{aligned}$$