Assignment I for Biostats Course VHM 801 at AVC - Fall semester 2024

The assignment is worth 10% of the final course mark. Please be aware that by handing in the home assignment you implicitly acknowledge to have read and accepted the instructions for home assignments as described on the VHM 801 homepage.

This assignment is based on data collected as part of the Framingham heart study. The data comprise 200 persons, and no additional information is available on how this subgroup was selected from the study, except that they appear to belong to the original cohort. For the purpose of this assignment we will assume they constitute a random sample among the study participants. The listing below briefly describes the variables included in the data.

The dataset is available in Minitab format and as a comma-separated file for import into other statistical software.

The home assignment has four questions which should all be answered.

  1. Carry out separate descriptive analyses for the cholesterol values for men and women. Choose the graphical representation and the statistics you find most useful to show each of the distributions, and comment specifically on each distribution's center, spread and shape, as well as potential outliers (including if these should be considered as truly outlying observations). Include a table (or tabular display) to compare the most important characteristics of the distributions between gender groups, and comment briefly on your findings; note: you are not expected to carry out statistical inference to compare the gender groups.

  2. According to IPS Supplementary Exercises 1.127 and 1.128, the cholesterol values in specific age and gender groups may be considered to follow approximately normal distributions. Consider here the age group 35-44 years, and examine for the men and women separately whether it would seem reasonable to assume the data to be normally distributed. Describe carefully the tools you use for this, and how you arrive at your conclusions. If you conclude that a variable is not normally distributed, describe how its distribution seems to differ from a normal distribution. (Minitab hint: The Data-Subset Worksheet menu allows you to create a new worksheet containing a subset of the data.)

  3. The reference quoted in IPS Supplementary Exercise 1.127 states the following statistics for the 35-44 age group (in a comparable time period), separated by gender groups: mean cholesterol value and proportion of high (>= 240 mg/dL) cholesterol values.

    GenderMean cholesterol (mg/dL)High cholesterol
    men2270.339 (or 33.9%)
    women2140.231 (or 23.1%)

    Compare descriptively (without carrying out any statistical inference) and separately for the men and women the mean and proportion of high cholesterol values in the table with those you can estimate from the actual data. Note that a proportion of high cholesterol values can be estimated from data in two ways: with or without assuming a normal distribution. It is sufficient to carry out one calculation for each gender group to estimate the proportion of high cholesterol values, but you should use the method most appropriate for the data at hand and justify your choice of method. Comment also briefly on your findings, e.g. how closely the values in the table agree with those estimated from the data and how the agreement compares between women and men.

  4. The reference quoted in IPS Supplementary Exercise 1.127 apparently does not actually state that the cholesterol values follow normal distributions with specified means and standard deviations; nor are standard deviations included in its table for cholesterol values. This suggests that the standard deviations in the exercise were computed from the stated mean and proportion of high cholesterol values. For one of the gender groups (of your own choice), determine the standard deviation that in a normal distribution with the mean from the table would give the same proportion of high cholesterol values as in the table. Describe carefully your method (a correct answer without a description of the method will not be awarded a full score for this question). Discuss any limitations or problems you see with this method of estimating the standard deviation.

Henrik Stryhn (hstryhn@upei.ca) 2024-09-25