Assignment I for Biostats Course VHM 801 at AVC - Fall semester 2024
The assignment is worth 10% of the final course mark. Please be aware that by handing
in the home assignment you implicitly acknowledge to have read and accepted
the instructions for home assignments as described
on the VHM 801 homepage.
This assignment is based on data collected as part of the Framingham heart study.
The data comprise 200 persons, and no additional information is available on how this subgroup was selected
from the study, except that they appear to belong to the original
cohort. For the purpose of this assignment we will assume they constitute a random sample among the
study participants. The listing below briefly describes the variables
included in the data.
- person: an arbitrarily coded running person number (1-200),
- chol: blood (serum) cholesterol level (in mg/dL),
- gender: person's gender (0=female; 1=male),
- age: person's age (in years) at the time of blood cholesterol measurement.
The dataset is available
in Minitab format and as a comma-separated file
for import into other statistical software.
The home assignment has four questions which should all be answered.
- Carry out separate descriptive analyses for the cholesterol values for men and women.
Choose the graphical representation and the statistics you find most
useful to show each of the distributions, and comment specifically on each distribution's center, spread and shape,
as well as potential outliers (including if these should be considered as
truly outlying observations).
Include a table (or tabular display) to compare the most important
characteristics of the distributions
between gender groups, and comment briefly on your findings; note: you are not expected to carry out statistical inference to compare the gender groups.
- According to IPS Supplementary Exercises 1.127 and 1.128, the cholesterol
values in specific age and gender groups may be considered to follow approximately normal
distributions. Consider here the age group 35-44 years, and examine for the men and women separately
whether it would seem reasonable to assume the data to be normally distributed.
Describe carefully the tools you use for this, and how you arrive at
your conclusions. If you conclude that a variable is not normally distributed, describe how its
distribution seems to differ from a normal distribution. (Minitab hint: The Data-Subset Worksheet
menu allows you to create a new worksheet containing a subset of the data.)
- The reference quoted in IPS Supplementary Exercise 1.127 states the following statistics
for the 35-44 age group (in a comparable time period), separated by gender groups: mean cholesterol value and proportion of high
(>= 240 mg/dL) cholesterol values.
| Gender | Mean cholesterol (mg/dL) | High cholesterol
|
|---|
| men | 227 | 0.339 (or 33.9%)
|
|---|
| women | 214 | 0.231 (or 23.1%)
|
|---|
Compare descriptively (without carrying out any statistical inference) and separately for the men and women
the mean and proportion of high cholesterol values in the table with
those you can estimate from the actual data. Note that a proportion of high cholesterol
values can be estimated from data in two ways: with or without
assuming a normal distribution. It is sufficient to carry out one
calculation for each gender group to estimate the proportion of high cholesterol values, but you should use the method most appropriate for the data
at hand and justify your choice of method. Comment also briefly on your findings, e.g. how closely the values in the table
agree with those estimated from the data and how the agreement compares between women and men.
- The reference quoted in IPS Supplementary Exercise 1.127
apparently does not actually state that the cholesterol values follow normal
distributions with specified means and standard deviations; nor are
standard deviations included in its table for cholesterol values.
This suggests that the standard deviations in the exercise were
computed from the stated mean and proportion of high cholesterol values.
For one of the gender groups (of your own choice), determine the standard
deviation that in a normal distribution with the mean from the table
would give the same proportion of high cholesterol values as in the table.
Describe carefully your method (a correct answer without a description
of the method will not be awarded a full score for this question).
Discuss any limitations or problems you see with this method of
estimating the standard deviation.
Henrik Stryhn
(hstryhn@upei.ca) 2024-09-25