Cover Page

To my children:

Mark, Helen and Stephen

Mathematical Models and Methods in Reliability Set

coordinated by
Nikolaos Limnios and Bo Henry Lindqvist

Volume 4

First Hitting Time Regression Models

Lifetime Data Analysis Based on Underlying Stochastic Processes

Chrysseis Caroni

image

Preface

The analysis of lifetime (or time-to-event) data is one of the busiest fields of statistics. This is partly due to the interesting theoretical issues that come up, which tend to be rather different from those that are seen in other areas, and, above all, due to the enormous range of its practical applications. One major field of application is in engineering and technology, where the subject is usually known as reliability. Reliability is concerned with how long a machine or a machine component will operate before it requires replacement or repair. The other major field of application is in biostatistics. A classic example of this field of application is the study of patients’ survival after therapy. Other important fields of application include finance and banking, as well as other areas.

In biostatistics, information about the individuals in the study may be available in the form of covariates (e.g. age and gender). Exploiting this information may improve prediction of lifetimes (a younger patient will generally do better than an older one). Covariates are usually included in statistical models in the form of regression models. Lifetime data analysis contains a rich range of regression models, which will be presented in this book. Many biostatistical texts restrict their attention almost exclusively to a famous regression model introduced by D.R. Cox; this is the model that predominates in statistical practice. However, there is much to be said for using models that are not simply empirical (as Cox’s model is) but are substantive in the sense that they provide a description of the underlying process. The emphasis of this book is on models of this type, in particular the one based on the concept that a lifetime ends when an underlying stochastic process reaches a boundary for the first time.

It is a pleasure to acknowledge my debt to several people who helped to inspire my interest in lifetime data analysis, through their own work, their willingness to discuss matters with me and respond to my queries, and their invitations to participate in conferences and visit their institutions. In this respect, I wish to record my gratitude especially to Mei-Ling Ting Lee, Alex Whitmore and Nikolaos Limnios. In addition, I thank Alan Kimber and Martin Crowder who first got me interested in lifetime data. Finally, Polychronis Economou and Dimitris Stogiannis, first as doctoral students and then as collaborators, have aided my work in many ways.

Chrysseis CARONI

May 2017

1
Introduction to Lifetime Data and Regression Models

1.1. Basics

The principal aim of this book is to present a particular approach to the regression analysis of lifetime data in detail, and to discuss its features and advantages in comparison to older-established approaches. This objective will first involve undertaking a general review of the various regression methods that can be found in the literature.

This opening chapter provides a brief review of the basic features of lifetime data and its modeling. First of all, what is meant by lifetime data? A great deal of statisticians’ activity goes into studying whether or not an event occurs and how various factors influence its occurrence. If we move on from the simple yes/no fact of occurrence to also examining how long it takes until the event occurs, we enter the realm of “time to event” or “lifetime” data. Basic examples include how long a machine operates until it breaks down (the event is the breakdown) and how long a patient lives after undergoing heart transplantation (the event is death). These examples show two major areas of application. One is in engineering and technology (where the subject is usually known as reliability modeling) and the other is in biomedical sciences (known as survival analysis). However, other areas of application include all those in which statistics is used - in other words, in virtually every science. As the two examples of a machine’s breakdown and a patient’s survival suggest, applications of lifetime data analysis can have immense practical importance. Well-known textbooks with wide coverage of lifetime data analysis include Lawless [LAW 03], Collett [COL 14], Kalbfleisch and Prentice [KAL 02] and Klein and Moeschberger [KLE 03]. A brief review was given by Hougaard [HOU 99]. Other papers reviewing the analysis of lifetime data include Kiefer [KIE 88] in economics and Chung et al. [CHU 91] in criminology.

In mathematical terms, lifetime is denoted by T. No two machines are identical or operate under identical conditions; no two people are quite alike. Consequently, we treat T as a random variable, which follows some distribution in the relevant population of machines or people (in general, units). We note that T must be non-negative. Furthermore, in this book, we will follow the vast majority of the literature in treating the time scale as continuous. Consequently, we suppose that Tf(t), t > 0 for some probability density function (pdf) f(.), and hence that image. The functions that present particular interest are the following:

  1. – the survival function image image
  2. – the hazard function h(t) = f(t)/S(t).

The former is P (T > t), the probability of survival for at least time t - the probability that the machine is still operating, or that the patient is still alive, after this time. In engineering and technological applications, this probability is called reliability and the notation R(t) is usually used instead of S(t). The term hazard function is replaced by failure rate. Other terms in use for the same function include force of mortality (in demography) and intensity.

The survival function or reliability P (T > t) is a quantity of basic scientific and practical importance. For example, in medical settings, a patient’s prognosis might be expressed as his or her five-year survival probability, and in manufacturing, reliability is obviously related to how long a guarantee period can be offered for a product. The hazard function can be interpreted as the instantaneous rate of failure at time t, given that the unit has survived that long, and hence the term failure rate. However, it is important to remember that the hazard refers to failure conditionally on survival to that time (the unconditional failure rate is of course given by the pdf of the lifetime distribution). More precisely, the hazard function gives the conditional probability of failure in the next short interval of time (t, t + δt], for a unit that is still functioning at time t:

[1.1]image

Also useful and important is the cumulative hazard, H(t) = image (u)du = − ln S(t) as well as the mean residual life, given by

image

which is the expected lifetime still to come for a unit that has already survived until time t.

The functions f, F, S, h and H are all equivalent, in the sense that knowing any one of them enables all the others to be deduced. Complete details can be found in standard references (e.g. Lawless [LAW 03]), which also present detailed expressions for the more widely used parametric lifetime distributions f(.), such as the exponential, Weibull, gamma, log-normal and others. In the following section, we present the best known parametric lifetime distribution as an example, the Weibull distribution. Some details of another parametric distribution - the inverse Gaussian distribution - can be found in section 2.5 and elsewhere in the text. Some general properties of lifetime distributions are presented briefly by Olkin [OLK 16] and at length by Marshall and Olkin [MAR 07].

Some aspects of these distributions that have major importance in the analysis of lifetime data, such as the hazard function, present little interest in other fields of statistics. Conversely, some properties of distributions that have great general importance do not concern us much in lifetime data analysis. The prime example is the mean of the distribution. This is because most lifetime distributions are highly skew, with a long tail to the right. For a distribution of this shape, the median is usually reported rather than the mean. Furthermore, in practice, the restricted duration of a study may make it difficult to estimate the mean accurately (see comments in section 1.9). However, sometimes a restricted mean survival time (RMST) can be used. By the definition of the mean μ of a distribution, and assuming that it exists, we have

image

Adapting this, we define the RMST up to time t as

image

This can also be written as μ(t) = E [min(T, t)]. The RMST thus represents the expected duration of survival up to time t: in other words, how much of the interval (0, t) an individual will survive, on average. RMSTs can be computed for different values of t and compared between groups of subjects; this will be especially useful if the relation between survival in the groups is not simple (e.g. the first group does better than the second initially, but later on, the second group has the lower hazard). See for example A’hern [AHE 16] and references therein.

Two further general comments about lifetime data must be made. First, it is a characteristic feature of such data that not all the units under study will actually experience the event during the study. Some patients will still be alive when the medical researcher closes the data file for analysis; some machines will still be functioning when the time allotted to the study runs out. The lifetimes of these units are said to be right censored at the times when they were observed. They provide information that must be taken into account in the analysis even though this information takes the different form T > t rather than T = t. This can only be done easily if the censoring process is uninformative about the lifetime (see section 3.1). Other types of censoring as well as the less common phenomenon of truncation are discussed in standard references (see, for example, Lawless [LAW 03, Chapter 2]).

The second additional comment is the observation that a “lifetime” need not correspond to clock time, or even be measured in units of time at all. For a machine, the relevant time may be the time for which it is actually operating, excluding periods when it is turned off or is idle. For a car, the operating “time” would probably be measured better by how many kilometers it has covered rather than by the calendar age of the vehicle, because this will be the more important factor as far as wear and tear is concerned. Sometimes there may be several alternatives: for an aircraft, for example, calendar age, flight hours and number of landings could all be relevant measures of lifetime (see Duchesne and Lawless [DUC 00]). The question of the appropriate time scale is also discussed by Farewell and Cox [FAR 79], Oakes [OAK 95] and Kordonsky and Gertsbakh [KOR 93, KOR 97] as well as others. Later on, we will see cases where overall “time” is a weighted sum of the durations of the periods of time spent in different states (e.g. the movement of an employee through different jobs with varying exposure to health risks).

Finally, we observe that the concept of a non-negative random variable describing the point at which an event occurs can be adapted to cases where the variable is not a time at all, but, for example, the load placed on a structure. The load is increased until the structure fails.

1.2. The classic lifetime distribution: the Weibull distribution

Here, for the purpose of illustration, we provide details of the Weibull distribution (named after the Swede Waloddi Weibull), which is the most widely used parametric model for lifetime data. Empirically, it has been found to fit well to data of many kinds, and in fact, its use with lifetime data can be justified by theoretical arguments (see below).

The pdf of the Weibull distribution in one common parameterization is

image

where α > 0 is the scale parameter and η > 0 is the shape parameter. The special case η = 1 gives the exponential distribution. The survival or reliability function is

image

and therefore the hazard function is

image

The behavior of the hazard function is as follows:

image

This means that the Weibull distribution is quite flexible when it comes to describing lifetime data. However, it is unable to capture various features that are sometimes observed in hazard functions in real life, such as when the hazard increases to a peak and then falls, or when it falls to a minimum and then increases.

Figure 1.1 presents examples of the shapes of the Weibull pdf, survival function and hazard function for various values of the parameters of the distribution. Note that the distribution is skewed to the right, which is a characteristic feature of lifetime distributions.

The expected value and the variance of the lifetime T can be found using the following expression for the r-th moment of the distribution:

image

where image is the gamma function. Setting r = 1 and 2 gives

image

and hence, the variance of the lifetime T is

image
image

Figure 1.1. Plots of the pdf (upper diagram), survival function (middle diagram) and hazard function (lower diagram) of the Weibull distribution for selected values of η, with α = 1

The following alternative parameterization of the Weibull distribution is also often seen in the literature:

image

Countless examples of applications of the Weibull distribution can be found in the literature, which mostly concern the lifetimes or strengths of materials. One application of a different kind is by McDonald et al. [MCD 96] to the lifetimes of a species of bird. By fitting a Weibull distribution to lifetimes and finding that the shape parameter is greater than one, they concluded that mortality rates increase with age - so-called actuarial senescence. This was claimed to be the first demonstration of the phenomenon in an unmanipulated, natural population and thus constituted the first empirical evidence against a long-held assumption that mortality of birds is generally independent of age.

The theoretical argument for the Weibull distribution’s widespread use in practical situations is the following “weakest link” argument. Many of the units that we study can be regarded as being made up of smaller components or parts, and it may be reasonable to suppose that the durability or strength of the whole is equal to the durability or strength of the weakest part, just as a chain is made up of links and the chain’s strength is given by the strength of its weakest link. Given this structure, the distribution of the unit’s lifetime is determined by the distribution of the minimum of the set of random variables that represent the lifetimes of the unit’s components. Statistical theory demonstrates that only certain distributions have the necessary properties to represent such a minimum. One of these extreme value distributions is the Weibull.

The literature contains many lifetime distributions, most of which do not see any practical application. For example, many extensions of the Weibull distribution have been devised (see, for example, Caroni [CAR 14a]). In gaining extra flexibility, these extensions lose appealing properties of the Weibull distribution, such as the extreme value interpretation and the properties of the regression models that will be discussed later on in this chapter.

1.3. Regression models for lifetimes

Although the fact that no two units can be identical means that there will always be a random component in the lifetime, in part it may be possible to predict the lifetime from the factors or covariates that describe the unit or the conditions under which it has been operating. A patient’s prognosis after an operation, for example, is likely to depend to some degree on his or her age, on medical history and on the variables that describe the state of health at the time of the operation. An older patient, in poor condition and with a long history of ill health will be expected to have a shorter time-to-event (death, relapse) than a younger patient who started out in better shape. Car tires would be expected to wear out quicker if the vehicle is often driven off-road.

The concept of introducing the dependence of an outcome variable such as the duration of a lifetime on the values of covariates is familiar from the multiple linear regression model

image

where x = (x0 , x1 , … , xp) is the vector of covariates, with x0 ≡ 1. The standard model takes the distribution of the random error term as image, in which case the model for the dependent variable can be written as

[1.2]image

This expression suggests one way of extending a regression model to situations where it is not reasonable to assume a normally distributed dependent variable: select a more appropriate distribution (e.g. Poisson with parameter μ) and link its parameters in some way to the linear predictor βx formed from the covariates (e.g. ln μ = βx is often an appropriate choice in combination with the Poisson distribution). In this way, we obtain the class of generalized linear models (GLM) in which the mean parameter is related to the linear predictor (see McCullagh and Nelder [MCC 89]). For example, the GLM version of [1.2] when the dependent variable Y is a count of the number of events and therefore might follow the Poisson distribution is

image

In the much wider class of generalized additive models for location, scale and shape (GAMLSS), as many as four parameters of the distribution can depend on covariates (see Rigby and Stasinopoulos [RIG 05]).

The same approach can be taken to distributions that are often used in modeling lifetime data. For example, the inverse Gaussian distribution and the gamma distribution both belong to the exponential family that is modeled in the standard framework of generalized linear models. The inverse Gaussian distribution will be mentioned in this context in section 3.2. However, there are other ways of approaching the matter in the context of lifetime data, which give rise to the general classes of models that will be considered in the following sections.

Parametric lifetime regression models are usually fitted by direct maximization of a likelihood function using numerical methods. Given a sample of n independent observations {(ti, xi, δi), i = 1, … , n}, where unit i with covariates xi has lifetime ti and censoring indicator δi (=1 if ti is an observed failure time, 0 if ti is a right censored observation time), the likelihood is

[1.3]image

where the parameter vector θ includes the regression coefficients. Using the relationships between the probability density function, hazard function and survival function, this likelihood can be written in various alternative forms, if desired. For example, using h(t) = f(t)/S(t) to substitute for f(t), the likelihood can be written in terms of the hazard and survival functions as

image

which may sometimes be convenient.

1.4. Proportional hazards models

To illustrate one of the main approaches to the regression modeling of lifetime data, we begin with the widely used Weibull distribution. Note that this does not fall within the framework of generalized linear models. Its survival function is

image

We introduce the effect of covariates x on the parameters of the model, giving

image

The scale parameter α now depends on x. (This is the usual form of model, although it is possible to allow the parameter η to depend on x instead, or to let both the parameters depend on covariates at the same time. In the latter case, the covariates affecting α and η do not need to be the same. A recent paper by Burke and MacKenzie discusses the general approach where both the parameters depend on the covariates for the Weibull distribution and in general [BUR 16a]. See section 3.2.)

Let image or simply image since the constant α can be absorbed into the exponent. (Once again, this is the usual form of the model, although not the only possibility.) Note that the function image is positive, a restriction that is necessary here.

The hazard function is readily obtained from image as

image

where θ = −ηβ. Now compare the hazard functions of the two units with covariate vectors x1 and x2. Their ratio is

image

which does not depend on time. In other words, the hazard function of one unit remains in constant proportion to the hazard of the other. This is the proportional hazards (PH) model, which applies for any non-negative α(x), not just image A particular version of the PH model - Cox’s semi-parametric PH regression model - has virtually become the standard model for evaluating biomedical lifetime data. This model will be described in section 1.14.

Another example of a model that possesses the PH property is obtained from the Gompertz distribution, which is described most simply by its hazard function

image

which is decreasing in t for ø < 1, increasing for ø > 1 and constant for ø = 1, in which case it reduces to the exponential distribution. The PH model modifies the distribution into another Gompertz distribution with a different value of μ but the same ø (see Hougaard [HOU 99]). The use of the Gompertz distribution is restricted mainly to demography and actuarial science, where it has a long history. It has been used, for example, to describe mortality among adults. In these contexts, ø > 1 (increasing hazard - i.e. mortality - at older ages).

Now consider what happens if ø < 1. In the survival function

image

the term øt tends to zero, therefore the limit of S(t) as t tends to infinity is not zero. In fact, with μ > 0 as before, the limit of S(t) is image where ξ = μ/ ln ø < 0 and hence 0 < S(∞) < 1. This version of the Gompertz distribution is called the negative Gompertz distribution by Marshall and Olkin [MAR 07, Chapter 10]). Because S(∞) < 1, it is an improper distribution or defective distribution. However, this feature is not necessarily a defect as far as using the distribution as a statistical model goes. The existence of a positive probability mass at infinity can be interpreted to mean that the corresponding proportion of the population will never die. This assumption is clearly meaningless in the actuarial study of human mortality, but could possibly be very realistic in a shorter-term study of mortality from a disease after a treatment. In the latter case, those who “never die” (at least, from the disease under study) are those who have been cured of the disease. Thus, the apparent defect becomes an asset of the model in its ability to model data. Examples of the application of the Gompertz distribution that exploit this characteristic include Cantor and Shuster [CAN 92] and Gieser et al. [GIE 98].

This feature will be mentioned (under the name of cured fraction or long-term survivors) quite often in this book, because it is shared by the inverse Gaussian distribution, which, as will appear in due course, is a central topic.

Both the basic examples of a distribution that possess the PH property, the Weibull and the Gompertz distributions, have hazard functions that are monotonic in t. This is not a necessary condition for a PH distribution. If we write the PH property in its general form

image

where h0(t) is a baseline hazard function, it is obvious that if the baseline hazard function h0(t) has a maximum or a minimum at a value t0, then the hazard functions h(t|x) for every x likewise have maxima or minima as the case may be at this same value t0. For example, if the hazard function falls to a minimum and thereafter increases - often claimed to be a realistic form in various situations - then that minimum would have to occur at the same time irrespective of the values of the covariates. This seems unlikely to be true in practice.

Bagdonavičius and Nikulin [BAG 99] proposed an extension of the PH model to the generalized PH model. The hazard function can be written as

image

where r and q are positive functions. Thus, the hazard rate at time t depends not only on the current values of the covariates (as in PH) but also on their history as expressed by the cumulative hazard H(t). One special case is the generalized linear PH model, in which image as usual and

image

so that

image

Thus, the cumulative hazard up to this moment in time is treated as an additional, unknown covariate. This model is examined further by Bagdonavičius et al. [BAG 05].

1.5. Checking the proportional hazards assumption

The theory that was outlined in the preceding paragraphs requires the assumption of PH. If this assumption is inappropriate for the data, then it is meaningless to fit this particular regression model. How can we check that the assumption is appropriate?

The hazard function

image

gives the survival function

image

where H0(t) is the cumulative hazard function corresponding to the baseline hazard function h0(t).

Consequently,

image

which means that the curves

image

for different values of x are simply the horizontally displaced versions of the curve ln H0(t) when plotted against t. Consequently, all the curves image for different xi are parallel to each other.

This observation suggests a simple way of checking for PH:

  1. – compute non-parametric Kaplan-Meier estimates of the survivor function image for selected x;
  2. – plot image against t for each selected x.

If all the lines for the various x are indeed parallel to each other, then the assumption of the proportional hazards is correct. This idea applies to any PH model, but does not tell us which distribution is the appropriate one if we are to carry out a parametric regression. Also, it does not require that the proportionality be expressed by the multiplicative factor image ; any non-negative function g(x) would do.

However, we could plot against the appropriate function of time to help determine the distribution. For example, if the lifetime distribution is thought to be Weibull, then from image, it follows that

image

The plot of image against ln t should give a straight line. It is usually easier to see that straight lines are parallel rather than arbitrary curves.

The weakness of this procedure is that the estimates image will only be satisfactory for this purpose if they are based on sufficiently large numbers of observations; otherwise, the sampling variability will be so large that it might be hard to say whether the curves are parallel or not. This means that there must be a reasonably large number of observations that share the same value of the covariates. For this reason, the method can only be applied if the covariates are few, or by carrying out suitable grouping of values of the covariates.

EXAMPLE 1.1.– Table 1.1 provides McCool’s data on hardened steel specimens tested until failure at four different levels of stress [MCC 80].

Table 1.1. McCool’s data on hardened steel specimens tested until failure at four different levels of stress [MCC 80]

Stress (106 psi) Ordered lifetimes
.87 : 1.67 2.20 2.51 3.00 3.90 4.70 7.53 14.70 27.80 37.40
.99 : 0.80 1.00 1.37 2.25 2.95 3.70 6.07 6.65 7.05 7.37
1.09 : 0.012 0.18 0.20 0.24 0.26 0.32 0.32 0.42 0.44 0.88
1.18 : 0.073 0.098 0.117 0.135 0.175 0.262 0.270 0.350 0.386 0.456

Figure 1.2Figure 1.3