Note: The terms event and failure are used interchangeably in this seminar, as are time to event and failure time.
In this seminar we will be analyzing the data of 500 subjects of the Worcester Heart Attack Study (referred to henceforth as WHAS500, distributed with Hosmer & Lemeshow(2008)). Understanding the mechanics behind survival analysis is aided by facility with the distributions used, which can be derived from the probability density function and cumulative density functions of survival times. As an example, we can use the cdf to determine the probability of observing a survival time of up to 100 days. In the graph above we can see that the probability of surviving 200 days or fewer is near 50%.
The survivor function, $S(t)$, describes the probability of surviving past time $t$, or $Pr(Time > t)$. The hazard function, then, describes the relative likelihood of the event occurring at time $t$ ($f(t)$), conditional on the subject's survival up to that time $t$ ($S(t)$). As we have seen before, the hazard appears to be greatest at the beginning of follow-up time and then rapidly declines and finally levels off. Also useful to understand is the cumulative hazard function, which as the name implies, cumulates hazards over time.
Let us again think of the hazard function, $h(t)$, as the rate at which failures occur at time $t$. From these equations we can see that the cumulative hazard function $H(t)$ and the survival function $S(t)$ have a simple monotonic relationship, such that when the Survival function is at its maximum at the beginning of analysis time, the cumulative hazard function is at its minimum. We can estimate the cumulative hazard function using proc lifetest, the results of which we send to proc sgplot for plotting. This seminar covers both proc lifetest and proc phreg, and data can be structured in one of 2 ways for survival analysis.
A second way to structure the data that only proc phreg accepts is the "counting process" style of input that allows multiple rows of data per subject. This structuring allows the modeling of time-varying covariates, or explanatory variables whose values change across follow-up time. Any serious endeavor into data analysis should begin with data exploration, in which the researcher becomes familiar with the distributions and typical values of each variable individually, as well as relationships between pairs or sets of variables. We see in the table above, that the typical subject in our dataset is more likely male, 70 years of age, with a bmi of 26.6 and heart rate of 87. Looking at the table of "Product-Limit Survival Estimates" below, for the first interval, from 1 day to just before 2 days, $n_i$ = 500, $d_i$ = 8, so $\hat S(1) = \frac{500 - 8}{500} = 0.984$. Survival analysis often begins with examination of the overall survival experience through non-parametric methods, such as Kaplan-Meier (product-limit) and life-table estimators of the survival function.
At a minimum proc lifetest requires specification of a failure time variable, here lenfol, on the time statement.
Without further specification, SAS will assume all times reported are uncensored, true failures.
We also specify the option atrisk on the proc lifetest statement to display the number at risk in our sample at various time points. Above we see the table of Kaplan-Meier estimates of the survival function produced by proc lifetest. From "LENFOL"=368 to 376, we see that there are several records where it appears no events occurred. By default, proc lifetest graphs the Kaplan Meier estimate, even without the plot= option on the proc lifetest statement, so we could have used the same code from above that produced the table of Kaplan-Meier estimates to generate the graph. However, we would like to add confidence bands and the number at risk to the graph, so we add plots=survival(atrisk cb). The step function form of the survival function is apparent in the graph of the Kaplan-Meier estimate. Because of its simple relationship with the survival function, $S(t)=e^{-H(t)}$, the cumulative hazard function can be used to estimate the survival function. The Nelson-Aalen estimator is requested in SAS through the nelson option on the proc lifetest statement. Researchers are often interested in estimates of survival time at which 50% or 25% of the population have died or failed.
Suppose that you suspect that the survival function is not the same among some of the groups in your study (some groups tend to fail more quickly than others). When provided with a grouping variable in a strata statement in proc lifetest, SAS will produce graphs of the survival function (unless other graphs are requested) stratified by the grouping variable as well as tests of equality of the survival function across strata. In the graph of the Kaplan-Meier estimator stratified by gender below, it appears that females generally have a worse survival experience. In the output we find three Chi-square based tests of the equality of the survival function over strata, which support our suspicion that survival differs between genders.
Whereas with non-parametric methods we are typically studying the survival function, with regression methods we examine the hazard function, $h(t)$. In regression models for survival analysis, we attempt to estimate parameters which describe the relationship between our predictors and the hazard rate.
Cox models are typically fitted by maximum likelihood methods, which estimate the regression parameters that maximize the probability of observing the given set of survival times. The probability of observing subject $j$ fail out of all $R_j$ remaing at-risk subjects, then, is the proportion of the sum total of hazard rates of all $R_j$ subjects that is made up by subject $j$'s hazard rate. We also would like survival curves based on our model, so we add plots=survival to the proc phreg statement, although as we shall see this specification is probably insufficient for what we want. On the model statement, on the left side of the equation, we provide the follow up time variable, lenfol, and the censoring variable, fstat, with all censoring values listed in parentheses. Model Fit Statistics: Displays fit statistics which are typically used for model comparison and selection.
Analysis of Maximum Likelihood Estimates: Displays model coefficients, tests of significance, and exponentiated coefficient as hazard ratio. When only plots=survival is specified on the proc phreg statement, SAS will produce one graph, a "reference curve" of the survival function at the reference level of all categorical predictors and at the mean of all continuous predictors. In this model, this reference curve is for males at age 69.845947 Usually, we are interested in comparing survival functions between groups, so we will need to provide SAS with some additional instructions to get these graphs. Acquiring more than one curve, whether survival or hazard, after Cox regression in SAS requires use of the baseline statement in conjunction with the creation of a small dataset of covariate values at which to estimate our curves of interest. This expanded dataset can be named and then viewed with the out= option, but obtaining the out= dataset is not at all necessary to generate the survival plots. Both survival and cumulative hazard curves are available using the plots= option on the proc phreg statement, with the keywords survival and cumhaz, respectively. Let's get survival curves (cumulative hazard curves are also available) for males and female at the mean age of 69.845947 in the manner we just described. We request survival plots that are overlaid with the plot(overlay)=(survival) specification on the proc phreg statement.
We also add the rowid=option on the baseline statement, which tells SAS to label the curves on our graph using the variable gender. The survival curves for females is slightly higher than the curve for males, suggesting that the survival experience is possibly slightly better (if significant) for females, after controlling for age. In our previous model we examined the effects of gender and age on the hazard rate of dying after being hospitalized for heart attack.
In the code below we fit a Cox regression model where we allow examine the effects of gender, age, bmi, and heart rate on the hazard rate. The questions of interest in survival analysis are questions like: What is the probability that a participant survives 5 years? In the first instance, the participants observed time is less than the length of the follow-up and in the second, the participant's observed time is equal to the length of the follow-up period.
A small prospective study is run and follows ten participants for the development of myocardial infarction (MI, or heart attack) over a period of 10 years. During the study period, three participants suffer myocardial infarction (MI), one dies, two drop out of the study (for unknown reasons), and four complete the 10-year follow-up without suffering MI. Based on this data, what is the likelihood that a participant will suffer an MI over 10 years?
This is called non-informative censoring and essentially assumes that the participants whose data are censored would have the same distribution of failure times (or times to event) if they were actually observed. Notice here that, once again, three participants suffer MI, one dies, two drop out of the study, and four complete the 10-year follow-up without suffering MI.

In survival analysis we analyze not only the numbers of participants who suffer the event of interest (a dichotomous indicator of event status), but also the times at which the events occur. Time zero, or the time origin, is the time at which participants are considered at-risk for the outcome of interest. In survival analysis, we use information on event status and follow up time to estimate a survival function. The horizontal axis represents time in years, and the vertical axis shows the probability of surviving or the proportion of people surviving. The figure below shows Kaplan-Meier curves for the cumulative risk of dementia among elderly persons who frequently played board games such as chess, checkers, backgammon, or cards at baseline as compared with subjects who rarely played such games. We focus here on two nonparametric methods, which make no assumptions about how the probability that a person develops the event changes over time. One way of summarizing the experiences of the participants is with a life table, or an actuarial table. To construct a life table, we first organize the follow-up times into equally spaced intervals.
For the first interval, 0-4 years: At time 0, the start of the first interval (0-4 years), there are 20 participants alive or at risk. This table uses the actuarial method to construct the follow-up life table where the time is divided into equally spaced intervals.
An issue with the life table approach shown above is that the survival probabilities can change depending on how the intervals are organized, particularly with small samples.
Appropriate use of the Kaplan-Meier approach rests on the assumption that censoring is independent of the likelihood of developing the event of interest and that survival probabilities are comparable in participants who are recruited early and later into the study. In the survival curve shown above, the symbols represent each event time, either a death or a censored time. These estimates of survival probabilities at specific times and the median survival time are point estimates and should be interpreted as such. Some investigators prefer to generate cumulative incidence curves, as opposed to survival curves which show the cumulative probabilities of experiencing the event of interest. From this figure we can estimate the likelihood that a participant dies by a certain time point. We are often interested in assessing whether there are differences in survival (or cumulative incidence of event) among different groups of participants. The log rank test is a popular test to test the null hypothesis of no difference in survival between two or more independent groups. A small clinical trial is run to compare two combination treatments in patients with advanced gastric cancer.
Six participants in the chemotherapy before surgery group die over the course of follow-up as compared to three participants in the chemotherapy after surgery group. The survival probabilities for the chemotherapy after surgery group are higher than the survival probabilities for the chemotherapy before surgery group, suggesting a survival benefit. The sums of the observed and expected numbers of events are computed for each event time and summed for each comparison group. To compute the test statistic we need the observed and expected number of events at each event time. To generate the expected numbers of events we organize the data into a life table with rows representing each event time, regardless of the group in which the event occurred. 2013?3?16? - See text ebook Survival Analysis Using SAS: A Practical Guide, Second Edition pdf by Paul D.
Easy to read and comprehensive, Survival Analysis Using SAS: A Practical Guide, Second Edition, by Paul D. You can square the z-score from the table below to get the chi-square values shown in the text. This study examined several factors, such as age, gender and BMI, that may influence survival time after heart attack.
That is, for some subjects we do not know when they died after heart attack, but we do know at least how many days they survived.
Thus, each term in the product is the conditional probability of survival beyond time $t_i$, meaning the probability of surviving beyond time $t_i$, given the subject has survived up to time $t_i$. Each row of the table corresponds to an interval of time, beginning at the time in the "LENFOL" column for that row, and ending just before the time in the "LENFOL" column in the first subsequent row that has a different "LENFOL" value. When a subject dies at a particular time point, the step function drops, whereas in between failure times the graph remains flat. SAS will output both Kaplan Meier estimates of the survival function and Nelson-Aalen estimates of the cumulative hazard function in one table.
In a nutshell, these statistics sum the weighted differences between the observed number of failures and the expected number of failures for each stratum at each timepoint, assuming the same survival function of each stratum.
From the plot we can see that the hazard function indeed appears higher at the beginning of follow-up time and then decreases until it levels off at around 500 days and stays low and mostly constant. Are there differences in survival between groups (e.g., between those assigned to a new versus a standard drug in a clinical trial)? True survival time (sometimes called failure time) is not known because the study ends or because a participant drops out of the study before experiencing the event. The most common is called right censoring and occurs when a participant does not have the event of interest during the study and thus their last observed follow-up time is less than their time to event. Participants are recruited into the study over a period of two years and are followed for up to 10 years.
Three of 10 participants suffer MI over the course of follow-up, but 30% is probably an underestimate of the true percentage as two participants dropped out and might have suffered an MI had they been observed for the full 10 years. The fact that all participants are often not observed over the entire follow-up period makes survival data unique.
Specifically, we assume that censoring is independent or unrelated to the likelihood of developing the event of interest.
However, the events (MIs) occur much earlier, and the drop outs and death occur later in the course of follow-up.
Consider a 20 year prospective study of patient survival following a myocardial infarction. There are a number of popular parametric methods that are used to model survival data, and they differ in terms of the assumptions that are made about the distribution of survival times in the population. Using nonparametric methods, we estimate and plot the survival distribution or the survival curve.
The study involves 20 participants who are 65 years of age and older; they are enrolled over a 5 year period and are followed for up to 24 years until they die, the study ends, or they drop out of the study (lost to follow-up). Life tables are often used in the insurance industry to estimate life expectancy and to set premiums. In the table above we have a maximum follow-up of 24 years, and we consider 5-year intervals (0-4, 5-9, 10-14, 15-19 and 20-24 years).
The proportion surviving past each subsequent interval is computed using principles of conditional probability introduced in the module on Probability. The Kaplan-Meier approach, also called the product-limit approach, is a popular approach which addresses this issue by re-estimating the survival probability each time an event occurs.
When comparing several groups, it is also important that these assumptions are satisfied in each comparison group and that for example, censoring is not more likely in one group than another. At Time=0 (baseline, or the start of the study), all participants are at risk and the survival probability is 1 (or 100%).
From the survival curve, we can also estimate the probability that a participant survives past 10 years by locating 10 years on the X axis and reading up and over to the Y axis. There are formulas to produce standard errors and confidence interval estimates of survival probabilities that can be generated with many statistical computing packages. The Kaplan-Meier survival curve is shown as a solid line, and the 95% confidence limits are shown as dotted lines. Cumulative incidence, or cumulative failure probability, is computed as 1-St and can be computed easily from the life table using the Kaplan-Meier approach.
For example, in a clinical trial with a survival outcome, we might be interested in comparing survival between participants receiving a new drug as compared to a placebo (or standard therapy). The test compares the entire survival experience between groups and can be thought of as a test of whether the survival curves are identical (overlapping) or not.
Twenty participants with stage IV gastric cancer who consent to participate in the trial are randomly assigned to receive chemotherapy before surgery or chemotherapy after surgery.

Other participants in each group are followed for varying numbers of months, some to the end of the study at 48 months (in the chemotherapy after surgery group). There are several forms of the test statistic, and they vary in terms of how they are computed.
The log rank statistic has degrees of freedom equal to k-1, where k represents the number of comparison groups.
The table below contains the information needed to conduct the log rank test to compare the survival curves above. This is a great compliment to anEasy to read and comprehensive, Survival Analysis Using SAS: A Practical Guide, Second Edition, by Paul D. Statistical analysis of time to event variables requires different techniques than those described thus far for other types of outcomes because of the unique features of time to event variables. How do certain personal, behavioral or clinical characteristics affect participants' chances of survival? For example, in a study assessing time to relapse in high risk patients, the majority of events (relapses) may occur early in the follow up with very few occurring later. What we know is that the participants survival time is greater than their last observed follow-up time.
This can occur when a participant drops out before the study ends or when a participant is event free at the end of the observation period.
The graphic below indicates when they enrolled and what subsequently happened to them during the observation period.
In this small example, participant 4 is observed for 4 years and over that period does not have an MI. Should these differences in participants experiences affect the estimate of the likelihood that a participant suffers an MI over 10 years? In a prospective cohort study evaluating time to incident stroke, investigators may recruit participants who are 55 years of age and older as the risk for stroke prior to that age is very low.
In this study, the outcome is all-cause mortality and the survival function (or survival curve) might be as depicted in the figure below. Some popular distributions include the exponential, Weibull, Gompertz and log-normal distributions.2 Perhaps the most popular is the exponential distribution, which assumes that a participant's likelihood of suffering the event of interest is independent of how long that person has been event-free. We focus on a particular type of life table used widely in biostatistical analysis called a cohort life table or a follow-up life table. The proportion of participants surviving past 10 years is 84%, and the proportion of participants surviving past 20 years is 68%. Survival curves are estimated for each group, considered separately, using the Kaplan-Meier method and compared statistically using the log rank test.
The primary outcome is death and participants are followed for up to 48 months (4 years) following enrollment into the trial. Using the procedures outlined above, we first construct life tables for each treatment group using the Kaplan-Meier approach.
Group 1 represents the chemotherapy before surgery group, and group 2 represents the chemotherapy after surgery group. Keep it on your phone and you'll have it where ever you go - even ?w2013?3?16? - See text ebook Survival Analysis Using SAS: A Practical Guide, Second Edition pdf by Paul D. Statistical analysis of these variables is called time to event analysis or survival analysis even though the outcome is not always death. On the other hand, in a study of time to death in a community based sample, the majority of events (deaths) may occur later in the follow up. In a prospective cohort study evaluating time to incident cardiovascular disease, investigators may recruit participants who are 35 years of age and older. Time is shown on the X-axis and survival (proportion of people at risk) is shown on the Y-axis.
The follow-up life table summarizes the experiences of participants over a pre-defined follow-up period in a cohort study or in a clinical trial until the time of the event of interest or the end of the study, whichever comes first.
The probability that a participant survives past interval 2 means that they had to survive past interval 1 and through interval 2: S2 = P(survive past interval 2) = P(survive through interval 2)*P(survive past interval 1), or S2 = p2*S1.
Note that the calculations using the Kaplan-Meier approach are similar to those using the actuarial life table approach.
The median survival is estimated by locating 0.5 on the Y axis and reading over and down to the X axis. The null hypothesis is that there is no difference in survival between the two groups or that there is no difference between the populations in the probability of death at any point. We multiply these estimates by the number of participants at risk at that time in each of the comparison groups (N1t and N2t for groups 1 and 2 respectively). Additionally, another variable counts the number of events occurring in each interval (either 0 or 1 in Cox regression, same as the censoring variable).
Other nonparametric tests using other weighting schemes are available through the test= option on the strata statement. Instead, we need only assume that whatever the baseline hazard function is, covariate effects multiplicatively shift the hazard function and these multiplicative shifts are constant over time. In each of these studies, a minimum age might be specified as a criterion for inclusion in the study. More details on parametric methods for survival analysis can be found in Hosmer and Lemeshow and Lee and Wang1,3.
Note that the percentage of participants surviving does not always represent the percentage who are alive (which assumes that the outcome of interest is death). The remaining 11 have fewer than 24 years of follow-up due to enrolling late or loss to follow-up. The probability that a participant survives past 4 years, or past the first interval (using the upper limit of the interval to define the time) is S4 = p4 = 0.897. We present one version here that is linked closely to the chi-square test statistic and compares observed to expected numbers of events at each time point over the follow-up period. The log rank test is a non-parametric test and makes no assumptions about the survival distributions.
As an example, imagine subject 1 in the table above, who died at 2,178 days, was in a treatment group of interest for the first 100 days after hospital admission. The red curve representing the lowest BMI category is truncated on the right because the last person in that group died long before the end of followup time.
Nonparametric procedures could be invoked except for the fact that there are additional issues. Survival analysis techniques make use of this information in the estimate of the probability of event. Follow up time is measured from time zero (the start of the study or from the point at which the participant is considered to be at risk) until the event occurs, the study ends or the participant is lost, whichever comes first. The calculations of the survival probabilities are detailed in the first few rows of the table. Specifically, complete data (actual time to event data) is not always available on each participant in a study. Here we see the estimated pdf of survival times in the whas500 set, from which all censored observations were removed to aid presentation and explanation. In very large samples the Kaplan-Meier estimator and the transformed Nelson-Aalen (Breslow) estimator will converge. In many studies, participants are enrolled over a period of time (months or years) and the study ends on a specific calendar date.
Patients often enter or are recruited into cohort studies and clinical trials over a period of several calendar months or years. Thus, participants who enroll later are followed for a shorter period than participants who enroll early. Thus, it is important to record the entry time so that the follow up time is accurately measured. For participants who do not suffer the event of interest we measure follow up time which is less than time to event, and these follow up times are censored.