Sankhyā : The Indian Journal of Statistics Special Issue on Statistics in Biology and Health Sciences 2007, Volume 69, Part 3, pp. 494-513 c 2007, Indian Statistical Institute ° Semiparametric Estimation of Hazard Function for Cancer Patients Ivana Horová, Zdeněk Pospı́šil and Jiřı́ Zelinka Masaryk University, Brno, Czech Republic Abstract The main aim of this paper is to model and study the survival of cancer patients. First, a parametric form of hazard function is proposed. This model results from a recent model of cancer cells population dynamics given in the paper Kozusko and Bajzer (Mathematical Biosciences, 2003) and depends on several parameters. The method of estimating such parameters is described as well. On the other hand, the nonparametric methods seem to be appropriate for survival data. Among them, the methods of kernel estimation of hazard functions are very effective ones. But there is a serious difficulty with them, namely the choice of a smoothing parameter. We propose an alternative method for the bandwidth selection based on the aforementioned parametric method. The theory developed is applied to four data sets. AMS (2000) subject classification. Primary 62F99, 62G99, 62P10. Keywords and phrases. Hazard function, kernel estimation, Gompertzian growth, parameter estimation. 1 Introduction The goal of the paper is to analyse survival data on cancer patients by both parametric and nonparametric methods. We propose a parametric form of the hazard function for cancer patients based on a simple and acceptable assumption that the hazard depends on proliferation speed of cancer cells. We also describe a method for parameter estimation. The nonparametric methods seem to be more appropriate for survival analysis because there are rarely sound reasons for choosing a particular parametric model. A very effective nonparametric method is kernel estimation of the hazard function. The properties of these estimates have been Semiparametric estimation of hazard function 495 investigated by many authors with different techniques; see e.g. RamlauHansen (1983), Tanner and Wong (1983), Tanner and Wong (1984), Yandell (1983), Tanner (1983), Uzunogullari and Wang (1992), Müller and Wang (1990a), Müller and Wang (1990b), Müller and Wang (1994), Jiang and Marron (2003). Nevertheless, there is a serious difficulty with kernel method, namely the determination of the smoothing parameter (bandwidth). Some of the proposed methods for optimal bandwidth selection depend on the hazard function being estimated. Therefore, one needs an “initial” estimation of the unknown hazard function and its second derivative, and the parametric method developed can serve this purpose. The organization of the paper is as follows. The next section reviews the notions of survival analysis from probabilistic, deterministic and statistical points of view. The third section introduces a parametric form of the hazard function along with interpretation of parameters and methods for their estimation. The fourth section summarizes a kernel estimate of the hazard function, and the fifth section presents a new method for estimating the optimal bandwidth. The theory developed is applied to four data sets in Section 6. A brief discussion and suggestions for further research conclude the paper in Section 7. 2 Survivor and Hazard Functions Survival analysis involves analysis of survival time or lifetime described by the random variable T , which is interpreted as the time from the beginning of follow-up to the death (or, to any event under consideration like development of a particular symptom, relapse after remission of a disease, see Collett (2003), Hougaard (2001)). Let us denote the cumulative distribution function of T by F , i.e., F (x) = P(T < x). A survival process can also be characterized by the survival function (e.g., Chaubey and Sen, 1996) F̄ = F̄ (x) = P(T ≥ x) = 1 − F (x), which is the probability that an individual survives for a time greater or equal to x, or by the hazard function λ = λ(x) describing the probability that an individual survives at most a short time interval ∆x under the condition that s/he survived till time x, that is expressed by the formula P(x ≤ T < x + ∆x|T ≥ x) = λ(x)∆x. 496 I. Horová, Z. Pospı́šil, J. Zelinka Dividing the last equality by ∆x and taking the limit as ∆x → 0, we obtain λ(x) = − F̄ 0 (x) , F̄ (x) where 0 denotes the derivative of the function indicated. Since F̄ (0) = 1, the survival function can be expressed by the formula ½ Z x ¾ F̄ (x) = exp − λ(t)dt . (2.1) 0 A slightly different approach for a survival process is described in what follows. Let us imagine a cohort of size N0 and suppose that the cohort dies out with the time dependent death rate µ = µ(x), i.e., the size of the cohort N = N (x) at time x evolves according to the initial value problem for ordinary differential equation N 0 = −µ(x)N, Thus N (0) = N0 . ½ Z N (x) = N0 exp − x ¾ µ(t)dt . 0 (2.2) The survival function F̄ can be considered as the quotient of surviving individuals from the cohort F̄ (x) = N (x) . N0 (2.3) Now, comparing the equations (2.1), (2.2) and (2.3), we can see that the hazard function λ corresponds to the death rate µ, i.e., λ(x) = − N 0 (x) . N (x) (2.4) The relationship between survival analysis and population dynamics are discussed in detail in Pospı́šil (2004). Now, let T1 , T2 , . . . , Tn be independent and identically distributed lifetimes with distribution function F . Let C1 , . . . , Cn be independent and identically distributed censoring times with distribution function G, which are usually assumed to be independent of lifetimes. In the random censorship model, we observe (Xi , δi ), i = 1, . . . , n, where Xi = min(Ti , Ci ), and Semiparametric estimation of hazard function 497 δi = 1{Xi =Ti } indicates whether the ith observation is censored or not. It follows that the {Xi } is independent and identically distributed with survival function L̄ satisfying L̄(x) = F̄ (x)Ḡ(x). The likelihood function, which depends on the hazard function λ and on the censoring process is given by the expression ½ Z Xi ¾ n ³ ´δi Y ¡ ¢1−δi ¡ ¢δ L(λ) = λ (Xi ) exp − G(Xi ) i (2.5) λ(t)dt G0 (Xi ) 0 i=1 according to Hougaard (2001). A more convenient form of it is the logarithmic one given by − log L(λ) n µZ X = i=1 0 Xi ¶ X ¶ n µ G(Xi ) 0 δi log 0 λ(t)dt − δi log λ(Xi ) − + log G (Xi ) G (Xi ) i=1 = `(λ) + const, (2.6) where `(λ) and const represent the first and the second sum, respectively. Hence, the dependence on the censoring process is expressed just by an additive constant. Kaplan and Meier (1958) proposed the so-called product-limit estimate of the survival function: Y µ n − j ¶δ(j) ˆ , (2.7) F̄ (x) = n−j+1 X(j) <x where X(j) denotes the jth order statistics of X1 , . . . , Xn , and δ(j) is the corresponding indicator of the censoring status. Let x1 , x2 , . . . , xm be all of the discontinuity points of F̄ˆ such that x0 = 0 < x1 < x2 < · · · < xm ≤ X(n) . Collett (2003) introduced the Kaplan-Meier type estimate of the hazard function P δ(i) λ̂(x) = X(i) =xj n P (xj+1 −xj ) , for x ∈ [xj , xj+1 ), j = 0, 1, . . . , m − 1; 1{X(i) ≥xj } i=1 (2.8) and Nelson (1972) proposed the modified empirical survival function of observation times by n 1 X 1{Xi ≤x} . L̄n (x) = 1 − n+1 i=1 (2.9) 498 I. Horová, Z. Pospı́šil, J. Zelinka 3 A Hazard Function for Cancer Patients We start with an assumption that the risk depends on the changes in the organism rather than on its status, i.e., the hazard is proportional to a rate of proliferation of cancer cells. In particular, denoting by y = y(x), a time dependent size of cancer cell population (number of cells, tumour volume or something similar), we assume λ(x) = βy 0 (x), (3.1) where β denotes the positive rate of proportionality. There are many models of tumour growth; for a recent survey and a reference list, see Araujo and McElwain (2004). Kozusko and Bajzer (2003) showed that under not very restrictive conditions, the classical Gompertzian model of tumour growth represents a consequence of a more sophisticated model of cancer cell proliferation. Another advantage of the Gompertzian model is its tractability. These reasons represent the basis for choosing the Gompertz function to be a model of cancer growth. The function is the solution of the following initial value problem for the ordinary differential equation: y y(0) = y0 . y 0 = −ay log , b The parameters y0 and b denote the initial and the maximal possible size of the population of cancer cells, respectively. The parameter a represents an analogy of intrinsic growth rate appearing in population dynamics models; it can be interpreted as a maximal possible increase of a tumour size in a unit of time. The solution of the initial value problem together with the assumption (3.1) yields the hazard function in the form o n ∗ (3.2) λ(x) = λ(x, a, t∗ , λ∗ ) = λ∗ exp 1 − a(x − t∗ ) − e−a(x−t ) , where µ ¶ 1 b t = log log . a y0 βab λ = , e ∗ ∗ Since all of the parameters a, b, y0 , β are positive and b À y0 , the parameters λ∗ and t∗ are also positive. Straightforward calculations imply 0 = lim λ(x) ≤ lim λ(x) < max {λ(x) : x > 0} = λ∗ = λ (t∗ ) , x→∞ x→0+ Z 0 ∞ λ(t)dt < ∞. Semiparametric estimation of hazard function 499 These formulae can be interpreted in the following way. A majority of deaths does not occur immediately after including a patient into a study. The maximal risk λ∗ comes at some time t∗ after the diagnosis (initial time). All of the patients need not die due to cancer causes (since lim F̄ (x) > 0 according x→∞ to (2.1)). These properties are in conformity with clinical observations. The parameters a, t∗ , λ∗ of the function λ given by (3.2) can be estimated by the maximum likelihood method, i.e., by a maximization of the function (2.5) or, equivalently, by a minimization of the function `(λ) = `(a, t∗ , λ∗ ) (3.3) defined in (2.6). The minimum can be achieved by an iterative method. For doing it, we need initial estimates ã, t˜∗ , λ˜∗ of the corresponding parameters. We start with the Kaplan-Meier type estimate λ̂ of the hazard function, cf. (2.8). We split the interval [0, X(n) ) into several subintervals I1 , . . . , Ip of equal length. Let J be the subinterval with the maximal number of deaths, i.e., X X δi = max δi : j = 1, 2, . . . , p ; Xi ∈J Xi ∈Ij the symbols X(n) , Xi , δi were defined in the Introduction. The empirical Kaplan-Meier type hazard function λ̂ can be approximated over the interval J by the quadratic regression function λ̂(x) ≈ Ax2 + Bx + C. The power expansion of the function (3.2) up to the quadratic term near the maximizing value t∗ is µ ¶ a2 ∗ ∗ 2 λ(x) ≈ λ 1 − (x − t ) . 2 Roughly speaking, these two quadratic functions should coincide on the interval J. Hence, comparison of the coefficients and some algebra yield initial estimates of the parameters, which can be used for starting the iterative minimization of the function (3.3). The actual hazard function can be more complicated. There need not be unique cause of death in a group of patients. For example, cell populations of local recurrence and distant failure may grow with different rates (i.e., different values of the parameter a) or they may have different limit sizes (i.e., different values of the parameter b and, consequently, of the maximal hazard λ∗ ), patients may be subjected to a treatment in different stages of the disease (different y0 ’s, i.e., different times of maximal hazard t∗ ). To deal 500 I. Horová, Z. Pospı́šil, J. Zelinka with such situations, we suppose that the cohort of patients be split up into k sub-cohorts with sizes N1 , N2 ,. . . , Nk , and that each sub-cohort evolves with its specific dynamics: ¾ ½ Z x λj (t)dt , j = 1, 2, . . . , k, Nj (x) = αj N0 exp − 0 Pk where αj > 0, j=1 αj = 1, cf. (2.1), (2.3). Consequently, the survival and the hazard functions for the complete cohort are given by the formulae F̄c (x) = k X j=1 and λc (x) = − N 0 (x) N (x) = ½ Z αj exp − k P j=1 x λj (t)dt 0 ¾ © Rx ª αj λj (x) exp − 0 λj (t)dt k P j=1 (3.4) , (3.5) © Rx ª αj exp − 0 λj (t)dt respectively; λc (x) is called composed hazard function. We suppose that each of the “partial” hazard functions λj is expressed by (3.2), i.e., λj (x) = λ(x, aj , t∗j , λ∗j ). Therefore, λc (x) = λc (x, α1 , a1 , t∗1 , λ∗1 , . . . , αk , ak , t∗k , λ∗k ) (3.6) and k four-tuples of parameters αj , aj , λ∗j , t∗j should be estimated. A splitting-up of patients into sub-cohorts (i.e., estimation of parameters αj ) may be carried out according to some clinical indications, and the parameters for each of the sub-cohorts may be estimated by minimization of the function (3.3). The estimates obtained in this way are used as initial approximations for iterative minimization of the logarithmic likelihood function `c (α1 , a1 , t∗1 , λ∗1 , . . . , αk , ak , t∗k , λ∗k ), (3.7) cf. (2.6). Let us denote the minimizing values of respective parameters by α̂j , âj , t̂∗j , λ̂∗j , j = 1, 2, . . . , k, and λ̂c (x) = λc (x, α̂1 , â1 , t̂∗1 , λ̂∗1 , . . . , α̂k , âk , t̂∗k , λ̂∗k ). (3.8) The described method of parameter estimation was tested on simulated data. The simulated cohort included 250 patients with lifetimes generated Semiparametric estimation of hazard function 501 by using the hazard function (3.5) with k = 2 and parameters shown in Table 1; the censoring times were taken from uniform distribution on (10, 250). One hundred data sets were generated, and for each of them, the described parameter estimation was applied. The procedure yielded results in 99 cases; the minimum of the logarithmic likelihood function was not achieved in only one case. The convergence depends on precision of initial approximation and suitable initial values can be found by a non-algorithmic “manual trialand-error” method, which cannot be implemented for a huge number of data sets. The estimated parameters are compared with their true values in Table 1. We can see that the parameters a2 , t∗2 , λ∗2 for the second cohort lie in the corresponding estimated interquartile range, while a1 , t∗1 , λ∗1 do not. This phenomenon may be caused by the fact that the size of the first cohort is four times less than the one of the second cohort (see the values of α1 and α2 in Table 1) and that the minimization of the likelihood function for the composed hazard function (3.7) may suppress an accuracy of parameter estimation for a minor cohort. Nevertheless, the present method allows to estimate a smooth function approximating the hazard function, which is the main aim of the paper. Table 1. Parameters used for simulations and characteristics of estimated parameters Estimates 4 Parameter Median Q0.25 Q0.75 α1 = 0.2 α2 = 0.8 a1 = 0.02 a2 = 0.06 t∗1 = 70 t∗2 = 100 λ∗1 = 0.1 λ∗2 = 0.005 0.20 0.80 0.004 0.065 528 99 42.3 0.0051 0.19 0.79 0.0011 0.060 256 97 4.3 0.0043 0.21 0.80 0.0046 0.072 587 101 121.1 0.0059 Kernel Estimates of the Hazard Function Let [0, T ], T > 0, be such an interval for which L(T ) < 1, where L is a cumulative distribution function of the Xi ’s. Let λ be a twice continuously differentiable function on [0, T ]. The family of such functions is denoted by C 2 [0, T ]. 502 I. Horová, Z. Pospı́šil, J. Zelinka The usual kernel estimate of the hazard function λ is defined by n 1X K λ̂h,K (x) = h i=1 µ x − X(i) h ¶ δ(i) n−i+1 (4.1) (see Müller and Wang, 1990a), Müller and Wang, 1990b, Müller and Wang, 1994), where K is a kernel and h = h(n) is a sequence of non-random positive numbers; h is called a smoothing parameter or a bandwidth. Usually, K is a real valued function on R, which is symmetric and integrated to unity. A well-known choice for K is the Epanechnikov kernel given by the formula 3 K(x) = (1 − x2 )1[−1,1] . 4 Concerning the choice of the bandwidth, modified cross-validation methods can be applied (see e.g. Marron and Padgett, 1987, Uzunogullari and Wang, 1992, Nielsen and Linton, 1995). The rate of convergence for the cross-validation method is rather slow. This motivates a search for improved methods of bandwidth selection. In Gonzales et al. (1996), methods of bootstrap bandwidth selection have been developed. Other methods can also be found in Sarda and Vieu (1991), Patil (1993a), Patil (1993b), Patil et al. (1994). Our idea of bandwidth selection is related to the idea of plug-in rule studied in Müller and Wang (1990b), Müller and Wang (1994). But our approach is different from previously published approaches because we use only “one stage plug-in”, and the parametric method makes it possible to estimate an optimal bandwidth. We begin with some notation and definitions. Let us denote V (K) = Z1 2 K (x)dx, D2 = 0 −1 Λ= ZT 0 ZT ³ λ(x) dx, β2 = L̄(x) Z1 ´2 λ00 (x) dx, (4.2) x2 K(x)dx. −1 The global quality of the estimate (4.1) can be described by the Mean Integrated Square Error (M ISE), ´ Z ³ M ISE λ̂h,K = 0 T ³ ´2 E λ̂h,K (x) − λ(x) dx, Semiparametric estimation of hazard function 503 where E denotes the expectation. It can be shown (Müller and Wang, 1990a) that the leading term M ISE(λ̂h,K ) of M ISE(λ̂h,K ) takes the form ´ 1 ³ V (K)Λ M ISE λ̂h,K = h4 β22 D2 + . (4.3) 4 nh Under assumptions that h → 0, nh → ∞ as n → ∞ and λ ∈ C 2 [0, T ], p the consistency of the estimate λ̂h,K can be proved, i.e. λ̂h,K → λ on [0, T ] (Müller and Wang, 1990a). Minimizing M ISE(λ̂h,K ), with respect to h, yields what we define as an asymptotically optimal bandwidth hopt . We easily find that µ ¶ ΛV (K) 1/5 . (4.4) hopt = n−1/5 β22 D2 An obvious difficulty of finding such an optimal bandwidth is that hopt depends on the unknowns Λ and D2 . This problem will be addressed in the next section. Nevertheless, some useful conclusions can be drawn. Firstly, the optimal bandwidth will converge to zero as the sample size increases, but at a very slow rate. Secondly, since the term D2 measures, in a sense, the rapidity of fluctuations in the hazard function λ, it can be seen from (4.4) that smaller values of h will be appropriate for more rapidly fluctuating hazard functions. Substituting the value of hopt from (4.4) back into the formula (4.3), we obtain (see, e.g., Horová et al., 2006) ´ 5 ³ ¡ ¢1/5 M ISE λ̂h,K = n−4/5 T (K) D2 Λ4 , (4.5) 4 where the functional T (K) is defined by 2/5 T (K) = β2 V (K)4/5 . For instance, the value of T (K) for the Epanechnikov kernel is equal to p . 5 81/15625 = 0.3491. The formula (4.5) shows the rate of the convergence for the optimal value of the bandwidth. Following Müller and Wang (1990a), the asymptotic (1 − α) confidence interval is given by ( )1/2 ³ λ̂h,K (x)V (K) α´ λ̂h,K (x) ± , (4.6) Φ−1 1 − (1 − Ln (x))hn 2 where Φ is the normal cumulative distribution function. 504 I. Horová, Z. Pospı́šil, J. Zelinka 5 A New Method for Estimating Optimal Bandwidth Let us draw attention to the formula (4.4). If we find suitable estimates of D2 and Λ, we can have an estimator ĥopt of hopt . Since L̄n (x) → L̄(x) in probability, 1/5 ZT 1 V (K) λ(x) h̃opt = n− 5 2 dx (5.1) β2 D2 L̄n (x) 0 is some estimate of hopt . The essence of the proposed method consists in evaluating quantities D2 = ZT 0 ¡ ¢2 λ00 (x) dx, Λn = ZT 0 λ(x) dx L̄n (x) by means of the parametric method investigated in Section 3. Let λ̂ denote a parametric estimate of λ and D̂2 = ZT ³ 0 ´2 λ̂ (x) dx, 00 Λ̂n = ZT 0 λ̂(x) dx, L̄n (x) Λ̂ = ZT λ̂(x)dx. (5.2) 0 Substituting these computed quantities back into formula (5.1), we arrive at the estimate ĥopt of hopt ĥopt = n − 51 Ã V (K)Λ̂n β22 D̂2 !1/5 . (5.3) Since D̂2 → D2 , Λ̂n → Λ, λ̂(x) → λ(x), for x ∈ [0, T ], in probability (see Lehmann and Casella, 1998) as n → ∞, the formula (5.3) is a consistent estimate of hopt . From the definition (2.9) of Ln , it is clear that 1 ≤ L̄n (x) ≤ 1 on [0, T ], n+1 and then Λ̂ ≤ Λ̂n ≤ (n + 1)Λ̂ . Thus (5.3) provides estimates of lower and upper bounds for the set of acceptable bandwidths given by ĥl ≤ ĥopt ≤ ĥu , where Semiparametric estimation of hazard function ĥl = n −1/5 Ã V (K)Λ̂ β22 D̂2 !1/5 , and ĥu = µ n+1 n ¶1/5 Ã V (K)Λ̂ β22 D̂2 !1/5 505 . (5.4) Let the hazard function λ be of the form (3.5) with known parameters and the censoring times originate from the uniform distribution on a known interval. Then, we are able to evaluate D2 , Λ and, consequently, to evaluate hopt by the formula (4.4). This gives the optimal bandwidth for the simulated data described in Section 3 as hopt = 17.41. For each out of the 99 generated data sets with successfully estimated parameters, the estimate ĥopt was evaluated. The median value of the estimates is 16.7, lower and upper quartiles are 15.7 and 17.7, arithmetic and geometric means are 20.3 and 18.3. Hence, the procedure of optimal bandwidth estimation appears to be working well. Remark. When estimating near 0 or T , the boundary effects can occur because the “effective support” [x−h, x+h] of K{(x−u)/h} is not contained in [0, T ]. This can lead to negative estimates of the hazard function near endpoints. In such cases, it may be reasonable to truncate λ̂h,K below at 0, that is to consider λ̂h,K (x) = max(λ̂h,K (x), 0). Similar considerations can be made for the confidence intervals. The boundary effects can be avoided by using kernels with asymmetric supports (Müller and Wang, 1990a). 6 Applications In this section, applications of the procedure described above will be presented. The data we are going to deal with have been kindly provided by the Masaryk Memorial Cancer Institute in Brno, Czech Republic (Soumarová et al., 2002a, Soumarová et al., 2002b, Horová et al., 2004) and by the Radiotherapy Department, Hospital of České Budějovice, Czech Republic (Dolečková et al., 2006). The period from the time origin to the death of a patient is the survival time; survival times for those, who have been alive at the end of study, are right censored. The first data set (BRB) involves 236 patients with breast carcinoma. The study has been carried out based on the records of women, who had received the breast conservative surgical treatment and radiotherapy as well 506 I. Horová, Z. Pospı́šil, J. Zelinka at the Masaryk Memorial Cancer Institute in Brno in the period 1983–1994. The patients with clinical stages I and II breast cancer carcinoma only have been included in this study. Of the complete set of 236 patients, there are 47 (19.9%) deaths from cancer. The second data set (BRCB) involves 152 patients with the same diagnosis and treatment, but patients were treated at the Hospital of České Budějovice in the period 1990 – 1996. Of the complete set of patients, there are 32 (21.1%) deaths from cancer. The third study (USB) is concerned with the retrospective study of 49 patients with uterine sarcoma treated at the Masaryk Memorial Cancer Institute from 1990 till 1999. All of the patients had a surgical treatment, and 21 patients (42.9%) died from cancer. The last data set (UCB) involves 222 patients with uterine carcinoma. These patients were treated during the period 1980–1998 at the Masaryk Memorial Cancer Institute. All of the patients had a surgical treatment and only patients of the first clinical stage have been included in this study. Of the complete set of patients, there are 27 (12.2%) deaths from cancer. The characteristics of treated data sets are summarized in Table 2. Table 2. Characteristics of the data sets; n = number of patients, T = maximal follow up in months, nd = number of deaths, pd = percentage of deaths Set of patients BRB BRCB USB UCB n 236 152 49 222 T 220 172 149 279 nd 47 32 21 27 pd 19.9 21.1 42.9 12.2 The first step of each analysis consists of estimation of parameters for the hazard function λ̂c (3.8) by the method described in Section 3. The data sets were split into sub-cohorts by presence/absence of local recurrence and by the clinical stage of disease. This stratification of data helps to obtain initial values of parameters α1 and α2 for numerical minimization of the likelihood function (3.7); it need not have clinical significance. The parameters of the function (3.2) were estimated for each sub-cohort of different data sets. These estimates serve as initial values for the estimates of parameters of the composed hazard function λc (3.5). All of the computations in this step have been conducted using the R software (R Development, 2003). The estimated parameters for each of the data sets are summarized in Table 3. Semiparametric estimation of hazard function 507 Table 3. Estimated parameters of hazard function λ̂c (3.5) with their standard errors (in brackets) for different data sets Data set BRB BRCB USB UCB Sub-cohort (i) αi ai t∗i λ∗i 1 0.1304 (0.0332) 0.02383 (0.0621) 80.07 (231) 0.1168 (0.64) 2 0.8696 (0.0332) 0.06274 (0.0188) 99.57 (7.11) 0.003313 (0.001) 1 0.9832 (0.0424) 0.02477 (0.0138) 44.14 (15.6) 0.002589 (0.000731) 2 0.0168 (0.0424) 0.05778 ( ) 106.5 ( ) 0.06499 ( ) 1 0.4949 (0.635) 0.1387 (0.0632) 9.673 (4.63) 0.05111 (0.0955) 2 0.1598 (0.386) 0.2958 (0.18) 44.94 (3.31) 0.09520 (0.329) 3 0.3453 (0.574) 0.2752 (0.172) 73.63 (2.84) 0.02493 (0.062) 1 0.5697 (1.71) 0.04587 (0.0169) 24.88 (8.25) 0.003575 (0.0116) 2 0.4303 (1.71) 0.03768 (0.0181) 125.7 (19.4) 0.004030 (0.018) The Hessian matrix for the BRCB set is nearly singular, the minimum of the logarithmic likelihood function is apparently flat, and, consequently, standard errors for the estimates of a2 , t∗2 , λ∗2 could not be evaluated. This is one more illustration of the fact that the estimation of hazard function parameters for a non-abundant cohort is quite a hazardous task. Consider the formula (4.1) with the Epanechnikov kernel. The optimal bandwidth is given by the formula (4.4), and the interval Hn = [hl , hu ] of suitable bandwidths is given by the inequalities (5.4). Simple calculations give V (K) = 0.6, β2 = 0.2. The values D2 , Λn and Λ can be approximated by Z T Z T Z T ¡ 00 ´2 λ̂c (x) λ̂c (x) dx, Λ̂n = D̂2 = dx, and Λ̂ = λ̂c (x)dx (6.1) 0 L̄n (x) 0 0 (cf. (5.2)). All of the integrals and derivatives have been computed using the MAPLE software. The lower bounds hl , upper bounds hu and estimates ĥopt of hopt have been evaluated for each data sets by using formulae (5.3) 508 I. Horová, Z. Pospı́šil, J. Zelinka 3 x 10 3.5 λ 3 2.5 2 1.5 1 0.5 0 60 50 40 30 20 h 10 0 20 40 60 80 100 120 140 160 180 200 220 months Figure 1: The dependence of the estimate λ̂h,K on bandwidth h. Patients set is BRB and hl = 18.0, hu = 53.7, ĥopt = 20.3. and (5.4). These computed values have been used to obtain the estimate λ̂h,K by formula (4.1) and the corresponding 0.95 confidence limits by (4.6); all of the computations have been done using the Matlab software. Figure 1 illustrates the effect of the bandwidth on the shape of the estimate λ̂h,K for BRB data. When x varies in an interval I = [0, T ] of the real line, the family of estimates {λ̂h,K , h ∈ [hl , hu ]} can be considered as a surface in 3D-space. The estimate corresponding to ĥopt is highlighted in the figure. The figure shows a very wide range of smoothing, from nearly the raw data (very wiggly thin lines) to the over-smoothed estimates. This family of smoothed curves models different features of the data visible at different levels of smoothing. We may conclude that the estimate corresponding to ĥopt yields a suitable smooth estimate. The optimal bandwidth lies close to the lower bound because the estimate of the upper bound is rather rough. Semiparametric estimation of hazard function 1 5 509 x 10 3 4.5 0.95 4 3.5 0.9 3 2.5 0.85 2 1.5 0.8 1 0.5 0.75 0 50 100 150 200 0 250 0 50 100 150 200 250 Figure 2: Kaplan-Meier and parametric estimates of survival function (left), parametric λ̂c and kernel estimate λ̂h,K of the hazard function (right). Patients set is BRB, ĥopt = 20.3. 3 1 6 0.95 5 0.9 4 0.85 3 0.8 2 0.75 1 0.7 0 20 40 60 80 100 120 140 160 180 200 0 x 10 0 20 40 60 80 100 120 140 160 180 200 Figure 3: Kaplan-Meier and parametric estimates of survival function (left), parametric λ̂c and kernel estimate λ̂h,K of the hazard function (right). Patients set is BRCB, ĥopt = 41.1. Figures 2–5 show the shapes of the estimates of survival functions and hazard functions for each data set with the following features. (i) In the left panel, the Kaplan-Meier estimate (dashed lines) of the survival function is presented together with the estimate obtained by using the parametric model (solid line) (3.8). 510 I. Horová, Z. Pospı́šil, J. Zelinka (ii) In the right panel, the estimate λ̂h,K of λ with bandwidths ĥopt (dashed line) is presented. In order to compare the kernel estimate with the estimate of the hazard function λ̂c (solid line) derived from the dynamical model, we include both the estimates in one figure. Moreover, it is seen that the parametric function lies within the domain of confidence interval (dotted line) obtained by using (4.6). 1 0.035 0.95 0.03 0.9 0.025 0.85 0.8 0.02 0.75 0.7 0.015 0.65 0.01 0.6 0.005 0.55 0.5 0 50 100 0 150 0 50 100 150 Figure 4: Kaplan-Meier and parametric estimates of survival function (left), parametric λ̂c and kernel estimate λ̂h,K of the hazard function (right). Patients set is USB, ĥopt = 6.5. x 10 3.5 1 3 3 0.95 2.5 0.9 2 1.5 0.85 1 0.8 0.5 0 0.75 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Figure 5: Kaplan-Meier and parametric estimates of survival function (left), parametric λ̂c and kernel estimate λ̂h,K of the hazard function (right). Patients set is UCB, ĥopt = 38.4. Semiparametric estimation of hazard function 7 511 Discussion The estimates of the hazard function obtained by the parametric and the nonparametric methods presented in sections 3, 4 and 5 are close to each other. Namely, the parametric estimate of the hazard function lies within the confidence limits of the kernel-based nonparametric estimate. This observation indicates that the described method may provide a suitable tool to evaluate death risk of cancer patients. Parametric estimation of the hazard function presents an alternative way of estimating an optimal bandwidth for kernel estimates; this has been done in Section 5. The main advantage of this approach is the fact that an optimal bandwidth can be estimated even in the case of a small number of observations. On the other hand, a nonparametric estimate can be a basis for estimation of hazard function parameters; one can search for parameters minimizing a distance between a function estimated by the kernel method and the parametric version of it. This idea suggests a type of iterative method alternating between parametric estimation giving an optimal bandwidth for kernel method and more precise approximation of parameters using the kernel estimate. Biological or clinical relevance of parameters appearing in the functions (3.2) and (3.5) should be a subject of further discussion with oncologists. The parameters may serve as a heuristic tool for classification of various cancer types or for comparison of different treatment methods. A remarkable fact is that the estimates of the parameters a1 and a2 in the data sets BRB and BRCB for patients with the same disease are not very different; thus these parameters may characterize a disease. Acknowledgements. Research of the first and the third authors was partially supported by MŠMT, LC06024. Research of the second author was partially supported by the Grant No. 201/01/0079 of the Grant Agency of the Czech Republic. References Araujo, R.P. and McElwain, D.L.S. (2004). A history of the study of solid tumour growth: the contribution of mathematical modelling. Bull. Mathematical Biology, 66, 1039–1091. Chaubey, Y.P. and Sen, P.K. (1996). On smooth estimation of survival and density functions. Statist. Decisions, 14, 11–22. 512 I. Horová, Z. Pospı́šil, J. Zelinka Collett, D. (2003). Modelling Survival Data in Medical Research. Chapman & Hall/CRC, Boca Raton. Dolečková, M., Horová, I., Budı́ková, M. and Hon, Z. (2006). Breast carcinoma: statistical evaluation (in Czech). Proceedings of XXX. BOD, 41–45. Gonzales-Mantiega, W., Cao, R. and Marron, J.S. (1996). Bootstrap selection of the smoothing parameter in nonparametric hazard rate estimation. J. Amer. Statist. Assoc., 91, 1130–1140. Horová, H., Růžičková, J., Ježková, B., Horová, I., Budı́ková, M. and Dörr, A. (2004). Adjuvant radiotherapy in the treatment of the carcinoma of uterus of the first clinical stage (in Czech, English summary). Česká a slovenská gynekologie, 5, 366–371. Horová, I., Zelinka, J. and Budı́ková, M. (2006). Estimates of hazard functions for carcinoma data sets. Environmetrics, 17, 239–255. Hougaard, P. (2001). Analysis of Multivariate Data. Springer-Verlag, Heidelberg. Jiang, J. and Marron, J.S. (2003). SiZer for censored density and hazard estimation. preprint. Kaplan, E.L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc., 53, 457–481. Kozusko, F and Bajzer, Ž. (2003). Combining Gompertzian growth and cell population dynamics. Mathematical Biosciences, 185, 153–167. Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation. Springer, New York. Marron, J.S. and Padgett, W.J. (1987). Asymptotically optimal bandwidth selection for kernel density estimators from randomly right-censored samples. Ann. Statist., 15, 1520–1535. Müller, H.G. and Wang, J.L. (1990a). Nonparametric analysis of changes in hazard rates for censored survival data: an alternative to change-point models. Biometrika, 77, 305–314. Müller, H.G. and Wang, J.L. (1990b). Nonparametric locally adaptive hazard smoothing. Probab. Theory Related Fields, 85, 523–538. Müller, H.G. and Wang, J.L. (1994). Hazard rate estimation under random censoring with varying kernels and bandwidths. Biometrics, 50, 61–76. Nelson, W. (1972). Theory and applications of hazard plotting for censored data. Technometrics, 14, 945–966. Nielsen, J.P. and Linton, O.B. (1995). Kernel estimation in a nonparametric marker dependent hazard model. Ann. Statist., 23, 1735–1748. Patil, P.N. (1993). Bandwidth choice for nonparametric hazard rate estimation. J. Statist. Plann. Inference, 35, 15–30. Patil, P.N. (1993). On the least squares cross-validation bandwidth in hazard rate estimation. Ann. Statist., 21, 1792–1810. Patil, P.N., Wells, M.T. and Marron, J.S. (1994). Some heuristics of kernel based estimators of ratio functions. J. Nonparametr. Statist., 4, 203–209. Pospı́šil, Z. (2004). Hazard function: a deterministic point of view. Summer School Datastat03, Proceedings, Folia Fac. Sci. Nat. Univ. Masaryk. Brunensis, Mathematica 15, I. Horová, ed., Masaryk University, Brno, 315–324. Semiparametric estimation of hazard function 513 R Development Core Team. (2003) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org/ . Ramlau-Hansen, H. (1983). Counting processes intensities by means of kernel functions. Ann. Statist., 11, 453–466. Sarda, P. and Vieu, P. (1991). Smoothing parameter selection in hazard estimation. Statist. Probab. Lett., 11, 429–434. Soumarová, R., Horová, H., Růžičková, J., Čoupek, P., Šlampa, P., Šeneklová, Z., Petráková, K., Budı́ková, M. and Horová, I. (2002). Local and distant failure in patients with stage I and II carcinoma of the breast treated with breastconserving surgery and radiation therapy (in Czech, English summary). Radiačnı́ onkologie, 2, 17–24. Soumarová, R., Horová, H., Šeneklová, Z., Růžičková, J., Horová, I., Budı́ková, M., Šlampa, P. and Kalábová, R. (2002). Treatment of uterine sarcoma. Arch. Gynecol. Obstetrics, 266, 92–95. Tanner, M.A. (1983). A note on the variable kernel estimator of the hazard function from randomly censored data. Ann. Statist., 11, 994–998. Tanner, M.A. and Wong, W.H. (1983). The estimation of the hazard function from randomly censored data by the kernel method. Ann. Statist., 11, 989–993. Tanner, M.A. and Wong, W.H. (1984). Data-based nonparametric estimation of the hazard function with applications to model diagnostics and exploratory analysis. J. Amer. Statist. Assoc., 79, 174–182. Uzunogullari, U. and Wang, J.L. (1992) A comparison of hazard rate estimators for left truncated and right censored data. Biometrika, 79, 297–310. Yandell, B.S. (1983). Nonparametric inference for rates with censored survival data. Ann. Statist., 11, 1119–1135. Ivana Horová, Zdeněk Pospı́šil and Jiřı́ Zelinka Department of Mathematics and Statistics Faculty of Science Masaryk University Kotlářská, CZ 611 37 Brno, Czech Republic E-mail: [email protected] [email protected] [email protected] Paper received April 2007; revised April 2008.
© Copyright 2024