Statistical Analysis of data from Breast Cancer patients MSc Dissertation By Thomas Hunter Smith School of Mathematical Sciences University of Nottingham I have read and understood the School and University Guidelines on plagiarism. I confirm that this work is my own, apart from acknowledged references Abstract This report uses Survival Analysis techniques to analyse data on Breast Cancer patients and attempts to determine factors which are associated with survival. Using the Nottingham Prognostic Index as a guideline and applying Kaplan-Meier, Log-Rank and Cox Proportional Hazard Models; four main prognostic factors were identified to be directly related to survival. Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . 1.1.1 How did this particular project come about? 1.2 Methodology . . . . . . . . . . . . . . . . . . . . . 1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Breast Cancer and the NPI . . . . . . . . . . . . . 1.5 How it works..... . . . . . . . . . . . . . . . . . . . . 1.6 Geography of Breast Cancer . . . . . . . . . . . . . 1.7 The Evolution of Statistics in Medicine . . . . . . . 1.8 The Nottingham Prognostic Index . . . . . . . . . . 2 Data and Variables 2.1 Introduction to the Data . . . . . . . . . 2.1.1 A note on computing..... . . . . . 2.2 Missing Data . . . . . . . . . . . . . . . 2.2.1 Still some missing data though!! . 2.2.2 Complete Case Analysis . . . . . 2.2.3 Single Imputation . . . . . . . . . 2.2.4 Multiple Imputation . . . . . . . 2.2.5 Maximum Likelihood Estimation 2.2.6 Dealing with the missing data . . 2.3 Exploratory Analysis . . . . . . . . . . . 2.3.1 Relationships between Variables . 2.3.2 External Factors . . . . . . . . . 2.3.3 Method of Detection . . . . . . . 2.4 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 6 7 7 8 9 10 10 12 . . . . . . . . . . . . . . 19 19 20 20 21 22 22 23 24 25 29 37 38 41 43 3 Survival Analysis 44 3.1 Survival Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 The Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Applications of Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . 47 1 3.4 3.5 3.6 3.7 Censoring . . . . . . . . . . . . . Purpose and Presumption . . . . Non-parametric Survival Analysis 3.6.1 Kaplan-Meier Estimates . 3.6.2 Greenwoods formula . . . 3.6.3 Life Table Method . . . . 3.6.4 Nelson-Aalen Estimators . Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 51 51 52 55 61 61 62 . . . . . . . 66 67 69 70 70 78 78 78 Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 82 83 85 4 Cox Proportional Hazard Models 4.0.1 Log Likelihood test . . . . . . . . . 4.0.2 Other Methods of Model Checking 4.1 Model Building using R . . . . . . . . . . 4.1.1 Model Building . . . . . . . . . . . 4.2 Index . . . . . . . . . . . . . . . . . . . . . 4.3 Model Checking . . . . . . . . . . . . . . . 4.3.1 Schoenfeld Residuals . . . . . . . . 5 Summary, Conclusion & Further 5.1 Summary and Conclusion . . . 5.2 Further Research . . . . . . . . 5.3 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendices 86 A Some further Interesting findings 87 2 List of Figures 1.1 Lymphatic System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 Lymphatic System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimated Survival Curves of Chemo-status amongst patients . . . . . Scatter Plot of NPI against Survival . . . . . . . . . . . . . . . . . . . . Scatter Plot of NPI against Survival with Fitted Survival Lines . . . . . Scatter Plot of NPI for patients dying from Breast Cancer . . . . . . . Scatter Plot of NPI for patients leaving the Study for Other Causes . . Distribution of Survival for Patients Dying from Breast Cancer . . . . . Distribution of Survival for Patients leaving the Study for Other Causes Distribution of Survival for Patients still Alive at the end of the Study Distribution of NPI scores across Centres (1-6) . . . . . . . . . . . . . . Distribution of Tumour Size for Centre 1 . . . . . . . . . . . . . . . . . Percentages of different Types of Tumour found in Centre 1 . . . . . . Scatterplot of Tumour Size against Survival with fitted linear model . . Histogram of age of Patients at 1st Operation . . . . . . . . . . . . . . Histogram of age of Patients at 1st Operation . . . . . . . . . . . . . . Distribution of Tumour Size for patients detected by screening . . . . . Distribution of Tumour Size for patients not detected by screening . . . 3.1 3.2 3.3 Kaplan-Meier Curve computed from Table 3.6.1 . . . . . . . . . . . . . . Kaplan-Meier Curves of Uncategorised NPI Scores . . . . . . . . . . . . Kaplan-Meier Curves of NPI Groups with 5-year and 10-year Survival Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kaplan-Meier Curves of Tumour Size (cm) categorised into 3 groups . . . Kaplan-Meier Curves of different types of Tumour . . . . . . . . . . . . . Kaplan-Meier Curves for different age groups . . . . . . . . . . . . . . . . Kaplan-Meier Curves for Endocrine Therapy . . . . . . . . . . . . . . . Kaplan-Meier Curves for Menopausal Status . . . . . . . . . . . . . . . . 3.4 3.5 3.6 3.7 3.8 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 28 30 32 33 34 34 35 35 36 37 38 39 40 41 42 42 . 54 . 56 . . . . . . 57 58 59 60 62 64 4.1 4.2 4.3 Kaplan-Meier Curves of Histological Grades of Tumour, as well as of the three factors involved in calculating the grade: Pleomorphism P, MitosisM and Tubule Formation T . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Estimated Survival curves for 7 groups of new variable Total=T+P+M . . 72 Plots of scaled Schoenfeld residuals against transformed time for each covariate in the final CPH model . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.1 Histogram of Survival for patients dying within 10 years of prognosis with predicted survival of 85% . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.2 Kaplan-Meier Curves of NPI with 95% CI . . . . . . . . . . . . . . . . . . 89 A.3 Kaplan-Meier estimates of Estrogen Receptor Status . . . . . . . . . . . . 90 4 List of Tables 1.1 1.2 NPI Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Co-variates and their characteristics . . . . . . . . . . . . . . . . . . . . . . 18 2.1 2.2 2.3 Overview of the Breast Cancer Centres . . . . . . . . . . . . . . . . . . . . 21 Different Combinations of NPI . . . . . . . . . . . . . . . . . . . . . . . . . 29 NPI 5-Year Scores against actual Scores from Centre 1 . . . . . . . . . . . 30 3.1 3.2 Censoring Percentages across all centres . . . . . . . . . . . . . . . . . . . 51 Computation of Kaplan-Meier curve from section of Survival . . . . . . . . 53 . 5 Chapter 1 Introduction ”The Nottingham Prognostic Index (NPI) is an accepted prognostication tool in the management of breast Cancers. The latest overall 5-year and 10-year Breast Cancer survival has been reported to be 85% and 77%”, - BreastCancerResearch.com 1.1 Background The aim of this report was to develop a thorough understanding of the branch of statistics known as Survival Analysis and its application in Primary breast Cancer research. This would be approached by evaluating the creation, function and application of the Nottingham Prognostic Index as well as applying methods used in Survival Analysis to a set of data on Breast Cancer patients in an attempt to identify factors associated with survival. 1.1.1 How did this particular project come about? Having not been particularly inspired by any of the dissertation topics kindly provided by the University, I took it upon myself to seek an alternative subject. My passion for medical research prompted me to look within the NHS. I managed to make contact with Dr. Chan, a consultant at the Oncology Department at Nottingham City Hospital and Professor Graham Ball, head Biostatistician at the John Van Geest Cancer research centre, Nottingham. I was lucky enough to come to an arrangement which would give me the chance to explore new statistical techniques on a real data set with information on 16,965 Breast Cancer patients. I hope to take full advantage of this great opportunity and deliver a final report which will include a comprehensive analysis of the data, and perhaps include findings which may be of use in future Breast Cancer research. 6 1.2 Methodology One of the first decisions made was to cut down the analysis to just one section of the data which contained information on 2,203 patients. A comprehensive analysis of the full 16,965 patients with the fluctuating inclusion and exclusion of certain co-variates and data was unrealistic given the time frame of the report. Several different methods were considered when dealing with the missing data, Little & Rubin (2002), including imputation and complete-case analysis - with the final consensus being that single imputation is the most appropriate approach, implemented by an algorithm constructed in statistial software package, R. The analysis of histograms was used to investigate distributions of co-variates, and other visualisation techniques were used to examine the behaviour of the data. Kaplan-Meier curves were used to compare survival times amongst the different outcomes of a variable and the existence of significant differences between groups was tested using log-rank. Proportional Hazard Models, Cox & Oakes (1982), were used to identify which factors are significant in predicting a patients survival. Model checking techniques included the Log-Liklihood ratio test, Kendall (1973), and the use of residuals, Schoenfeld (1981) to identify violation of Proportional Hazards. 1.3 Results Following the analysis, four prognostic factors remained significant in determining survival: Tumour Size, Histological Grade, Lymph Nodes (all of which make up the original prognostic index) and Lymphovascular Invasion (LVI). Lymph Node involvement was deemed to be the most important factor in determining prognosis for Breast Cancer. Despite prior thoughts to the contrary, this report found that a patients menopausal status remains an insignificant factor, as was proven through both the log-rank test and Proportional Hazard models. The same was also true for age, although a re-categorised age variable hinted there is evidence to suggest that these factors are not entirely insignificant in determining prognosis. Time dependent factors such as recurrence and metastasis were influential in determining survival, but were not considered while generating a Prognostic Index using Proportional Hazard models, as they are unlikely to be present as prognostic factors at the time of diagnosis. The same can be said of therapy variables. The presence of a certain type of therapy on an individual is dictated by the biology of the tumour, thus a patient’s survival chances must also be dictated through the biology of tumour. Rather this than a patient’s prognosis being based on a set of information which is based on another set of information, potentially leading to bias, loss of information and confusion. It appears that the methods which are used to measure tumour size aren’t always fully accurate. While measurements were being made to the nearest millimetre, an unusual number of measurements were rounded to the nearest 0.5 of a centimetre. While this 7 rounding is for the most part compensated by the rescaling of tumour size in the Nottingham Prognostic Index, it did not fully remove the inevitability that on occasion there will be rounding which will effect the prognostic index in a way that will alter the predicted survival of a patient. Amongst the most significant findings was that a patient was up to 20% more likely to survive from Breast Cancer if they have regular screening than if they don’t. The 10 different types of tumour which existed, within the set of patients analysed, did not prove to be an important factor in survival, although a patient with Tumour Type 1 seemed to have smaller survival % than a patient with one of the other nine types. 1.4 Breast Cancer and the NPI ”It is already the top Cancer in women both in the developed and the developing world,and in many countries, it’s becoming more common.”, World Health Organization, 2011 Breast Cancer makes up almost 1/5 of all Cancers found in women. In 2010, in the UK alone, over 11,000 women lost their lives as a result of invasive Breast Cancer. As such, the UK government continues to invest millions of pounds each year to aid and contribute to various different areas of research, which share the common goal of prolonging a diagnosed patient’s life for as long, and as pain free, as possible. Statistics is one of these areas. • In 2010, 49,564 women and 397 men in the UK were diagnosed with invasive Breast Cancer. • In 2010 there were 11,633 deaths from invasive Breast Cancer in the UK. • The most common symptom of Breast Cancer in men is a hard, painless lump that develops on one of the breasts. • In 2005-2009, 85 percent of women in England survived their invasive Breast Cancer for five years or more. • Breast Cancer is often thought of as only affecting women. Men can also develop it, although it’s much less common. It affects just one in every 100,000 men in England. • In 2010, 5,765 women and 26 men in the UK were diagnosed with non-invasive Breast Cancer. 8 1.5 How it works..... Everybody is well aware of Cancer and the devastating effects it can have on people and their lives, but surprisingly little people know HOW the Cancer can become fatal. In tumours that grow inside of some vital organs, the very presence of the tumour itself can interfere enough with the organ’s function that it eventually imparts lethal effects on that organ. A good example of this would be a brain tumour. A brain tumour causes tremendous amounts of localised swelling and it doesn’t have to be a large thing in order to begin to crush vital brain tissue, leading to coma and death. Tumours of the lung can plug off airways in a manner that leads to respiratory compromise, failure, and then death. Cancers of the colon can obstruct the intestine or perforate, and both of these problems can be fatal to the patient. Some tumours that arise from hormone secreting tissue such as the adrenal glands or thyroid can actually cause such a great release of hormones that it overwhelms the patient, leading to a fatal state. Tumour growth requires a tremendous amount of bodily energy expenditure. This is fast growing tissue and, as a result, uses up all of the normal energy stores and building blocks for healthy tissue such as protein. Cancer risk increases with age: Most Cancers are caused by DNA damage that accumulates over a person’s lifetime. Cancers that are directly caused by specific genetic faults inherited from a parent are rare. But we all have subtle variations in our genes that may increase or reduce our risk of cancer by a small amount - World Health Organisation Breast Cancer, isn’t going to cause death due to the presence of the tumour in the breast because the breast isn’t a vital organ. Breast Cancer can mestastasize to other organs such as the liver and cause enough local destruction there to kill by way of organ damage, but even this is rare. Breast Cancer most often kills by spreading to many sites in the body and creating enough tumour burden so that the patient is physiologically overwhelmed, and they become cachectic and are susceptible to secondary illnesses such as infections which ultimately lead to their demise. As the total mass of cancerous tissue spreads throughout and the body accumulates multiple kilograms, it doesn’t matter if it’s interfering with specific organ functions any more. The tumour burden itself starts to make the patient ill. They begin to waste away. At that point, it’s a lot like having AIDS or other such devastating diseases. The body becomes unable to combat even the most minor illnesses it becomes exposed to. In these cases, patients succumb to secondary processes like pneumonia or other infections. Not an experience that I would wish upon anybody. So there are several different ways in which a Breast Cancer can develop and harm the body. 9 1.6 Geography of Breast Cancer There are surprisingly large differences in the incidence of Breast Cancer across different parts of the world. Belgium has the highest rates at 109 out of 100,000 people (0.109%), with other developed countries such as Australia with a 0.084% incidence rate. Compare these to developing countries such as India (0.023%) and Mongolia which has as little as 8 people per 100,000 diagnosed. No one has an answer to why this is the case one train of thought is that women have children and begin breast feeding at an earlier age in developing countries which is thought to reduce a womens chances of developing breast cancer. But this still does not explain relatively large variations amongst countries where the average age a women gives birth are very similar; Lifestyle, inheritance patterns, there are many potential reasons. On the other hand, recovery rates are higher in developed countries. ”Breast Cancer survival rates [range] from 80% or over in North America, Sweden and Japan to around 60% in middle-income countries and below 40% in lowincome countries,” according to the WHO. The aim of this report is to draw upon the medical data taken from 16,000 patients diagnosed with breast cancer and, through the use of Survival Analysis and Proportional Hazard models, construct a model which will accurately model a patients 5-year survival score. This is created with the intention of providing the patient with as much accurate information as possible regarding their Cancer, to help them decide upon the best course of action for themselves. There does, of course, already exist a model which does exactly this. 1.7 The Evolution of Statistics in Medicine It wasnt until around the 17th century that statistics was used in medicine. The idea of statistical analysis itself was a relatively fresh and emerging concept. And indeed its use in medicine would continuously breathe new life into medical research up right up until the present day. Prior to this very successful marriage, both concepts had lived very different lives. Statistics had only really been around for a couple hundred years, while medicine itself has been a primal instinct of humans ever since the dawn of civilisation. Prehistoric man treated illness as a spiritual event, and would perform a whole manner of ritualistic methods to try and remove these demons. Popular techniques included Trephination (piercing holes on the skull to ventilate these spirits) and the application of herbal medicines. For some reason the former of these has fallen out of favour with most modern healthcare systems, while the latter has proved more popular, and is still widely used in areas such as Homeopathy. The evolution of medical practises was specific to the part of world it orginated, with Europe, China, India and Native America all having their own remedies to medical issues. It was not until the introduction of Medical Schools that medicine really began to develop as a science, the first one of these is claimed to have been in 10 Salerno, Italy in the 13th Century. This allowed brilliant minds to work together, and the renaissance led to revolutionary changes in the theory of medicine. In the fteenth century, Vesalius repudiated Galens incorrect anatomical theories and Paracelsus advocated the use of chemicals instead of vegetable medicines. From the sixteenth-century development of the microscope by Janssen and Galileo to the seventeenth century theory of the circulation of blood (Harvey), scientists learned about the actual functioning of the human body. The eighteenth century saw the development of modern medicines with the isolation of foxglove (digitalis) by Withering, the use of inoculation (against smallpox) by Jenner, and the postulation of the existence of vitamins (vitamin C, antiscorbutic factor) by Lind. Our knowledge (or indeed the invention and development) of Statistics on the other hand, endured a far slower and less eventful growth spiral. Of course medical care has always been an important part of our existence, but up until a few hundred years ago no-one would have thought that mathematical theory could be used to help sustain and improve human life. One of the birth places of statistics and probability was said to have been born around 300AD when ancient greeks wanted to describe the outcome of a game of dice between Zeus, Hades and Poseidon. Its theory developed through to the 13th century with the book of Abacus by Fibonacci and the 15th century with Luca Pacccioli with his basic principles of probability. Cardano, Galileo, Pascal, Fermat, Bernoulli (and his nephew) and Gataker all had significant inputs. These ideas were further refined by scientists including Huygens (1657), Leibniz (1662), and Englishman John Graunt (1660) wrote further on norms of statistics, including the relation of personal choice and judgment to statistical probability. Graunt in particular provided some of the first evidence of Statistics being employed in medicine when he categorized the cause of death of the London populace using statistical sampling. James Lind, a Royal Navy surgeon, carried out the rst recorded clinical trial in 1747. In looking for a cure for scurvy, he fed sailors aficted with scurvy six different treatments and determined that a factor in limes and oranges (subsequently found to be vitamin C) cured the disease while other foods did not. His study was not blinded, but as a result (although not for 40 years)limes were stocked on all ships of the Royal Navy, and scurvy among sailors (limeys) became a problem of the past. By the 19th century medical research received a boom with the fast development of technology to aid the understanding of the human anatomy. This gave birth to Case studies which would use statistics to prove whether or not a treatment was beneficial. Most research done before the twentieth century was more anecdotal than systematic, consisting of descriptions of patients or pathological ndings. Many if not most of the Statistical tests which we are familiar with today were established in the 20th century by Sir Ronald Fisher, Archie Cochrane and Austin Bradford Hill, the first of whom is considered by many to be the father of modern statistics. Epidemiological studies and statistical modelling became increasingly popular, and lead to continued advancements 11 such as accelerated time failure and proportional hazard models. Statistics developed for the purposes of medical research became commonplace across medicine, including of course, Cancer research. The publication by J. Mayer (2002) has a particularly fascinating take on the evolution of medicine and statistics. 1.8 The Nottingham Prognostic Index The Nottingham Prognostic Index (NPI) is used to determine prognosis following surgery for Breast Cancer. Its value is calculated using three pathological criteria: the size of the lesion; the number of involved lymph nodes; and the grade of the tumour. The formula is shown in(1.1) NPI = N + G + {0.2 × S} (1.1) Where: • S is the size of the index lesion in centimetres (ie the Size of the Tumour) • N is the number of lymph nodes involved: 0 = 1, 1 − 3 = 2, > 3 = 3 • G is the grade of tumour: Grade I =1, Grade II =2, Grade III =3 The resulting scores from this formula are then divided up into scores which would determine a patients predicted 5-year survival score. These scores are displayed in Table 1.8 These scores would be taken in to account when determining the patients prognosis, and Score 5-Year Survival (%) ≥ 2.0 to ≤ 2.4 > 2.4 to ≤ 3.4 > 3.4 to ≤ 5.4 > 5.4 93 85 70 50 Table 1.1: NPI Scores provide them with accurate information on their cancer, and their survival prospects. The Nottingham Prognostic Index was originally derived from clinico-pathological correlation of a large series of symptomatic Breast Cancer patients. It is likely that it will need to be revised in the near future to allow its effective application to screening patients. - Breast Pathology on the Web 12 4 NPI 5 6 7 Figure 1.1: Lymphatic System ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●●● ●● ●●● ●● ● ● ● ●● ●●●● ● ●● ● ● ● ● ● ● ●● ● ●●●●●● ●● ●●●●● ● ● ●● ● ● ●●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ●●●●● ●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ●● ●●● ●●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ●●● ● ● ●● ●● ●●●●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ●●● ●● ● ●●●●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●●●● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ●●●● ●● ● ● ● ● ● ● ● 2 ● 0 ● ● ● ● ● ● ● ● ●● 50 ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● 100 150 ●● ● ● 200 Survival (months) Originally formulated over 30 years ago, many Cancer centres across the world have used the above 5-year survival score derived from the NPI as one of their primary sources to determine prognosis following surgery. While this index has proved very successful over the years in terms of pairing up a patient to the correct surgery, scientists and statisticians are unanimous in their opinion that there is plenty of room for improvement. An example of such an updated version would be the development of NPI+; a new updated version of the old index. One fact which is widely accepted across all prognostic index’s is that the most important factor in determining prognosis is the patients Lymph Node Status. This status is based on the number of nodes from the lymphatic system involved in the tumour (pictured) While this revised index is considered by most to be a more proficient at prescribing the correct surgery, there is, as always, continued room for improvement. While conducting research for this report, many online forums were encountered regarding the correct prognostic index, with many sufferers received contrasting answers from different places. It seems that there are still large variations in the methods used to determine prognosis out there, even with in the UK. Unfortunately, no unique and widely accepted method has been established and this leads to a great deal of worry across women diagnosed with Breast Cancer. This uncertainty could be detrimental to patients chances of 13 survival due to the delays it will cause in their choice of treatment. It seems ridiculous, but this amount of disparity draws comparisons to the inconsistencies of prehistoric medical practise mentioned earlier. Hopefully one day, our improved knowledge of the biology of Breast Cancer and the advancement of ever more sophisticated statistical techniques will see a Unified Prognostic Index developed which will optimise a patients survival time. The best way of understanding mathematical function is to see it in action, so let us have look at some example of a patients Nottingham Prognostic Index taken from our study: Patient ID 4231 had a size of 2.1cm, a histological grade of 3 and no Lymph nodes involved. The simple calculation of 0.2 × 2.1 + 1 + 3 yields a prognostic index of 4.42. From the index, this provides a an estimated 5-year survival score of 70%. This would have provided the patient and the NHS with valuable information to aid them when deciding the next best approach for treating the Cancer. As the 5-year survival score was relatively middle ground in relation with other NPI scores, there might have been some difficult choices with regards to the kind of therapy to go with. As it turned out, the patient underwent Radio Therapy and was still alive at the age of 60, almost 9 and a half years after the Prognostic Index was assigned. This is an example of the NPI working well; the score would have suggested a relatively intensive treatment to maximise the patients survival, and it paid off. Patient ID 4612 had a tumour size of 4.8cm, 4 lymph nodes involved and a histological grade of 3, yielding a very high prognostic index of 6.96. This comprehensively gave the patient a 5-year survival score of 50 percent. This would mean that the patient would require chemotherapy or run a high risk of not surviving more than 5 years. This index score once again worked well, as the poor prognosis it administered prompted the patient to go ahead with chemotherapy and was still alive 7 and a half years after the study began at the age of 65. Patient ID 2271 was looking in reasonably good shape at the beginning of the study and was assigned a Prognostic Index score of 4.4, which is equivalent (according to the NPI) to a 5-year survival score of 70 percent. The patient was told she stood a good chance of living past 5 years, and could perhaps decide against undergoing chemotherapy. As a result, the patient did not undergo any kind of chemotherapy treatment and unfortunately died at 58 only 3 years after the prognosis. This is an example of the Index going wrong the patient may well have been misled with her prognosis and skipped therapy which while being severe, may well have saved her life. The final patient (ID 2758) provides an example that the Nottingham Prognostic Index was used by the patient to make a concious decision about how to proceed. In the three examples above patients were not forced into any kind of therapy assigned from the NPI, 14 the index merely provides them with information on their chances of survival and recommends them with an appropriate plan of action which will hopefully prologue their lives for as long as possible. This particular patient was given a very poor prognosis with an index score of 6.9, rating their 5-year survival at 50% or less. This NPI score would normally imply a patient undergoes Chemotherapy. But on this occasion data suggests the patient weighed up the options and decided to stay away from the very undesirable effects of chemotherapy and live out the rest of her live in as stress free a manner as possible. She ended up dying 3 years later at the age of 41. These moral issues are very hard to comprehend at the best of times, let alone attempting to explain them using mathematics. So this report shall for the main part focus away from these delicate matters and focus on delivering a prognostic index which will give the patient as much accurate information as possible and make their choices based on well-grounded, truthful medical evidence. Many more patients would need to be run through the NPI to develop a good idea of how well the index works, but already its usefulness is evident. The application of the index, and its effectiveness will be looked at in greater detail later on. The process of constructing a prognostic index is an arduous one. First and foremost the statistician would be looking for as large and complete a data set as possible, and of course with the appropriate information. Information on the age of the patient at first diagnosis and their current status (living or dead) is simply not enough. In the case of this analysis, as much information on the biological and surgical status of each patient is essential. The current Nottingham prognostic index considers only 3 factors associated with Breast Cancer, but it has long been known that it can be improved, most likely with the addition information from other areas. The dataset analysed in this study ticked most of these boxes: It had information on more than 16,000 patients suffering from invasive Breast Cancer. There did, inevitably exist sections of missing data, but nothing that could not be managed. There was information on all biological factors suspected to relate to breast cancer as well as information on all of the surgery and therapy the patients went through. The information was taken over a set period of time, so the patients would not be monitored until death but only to the end of the study. It is wholly unfeasible to conduct this level of monitoring over a long period of time for more than a few patients. Thankfully there are techniques available to deal with this. With the exception of a few small cases of missing, there is as much information as one could hope for at this stage. In addition to these are Time Dependent Variables: Distant Metastases, Local Recurrence, Contralateral Breast Cancer and Regional Recurrence. These were omitted from the Model building process, the reason for which shall be explained later. With all of this information at hand the analysis could begin. This would roughly follow the process displayed below: 15 1. Conduct a thorough exploratory analysis of the data. Through the use of graphical analysis and statistical tests, develop a thorough understanding of the co-variates and their behaviour eg. using tests to determine significance and histograms to understand distributions 2. Once the basic knowledge of the data has been established, begin to explore intereactions and correlations within the data set, which will provide useful information for later on when constructing a statistical model. Pearson correlation,chi-squared test 3. Use appropriate techniques to solve the problems regarding missing data, these techniques are context dependent but the main aim is to maintain as much of the data as possible while minimizing the bias in the process 4. Begin applying survival analysis to the data, such as calculating and plotting KaplanMeier curves, to understand differences in survival amongst variables and their outcomes Begin to sub categorise variables using log ranks tests 5. Fit Cox Proportional Hazard models to the data and slowly but surely remove variables which have little to no impact on survival, and eventually have a model which only contains variables (and any potential interactions and necessary transformations) which effect survival from breast cancer 6. Assess the validity and accuracy of the model by running new data through it, and measure its expected outcomes against observed outcomes. Plot residuals and test to ensure model does not not reject the assumption of proportional hazards Wald, Likelihood and Score. Check diagnostics: for violation of assumption of proportional hazards, for influential data and for non-linearity in the relationship between the log hazard and the co-variates 16 7. Using the coefficients from the model, construct a new prognostics index with new 5-year survival scores In order to complete as effective an analysis as possible we must be able to weigh up all potential factors effecting prognosis, and whenever possible, disregard any prior beliefs which may distort potential truths which spill out from the study. It is often the case (and indeed was the case at several points during the study) that a piece of information pops up which completely contradicts previous inferences on the topic. More often than not, this is a mistake on the statisticians part or perhaps a mis-collection of data. But every so often this isn’t the case, and the findings might be very interesting and open new windows for discovery. Data analysis should be entered with as open a mind as possible (within reason), and should never disregard findings which lie in the face of those previous, unless they can be categorically disproved. Employ tried and tested methods but also try new approaches- History tells us that people have been blinded to the truth for centuries, simply by not even questioning a long standing and respected statement. Breast Cancer is one of the foremost areas of medical research, and is always at the tip of a biomedical statisticians tongue. This report may not be that elusive cure but hopefully will add a little piece to the jigsaw which scientists from all over the world are trying to complete. 17 Variable (R-code) Description Menopausal Status meno 1=Pre-Menopause, 2=Post-Menopause 63% patients post menopausal Lymph Nodes lymphnode 1={0 Lymph Nodes Involved}, 2 = {1 − 3}, 3 = {3 <} Tumour Type type Nominal Categorical 10 different types Age at first Operation age Range from 18-70 Mean age 54 Tumour Size(cm) size Continuous Variable Range={0.1, ..., 4.9}, Mean=1.85cm Method of Detection detection 1=Screening 2=Other Screening→ regular checkups Lymphovascular Invasion LVI 1=Cancer cells in Blood vessels 2=Not present in Blood Vessels Tubule formation T > 75% =1, 10 − 75% =2, < 10% =3 Mitosis M Mitotic Count: 1=0-9, 2=10-19,3=20+ Nuclear Cells 1=Ok 2=Bad, 3=Worse 1 = {3, 4, 5}, 2 = {6, 7} 3 = {8, 9} Indicator Variable 1=Therapy 2=No Therapy ”” ”” Pleomorphism P Histological Grade (T+P+M) grade Chemotherapy chemo Radio Therapy radio Axillary Surgery axillary Endocrine Therapy endo ”” ”” ”” ”” Table 1.2: Co-variates and their characteristics 18 Chapter 2 Data and Variables 2.1 Introduction to the Data The dataset provided by the NHS contains information on 16,965 women diagnosed with Breast Cancer taken from 10 centres around the world. This data set provides a set of information on each diagnosed patient, mainly the biological and surgical points of interest which are considered to be relevant in this study. A brief overview of the variables and what is going on is displayed in Table 1.8. A detailed description of these variables, in the context of Breast Cancer is provided in the glossary towards the end of this report: When dealing with such a large data set, it is very easy to become overwhelmed by the sheer size of the task which lies ahead. It is very difficult to know where to start, which way you should approach the analysis, what goals you should be aiming for, and so on. - A thorough understanding of Survival Analysis and the Nottingham Prognostic Index would be needed before beginning any serious analysis. Since R would be used extensively throughout the analysis, the anticipation and interpretation of its output would be crucial. Analysis is often a trial and error game, but that is all for nothing if you cannot tell between a meaningful result and a meaningless result - If a dataset of this size was ever to be analysed with some useful findings under the given time frame, some compromises would need to be made. Would a thorough analysis of all 16000+ patients be possible? Would all of the variables be maintained ; some of which surpassing 50 percent missingness? - The data needed to be as complete as possible, so techniques would need to be identified to handle the missing data. Ideally these would maintain as much information as possible without introducing any serious bias - Efficiency in the choice of methodology, it is no use spending weeks using one technique, then deciding that it is no good; I would need to be strategic with my selection 19 - In addition, there are never any guarantees than any real breakthroughs will come about, so knowing when to stop is important-no research is ever pointless! 2.1.1 A note on computing..... The analysis for this research project was conducted solely on Microsoft Excel and R, both of which are excellent packages. In Excel, the built in package ”Data Analysis” was used extensively for the exploratory analysis, and was very useful for summarising and producing basic plots of the data. The R package ”survival” was used throughout the Survival Analysis. This has many features, including inbuilt functions which can quickly produce survival curves as well as fit and check Proportional Hazard Models. 2.2 Missing Data It is never a massive surprise to come across missing data when dealing with large datasets in any field; be it engineering or analytics or medicine, for whatever reason, the existence of missing data is almost a forgone conclusion. Having considered the magnitude of the task ahead, and the great volumes of analysis to be performed, it was decided early on that the focus would be on just one centre. Given the time frame of this study, a comprehensive study across all 10 centres, all of which with having vast differences in the information provided (such as the variables described and missing data), would realistically be infeasible. Looking at the varying information on the patients provided across all 10 centres, it was important to choose a centre with minimal amounts of missing data. This goes without saying, but perhaps more important consideration , would be to choose a centre where the missing data (each one of the 10 centres had varying counts of missing values, spread across all variables), it was insignificant enough to not distort the analysis in any significant way. After careful consideration and preliminary analysis, it was finally decided that the study would be focused on Centre (1). The main reasons being as follows: • The Number of patients was large enough for there not be any doubt over the accuracy of the study (Which may have been an issue if, say, centre (2) was looked at) and was digestible enough for R to handle the analysis, which again may have been an issue if ,say, centre (9) was looked at. • Using prior information on the variables which may play an important role in survival, centre 1 has an almost full data set, with almost complete information across all variables considered important for the purposes of constructing an accurate and effective CPH model 20 Centre 1 2 3 4 5 6 7 8 9 10 No. Patients 2203 1308 562 2391 1321 604 1539 1491 4136 1410 Mean NPI 4.07 3.23 3.83 3.45 3.66 3.92 3.63 3.83 3.45 4.12 Table 2.1: Overview of the Breast Cancer Centres • The information the censoring of patients is almost complete, as a pose to most of the other centres, where there are vast gaps in the information, which would of course make matters a lot more complicated 2.2.1 Still some missing data though!! One of the main reasons Centre (1) was chosen as the focal point of the study was the relatively small ammounts of missing information it had when compared to other centres. Survival analyses based on a data collection process which the researcher has little control over are often plagued by problems of missing data. Deleting cases with any missing data will result in information loss and usually results in bias, while many analytic procedures that retain this information in some form underestimate the resulting uncertainty in parameter estimates and other output. Despite this, there were still some gaps in the information on many of the key variables: • Several of the variables (eg. Parafin Block, Frozen tissue) which were monitored in other centres were not monitored in centre (1). For the purposes of this study, these variables shall be disregarded. In perhaps future studies, these variables can be analysed across the other centres, and cross referenced with the results from centre (1), to unearth their potential importance • There some missing points in most of the remaining variables. Ideally the solution to this problem would not involve completely removing patients with these missing points from the study. Several papers were covered in the search for the best approaches to handle missingness in a survival analysis dataset. The correct/most effective approach is obviously context dependent but Little and Rubin (2002) stood out for its clarity and general content. 21 These approaches are more applicable for higher levels of missingness, and only one of the variables had higher than 5% missingness. So some amount of exploration into different methods would be useful. Such an exploration would (i) undoubtedly be useful for future data analysis, such as going on to analyse the other centres with greater chunks of information missing, (ii) provide further insight into the data itself; it is not unusual to make interesting discoveries as bi-products of other analysis. There were 4 main approaches: (a) Complete Case analysis (CC) (b) Single Imputation (SI) (c) Multiple Imputation (MI) (d) Maximum Likelihood Estimation (MLE) Like almost all statistical techniques, these have their benefits and drawbacks. 2.2.2 Complete Case Analysis A direct approach to missing data is to exclude them. In the regression context, this usually means complete-case analysis: excluding all units for which the outcome or any of the inputs are missing. There are a few different ways to do this; list wise deletion (an entire patient would be removed from the study if they had any missing data points) and pairwise deletion ( the patient would be removed if there were data missing from any of the variables used in the study). Immediately these strike as nave solutions for a study with as many covariates as this. Generally speaking, there are two main problems arise with complete-case analysis: 1. If the units with missing values dier systematically from the completely observed cases, this could bias the complete-case analysis 2. If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a simple analysis The small amount of missing values are well spread across the data, and the application of CC would see almost 11 % of patients being removed from the study, so this approach has not been deemed viable. 2.2.3 Single Imputation Albeit a relatively simple approach, single imputation is a very easy and efficient way to replace missing values in a data set. Once again there are several paths one can go down with this methodology: 1. Matching: This involves identifying a variable which correlates well with the one with missing points, and comparing data points, so for example we know that menopausal status has a strong relationship with age so a missing value in either could be inferred from the other 22 2. Hot-deck imputation involves sorting a dataset according to any of a number of variables, thus creating an ordered dataset. The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value. Another method is known as cold-deck imputation, which substitutes a value (with similar correlations) taken from another cold dataset of the same nature 3. Rth nearest neighbour: a similar approach to matching and is essentially an algorithm for imputing missing data. Rather than comparing variables of single observations, it looks at numerous observations to conjure a value to impute. This is very useful when looking at genetic arrays where there are 10s of thousands of missing points 4. Mean substitution: substitutes the mean of the known data of the variable for the missing data A consideration for these methods is dealing with categorical data such as deciding which value to take when that calculated lies in between the categories. For example when testing mean substitution on the variable Did the patient receive Endocrine Therapy? the mean value from the known data was 1.57 (1=Yes, 2=No) so how would we decide which answer to give to the 52 patients with no information? Around 80% of the variables in the dataset are indicator or categorical, so this is of obvious concern. In addition, mean imputation ruins the correlations among variables because it changes the way two variables co-vary. That is, if two variables move in tandem then using mean imputation to fill in the missing values destroys the natural relationship between the predictor and outcome variables since the imputed values will all be the mean of the predictor and remains static for all missing values while the outcome still varies independently. 2.2.4 Multiple Imputation Becoming increasingly popular with the continued development of computing power, multiple imputation methods are the most popular and generally most effective method of dealing with missing data. In order to deal with the problem of increased noise due to imputation, Rubin and Little(1987) developed a method for averaging the outcomes across multiple imputed data sets to account for this. The way this works is that imputation processes similar to stochastic regression are run on the same data set multiple times and the imputed data sets are saved for later analysis. Each imputed data set is analyzed separately and the results are averaged except for the standard error term (SE). The SE is constructed by the within variance of each data set as well as the variance between imputed items on each data set. These two variances are added together and the square 23 root of them determines the SE, thus the noise due to imputation as well as the residual variance are introduced to the regression model. *statistical noise is a term that refers to the unexplained variation or randomness that is found within a given data sample or formula There are many, many different techniques when implementing multiple imputation, sometimes they can be tailored to suit a set of data exclusively and sometimes they can be used across a whole range of datasets. Below are a few examples: (i) multiple imputation without inclusion of the outcome (MI-) (ii) multiple imputation with inclusion of the outcome (MI+) (iii) Multiple Imputation using regression switching with predictive mean matching (MICE) (iv) multiple imputation using regression switching with predictive mean matching (MICE PMM) (v) multiple imputation using flexible additive imputation models 2.2.5 Maximum Likelihood Estimation Maximum likelihood involves the estimation of parameters based on the observed data using Bayesian estimation techniques. So MLE can use information already known about a dataset, and use that to estimate a parameter of interest, say missing data. A likelihood function is the probability or probability density for the occurrence of a sample configuration x0 , x1 , ..., xn given that the probability density f (x | θ) with parameter θ is known. L(θ | x) = f (x1 | θ)f (x2 | θ); ...f (xn | θ) n Y = f (xi | θ) (2.1) i=0 The Maximum Likelihood Estimator (MLE) θˆ can be obtained by maximising L(θ | x) as a function of θ. As such it is usually far more convenient to work with the log-likelihood: l(θ | x) = logL(θ | x) n 1X f (xi ; θ) = n i=1 (2.2) ∂ ˆ There are also more developed methods l(θ; x) = 0 To obtain a value for θ. Then set ∂θ which use iterative procedures for parameter estimation, namely the E-M Algorithm. 24 The iterations alternate between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. The ebook by Millar (2011) gives a more modern take of MLE and how it has evolved along with computing. Maximum likelihood estimation is equivalent to multiple imputation in most cases though maximum likelihood is easier to implement than multiple imputation thus usually more preferred, but multiple imputation has slightly more situations in which it is applicable. 2.2.6 Dealing with the missing data Having already disregarded Complete Case analysis, the choice remained between single and multiple imputation. When necessary, multiple imputation can be very effective but it can also be very complicated. It was decided that in this instance it was not necessary, as all but one of the variables are less than 5 % missingness. Any potential extra bias which would be brought on by using single imputation will be minimal, and would not effect the outcome and accuracy of the analysis. In the Endocrine therapy example mentioned above, rather than replace the 52 missing values with the value 2 as the average was 1.57, a function in R was used to randomly insert the values based on the values already taken for this variable. So in this example the random.impfunction randomly inserted 32Nos and 20 Yess into the missing data points. random.imp <- function (a){ missing <- is.na(a) n.missing <- sum(missing) a.obs <- a[!missing] imputed <- a imputed[missing] <- sample (a.obs, n.missing, replace=TRUE) return (imputed) } This algorithm worked extremely well for the majority of variables missing points, however there were a few which required more attention: Internal Mammary Nodes Of the 1564 patients accounted for in this variable, 98.15% 0 Internal Mammary Nodes, with 1.66 % having 1 and 0.29 % having 2. With around 30 % missing, this was the variable had by far the most missing information (with all others being less than 5 percent 25 Figure 2.1: Lymphatic System 700 120.00% 600 100.00% 500 # of Patients 80.00% 400 60.00% 300 40.00% 200 20.00% 100 0 0.00% 3 4 5 6 7 8 9 More T+P+M missingness), imputing new values would be a relatively arduous task. With this being the only variable with substantial missing information, conditions are ideal for non response weighting to be applied. This would be a long and relatively fruitless task, considering the fact that it correlates strongly with number of Axillary nodes positive. Due to this high level of correlation, the decision was made to Tubile Formation, Pleomorphism and Mitosis These three factors are used to calculate the final Histological Grade of the tumour. Twenty five patients had all three of these missing at once, and a further 5 had just Mitosis missing. Fortunately, every single patient had their final histological grade, so it wasn’t difficult to impute values such that they didn’t violate the criteria. For example, one patient had information on the final grade, as well as Tubile formation and Pleomorphism both being graded 3. The histological grade was 2, so the only value Mitosis could take would be 1, otherwise there would be inconsistency in the model. The sub grade information for T, M, P for those patients with all three missing was imputed under the assumption that: • those with a histological grade of three would automatically have a grade of 3 for Tubile formation and Pleomorphism 26 • those with a histological grade of one would automatically have a grade of 1 for Mitosis and Pleomorphism Having manually filled in the points to match this criterion, the random.imp function could be used to randomly insert the rest of the missing data for T,P and M. Recurrence The problem with these recurrence variables is that their potentially hazardous effects don’t usually become present until some time after the patient has had initial surgery. This means recurrence is a time-dependent covariate, which has the potential to make the model building process far more complex. It is clear from the data that the recurrence variables have a large say in the patient’s survival, so it is important that they are taken into account when modelling survival. Unfortunately the fact that they remain anonymous until sometime until after the patient’s first prognosis means that they will essentially be useless in a Prognostic Index as it is unlikely that any sign of them will appear until some months and years after the patients first 5-year survival score is given. There does exist methods which accommodate Time dependent variables into a model, but in this situation they are not required: while it will be possible how the existence of recurrence effects survival, it is unlikely that a model will exist which can accurately predict if recurrence will occur. As such the time dependent variables which involve recurrence were removed from the study. Chemotherapy and the 9s.... An interesting problem arose when it was unveiled that 59 (2.7%) patients had observation which contained 9s in place of where the standard 1’s and 0s would be for Chemotherapy. Figure 2.2 shows that this third outcome has even lower survival times than those who underwent chemotherapy. One possible answer to this is that patients who had this outcome did undergo chemotherapy, but ceased the treatment as their cancer had developed to a point that survival seemed unlikely. The choice was made to keep these data points as they are, and treat chemotherapy as an ordinal categorical variable. This decision makes sense in that there is a clear difference in the survival times for these three outcomes. Later research using the log-rank test confirmed a clear separation in survival between the three outcomes. Summing up this section, there was a substantial amount of lists and a,b,c-ing which explains the great variety of methods which exist for handling missing data. As was mentioned earlier, the right approach is always context dependent but, even in matters regarding a single issue there are often several approaches to take. Another major con27 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 2.2: Estimated Survival Curves of Chemo-status amongst patients 0.0 Chemotherapy No Chemotherapy Unknown Therapy 0 50 100 150 t 28 200 sideration are the time restraints put in place. In the case of this report the choice of approach was not complicated by this factor due to the low levels of missingness which required only the simple matter of single imputation. The application of methods of multiple (regression based )imputation would probably have resulted in reduced biased, but give the low levels of missing data and the given time restraints, single imputation was deemed adequate. 2.3 Exploratory Analysis A long time was spent exploring variables, their relationships with each other, distributions and effect on survival. Considering the behaviour of the original NPI against survival provided a reasonably good starting point. We can see that the prognostic index scores are segregated into five clear groups, which makes good sense if we break down the combinations of the score in Table (2.3). S = 0.2×Tumour Size ≈ 0.36 ± 0.12, based on the median values for tumour size, is assumed be consistent for each combination. This is reflected in Figure 2.3, with the patients clustered around the NPI regions of Group (1) (2) (2) (3) (3) (3) (4) (4) (5) Lymph Node Status 1 1 2 2 3 1 3 2 3 Histological Grade NPI Score 1 ≈ 2.36 ± 0.12 2 ≈ 3.36 ± 0.12 1 ”” ”” 2 ≈ 4.36 ± 0.12 1 3 ”” ”” 2 ≈ 5.36 ± 0.12 3 ”” ”” 3 ≈ 6.36 ± 0.12 ”” ”” Table 2.2: Different Combinations of NPI (2.24,2.48), (3.24,3.48), (4.24,4.48),(5.24.,5.48),(6.24,6.48) and is even more evident in the distribution of the NPI scores in Figure2.10. The size of these regions is down to the first and third Quartiles of Size multiplied by the 0.2 as noted in the index. The data points lying outside of these regions were patients with either exceptionally small or exceptionally large tumour sizes. The majority of the observations lie in the groups (2) and (3) of the index. What may be of interest now, and potentially useful for looking at constructing a new index later on is to see how these five groups differ in survival. For groups (1) and (2), the majority 29 4 2 3 NPI 5 6 7 Figure 2.3: Scatter Plot of NPI against Survival ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ● ●● ●● ● ●● ● ●●● ● ● ●●● ●● ● ●●●● ●● ● ● ● ●●● ●● ● ●● ●● ●● ●● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ●● ●● ● ● ●●● ●● ●● ●● ●● ●●● ● ● ● ●● ● ●● ●●● ● ●●● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ●●● ●●● ● ●●●● ●● ● ● ● ●● ●● ● ● ●● ●●● ●● ●●●● ●●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●●● ●● ●● ●● ● ●● ●● ●●● ● ● ● ●●●● ● ● ●●● ●●●● ●● ●●● ●●●●●●● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ●●● ● ● ●● ●●●● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●● ● ●● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●●●● ● ● ● ●● ●● ● ●● ●● ● ● ●● ● ● ●● ●● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ●●● ●● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ●● ● ●●●● ●● ● ●● ● ● ● ● ●●●● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●●●●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●●●●● ● ● ● ● ● ●●● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ●●●● ●● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ●● ●● ● ●● ● ●● ● ● ● ●●● ● ● ●● ● ●● ● ● ● 0 50 100 ● 150 200 Survival (months) of the patients had a survival time of more than 80 months, and while the same applies for (3), there is an clear increase of patients surviving < 80 months. A measure of the accuracy and performance of the NPI against the 2203 patients from Centre (1) can be easily obtained. Table (2.3) shows the 4 different categories of NPI scores, the number of patients and the actual survival percentages of each group in Centre (1). It is immediately transparent that there are huge differences between the index score 5-Year Survival (NPI Score) 90% (≥ 2− ≤ 2.4) 85% (> 2.4− ≤ 3.4) 70% (> 3.4− ≤ 5.4) 50% (≥ 5.4) Number of Patients % Survival Actual 321 96.4% (307 patients) 468 95.1% (445 patients) 1107 86.2% (954 patients) 308 61% (188 patients) Table 2.3: NPI 5-Year Scores against actual Scores from Centre 1 and the actual score - is this intentional? Looking at the different categories of 5-year survival, these patients do not match very well with the predicted survival percentages. The lower scores (< 2.4)for the NPI matches well with insignificant difference 30 (pairedt − test, α = 0.05) however the three subsequent categories are less encouraging. For the next two stages, the prognostic predicted survival differs significantly(α = 0.05) with 10% and 11% difference respectively, with more patients surviving beyond the 5-year mark than the index anticipated. And the final category anticipated 10% more people surviving then actually did. Additionally 85% survive longer than 5 years, which emphasises how the vast majority of patients have low range NPI scores. It is initially surprising to see that less than 50% of the patients survive for less than 10 years, but this is likely to be a figure severely distorted by censoring, meaning that the survival times for this data take into account patients who were still alive at the end of the study, from which point their age cannot be monitored. Thus it that age is used for the analysis. Since many of these studies will last less than 10 years, the true percentage survival at 10 years is tricky to establish. The 5-year survival is far easier to clarify, as almost all studies of this nature will last at least 5 years. So we have seen there are significant inconsistencies, and you would expect something as developed as the NPI to be more accurate there are several possible reasons for this. Were possibly not enough patients tested on? There seem to be different levels of survival across centres Centre (1) seems to have reasonably high levels Perhaps the index is designed to be under predict survival by around 10% to make patients think more cautiously about their choices. I would be interesting to hear the truth, and if indeed this is the truth, is it morally right for NHS to distort information for the patients in such a way? Or perhaps the NPI is that inaccurate and in desperate need to revising? Interesting thought......when counting up the patients surviving less than 5 years after diagnosis there was quite a high intensity of patients dying slightly after the 5-year survival time, especially 6-7 years as can be seen in Figure A.1. This shows the distribution of time of death for patients with NPI score > 2.4to3.4 ≤ dying before 10 years. This is one of a few examples where 5 years as time point of reference for survival seems inappropriate. On this evidence, were this particular prognostic index to be used, either the survival % for each group should be increased by 10% or the time point be increased to 7 years. If this was to be performed to the scores from centre 1, the actual scores would be insignificantly different (α = 0.05) to the NPI scores. Figure 2.4 demonstrates how the predefined categories segment the data. The vertical red line represents the 5 year mark, and the blue horizontal lines mark the different categories which determine survival. So as we are moving upwards through the sections we are expecting the number of points on the right to decrease and on the left to increase (in propotional terms for that particular category). While this is the case, it does not fit in with the levels of survival stated by the NPI. The green lines represent the actual times which would need to be assigned to this dataset if the prognostic index was to maintain the survival percentages for its 31 4 3 2 NPI 5 6 7 Figure 2.4: Scatter Plot of NPI against Survival with Fitted Survival Lines ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ●● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ●● ● ●● ●● ● ●● ● ●●● ● ● ●●● ●● ● ●●●● ● ● ● ● ●●● ●● ●● ● ●● ●● ●● ● ●●● ●● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●●●● ● ●●● ● ● ●● ● ●● ● ●● ●●● ● ● ● ● ●● ●●● ● ●● ● ●●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ● ●●●● ●● ●●●● ● ●● ● ●● ●● ●● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ●● ● ● ●● ●● ● ● ●● ●●● ● ● ● ●●●● ● ● ●●● ●●●● ●● ●●● ●●●●●●● ●● ●● ●● ●● ● ● ● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ●●●● ●● ●● ●● ●● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ●● ●● ●●● ● ●● ●● ● ● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●●● ● ● ● ●●● ●● ● ● ●● ●●● ● ● ● ●● ● ● ●● ●● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ●●●●●● ●●● ●●● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ●●● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●●●●●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ●●●●● ●● ●● ● ●● ●● ● ● ● ●●● ●● ● ●● ●●●●● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ●●● ● ● ● ●● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●● ●● ● ● ● ● ●● ● ●●● ●●●● ●●● ●● ● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ●● ● ●●● ●●● ● ●● ● ● ● ● ● ●● ●●● ●●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ●● ● ● 0 50 ● 100 survival 32 150 200 4 NPI 5 6 7 Figure 2.5: Scatter Plot of NPI for patients dying from Breast Cancer ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ●●● ●● ●●● ●● ● ● ● ●● ●●●● ● ●● ● ● ● ● ● ● ●● ● ●●●●●● ●● ●●●●● ● ● ●● ● ● ●●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ●●●●● ●●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ●● ●●● ●●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ●●● ● ● ●● ●● ●●●●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ●●● ●● ● ●●●●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●●●● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ●●●● ●● ● ● ● ● ● ● ● 2 ● 0 ● ● ● ● ● ● ● ● ●● 50 ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● 100 150 ●● ● ● 200 Survival (months) set categories. From this visual demonstration we see that we would need to squeeze or retract the blue lines enough so that the green line matches the red line. This would require altering the margins to something more or less than 1 (or 2), and obviously. The nature of this obviously requires on the distribution of the points. Were they uniformly distributed, it would be easier to see how the index would needed to be altered to fit the data. In the case of centre 1, the data are gathered in 5 groups, but there are only 4 categories for these to fall into. There is no immediate explanation for this, but it is a reminder that the model building is just part of the process, and the subsequent index requires some amount of visualising. Figures 2.5 and 2.6 represent the difference in NPI scores for patients dying from breast cancer those leaving the study from other causes. These can help visualize the possibility of informative censoring. After looking at the age distributions for Death from Cancer in Figure (2.7), leaving the study other causes (Figure 2.8) and still alive at the end of the study (Figure 2.9), they all have strong suggestion that they belong to a family of normal distributions. But by 33 Figure 2.6: Scatter Plot of NPI for patients leaving the Study for Other Causes ● ● ● ● ● 6 ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● 5 ● ● ● ● ● ● ● ● ● ● 4 NPI ● ●●●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 3 ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● 2 ● ● 0 ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● 50 ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●● ● ● 150 ● 200 Survival (months) Figure 2.7: Distribution of Survival for Patients Dying from Breast Cancer 485 patients dying of from breast cancer 40 35 Frequency 30 25 20 15 10 5 0 Age 34 Figure 2.8: Distribution of Survival for Patients leaving the Study for Other Causes 173 patients dying from other causes 25 Frequency 20 15 10 5 0 40 44 48 52 56 60 64 68 72 76 80 84 More Age Figure 2.9: Distribution of Survival for Patients still Alive at the end of the Study 1545 patients still alive at the end of the study 140 120 Frequency 100 80 60 40 20 0 Age 35 Figure 2.10: Distribution of NPI scores across Centres (1-6) Distribution of Nottigham Prognostic Index Scores across patients from Centres 1-6 800 700 Frequency 600 500 400 300 200 100 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 6.1 6.3 6.5 6.7 6.9 More 0 NPI comparing these histograms, information can be gathered which may give more insight into the nature of the censoring. The similar patterns all three groups follow follow suggest that the reasons for those who left the study for other causes, may be linked to the breast cancer. While these deaths might not come directly from breast cancer, they may be directly associated with it. Spreading across other centres, Figure (2.10) looks at the NPI of over 8,000 patients. It can be seen that the patterns of the histograms follow a similar shape to the other groups looked at. But Immediately it is evident that something is not quite right as the graph is displaying a number of patients with NPI scores of less than the base rate of 2. This will be the result of either missing information on the Tumour Grade or the Number of Lymph Nodes involved. Looking at Figure (2.11), once again a normal distribution comes to mind. On closer inspection it becomes evident that there is a disproportionate frequency of patients with a Tumours Sizes with factor 0.5, implying that for more than a few patients, this measurement is being rounded up/down to the nearest factor of 0.5, which is surprising. The histogram suggests that it is possible to accurately measure the size of a tumour to the nearest mm, yet it is obvious that less accuracy has been used in many instances. Given that the size of the tumour plays a relatively large role in determining prognosis, hence the choice of surgery, hence the chances of a patients survival, perhaps more time and 36 Figure 2.11: Distribution of Tumour Size for Centre 1 Distribution of Tumour Size amongst patients in Centre 1 250 Frequency 200 150 100 50 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 0 Tumour Size (cm) effort could be invested in measuring the tumour size as accurately as possible. Having said that, the potential dangers of this measurement strategy will be limited substantially be the fact that Tumour Size is only 20% present in the final NPI. It is quite possible that those involved in the measurement process were well aware of this fact. Were Tumour Size to carry more power within a Prognostic Index, perhaps there would be more emphasis on the accuracy. Additionally it seems more accuracy is used for measuring smaller tumours (< 1.5cm), and as the size increases, so does the amount of rounding. There were 9 different types of tumour found in the patients. Figure (2.12) shows that the majority are of type 1, with type 11d also making up a large share of the patients. With such a range of tumour types, instinct dictates that differentiation between types may influence survival in some way. A pie chart is no way to measure this but later on, differences in survival between types will be investigated using Kaplan-Meier estimators. 2.3.1 Relationships between Variables With the range of types of variable in the dataset, there does not exist a single generalised method to detect relationships. Since there exists continuous data (eg Survival), nominal categorical (eg Tumour type), ordinal categorical (eg Histological Grade) as well as binary/indicator variables (eg Endocrine therapy; a great deal of care must be taken in choosing the correct method for each situation. A rigorous procedure of comparing 37 Figure 2.12: Percentages of different Types of Tumour found in Centre 1 % Spread of Tumour Types T12 1.54% Missing 0.50% T11d 24.33% T8 0.22% T1 58.24% T7 5.44% T3 6.72% T6 0.18% T5 1.50% T4 1.32% variables is not a necessity but in an ideal world a good working knowledge of the main relationships would be attained before throwing the variables into a statistical model. Having already had a look at the behaviour of the NPI against survival, it might be of interest to explore its relationship against another continuous variable, say size, as shown in Figure 2.13 While not being immediately striking, the data points are mainly populated around about 90-170 months in survival with Tumour size (< 2.5cm. And there is a trend appearing after about 90 months of survival, the smaller the tumour size the longer the survival as might expect. This is echoed by the line, which is the linear relationship between the two variables. While this is heading in the right direction, it is certainly not a true relationship between the variables as it ignores several key attributes to the data such as censoring. 2.3.2 External Factors There is always the possibility that there are factors which are not included in the study which could affect a patients survival. Censoring (which shall be looked at in greater depth later on) takes into account situations where the patients survival has been compromised by factors unrelated to Breast Cancer. But there are times where the collection of information for variables maybe inconsistent amongst patients. For example the time 38 5 Figure 2.13: Scatterplot of Tumour Size against Survival with fitted linear model ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ●●●●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ●●● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ●●● ●● ●● ● ●●●● ● ● ●●● ●●● ●● ● ●● ●● ●● ●● ●●● ●● ●●●● ● ● ●● ●● ● ● ●● ● ● ●● ●●● ●● ● ●● ● ● ●●● ●● ●●●● ● ● ●● ● ●●●●●● ●● ●● ● ● ● ● ● ● ●● ●●● ● ●● ●●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●● ●●●●● ●● ●● ● ●●● ● ●● ●●●● ● ●●●● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●●● ● ● ●●● ●● ● ● ●● ●●● ● ● ●●●● ●●● ●●●●●●● ● ●● ●●● ●●● ●● ● ● ● ●●●●●● ●● ● ●●● ●●● ● ● ●● ● ● ●● ● ●● ●● ●●● ● ●● ●●● ●● ● ● ●●● ●● ●●● ● ● ● ● ●● ●●●● ●●●● ● ●●●● ● ● ● ●● ●●● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ●●●● ● ●● ●● ●●●● ●● ●● ● ● ●●● ● ●● ●● ●● ●● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●● ●● ● ● ● ●● ●● ●● ● ●●●●●● ●●● ● ● ● ●●● ●●● ● ●●● ● ● ●● ●● ● ● ● ● ● ●●●● ●●●●●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●●●● ● ● ● ●● ●● ●● ● ● ●●● ●●●● ● ●● ● ●●● ●● ●● ● ●● ●● ● ● ● ●●●● ● ●● ● ●●●●● ● ●● ●● ● ● ●●● ●● ● ●●●●● ●● ● ● ● ●● ●●●●● ● ●● ●● ●● ● ● ●● ● ●● ● ● ● ●● ●●●● ● ● ●●●●●● ●● ●●● ● ●●● ● ● ●●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●● ● ●●●● ●●●● ● ● ●● ●●● ●● ● ● ● ● ●● ● ●● ●● ● ●●● ●● ● ● ● ●●●●● ●●● ●● ●● ●●●● ●● ● ●● ●● ● ● ●● ●● ● ● ●●● ●● ● ●●● ● ● ● ●●● ● ●● ● ● ● ●●● ● ●●● ● ●●● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ●●● ●● ● ●● ● ● ●●● ●● ● ●●● ● ● ●● ●●●●● ● ● ● ● ●●● ●● ● ● ● ●●●●●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ●●●● ● ● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ●●●● ● ● ●●● ● ●● ●● ●●●●● ● ●●● ● ●● ● ● ●●●● ● ● ●● ●●● ● ●● ●● ● ● ●●●● ●●● ● ●●●● ●● ●● ●●● ● ● ● ●● ●● ● ● ● ● ● ●● ●●● ●●●● ● ●●● ● ●● ● ●●● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●●● ● ●● ● ●● ● ● ● ●● ●●● ● ●● ●● ●● ●● ●●● ●● ● ●● ●● ● ● ●●● ● ●●●●● ● ● ●● ●● ●● ● ● ●● ●●●●●● ●● ● ● ● ●● ● ● ● ●●●● ● ●● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●●●● ●●● ●● ● ●●●●● ● ● ●●● ●● ●●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●● ●●● ● ●● ●● ● ●● ● ● ●●●●●● ●●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●● ●●●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ●●●● ● ● ●● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● 0 1 2 Size(cm) 3 4 ● ●●● ● 0 50 100 Survival 39 150 200 Figure 2.14: Histogram of age of Patients at 1st Operation Frequency Patients age at 1st Operation 200 180 160 140 120 100 80 60 40 20 0 Age of Patient at 1st Operation between a person having cycles of chemotherapy may be fluctuate between patients due to different waiting times in hospitals. Or the time taken between initial diagnosis and the first operation may be affected by a factor not looked at in this study, such as delays in readily available equipment/medication or the mental state of a patient. One of the variables that may be an important factor is the date of diagnosis; however information on this is unavailable. The closest information relevant to this is the date of the 1st operation, the distribution of which is displayed in Figure 2.14. While there might be time variations between the date of diagnosis and the date of first operation for each patient, it should be noted that the cancers progression will be dictated by the characteristics of the tumour, all of which are taken at the same point for each patient. Figure 2.14. One variable which provides insight into the stage of cancer for each patient going into the study is Method of Detection This refers to the circumstances in which the tumour was first detected and has two possible outcomes: Detected through Screening or Not detected through screening. The patients who had detected a tumour through screening would have, on average, a less progressed breast cancer. On the other those who were diagnosed out with screening are likely to have further progressed cancer. This makes intuitive sense, as the regular frequency of screening would help ensure that the tumour was identified as early as possible. Patients not having regular screening check ups run the risk of a tumour sitting unnoticed for a long period of time, bringing with it far greater 40 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 2.15: Histogram of age of Patients at 1st Operation 0.0 Screening Other Method 0 50 100 150 200 t risk. A way of measuring the effect that late diagnosis can have on a patient is to plot a Kaplan-Meier curve showing survival. Figure 2.15 shows the different percentage survival rates of patients over time for patients who’s tumour was detected by screening against those who was not. This striking difference in survival for these two methods of detection may suggest that it may well be a factor in determining survival, but then again cancers severity is dictated by its biological characteristics, which represent themselves regardless of the time they were detected. 2.3.3 Method of Detection Exploring the difference in methods against other variables provides interesting results. Take Figures (2.16) and (2.17). Comparing the histograms on tumour size of Screening against not screening roughly show two normal distributions, however the tumour size for screening is centred around 1.2cm, while the histogram for the patients who did not undergo screening was centred around 1.8cm. So if a patients tumour was first caught out in screening, they can expect to have a tumour 60mm less in size than if their tumour was detected another way. It is certainly true that there will be a relationship between the therapy a patient undergoes and their subsequent prognosis. This does not necessarily 41 Figure 2.16: Distribution of Tumour Size for patients detected by screening Tumour Size Histogram - Method One (722 Patients) 70 60 Frequency 50 40 30 20 10 5 4.8 4.6 4.4 4 4.2 3.8 3.6 3.4 3 3.2 2.8 2.6 2.4 2 2.2 1.8 1.6 1.4 1 1.2 0.8 0.6 0.4 0 0.2 0 Tumour Size (cm) Figure 2.17: Distribution of Tumour Size for patients not detected by screening Tumour Size Histogram - Method Two (1481 Patients) 140 120 80 60 40 20 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 Frequency 100 Tumour Size (cm) 42 depend on the success of the therapy, but on the nature of the therapy. For example, a patient undergoing Chemotherapy treatment is more likely to have a poor prognosis than a patient undergoing hormonal treatment. These are two different processes likely to be prescribed at different developmental stages of the cancer. Chemotherapy is considered the most severe treatment, and is only used if all other treatments are not successful. So a patient undergoing chemotherapy is likely to have worse survival chances than patients undergoing other treatments such as radiotherapy and Endocrine therapy. 2.4 Summary of this Chapter Exploratory analysis is a fairly systematic exercise, and the results displayed in this chapter are amongst the most interesting from many, many comparisons, investigations and plots. With the frequent use of Scatter plots and Histograms; visualisation is often the way to unravel any secrets lurking in the data. While the exploratory Analysis is more about trial, error and instinct, handlign the missing data is a far more subtle exercise. In addition to Little & Rubin (2002), Allison,P (2001) contains a well refined list of methods. Now the missing data has been dealt with, and any hidden secrets within the variables has been unearthed, now is as good a time as any to move on with the Survival Analysis. 43 Chapter 3 Survival Analysis The term ”survival analysis” is a statistical approach designed to take into account the amount of time an experimental unit contributes to a study. That is, it is the study of time between entry into observation and a subsequent event. Less formally, a unit may be a particular patient under study or a structure under observation while the event would be the focal point of the study such as death. Survival Analysis made an immediate impact in medical research in particular. When observing units as unpredictable as humans over a certain time frame, problems would arise such as patients leaving before the timeframe of the study had elapsed. This could be for a number of reasons, such as quitting the study, moving too far away to continue monitoring or dying from reasons unrelated to the topic under investigation. As a result of these unpredictable and uncontrollable events, the researcher was forced to withdraw the units and all associated data from the study. This would have been extremely frustrating as units which would have provided valuable information to the study, have to be completely disregarded. As a result these studies could never be entirely regarded as trust worthy, as there was simply too much information missing. Thankfully techniques called censoring enabled researchers to analyse incomplete data due to delayed entry or withdrawal from the study. Survival analysis meant that studies could start without all experimental units enrolled at start of observation time and could end before all experimental units had experienced an event. This allowed each and every experimental unit to contribute all of the information possible to the study considering the amount of time the researcher was able to observe the unit. In a similar fashion to those involved in the construction of the Nottingham Prognostic Index, this project shall focus on the use of Survival Analysis. 44 3.1 Survival Function Survival Analysis relies on the Survival Function S(t). This function is a property of any random variable that maps a set of events, usually associated with mortality or failure of some system, onto time. It captures the probability that the system will survive beyond a specified time. Let T be a non-negative random variable representing the waiting time until the occurrence of an event. For simplicity this report shall adopt the terminology of survival analysis, referring to the event of interest as ’death’ and to the waiting time as ’survival’ time, but the techniques to be studied have much wider applicability. Assuming for that T is a continuous random variable with probability density function f (t) and cumulative distribution function F (t) = P r{T ≤ t} giving the probability that the event has occurred by duration t. <t+4t) f (t) = lim4t→0 P (t≤T4t (3.1) The Survival Function is often referred to as the Complementary cumulative distribution function. This can be seen below, with the Survival Function conventionally denoted S(t), for some time t and T (usually representing time of death or failure) be a continuous random variable with cumulative distribution function F (t) on the interval [0, ∞). Its survival function is S(t) = P (T > t) Z ∞ f (u)du = 0 = 1 − F (t) (3.2) Simply put, (3.2) shows the probability that the time of death is later than some specified time t. This has the following properties: • Every survival function S(t) is monotonically decreasing.The survival function must be non-increasing: S(u) ≤ S(t)if u ≥ t. This property follows directly from F (t) = 1 − S(t) being the integral of a non-negative function. This reflects the notion that survival to a later age is only possible if all younger ages are attained. Given this property, the lifetime distribution function and event density (F and f below) are well-defined; the survival function is usually assumed to approach zero as age increases without bound, i.e., S(t) → 0 as t → inf, although the limit could be greater than zero if eternal life is possible. • The survivor function is smooth and right continuous • The time, t = 0, represents some origin, typically the beginning of a study or the start of operation of some system. S(0) is commonly unity but can be less to represent the probability that the system fails immediately upon operation. In other 45 words one would usually assume S(0) = 1, although it could be less than 1 if there is the possibility of immediate death or failure. This function would be used for two main reasons. Firstly determining a patients probability of surviving to time t and secondly determining the % which survive survive to time t. The roots of survival analysis can be traced back to mortality tables from centuries ago, however it was not until the second world war that the Survival Analysis of the modern day began rapid development. Ironically, his new era was stimulated by interest in reliability (or failure time) of military equipment. It seems that it was the divine incentive to kill rather that to save which really pushed forward this breakthrough. At the end of the war these newly developed statistical methods emerging from strict mortality data research to failure time research, quickly spread through private industry as customers became more demanding of safer, more reliable products. As the uses of survival analysis grew, parametric models gave way to non-parametric and semi parametric approaches (all of which I shall discuss in greater detail later on) for their appeal in dealing with the ever-growing field of clinical trials in medical research. 3.2 The Hazard Function The Hazard function h(t) is, in short, used to measure the failure rate for very small intervals (or continuous) time. This provides a crucial link between Survival analysis and Proportional Hazard models, and assesses the instantaneous risk of demise at time t, conditional on survival to that time: h(t) = lim4t→0 P rd{t ≤ T ≤ t + 4t} | T ≥ te/4t (3.3) In words: the probability that if a unit survives to t, that unit will succumb to the event in the next instant.In terms of the survival function: f (t) s(t) d = − lnS(t) dt = slope of − lnS(t) h(t) = = how many (in %) die at time t (3.4) This can be interpreted by considering an individual alive at time t. The chances of dying in a small interval [t, t + 4t) are then given by: qt = S(t) − S(t + 4t) ≈ h(t)4t S(t) 46 (3.5) Which can be alternatively written as: h(t) = N o of deaths in [t, t + 4t) N o person-years at risk in[t, t + t) (3.6) It should be noted that, unlike the survival function S(t), the hazard function is NOT a probability so can take values of less than 0 and more than 1. Also, unlike the survival function, the hazard function can instantaneously decrease OR increase (OR both) over time.The modelling of survival data usually employs a hazard function or the log-hazard. For example, assuming a constant hazard, h(t) = v, implies an exponential distribution of survival times, with density function p(t) = ve−vt . In essence the Survival increases with t as the Hazard decreases with t and vice versa. 3.3 Applications of Survival Analysis Despite the assumptions that one may make from its name, Survival Analysis is not only restricted to modelling cases concerning the actual transition of an organism from living to death. It has a much broader application, using essentially the same methodology across various different fields. Some examples of this include event-history analysis used in sociology, reliability theory used in Engineering and duration analysis in economics. It is not all that difficult to imagine how this topic is so easily transferable across various, and in many ways very different areas. . So a question posed in Medical research such as What proportion of a group suffering from an undesirable condition will survive past a certain point? could be answered with exactly the same statistical techniques as a question asked by an incompetent Engineer What proportion of these houses will collapse before a certain point? Or a psychiatrist looking for an answer to the question What effect does electroshock therapy have on a patients mental stability? would need only go as far as the answer to an economists question What effect does David Cameron have on the United Kingdoms economic stability? or perhaps even a combination of the two What effect does electro-shock therapy have on David Cameron? Perhaps these examples are a little far fetched, but never the less give you a taste of the diversity and transferability of Survival Analysis. As such, it is used increasingly in areas such as Political science, where these approaches can be extremely informative. – In international relations; war duration, alliance longevity, peace duration, length of trade agreements, etc – In generalized politics; Government durations, length of civil wars, , confirmation durations, political careers, bill co-sponsorship, policy innovation/adoption coalition durability, etc 47 – In America, where the political structure is a lot more complex than simply Democratics Vs. Republicans; Campaign contributions (which, on the evidence of last years US elections , may well be worth Billions of dollars), Congressional overrides of Supreme Court decisions, etc Or in other problems such as economics, sociology, anthropology, – Length of a career (within a single role, including redundancies and promotions etc), Strike durations – Issues dealing with criminals and their employability etc, success in life based on marital/childhood circumstances, in music, the chances of a band being able to make a record label a sizeable profit – Various health and development related issues More generally, survival models are useful for any issue in which the phenomenon of interest is a duration, and/or the response is the occurrence of a discrete event in time. As with any statistical approach, process, method or model, there are certain rules which apply, and with these rules often come problematic issues. But first lets have a look at the characteristics of Time-To-Event Data 1) Take place over time 2) Discrete events (i.e., not continuous) 3) the unit may not experience the event (i.e., possibility of censoring) Criteria for Survival Analysis Unfortunately these characteristics bring with them several problems and boxes which must be ticked if they are to function in the correct manner. Bringing the concept of time into any scenario makes things a whole lot trickier as is the case with Survival Analysis; - all of the duration data is strictly non-negative (unfortunately time travel has still eluded even the sharpest of scientists) - Also, the data are conditional: to survive to some time t, one must necessarily have survived up to t-1 as well - Additionally, observations are regularly encountered which have not failed yet (i.e., censored data) Thankfully there does exist well exercised methods of dealing with these problems, which shall be covered in greater detail in the next section. 48 3.4 Censoring Censoring is a form of missing data problem which is common in survival analysis. A thorough understanding of censoring and its role are crucial to completing a comprehensive analysis of the survival data, and hence fitting the proportional hazard models. • Left censoring a data point is below a certain value but it is unknown by how much • Interval censoring a data point is somewhere on an interval between two values • Right censoring a data point is above a certain value but it is unknown by how much • Type I censoring occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored • Type II censoring occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored • Random (or non-informative) censoring is when each subject has a censoring time that is statistically independent of their failure time. The observed value is the minimum of the censoring and failure times; subjects whose failure time is greater than their censoring time are right-censored Ideally, in a medical scenario (involving patient monitoring based on life and death) such as this, both the birth and death dates of a subject are known, in which case the lifetime is known. If it is known only that the date of death is after some date, this is called right censoring. Right censoring will occur for those subjects whose birth date is known but who are still alive when they are lost to follow-up or when the study ends. If a subject’s lifetime is known to be less than a certain duration, the lifetime is said to be left-censored. It may also happen that subjects with a lifetime less than some threshold may not be observed at all: this is called truncation. Note that truncation is different from left censoring, since for a left censored datum, we know the subject exists, but for a truncated dataset, we may be completely unaware of the subject. Truncation is also common. In a so-called delayed entry study, subjects are not observed at all until they have reached a certain age. Left censoring of data can occur when a person’s survival time becomes incomplete on the left side of the follow-up period for the person. As an example, we may follow up a patient for breast cancer from the time of them being diagnosed. We may never know the exact time that said patient will have had the tumour for. 49 Looking at the censoring in the context of our dataset, the ”birth” of our patient is now considered the point at which they were first diagnosed with Breast Cancer. At the end of the study, there are three considered and categorised outcomes which each patient shall face: They shall either - Have survived beyond the period of study (Often the most likely scenario, as realistically studies cannot extend to decades, and the expected survival time of any diagnosed patient far surpasses this) - Have died as a direct or indirect result of having breast cancer - Have left before the end of the study The latter of these outcomes can be from a variety of reasons; the patient may have decided (through poor health or otherwise) that they no longer wished to be monitored, they might have (as a direct or indirect result of breast cancer) suffered other serious illnesses which became more serious etc. The table 3.1 provides more of an insight into the nature of the censoring in this particular study, where 0 = The patient was still alive at the final point of study 1 = The patient died as a result of Breast Cancer 2 = The patient has left the study for other reasons For obvious reasons, 3.1 provides further justification of the decision to focus the study on Centre (1). The nature of this categorisation may seem relatively straightforward, however its application and legitimacy divides opinions of statisticians around the world. The main point of discussion being how to draw the line between ”related” and ”unrelated” deaths? How can one be certain that the ”unrelated” deaths are not as an indirect result of having a tumour ? For example, following surgery a patient may become more susceptible to other potentially fatal infections. Lets say a patient has undergone chemotherapy, which can significantly repress a bodies immune system, and they later are diagnosed and killed with pneumonia. For the case of the data monitoring and collection, this death would usually be classified as death by other causes. Naturally these questions open up a whole new range of problems, and unfortunately cannot be fully addressed in this report (due to limited information on the patients, time constraints etc). As a result, this study shall neglect this eventuality, and concentrate on analysing the censored data in the form it is presented. While it is likely that left censoring is present within this dataset, and that some of the patients may have had had the tumour for some ammount of time before they were diagnosed, for the purposes of this project, all patients will be considered to be right censored. 50 Centre 1 1-5 6-10 Patients 2203 7785 9180 0 70.1% 73.2% 72.8% 1 22% 12.8% 11.4% 2 7.9% 3.8% 3.4% Missing info on Censoring 0% 10.2% 12.4% Table 3.1: Censoring Percentages across all centres The vast majority cases in Centre (1), and indeed the entire dataset are Right Censored. 3.5 Purpose and Presumption The 3 main goals of Survival Analysis are: 1. To estimate and interpret survival characteristics, for example looking a Kaplan Meier plots of our variables to see how they behave against survival 2. Compare survival for different groups, for example using the log-rank test to see how different outcomes of a variable effect survival 3. To asses relationship of Explanatory variables to survival time using a Cox Proportional Hazards model These all run under two main assumptions: 1. Independent Observations 2. Independent Censoring Independent observations, speaks for itself, the outcome of one patient will in no way affect the outcome of another. Independent censoring: Consider two random individuals both alive at time t, such that the first individual is not censored and the second individual is censored at some point before t. Censoring is called independent if these individuals have equal chances of survival. 3.6 Non-parametric Survival Analysis Once the time to event data has been collected, the first task is to describe it usually this is done graphically using a survival curve. Visualisation plays a large role in Survival Analysis and allows the appreciation of temporal patterns in the data. It also helps us to identify an appropriate distributional form for the data. If the data are consistent with a parametric distribution, then parameters can be derived to efficiently describe the survival pattern and statistical inference can be based on the chosen distribution. Non-parametric 51 methods are used when no theoretical distribution adequately ts the data. There are three non-parametric methods for describing time to event data: • the Kaplan-Meier method • the Life table method • the Nelson-Aalen method All three of these methods have their place in modern day Survival Analysis, although Kaplan-Meier is generally accepted to be the most effective technique. 3.6.1 Kaplan-Meier Estimates Kaplan-Meier estimators shall play a crucial role in determining the varying effects which the variables have on survival. The Kaplan-Meier curves which these estimators yield show a plot of percentage survival against time. They are often a used in survival analysis provide a very digestible graphical output, which can be easily interpreted by those less familiar with the statistical concepts at work. The Kaplan-Meier method is based on individual survival times and assumes that censoring is independent of survival time (that is, the reason an observation is censored is unrelated to the cause of failure). Kaplan-Meier estimator of survival at time t is shown in (3.7). S(ti ) = Q 0≤j≤n P r(T > ti | T ≥ ti ) = S(tj−1 )P r(T > tj | T ≥ tj ) nj − mj for 0 ≤ t ≤ T = S(tj−1 ) nj (3.7) Here tj , j = 1, 2, ....n is the total set of failure times recorded, T is the failure time with distribution F and density f , mj is the number of failures at time tj , and nj is the number of individuals at risk at time tj . Note that: 1. for each time period the number of individuals present at the start of the period is adjusted according to the number of individuals censored and the number of individuals who experienced the event of interest in the previous time period, and 2. for ties between failures and censored observations, the failures are assumed to occur first A note on Kaplan and Meier....... Both were students of the famous John Tukey. In 1952, Paul Meier started working on the duration of cancer while at Johns Hopkins University, Chicago, USA. Edward Kaplan later started working on the lifetime of vacuum tubes in the repeaters of sub-oceanic telephone cables while at Bell labs. They independently submitted their research on survival times to the Journal of the American 52 Statistical Association, whose editor encouraged them to submit a joint paper, which they did: Kaplan & Meier (1958). Here is an example of survival times (in months) taken directly from our dataset, the +s indicate censoring has occurred: 88 66 179+ 188+ 121 136 179 177+ So the uncensored survival times show that the patient has deceased from breast cancer that long after the study began. The censored times indicate that the patient was still alive and at that age once the study period had expired OR that they had died from other causes during the same period. The table 3.6.1 below shows how a Kaplan-Meier would be computed from the above survival times by taking into account censoring. The resulting curve is shown in Figure 3.1 - notice the red line’s corresponding to the values of S(tj ). • t(j) = Ordered failure times • mj = N o of patients at tj ) • nj = # failure times at t(j) • qj = # censored in [tj , tj+1 ) • S(tj ) = #survivingpastt(j) n(j) t(j) mj nj qj S(tj ) nj − mj t(0) = 0 m(0) = 8 n(0) = 0 q(0) = 0 S(t(j) ) = S(tj−1 ) nj 8−7 × 1 = 0.875 66 8 1 0 8 88 7 1 0 0.75 121 6 1 0 0.625 136 5 1 0 0.5 177 4 0 1 0.5 179 3 1 1 0.335 188 3 0 1 0.335 Table 3.2: Computation of Kaplan-Meier curve from section of Survival 53 0.0 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.1: Kaplan-Meier Curve computed from Table 3.6.1 0 50 100 t 54 150 3.6.2 Greenwoods formula The KaplanMeier estimator is a statistic, and several estimators are used to approximate its variance. Once such approximation is: X qj ˆ j )) = S(tj )2 (3.8) V ˆar(S(t n (nj − qj ) t <t j j This result in (3.2) is known as Greenwoods formula. Its derivation may be questioned because it conditions on the nj which are random variables, but the result is in the spirit of likelihood theory, conditioning on all observed quantities, and has been justied ˆ rigourously. Peterson (1977) has shownq that the K-M estimator S(t) is consistent, and ˆ − S(t)) converges in law to a Gaussian Breslow and Crowley (1974) show that n(S(t) process with expectation 0 and a variance-covariance function that may be approximated using Greenwoods formula. As already been seen, Kaplan-Meier curves are very good at giving an immediate idea of the behaviour of different groups with survival, especially when the variable of interest can only take 2-4 values. One of the good things about K-M curves are that they are very quick and easy to interpret, and as such one can skim through large amounts of variables/outcomes and inference behaviour in a small space of time. Since NPI is a continuous variable and can take many values, there are many survival curves on the Plot 3.2, which is quite frankly a mess. Not much information at all can be taken from this so a way of organising the NPI data must be found such that it will not result in the Survival curves looking like a box of string. If the NPI is intuitively broken down into groups, then making sense of its relationship with survival is a possibility. Spitting a continuous variable up into a categorical variable can be a delicate issue as there are several points which need to be addressed: - How many categories will the variable be group up into? Are there any trends in the variable which can provide clues on the best way to do this? - Shall the categories be of even interval or shall they be adjusted to cater for specific behaviour in its continuous form? - Are there methods in place to help with these decisions? In the case of Figure 3.2, the variables were separated into 4 groups with the same intervals as the index (In Table 1.8). These intervals are not of even length, so the scientists behind the NPI decided that there were patterns in the Prognostic Indexs relationship with survival which suggested that intervals of equal length were not appropriate. Figure 55 0.0 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.2: Kaplan-Meier Curves of Uncategorised NPI Scores 0 50 100 150 t 56 200 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.3: Kaplan-Meier Curves of NPI Groups with 5-year and 10-year Survival Times 0.0 5−Year Survival 10−year Survival 0 50 100 150 200 t 3.3 and shows the survival curves of the 4 groups of the NPI. There are methods of checking whether these categories are significantly different from one another but from just looking at the plots we can deduce information about the NPI scores from this set of 2203 patients. The curves are in the right ordering as we might expect, with 2 < S ≤ 2.4 being the group with highest survival and S > 5.4 being at the bottom, where S is the a patients NPI score at the beginning of the study. There is a small amount of overlap between 2 < S ≤ 2.4 and 2 < S ≤ 3.4 categories of the index, but the different curves seem to reflect the chosen intervals of index well. If you refer to A.2 in the appendix, there is minimal overlap of CIs once again with exception to the top two. The 5-year survival times from this are displayed in Table 2.3. This shows that the NPI was miles off with its 5-year survival times. Coincidently, it seems like the actual 10-year survival time is closer to the 5-year survival times postulated by the index. Another continuous variable which might have more predictive power when split into categories and is also currently useless in a KM-plot is tumour size. History suggests that 57 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.4: Kaplan-Meier Curves of Tumour Size (cm) categorised into 3 groups 0.0 Low to 1.5cm 1.51 to 2.5cm 2.51cm+ 0 50 100 150 200 t tumour size certainly is a factor in determining survival, so some time evaluating different ways of using it to predict survival would be well spent. A variable of intrigue which was touched on earlier was the tumour type, no emphasis had been put on this variable as a potential prognostic factor but if an object has enough variation biologically to be placed in to categories, then will its behaviour against survival still have enough variation to be categorised? Eight of these types all cross over and follow very similar patterns of survival, however there are two types which stand out. One has visibly better survival time then the other types, but there are so few cases of this that we cannot statistically prove any significant difference. And secondly there is a type which has visibly lower survival times than the others. Now, this variable has more points, so there may be a chance that this type is significantly different to the others. The method which would usually be used to deduce this plays a large role in Survival Analysis and is known as the Log-Rank Test. Interesting finding: While age does not seem to play an important part in determining survival based on Proportional Hazards (p=0.21); it may be of interest to see how age 58 0.0 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.5: Kaplan-Meier Curves of different types of Tumour 0 50 100 150 t 59 200 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.6: Kaplan-Meier Curves for different age groups 0.0 Under 40 In their 40's In their 50's 60 and over 0 50 100 150 200 t behaves as an ordinal categorical variable as opposed to a continuous one. Splitting the age at first operation up into 4 groups: ≤ 40 (185 patients), 40 < to ≤ 50 (554 patients), 50 < to≤ 60 (761 patients) and > 60 (703 patients). As shall be seen later, no significance was found in R through Cox Proportional Hazards(p=0.34). Every so often however, intrigue persists, and the exploration of this idea continued. Figure 3.6 shows ploted survival estimates of the 4 age categories formed. Immediately, there was graphical evidence that those diagnosed with breast cancer at younger than 40 would have significantly less percentage survival than other age groups (especially between 5-8 years where the survival is a full 10% less than the other groups, which themselves do not differ between each other in any significant way). A possible reason for this is that more people at a younger age left the study (censored = 2) or another reason could be the smaller numbers in that group. Kaplan-Meier estimates are invaluable in modern day survival analysis, but there are a few restrictions which can occasionally make it difficult to use them effectively: 1. Requires categorical predictors, as we saw earlier we had to order the continuous variables into groups which can often lead to significant loss of information 60 2. Cannot accommodate time dependent variables; the cumulative survival approach the method takes means that variables which vary over time will not be explained. Thankfully the variables we are looking at are independent of time, so this is not an issue 3. The don’t control for covariates 4. Descriptive, they are very useful at providing insight on the data but no real statistically significance can be formally derived 3.6.3 Life Table Method The life table method (also known as the actuarial or Cutler Ederer method) is an approximation of the Kaplan-Meier method. It is based on grouped survival times and is suitable for large data sets. The life table method assumes that subjects are withdrawn randomly throughout each interval therefore, on average they are withdrawn half way through the interval. This is not an important issue when the time intervals are short, but bias may introduced when time intervals are long. This method also assumes that the rate of failure within an interval is the same for all subjects and is independent of the probability of survival at other time periods. Life tables are produced from large scale population surveys (e.g. death registers) and are used less-frequently these days. Generally speaking, the Kaplan-Meier method is preferred because it is less prone to bias. 3.6.4 Nelson-Aalen Estimators The Nelson - Aalen estimator can be used to provide a non-parametric estimate of the cumulative hazard rate function. Instantaneous hazard is dened as the proportion of the population present at time t that fail per unit time. The cumulative hazard at time t, H(t) is the summed hazard for all time up to time t. The relationship between cumulative hazard and survival is as follows: H(t) = ln[S(t)] or S(t) = eH(t) The Nelson-Aalen estimator of cumulative hazard at time t is dened as: X H(t) = dj /rj , for 0 ≤ t ≤ t+ j:tj ≤t Intuitively, this expression is estimating the hazard at each distinct time of death tj as the ratio of the number of deaths to the number exposed.The cumulative hazard up to time t is simply the sum of the hazards at all death times up to t, and has a nice interpretation as the expected number of deaths in (0, t] per unit at risk. This estimator has a strong justification in terms of the theory of counting processes. 61 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.7: Kaplan-Meier Curves for Endocrine Therapy 0.0 With Therapy Without Therapy 0 50 100 150 200 t 3.7 Log-Rank Test Previously, this report was exploring ways of clarifying whether there were significant differences between groups in terms of survival. There was often strong suggestion from the Kaplan-Meier curves that there were significant difference, but there was no way of proving this using a simple plot. The Log-Rank test can be used perform comparisons of survival curves using hypothesis tests. It is always of interest to measure the effect a kind of therapy has on patients survival; The Kaplan-Meier plot in 3.7 shows that there are obvious differences in the survival of patients who have had Endocrine (Hormonal) therapy and those who have not. But the Log-Rank can be used to formally test if there is significant between different groups, Rodriguez (2001). In strange way, making any inferences of the performances of the therapies going on here is unlikely. These are not Clinical trials, so are not comparing people who are not on the therapy against people in the same condition who are not on the treatment (placebo). The patients who are not on one kind of therapy will probably be on a range of other therapies. It is very difficult to compare treatments for something as severe as cancer as there are many other underlying factors such as the unpredictability of the tumour and other therapies the patient is on. For example the situation in 3.7 had the people who were not on endocrine therapy 62 surviving longer than those who were on it, one might be thinking: Is this treatment really working? In reality it is difficult to gauge a performance of this therapy as those who are not using hormonal therapy may have further developed Cancer and using harsher treatments such as Chemotherapy. Despite having a more intense treatment, their survival chances will be less than those who do not require that treatment at that moment in time. While its application in the above example is not in very useful for the purposes of our study, there were some issues which were left earlier on which can be resolved using this method. One of these was determining if there was significant difference in groups of tumour type and also between the groups we created for the size of tumour. Using the ”survdiff” function from the R survival package, the following output ensues from Tumour Type: Call: survdiff(formula = surv ~ type, data = data) n=2195, 8 observations deleted due to missingness. N Observed Expected (O-E)^2/E (O-E)^2/V type=1 1286 346 267.91 22.759 51.054 type=3 148 25 34.32 2.530 2.729 type=4 29 2 6.86 3.444 3.502 type=5 33 3 7.39 2.609 2.656 type=6 4 0 1.03 1.028 1.033 type=7 120 1 30.29 28.323 30.304 type=8 5 0 1.46 1.455 1.463 type=11 536 103 128.39 5.021 6.854 type=12 34 5 7.35 0.753 0.766 Chisq= 68.2 on 8 degrees of freedom, p= 1.12e-11 The p−value* shows straight away that there is significant difference between the groups. Looking back at Figure 3.6, the younger age group seemed to have a slightly worse survival time than the others - but at that time there was no concrete way to test these suspicions. Step in the Log-Rank test (and R!) * The p-value is the probability of obtaining a result as extreme or more extreme than the one observed based on the current sample, given the null hypothesis is true. Call: survdiff(formula = surv ~ age.cat) 63 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure 3.8: Kaplan-Meier Curves for Menopausal Status 0.0 Pre−Menopause Post−Menopause 0 50 100 150 200 t age.cat=1 age.cat=2 age.cat=3 age.cat=4 N Observed Expected (O-E)^2/E (O-E)^2/V 185 54 39.6 5.243 5.723 554 123 128.1 0.206 0.281 761 155 168.4 1.063 1.632 703 153 148.9 0.113 0.164 Chisq= 6.6 on 3 degrees of freedom, p= 0.0842 As p= 0.0842 > 0.05, there is marginally insignificant difference in survival between these age groups. That is not to say that there is NO difference, and there may well be other factors contributing to these differences such as censoring. Menopause causes a great deal of debate amongst scientists n breast cancer research. The original NPI did not consider it an import factor, as have many other since but looking at the K-M plot 3.8, it seems like there may be slight contrasts in survival between those who have/havent gone through menopause. Call: survdiff(formula = surv ~ meno, data = data) 64 N Observed Expected (O-E)^2/E (O-E)^2/V meno=1 817 190 185 0.1448 0.235 meno=2 1386 295 300 0.0892 0.235 Chisq= 0.2 on 1 degrees of freedom, p= 0.628 The log rank test above shows that there are no significant differences between the survival of the two groups. Despite this, the menopausal status of a patient is still something which could be looked into - it is obviously an indicator of age, but we have already seen that age is not a large factor in survival. The K-M plot 3.8 shows that while there is substantial cross over between the two groups, it is post-menopausal women who tend to have the slight edge over those younger. Could this be something to do with the fluctuating hormonal levels of a Pre-menopausal women? Perhaps there are some women who have more volatile menstrual cycles, which can affect a women’s Oestrogen levels: See the Appendix A.3. However, it should be remembered that Kaplan-Meier curves are only informative and do not provide solid evidence of significance. Looking back to the geography of Breast Cancer, it seemed that those who gave birth at an early age in countries such as Mongolia were much less susceptible to breast cancer. These are all interesting trends all of which could be relevant to breast cancer survival, and hopefully continued research will reveal new information on Pre-Menopausal women and their association with breast cancer. 65 Chapter 4 Cox Proportional Hazard Models Cox Proportional Hazard Models are a class of survival models in statistics, first proposed by Sir David Cox. Survival models relate the time that passes before some event occurs to one or more covariates that may be associated with that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one’s hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure. Other types of survival models such as accelerated failure time models do not exhibit proportional hazards. These models could describe a situation such as a drug that reduces a subject’s immediate risk of having a stroke, but where there is no reduction in the hazard rate after one year for subjects who do not have a stroke in the first year of analysis. As mentioned, survival analysis typically examines the relationship of the survival distribution to covariates. Most commonly, this examination entails a specification of a linear-like model for the log-hazard. For example, a parametric model based on the exponential distribution may be written as loghi (t) = α + β1 xi1 + β2 xi2 + ..... + βk xik (4.1) or, equivalently hi (t) = exp(α + β1 xi1 + β2 xi2 + ..... + βk xik ) (4.2) that is, as a linear model for the log-hazard or as a multiplicative model for the hazard. Here, i is a subscript for the observations, while the x’s are the co-variates. The constant n this model represents a kind of log-baseline hazard, since loghi (t) = α when all of the x’s are equal to zero. 66 The Cox model, in contrast leaves the baseline hazard function α(t) = logh0 (t) unspecified: loghi (t) = α(t) + β1 xi1 + β2 xi2 + ..... + βk xik (4.3) or, again equivalently, hi (t) = h0 (t)exp(β1 xi1 + β2 xi2 + ..... + βk xik ) (4.4) This model is semi parametric because while the baseline hazard rate can take any form, the covariates enter the model linearly. Now, if looking at two different observations (or patients), say, i and i0 that differ in their x-values, with corresponding linear predictors ηi = β1 xi1 + β2 xi2 + ..... + βk xik (4.5) and also ηi0 = β1 xi0 1 + β2 xi0 2 + ..... + βk xi0 k (4.6) The Hazard ratio for these two observations, hi (t) h0 (t)eηi = hi0 (t) h0 (t)eηi0 eηi = η0 ei ηi −ηi0 =e (4.7) which is independent of time t. Consequently, the Cox model is a proportional hazards model. Even though the baseline hazard is unspecified, the cox model can still be estimated by the method of partial likelihood, developed by Cox in the same paper in which he developed his original Proportional Hazards model. 4.0.1 Log Likelihood test The Log Likelihood test will play a major part in the model building process. Its main function is to compare two models, using hypothesis testing. The main criterion for this test is that one model is nested in the other, with the nested model being the null hypothesis. An example would be comparing a model with 3 variables a, b, c against its 67 nested model with just two variables a, b to determine whether or not a models performance improves with the inclusion of the variable c. A test would then be performed to determine whether or not the model is significantly better than the nested model. The Log-Likelihood for the nested model is always going to be smaller than the alternative model, but the p-value resting from the test will determine whether this significant or not: If p < 0.05 then the null hypothesis is rejected and the alternative model is deemed to be significantly better than the null model, conversely if p > 0.05 then the alternative model is rejected so stick with the original model. Very conveniently, this test is performed automatically when one is fitting models using R; but to have s better idea of how it works, next is an example of the Log-Likelihood test being used on the data. Given a Proportional Hazards model with m variables h(t | x1 , ...., xm ) = h0 (t)eβ1 x1 +β2 x2 +.....+βm xm (4.8) where xi are the co-variates such as Tumour Size, Menopausal Status etc. with βi their respective coefficients and baseline hazard function h0 (t). The goal is to test whether or not the model h(t | x) is improved with the inclusion of x1 =Tumour Grade. Beginning with the null hypothesis H0 : β1 = 0 (4.9) H1 : β1 6= 0 (4.10) With composite Hypothesis This is essentially testing whether a proposed model h(t | x1 ) = h0 (t)eβ1 x1 (4.11) is significantly (α = 0.05) at least as accurate at predicting survival than the null model. Using the test statistic L(x0 | t, β) D = −2ln L(x1 | t, β) = −2ln(Likelihood of Null Model) + 2ln(Likelihood of Composite Model) = 182.2 (4.12) 68 Reject H0 if D > χ21 at α = 0.05. Since comparing the difference of one parameter in the model, there is 1 degree of freedom for the χ2 statistic, which is equal to 3.84 with 95% confidence. So D > 3.84 and reject H0 , concluding that x1 =Tumour Grade is an important factor in determining survival. 4.0.2 Other Methods of Model Checking Two alternative approaches for testing models is the Wald and Score statistics (which are once again provided in the output of a model summary in R), which both also use the likelihood to assess the models fit Wald Statistic.... Again, say testing for β1 6= 0. The Wald test is that if { β1 2 } > χ21 SEβ1 (4.13) then include x1 in the model. This is far simpler than the above test, but for data that produce large estimates of the coefficient, the standard error is often inflated, resulting in a lower Wald statistic, and therefore the explanatory variable may be incorrectly assumed to be unimportant in the model. Score Statistic.... uses a simple null hypothesis that β is equal to some particular value β1 .The statistic to test H0 : β = β1 is S(β1 ) = U (β1 ) I(β1 ) (4.14) (4.15) where U (β1 ) = ∂logL(β1 | x) ∂β1 (4.16) and fishers information I(β1 ) = −E ∂2 logf (X; β1 ) | β1 ∂ 2 β1 (4.17) The score test once again uses S(β1 ) v χ21 to test the hypothesis. The main advantage of the Score-test is that it does not require an estimate of the information under the alternative hypothesis or unconstrained maximum likelihood. 69 4.1 Model Building using R With this recently established understanding of the Proportional Hazards Model and how it works, its now time to impute covariates in to the model in R and determine their effectiveness by comparing them against the null model using the log likelihood ratio test. The results are compared against a chi squared distribution at 95% confidence. This is 3.84 for 1 degree of freedom. NB Care needs to be taken to not to confuse the log likelihoods when comparing models in R; the output from a model automatically shows the log likelihood ratio against the GLOBAL NULL model. The output will provide p-values for the test of each co-variate being included/exclude in the model. Three paths......remembering back, Histological Grade is made up of three components; Mitosis, Tubular Formation and Pleomorphism. Perhaps one of these sub-variables is more influential in predicting survival and can incorporated into the model. Figure 4.1 provides some insight. Tubular formation and Mitosis have very few patients of grade 1. Grades 2 and 3 have good separation in Mitosis and Pleomorphism, less so with Tubular formation, although the log-rank test shows significant difference (p= 4.29e-13). It might be of interest to have a look at the survival of the summation of these grades in Figure 4.2, which shows the 7 groups. Remember the Tumour grade used in the current index is calculated by the summation of these three factors, but can only take three values, so perhaps a new variable of this form may maintain more information on the tumour. So initially the model building process will be ran three times: 1. with the Histological Grade as a complete variable 2. with the Histological Grade removed and replaced by its components as three new variables (M,P,T) 3. with the summation of the three component values (P+M+T)as a single variable 4.1.1 Model Building The following shows the results of the full model (3) being run through the data in R. The Variables associated with these R names can be seen in Table 1.8. Call: coxph(formula = surv ~ lymphnode + size + type + ER + ERc + radio + axradio + chemo + endo + endotype + Op.type + detection + axillary + total + LVI + age + nodespositive, data = data) n= 2202, number of events= 485 70 Figure 4.1: Kaplan-Meier Curves of Histological Grades of Tumour, as well as of the three factors involved in calculating the grade: Pleomorphism P, MitosisM and Tubule Formation T 100 0.0 150 200 0 100 150 200 t t T Histological Grade 0.0 0.4 ^ S(t) 0.4 0.0 ^ S(t) 50 0.8 50 0.8 0 0.4 ^ S(t) 0.4 Grade I Grade II Grade III 0.0 ^ S(t) 0.8 M 0.8 P 0 50 100 150 200 0 t 50 100 150 t 71 200 0.4 ^ S(t) 0.6 0.8 1.0 Figure 4.2: Estimated Survival curves for 7 groups of new variable Total=T+P+M 0.0 0.2 T+M+P=3 =4 =5 =6 =7 =8 =9 0 50 100 150 t 72 200 coef exp(coef) se(coef) z lymphnode 0.321584 1.379311 0.102964 3.123 size 0.186466 1.204983 0.057591 3.238 type 0.019496 1.019687 0.013005 1.499 ER -0.001122 0.998879 0.001072 -1.047 ERc -0.105815 0.899591 0.185732 -0.570 radio -0.104542 0.900737 0.077054 -1.357 axradio 0.081222 1.084611 0.068158 1.192 chemo 0.222189 1.248808 0.060984 3.643 endo 0.038994 1.039764 0.135264 0.288 endotype -0.008773 0.991266 0.034322 -0.256 Op.type 0.112873 1.119490 0.107827 1.047 detection 0.280256 1.323469 0.121690 2.303 axillary 0.097829 1.102774 0.120183 0.814 total 0.346648 1.414319 0.044155 7.851 LVI -0.413924 0.661051 0.103112 -4.014 age 0.012384 1.012461 0.005477 2.261 nodespositive 0.129774 1.138571 0.025323 5.125 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 lymphnode size type ER ERc radio axradio chemo endo endotype Op.type detection axillary total LVI age Pr(>|z|) 0.001788 0.001205 0.133856 0.295277 0.568869 0.174867 0.233389 0.000269 0.773131 0.798264 0.295190 0.021277 0.415646 4.11e-15 5.96e-05 0.023756 2.98e-07 1 exp(coef) exp(-coef) lower .95 upper .95 1.3793 0.7250 1.1272 1.6877 1.2050 0.8299 1.0764 1.3490 1.0197 0.9807 0.9940 1.0460 0.9989 1.0011 0.9968 1.0010 0.8996 1.1116 0.6251 1.2946 0.9007 1.1102 0.7745 1.0476 1.0846 0.9220 0.9490 1.2396 1.2488 0.8008 1.1081 1.4074 1.0398 0.9618 0.7976 1.3554 0.9913 1.0088 0.9268 1.0602 1.1195 0.8933 0.9062 1.3829 1.3235 0.7556 1.0426 1.6800 1.1028 0.9068 0.8713 1.3957 1.4143 0.7071 1.2971 1.5422 0.6611 1.5127 0.5401 0.8091 1.0125 0.9877 1.0017 1.0234 73 ** ** *** * *** *** * *** nodespositive 1.1386 0.8783 1.0834 Concordance= 0.77 (se = 0.014 ) Rsquare= 0.182 (max possible= 0.961 ) Likelihood ratio test= 442.5 on 17 df, Wald test = 478.5 on 17 df, Score (logrank) test = 614.4 on 17 df, 1.1965 p=0 p=0 p=0 After fitting the three full models, it is evident that all are very similar so there is not much to go on but initial model construction shall begin with choice (3) with total (T+P+M). The Appendix contains the summary of the R output of the other two models. The likelihood ratio test shows that models (1), (2) and (3) have likelihood ratio scores of 433.6, 445.1 and 442.5 respectively. This is very little to go on at this point, and on first glance the second model would be chosen. But remembering that this has two extra variables, and operating on 19 df, so χ219 = 30.14 > χ217 = 27.59 which pushes the choice into the favour of (3). In addition the second model has Pleomorpism(P) as insignificant, to remove it at this early stage might be slightly ruthless. LVI = x7 , chemo = x4 and total = x6 are all highly significant(***). lymphnode= x1 and size = x2 are significant(**) terms while age = x8 and detection = x5 are marginally significant(*). While Tumour Type = x3 is not significant at this point, it is not far off and the analysis on it earlier suggest that it might yet play a role. So the first model with 8 co-variates is: h(t | xi ) = h0 (t)exp(β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + β6 x6 + β7 x7 + β8 x8 ) (4.18) Based on earlier analysis, Tumour Type 1 had significantly lower survival than the other types, so type shall be re-categorise to be new.type where (1=Tumour Type 1, 2= Any other tumour type). The same will be done for age and will be recategorised into a new variable age.cat with 4 distinct categories 40 >,40 − 50,> 50 − 60 and 60 <. Call: coxph(formula = surv ~ lymphnode + size + new.type + chemo + detection + total + LVI + age.cat, data = data) n= 2195, number of events= 485 lymphnode coef exp(coef) se(coef) 0.65777 1.93048 0.06761 74 z Pr(>|z|) 9.729 < 2e-16 *** size 0.19650 1.21714 0.05515 3.563 0.000367 new.type 0.01430 1.01440 0.11937 0.120 0.904648 chemo 0.18015 1.19740 0.02856 6.307 2.84e-10 detection 0.33580 1.39906 0.12175 2.758 0.005812 total 0.35735 1.42954 0.04110 8.696 < 2e-16 LVI -0.34246 0.71002 0.09974 -3.434 0.000596 age.cat 0.14389 1.15476 0.05005 2.875 0.004043 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 *** *** ** *** *** ** exp(coef) exp(-coef) lower .95 upper .95 lymphnode 1.930 0.5180 1.6909 2.2040 size 1.217 0.8216 1.0924 1.3561 new.type 1.014 0.9858 0.8028 1.2818 chemo 1.197 0.8351 1.1322 1.2663 detection 1.399 0.7148 1.1021 1.7761 total 1.430 0.6995 1.3189 1.5494 LVI 0.710 1.4084 0.5839 0.8633 age.cat 1.155 0.8660 1.0469 1.2738 Concordance= 0.761 (se = 0.014 ) Rsquare= 0.168 (max possible= 0.961 ) Likelihood ratio test= 404.3 on 8 df, Wald test = 390.9 on 8 df, Score (logrank) test = 441.6 on 8 df, p=0 p=0 p=0 lymphnode has now become the most significant variable; this is because there were other variables which followed similar patterns and their removal has increased Lymph Nodes power in the model. The re-categorised variable age.cat has seen it become significant rather than marginally significant while new.type is still insignificant so it shall be removed from the model. LVI is the only negative term in the model, as it is numbered 1=Yes and 2=No; unlike all the other variables, if the value of LVI increases so does the survival time for that patient. This is quite confusing, but it is merely a case of the renumbering the variable. It was decided that chemo would be removed from the model. Yes, it is a significant factor in determining survival, as one would expect: A patient who undergoes Chemotherapy is likely to be in worse condition than someone who does nt, thus it is a good indicator of survival. But this whole process is about establishing a Prognostic Index which can weigh up all the current information on a patient soon after they are diagnosed, and it is unlikely that they will have had any kind of Chemotherapy by that point. Ideally a model would be constructed based around the information that 75 will be readily available for a patient as soon as possible so an Index score can be determined and the correct course of action can be taken. It does not make sense to be waiting to see whether or no a patient is having Chemotherapy before telling them their survival prospects. The status of age.cat has also been reconsidered. Having been marginally insignificant without categorisation, based on the exploratory analysis it was never going to be largely significant in predicting survival. Yet, at this late stage, there it sits pretty in the model as highly significant (***). This is not true, and after running the log-rank test on this new variable it came to light that there were no significant differences between the 4 groups (p=0.0842). So the second model is; h(t | xi ) = h0 (t)exp(β1 x1 + β2 x2 + β5 x5 + β6 x6 + β7 x7 ) (4.19) Call: coxph(formula = surv ~ lymphnode + size + +detection + total + LVI, data = data) n= 2203, number of events= 485 coef exp(coef) se(coef) z Pr(>|z|) lymphnode 0.63653 1.88991 0.06813 9.343 < 2e-16 size 0.17292 1.18877 0.05481 3.155 0.001607 detection 0.18522 1.20349 0.11644 1.591 0.111660 total 0.33184 1.39353 0.03488 9.514 < 2e-16 LVI -0.35262 0.70285 0.09856 -3.578 0.000347 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 exp(coef) exp(-coef) lower .95 upper .95 lymphnode 1.8899 0.5291 1.6537 2.1599 size 1.1888 0.8412 1.0677 1.3236 detection 1.2035 0.8309 0.9579 1.5120 total 1.3935 0.7176 1.3014 1.4921 LVI 0.7028 1.4228 0.5794 0.8526 Concordance= 0.754 (se = 0.014 ) Rsquare= 0.152 (max possible= 0.961 ) Likelihood ratio test= 362.7 on 5 df, p=0 Wald test = 341 on 5 df, p=0 76 *** ** *** *** Score (logrank) test = 379.4 on 5 df, p=0 Now, immediately with the removal of age.cat, the detection goes from being highly significant to being not significant at all. This could perhaps be explained by a possible relationship between older people going for screening more often than younger. In some ways this makes sense, as regardless of whether a person has been screened or not, their survival is always going to be determined by the progression and nature of the tumour. The fact that someone has had their tumour detected earlier on through screening may well reduce the potential risks which the factors can have, but any such changes will be explained by the variables themselves. So there is no need to include method of detection as all the information it can bring is already provided through the 4 remaining variables. Call: coxph(formula = surv ~ lymphnode + size + total + LVI, data = data) n= 2203, number of events= 485 coef exp(coef) se(coef) z Pr(>|z|) lymphnode 0.63797 1.89264 0.06815 9.361 < 2e-16 size 0.18506 1.20330 0.05407 3.423 0.000620 total 0.34047 1.40561 0.03455 9.855 < 2e-16 LVI -0.36264 0.69584 0.09856 -3.679 0.000234 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 *** *** *** *** exp(coef) exp(-coef) lower .95 upper .95 lymphnode 1.8926 0.5284 1.6560 2.1631 size 1.2033 0.8311 1.0823 1.3378 total 1.4056 0.7114 1.3136 1.5041 LVI 0.6958 1.4371 0.5736 0.8441 Concordance= 0.752 (se = 0.014 ) Rsquare= 0.151 (max possible= 0.961 ) Likelihood ratio test= 360.1 on 4 df, Wald test = 341.6 on 4 df, Score (logrank) test = 378.9 on 4 df, p=0 p=0 p=0 This leaves the final 4 factors. All of these are highly significant with the model, 3 of which make up the original NPI. Notice that unlike the others, the coefficient for LVI is negative. This is due to the fact that the variable was imputed with outcomes 1=Yes and 2=No, and moving in a positive direction numerically moves in a positive way in terms 77 of survival (No LVI −→ Better Survival), whereas as the other three variables move in a positive direction, the survival moves in a negative direction. Which leaves the third and final model: h(t | xi ) = h0 (t)exp(β1 x1 + β2 x2 + β6 x6 + β7 x7 ) = h0 (t)exp(0.64x1 + 0.19x2 + 0.34x6 − 0.36x7 ) (4.20) 4.2 Index Using the coefficients from the final model, an Index can be derived. I = 0.64x1 + 0.19x2 + 0.34x6 − 0.36x7 (4.21) Where x1 = Lymph Node Involvement x2 = Tumour Size (cm) x6 = Mitosis+Pleomorphism+Tubular x7 = Lymphovascular Invasion (4.22) 4.3 Model Checking After fitting an appropriate PH model for the given data, it is vital to check the goodnessof-fit and the residuals of the fitted model. The goodness-of-fit of a PH model mainly focuses on checking the validity of the assumption of the proportional hazards (i.e. whether the effects of covariates on risk remain constant over time). Violations of PH assumption can be resolved by: • Stratification • Adding time*covariate interaction • Adding other time-dependent version of the covariate 4.3.1 Schoenfeld Residuals For a more general Coxs regression model, the Proportional Hazards property is one of the restrictions to using the model with time-fixed covariates. It assumes that the hazard 78 ratio between two sets of covariates is constant over time, because the common baseline hazard function cancels out in the ratio of the two hazards. However, the impact of time-varying covariates leads the hazard to vary over time, thus violating the assumption of PH in the model. To test this PH assumption of Coxs regression model, Schoenfeld (1980) introduced a dummy time-dependent covariate. A Schoenfeld Residual...... is defined as the covariate value for the individual that failed minus its expected values and yields residuals for each person that failed for each variable OR is the expected value of each covariate at time ti = a weighted average of the covariate, weighted by the likelihood of failure for each individual in the risk set R at time tj Schoenfeld residual = xik − Pj∈R(ti ) i=1 xjk pj (4.23) pj = Probability of event now for ith person R(ti ) = Risk set K = Covariate of Interest Xik = Covariate-value for the person i Plots of these residuals are useful in detecting non proportionality of predicted hazards of the fitted model over the covariate space for each covariate. Under the proportional hazards assumption, the Schoenfeld residuals have the sample path of a random walk; therefore, they are useful in assessing time trend or lack of proportionality. Due to time dependent covariates the generalized linear regression of the Schoenfeld residuals on functions of time gives a non-zero slope.Thus, a non-zero slope is an indication of a violation of the proportional hazard assumption. As with any regression it is recommended to look at the graph of the regression in addition to performing the tests of non-zero slopes. There are certain types of non-proportionality that will not be detected by the tests of non-zero slopes alone but that might become obvious when looking at the graphs of the residuals such as nonlinear relationship (i.e., a quadratic fit) between the residuals. In R, the cox.zph function (also found in the survival package) calculates tests of the proportional hazards assumption for each covariate, by correlating the corresponding set of Schoenfeld residuals with an appropriate transformation of time. rho chisq p lymphnode -0.057036 1.77e+00 1.84e-01 79 19 52 100 −2 0 2 4 ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●●●● ●●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●●●● ● ●●● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ●●● ● ●●●● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●●●● ● ● ● ●● ●●● ● ●● ●● ● ●● ●● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ●●● ●● ● ●●●● ● ●● ● ● ●● ●● ●●● ●●●● ● ● ●●●● ●● ●● ● ●●●● ●●● ● ● ●●● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ●● ● Beta(t) for size 4 2 0 −2 Beta(t) for lymphnode Figure 4.3: Plots of scaled Schoenfeld residuals against transformed time for each covariate in the final CPH model 180 ● ● ● ●● ● ● ● ● ● ● ● ●●● ●●●● ● ●● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ●●●● ●● ● ●●● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ●● ●● ● ● ● ●●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●●●●● ●●●● ● ●● ● ●● ●● ●● ● ●● ● ● ● ● ● ●●●●●●● ● ●●●● ● ● ● ● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ●● ● ● ● ● ●● ● ●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ●● ●●●●● ● ● ●●● ● ● ● ●●●●● ●● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● 19 52 52 100 0.005465 -0.344318 -0.000182 NA 4 2 0 −4 180 ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ● ●● ● ●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ●●● ●●●● ●●●● ● ● ●●● ● ● ●● ●●●●● ● ● ● ●●●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ● ● ● ●●● ●● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ●● ● ●●●●● ●● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ●●● ● ● ●●●● ●●●●●●● ●●● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●●●● ●●● ● ● ● ● ● ●● ● ● ● 19 Time size total LVI GLOBAL −2 ● ●●● ● ● ●● ● ●●●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ●● ●● ●●● ● ●●● ● ● ●● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ●●●●● ● ● ●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ●●●●●● ●● ● ● ● ●● ● ● ●●● ● ●● ●● ●● ● ●● ●●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●●●●● ● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ●●● ●● ●● ● ●● ● ● ●● ●● ●● ● ● ●● ● ●●●● ●● ● ●● ● ●● ●●●● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● 19 180 Time Beta(t) for LVI −0.5 0.5 −2.0 Beta(t) for total 1.5 Time 100 52 100 180 Time 1.41e-02 5.17e+01 1.67e-05 5.84e+01 9.06e-01 6.53e-13 9.97e-01 6.32e-12 The output R code above shows us that there is strong evidence of Non-proportional hazards for the variable total. The object returned by cox.zph produces graphs of the scaled Schoenfeld residuals against transformed time. Figure 4.3 shows the Schoenfeld residuals, which are very useful in checking for violation of proportional hazards. The solid line is a smoothing-spline fit to the plot, with the broken lines representing a ±2 standard error band around the fitted line. As expected from the original R output there seems to be more of a trend in the residual plot of total than the other variables. One method, Fox (2002) of perhaps attempting to remove this violation of proportional hazards is to introduce an interaction term into the model. In this case, the model included an interaction term between total and survival to try and remove the violation of PH. As can be seen in the plot below this did not work out as planned, and now LVI and lymphnode violate PH. rho chisq p 80 lymphnode 0.09865 5.4777 size 0.02947 0.3935 total 0.00914 0.0807 LVI -0.01023 0.0595 total:survival 0.31385 98.0970 GLOBAL NA 455.8821 0.0193 0.5305 0.7763 0.8073 0.0000 0.0000 There is a lot of work going on by statisticians around the world in trying to refine the model checking techniques for proportional hazards. The nature of these means that more criteria have to be met than Linear and Generalized Linear models, such as proportional hazards, which makes it difficult to construct a fully working model. For a first attempt, having only one variable violating Proportional hazards is not bad. As such, the prognostic index might well stand a chance of being a useful tool in predicting survival. The interaction term will be swiftly removed and the final model is shown in 4.24. The higher the Index, the worse the Prognosis. I = 0.64LN + 0.19T S + 0.34M P T − 0.36LV I (4.24) Where LN = Lymph Node Involvement TS = Tumour Size (cm) M P T = Mitosis+Pleomorphism+Tubular LV I = Lymphovascular Invasion 81 Chapter 5 Summary, Conclusion & Further Research 5.1 Summary and Conclusion As with the all previous studies, Lymph Nodes was the most significant factor in determining survival (β1 = 0.64). The Histological Grade of the tumour was once again a significant factor, although rather than be used represented as a three point categorisation of Tubular Formation (T), Mitosis (M) and Pleomorphism (P), it would instead be represented as the 7-point categorisation of the summation (total in R), to prevent loss of information. The 7-point categorisation seemed to work slightly better within the model, with (β6 = 0.34 in the final Index. A third model was used which included T,P & M each as individual variables, but it did not fair as well as the other two models. One interesting finding did come from this model in that Pleomorphism was insignificant within the model (p > 0.05) where as Mitosis and Tubular formation we very significant. Lymphovascular Invasion (LVI) was the the fourth variable to come out significant after the model building process, and was present in the final index with β7 = −0.36. The fact that this variable is negative is purely down to the fact it was labelled different to the other factors. Care should be taken if this Index is used on other datasets to make sure that the co-variates are labelled the correct way around. The recategorising of age into the variable(age.total) improved its predictive power within the model, to a level of significance. Although this variable was eventually excluded thorough lack of certainty over it, perhaps in the future more time can be exploring methods to include it in a prognostic index. The same can be said of a patients menopausal status, the Kaplan-Meier estimates showed clear differentiation between pre-menopausal and post-menopausal patients. The log-rank test rendered this difference marginally insignificant, but there is surely scope to find a way of incorporating these difference, in however small a way, in a prognostic index. The fact that menopausal status is directly 82 correlated with age perhaps suggests the possibility that only one of them has a marginal effect on survival, which is reflected in the other. Further interesting points to come from exploratory analysis: - The accuracy of measurement for tumour size seems questionable, with large numbers represented for the factors of 0.5cm, specially for tumour sizes > 1.5cm. With tumour size being a significant factor in the Nottingham Prognostic Index and in the Index generated from this report (β2 = 0.19), this apparent rounding of tumour size effect the 5-year survival scores. It should be noted the knock on effect of this rounding will be minimized in the NPI with Tumour Size weighted with a factor of 0.2 compared with 1 for Lymph node involvement and Histological grade - There is a significant difference in survival for patients having their cancer detected through screening against those who have their tumour detected through other methods. Patients who have their cancer detected through screening generally identify the tumour far earlier, hence improved survival. Figure 2.15 shows clear difference between the survival % ’s, and Log-rank test shows clear difference between the groups (p < 0.05) After deliberation, it was decided Method of detection should not be used in a prognostic index because any decrease in the patients survival chances which result from late detection would already be represented by the biological attributes of the tumour. Those patients who do have regular screening have on average 10% better survival, yet only 32.8% have regular checks. This is perhaps the most striking statistic to come from this report, and findings like this should act towards encouraging the female population to attend regular check ups As well as covering all of the main areas of survival analysis, this report has touched upon less well known techniques. Kaplan-Meier curves proved the most useful method of gaining insight into the behaviour by providing very readable visualisations of the relationship of variables with survival. These visual plots are also beneficial in that they can be used to explain findings to those less familiar with the statistical techniques behind this research. Cox-Proportional Hazard models provide a very efficient way of establishing significant variables, and using R one can very quickly single out important pieces of information on their behaviour within the model. The correct approach to take for checking a model remains an issue, although this report demonstrated their usefulness in determining violations of Non proportionality. 5.2 Further Research This has been a fascinating study to be a part of, and unfortunately there was simply not enough time to achieve all that was hoped from the many points covered in this 83 project, each of which can open the door for further research. Such is the nature of these modern statistical techniques, new questions are always arising and time would be well spent seeking the answer to them. 1. The Prognostic Index designed (4.24) in Chapter 4 currently has no function other than providing information on the factors involved with survival, and their comparative significance through their β-values. With further research, this Index could be configured against data to separate the Index into groups and associate these with % survival scores for different gradings constructed. With this Index and associated Survival Scores, this model could be validated by running patients though the score and comparing their predicted survival with actual survival (and indeed that obtained from the NPI). One point which was touched upon earlier in the report was the possibility of rather than using 5-year scores, using a 7-year survival score. This could also be explored 2. The four factors which were established as being important in predicting survival came from a proportional Hazards model which violated the assumption of proportional hazards. More time could be invested in finding a way to augment the model and the variables inside of it to improve its accuracy and goodness of fit. This report explored Schoenfeld residuals as a method for testing a fit of a CPH model, (which in themselves are many different areas to study) but there are different methods out there such as Martingale residuals which could also be employed in the model checking stages. 3. More of the same research looking at Centres other than (1), as was seen figure 1.8 there are quite significant differences in the average Nottingham Prognostic Index scores. This could be for a number of reasons such as different methods of treatment, or perhaps specialized for specific kinds of breast cancer, all of which would be very interesting to look at. 4. Continue with the scrutiny of the NPI, and continue to gain insight into its shortfalls and what can be learnt from it 5. Read further into the most effective way the three components of Histological Grade can be used as predictive elements. Pleomorphism was insignificant when isolated as a single variable, but more research could go into whether it is wholly insignificant or if it is significant when used with other factors 6. Expand upon the interesting findings on the measurement of tumour size, and deduce whether this could truely be an issue in misrepresentation of a patient within a prognostic index 84 5.3 Acknowledgements I would like to thank Professor Andy Wood of Nottingham University for his continued help and support throughout this eventful project. Special thanks must also go to Dr. Steven Chan of Nottingham City Hospital and Professor Graham Balls for giving me the opportunity to work on a dissertation of this nature. 85 Appendices 86 Appendix A Some further Interesting findings R Output of Full model (1) with Tumour Grade chosen. Call: coxph(formula = surv ~ lymphnode + size + grade + type + ER + ERc + radio + axradio + chemo + endo + endotype + Op.type + detection + axillary + LVI + age + nodespositive, data = data) n= 2202, number of events= 485 (1 observation deleted due to missingness) coef exp(coef) lymphnode 0.299822 1.349618 size 0.197924 1.218870 grade 0.714215 2.042582 type 0.015806 1.015932 ER -0.001346 0.998655 ERc -0.155817 0.855716 radio -0.105538 0.899840 axradio 0.083471 1.087053 chemo 0.221175 1.247542 endo 0.037318 1.038023 endotype -0.012185 0.987889 Op.type 0.122273 1.130062 detection 0.291767 1.338791 axillary 0.092484 1.096896 LVI -0.407942 0.665017 age 0.011634 1.011702 nodespositive 0.132179 1.141312 --- se(coef) 0.103543 0.056937 0.097810 0.012853 0.001070 0.184748 0.074691 0.065329 0.060331 0.135088 0.034601 0.107681 0.121626 0.120481 0.102926 0.005479 0.025645 87 z 2.896 3.476 7.302 1.230 -1.257 -0.843 -1.413 1.278 3.666 0.276 -0.352 1.136 2.399 0.768 -3.963 2.124 5.154 Pr(>|z|) 0.003784 0.000509 2.83e-13 0.218791 0.208654 0.399004 0.157659 0.201357 0.000246 0.782356 0.724723 0.256162 0.016445 0.442710 7.39e-05 0.033712 2.55e-07 ** *** *** *** * *** * *** Figure A.1: Histogram of Survival for patients dying within 10 years of prognosis with predicted survival of 85% 10 120.00% 9 100.00% 8 Number of Patients 7 80.00% 6 5 60.00% Frequency 4 Cumulative % 40.00% 3 2 20.00% 1 0 0.00% NPI Scores Signif. codes: lymphnode size grade type ER ERc radio axradio chemo endo endotype Op.type detection axillary LVI age 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 exp(coef) exp(-coef) lower .95 upper .95 1.3496 0.7410 1.1017 1.6533 1.2189 0.8204 1.0902 1.3628 2.0426 0.4896 1.6863 2.4742 1.0159 0.9843 0.9907 1.0419 0.9987 1.0013 0.9966 1.0008 0.8557 1.1686 0.5958 1.2291 0.8998 1.1113 0.7773 1.0417 1.0871 0.9199 0.9564 1.2355 1.2475 0.8016 1.1084 1.4041 1.0380 0.9634 0.7966 1.3527 0.9879 1.0123 0.9231 1.0572 1.1301 0.8849 0.9150 1.3956 1.3388 0.7469 1.0548 1.6992 1.0969 0.9117 0.8662 1.3891 0.6650 1.5037 0.5435 0.8137 1.0117 0.9884 1.0009 1.0226 88 0.0 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure A.2: Kaplan-Meier Curves of NPI with 95% CI 0 50 100 150 t 89 200 0.2 0.4 ^ S(t) 0.6 0.8 1.0 Figure A.3: Kaplan-Meier estimates of Estrogen Receptor Status 0.0 ER Code Negative ER Code Positive 0 50 100 150 t 90 200 nodespositive 1.1413 0.8762 1.0854 1.2001 Concordance= 0.772 (se = 0.014 ) Rsquare= 0.179 (max possible= 0.961 ) Likelihood ratio test= 433.6 on 17 df, p=0 Wald test = 467.1 on 17 df, p=0 Score (logrank) test = 606 on 17 df, p=0 R summary of full model (2) with individual variables T,P,M included Call: coxph(formula = surv ~ lymphnode + size + type + ER + ERc + radio + axradio + chemo + endo + endotype + Op.type + detection + axillary + T + P + M + LVI + age + nodespositive, data = data) n= 2202, number of events= 485 (1 observation deleted due to missingness) coef exp(coef) se(coef) z lymphnode 0.321164 1.378732 0.102950 3.120 size 0.181484 1.198996 0.057674 3.147 type 0.019126 1.019310 0.013274 1.441 ER -0.001077 0.998924 0.001075 -1.002 ERc -0.105176 0.900166 0.185757 -0.566 radio -0.099991 0.904846 0.076269 -1.311 axradio 0.076435 1.079432 0.067495 1.132 chemo 0.221688 1.248182 0.061133 3.626 endo 0.045326 1.046369 0.135799 0.334 endotype -0.006510 0.993511 0.034355 -0.189 Op.type 0.113824 1.120555 0.107772 1.056 detection 0.282151 1.325979 0.121720 2.318 axillary 0.088392 1.092417 0.120346 0.734 T 0.384513 1.468898 0.106985 3.594 P 0.154848 1.167480 0.126398 1.225 M 0.424319 1.528548 0.074003 5.734 LVI -0.414804 0.660470 0.103185 -4.020 age 0.011928 1.012000 0.005483 2.176 nodespositive 0.130908 1.139863 0.025316 5.171 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 91 Pr(>|z|) 0.001811 0.001651 0.149618 0.316378 0.571255 0.189844 0.257448 0.000287 0.738552 0.849709 0.290897 0.020447 0.462651 0.000326 0.220545 9.82e-09 5.82e-05 0.029579 2.33e-07 1 ** ** *** * *** *** *** * *** exp(coef) exp(-coef) lower .95 upper .95 lymphnode 1.3787 0.7253 1.1268 1.6870 size 1.1990 0.8340 1.0708 1.3425 type 1.0193 0.9811 0.9931 1.0462 ER 0.9989 1.0011 0.9968 1.0010 ERc 0.9002 1.1109 0.6255 1.2955 radio 0.9048 1.1052 0.7792 1.0507 axradio 1.0794 0.9264 0.9457 1.2321 chemo 1.2482 0.8012 1.1072 1.4071 endo 1.0464 0.9557 0.8018 1.3655 endotype 0.9935 1.0065 0.9288 1.0627 Op.type 1.1206 0.8924 0.9072 1.3841 detection 1.3260 0.7542 1.0445 1.6832 axillary 1.0924 0.9154 0.8629 1.3830 T 1.4689 0.6808 1.1910 1.8116 P 1.1675 0.8565 0.9113 1.4957 M 1.5285 0.6542 1.3222 1.7671 LVI 0.6605 1.5141 0.5395 0.8085 age 1.0120 0.9881 1.0012 1.0229 nodespositive 1.1399 0.8773 1.0847 1.1978 Concordance= 0.771 (se = 0.014 ) Rsquare= 0.183 (max possible= 0.961 ) Likelihood ratio test= 445.1 on 19 df, p=0 Wald test = 483 on 19 df, p=0 Score (logrank) test = 624.8 on 19 df, p=0 92 Bibliography [1] D.R.Cox & D.Oakes, 1984, Analysis of Survival Data, Chapman & Hall [2] G. Rodrguez, 2001, Non-Parametric Estimation in Survival Models, Princeton Press [3] J.Fox, 2002, Cox Proportional-Hazards Regression for Survival Data [4] R.J.Little & D.B.Rubin, 2002, Statistical Analysis with Missing Data, WileyInterscience [5] M.Galea,R.Blamey,C.Elston,I.O.Ellis, 1992,The Nottingham Prognostic Index in Primary breast Cancer Kluwer Academic publishers [6] N.E Breslow, 1976, Analysis of Survival Data under the Proportional Hazards Model,International Statistical Review [7] Schoenfeld, D , 1981, The asymptotic properties of nonparametric tests for comparing survival distributions, Biometrika [8] ,Kaplan,E.L.& P.Meier, 1958. Non-parametric Estimation from Incomplete Observations. J. Am. Stat. Assoc., 53:457481 [9] Allison,P D, 2001 Missing Data (1st ed.), Thousand Oaks: Sage Publications, [10] D. Mayer, 2004, Essential Evidence Based Medicine, Cambridge Press Publication [11] R.B Millar, 2011, Maximum Likelihood Estimation and Inference: With Examples in R, SAS and ADMB,John Wiley & Sons, [12] Kendall, M.G., Stuart, A. 1973, The Advanced Theory of Statistics, Volume 2 Griffin. [13] N. Breslow & J. Crowley, 1974, A Large Sample Study of the Life Table and Product Limit Estimates Under Random CensorshipAnn. Statist. Volume 2, Number 3 93
© Copyright 2025