May 10, 2022

Please read the following study: Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020). Protecting patient privacy in survival analyses. Journal of th

Please read the following study:

Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020). Protecting patient privacy in survival analyses. Journal of the American Medical Informatics Association, 27(3), 366–375. https://doi.org/10.1093/jamia/ocz195

Discuss your response to this survival analysis study. Do you have the same concerns as the researchers regarding the patient privacy issues when presenting actuarial/survival analysis tables? Do you have other suggestions regarding protecting patient privacy within a study?

Be sure to support your statements with logic and argument, use at least two peer reviewed articles and cite them to support your statements.

ocz195.pdf

Research and Applications

Protecting patient privacy in survival analyses

Luca Bonomi1, Xiaoqian Jiang2, and Lucila Ohno-Machado1,3

1Department of Biomedical Informatics, UC San Diego Health, University of California, San Diego, La Jolla, California, USA, 2School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA, and 3Division of

Health Services Research and Development, VA San Diego Healthcare System, La Jolla, California, USA

Corresponding Author: Luca Bonomi, PhD, UCSD Health Department of Biomedical Informatics, University of California

San Diego, 9500 Gilman Dr., La Jolla, California 92093, USA; [email protected]

Received 15 July 2019; Revised 9 September 2019; Editorial Decision 6 October 2019; Accepted 18 October 2019

ABSTRACT

Objective: Survival analysis is the cornerstone of many healthcare applications in which the “survival” proba-

bility (eg, time free from a certain disease, time to death) of a group of patients is computed to guide clinical

decisions. It is widely used in biomedical research and healthcare applications. However, frequent sharing of

exact survival curves may reveal information about the individual patients, as an adversary may infer the pres-

ence of a person of interest as a participant of a study or of a particular group. Therefore, it is imperative to de-

velop methods to protect patient privacy in survival analysis.

Materials and Methods: We develop a framework based on the formal model of differential privacy, which pro-

vides provable privacy protection against a knowledgeable adversary. We show the performance of privacy-

protecting solutions for the widely used Kaplan-Meier nonparametric survival model.

Results: We empirically evaluated the usefulness of our privacy-protecting framework and the reduced privacy risk

for a popular epidemiology dataset and a synthetic dataset. Results show that our methods significantly reduce the

privacy risk when compared with their nonprivate counterparts, while retaining the utility of the survival curves.

Discussion: The proposed framework demonstrates the feasibility of conducting privacy-protecting survival

analyses. We discuss future research directions to further enhance the usefulness of our proposed solutions in

biomedical research applications.

Conclusion: The results suggest that our proposed privacy-protection methods provide strong privacy protec-

tions while preserving the usefulness of survival analyses.

Key words: data privacy, survival analysis, data sharing, Kaplan-Meier, actuarial

INTRODUCTION

Survival analysis aims at computing the “survival” probability (ie,

how long it takes for an event to happen) for a group of observa-

tions that contain information about individuals, including time to

event. In medical research, the primary interest of survival analysis

is in the computation and comparison of survival probabilities

across patient groups (eg, standard of care vs. intervention), in

which survival may refer, for example, to the time free from the

onset of a certain disease, time free from recurrence, and time to

death. Survival analysis provides important insights, among other

things, on the effectiveness of treatments, identification of risk,

biomarker utility, and hypotheses testing.1–10 Survival curves ag-

gregate information from groups of interest and are easy to gener-

ate, interpret, compare, and publish online. Although aggregate

data can be protected by different approaches, such as, round-

ing,11,12 binning,13 and perturbation,14 survival analysis models

have special characteristics that warrant the development of cus-

tomized methods. Before describing our proposed solutions, we

briefly review how survival curves are derived and what their vul-

nerabilities are from a privacy perspective.

VC The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.

366

Journal of the American Medical Informatics Association, 27(3), 2020, 366–375

doi: 10.1093/jamia/ocz195

Advance Access Publication Date: 21 November 2019

Research and Applications

D ow

nloaded from https://academ

ic.oup.com /jam

ia/article/27/3/366/5637338 by guest on 09 M ay 2022

https://academic.oup.com/

Survival analysis methods and privacy Methods for survival analysis can be divided into 3 main categories:

parametric, semiparametric, and nonparametric models. Parametric

models rely on known probability distributions (eg, the Weibull distri-

bution) to learn a statistical model. These models are less frequently

used than semi- or nonparametric methods, as their parametric

assumptions hardly apply in practice. Even though the released curves

exhibit a natural “smoothing,” studies have shown that the parame-

ters of the model may reveal sensitive information.15 Semiparametric

methods are extremely popular for multivariate analyses and can be

used to identify important risk factors for the event of interest. As an

example, the Cox proportional hazards model16 only assumes a pro-

portional relationship between the baseline hazard and the hazard at-

tributed to a specific group (ie, it does not assume that survival

follows a known distribution, as is the case with parametric models).

Nonparametric models are frequently used to describe the survival

probability over time, without requiring assumptions on the underly-

ing data distribution. Among those models, the Kaplan-Meier (KM)

product-limit estimators are frequent in the biomedical literature. As

an example, a search for PubMed articles using the term Kaplan-

Meier retrieves more than 8000 articles each year, from 2013 to

2018. A search for actuarial returns about 500 articles per year. In

this article, we focus on the KM estimator and present results for the

actuarial model in the Supplementary Appendix. The KM method

generates a survival curve in which each event can be seen by a corre-

sponding drop in the probability of survival. For example, Foldvary

et al4 used the KM method to analyze seizure outcomes for patients

who underwent temporal lobectomy for epilepsy. In contrast, in the

actuarial method,17,18 the survival probability is computed over pre-

specified periods of time (eg, 1 week, 1 month). For example, Balsam

et al19 used actuarial curves to describe the long-term survival for

valve surgery in an elderly population.

It is surprising that relatively little attention has been given so far

to the protection of individual privacy in survival analysis. Survival

analyses generate aggregated results that are unlikely to directly re-

veal identifying information (eg, name, SSN).20 However, a knowl-

edgeable adversary, who observes survival analysis results over time,

may be able to determine whether a targeted individual participated

in the study and even if the individual belongs to a particular sub-

group in the study, thus learning sensitive phenotypes. Several previ-

ous privacy studies have shown that sharing aggregated results may

lead to this privacy risk.15,21,22 For example, small values of counts

(eg, <11) may reveal identifiable information about patients and

their demographics.11,23 As survival analyses rely on statistical prim-

itives (eg, counts of time to events), they share similar privacy risks.

In fact, each patient is responsible for a drop or step in the survival

curve. Therefore, the released curves may reveal, in combination

with personal or public knowledge, sensitive information about a

single patient. For example, an adversary who (1) has knowledge of

the time to events of individuals in various groups at a certain time

(eg, previously released survival curves for different groups) and (2)

knows that a person of interest joined the study may infer the pres-

ence of such an individual in a specific group (eg, patients in the hep-

atitis B subgroup) as the released curves are updated. Specifically, an

adversary can construct a survival curve based on their auxiliary

knowledge and can infer whether the person of interest is in the

group by comparing such a curve with the one from a group, as il-

lustrated with the curves s1’ and s2’ in Figure 1 (left panel). The dif-

ferences between the exact curves and those obtained by the

adversary disclose the participation of the person of interest in a

group (ie, the patient with time to event at time unit 61 contributed

to the curve s2’, thus the individual of interest was in group 2). This

scenario is realistic for dashboard of “aggregate” results, where

tools for data exploration (eg, web interfaces and application

programming interfaces) may enable users to obtain frequent fine-

grained releases, and certainly is not limited to survival analysis, ap-

plying also to counts, histograms, proportions (when accompanied

by information on the total number of participants), and other seem-

ingly harmless “aggregate” data.

It is imperative to develop privacy solutions to protect the individ-

ual presence in the released survival curves. In this work, we consider

the formal and provable notion of differential privacy,24 in which the

released statistics are perturbed with carefully calibrated random

noise. Specifically, differential privacy ensures that the output statis-

tics are “roughly the same” regardless of the presence or absence of

any individual, thus providing plausible deniability. In fact, the differ-

ences between differentially private survival curves s1’-dp and s2’-dp

and those obtained with the adversarial knowledge in Figure 1 (right

panel) do not reveal information about the presence of any individual

in either group, as opposed to the original curves (left panel).

Objective Current research in survival analysis includes the development of ac-

curate prediction models, under the assumption that sharing aggregate

survival data does not compromise privacy. For example, deep neural

networks have been recently used to learn the relationship between a

patient’s covariates and the survival distribution predictions.25–28 An-

other example by Lu et al29 describes a decentralized method for

learning a distributed Cox proportional hazards model without shar-

ing individual patient-level data. Those solutions disclose exact results

that may enable privacy attacks by untrusted users.15,22,30

Several approaches have been proposed for privacy-protecting

survival analyses.20,31–33 However, they do not provide provable

privacy guarantees. O’Keefe et al20 discussed privacy techniques

based on data suppression (eg, removal of censored events), smooth-

ing, and data perturbation. Yu et al32 proposed a method based on

affine projections for the Cox model. Similarly, Fung et al33 devel-

oped a privacy solution using random linear kernel approaches. De-

spite promising results, these solutions do not provide provable

privacy protection and may be vulnerable in the presence of an

adversary who has auxiliary information (eg, knowledge of the

time-to-event data [hospitalization, death, etc.] and from previous

publication of survival curves).

We developed a privacy framework, based on the notion of differ-

ential privacy, that provides formal and provable privacy protection

against a knowledgeable adversary who aims at determining the pres-

ence of an individual of interest in a particular group. Intuitively, our

framework transforms the data before the release, similarly to previ-

ous methods based on generalization (eg, smoothing) and truncation

(eg, censoring aggregate counts below a threshold).20,23 In our case,

privacy is protected with the injection of calibrated noise. We show

how this framework can be used to release differentially private sur-

vival analyses for the KM estimator (see the Supplementary Appendix

for the actuarial method). Furthermore, we define an empirical pri-

vacy risk that measures how well an informed adversary may recon-

struct the temporal information of time to event of an individual who

participated in the study. Our evaluations show that an adversary can

reconstruct the time to event with a small error from the observed

nonprivate survival curves, thus indicating high privacy risk (eg, po-

tential reidentification by linking the exact time intervals with external

data). Our proposed methods significantly reduce privacy risks while

Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 367

D ow

nloaded from https://academ

ic.oup.com /jam

ia/article/27/3/366/5637338 by guest on 09 M ay 2022

https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data

retaining the usefulness of the survival curves. We must emphasize

that an ideal privacy protection mechanism should not rely on specific

assumptions about what background knowledge the adversary has, as

violations in the adversary’s knowledge may make privacy protection

invalid. Thanks to differential privacy, our methods do not require

such assumptions and thus provide protection regardless of how

much information the adversary has.

MATERIALS AND METHODS

Nonparametric survival models Nonparametric survival models estimate the survival probability of

a group of individuals by analyzing the temporal distribution of the

recorded events during the study. Typically, each individual has a

single temporal event, which may represent the development of a

symptom, disease, or death. Some of these events may be only par-

tially known (eg, subject drops out of the study, no follow-up)17,34

and therefore are denoted as censored events. We assume a study of

N individuals over a period of T time units (eg, days, months). Fur-

thermore, ui denotes the number of uncensored patients (known

recorded event [eg, death]), ci denotes the censored patient at time ti;

and ri represents those remaining before ti (excluding any individual

censored previously). Table 1 summarizes the nonparametric models

considered in this article. Additional details are reported in the Supple-

mentary Appendix.

Differential privacy Differential privacy24 enables the release of statistical information

about a group of participants while providing strong and provable

privacy protection. Specifically, differential privacy ensures that the

probability distribution on the released statistics should be “roughly

the same” regardless the presence/absence of any individual, thus

providing plausible deniability. Differential privacy has been suc-

cessfully applied in a variety of settings,14,35 such as data publication

(eg, 1-time data release),36–40 iterative query answering,41–43 contin-

ual data release (eg, results are published over time),44–50 and in

combination with various machine learning models.30,51–53 Among

those works, we are inspired by the differentially private model pro-

posed for continual data release,46–49 as survival analyses estimate

the survival function at time t using the time to events up to t: In our

setting, we consider an event stream S ¼ðe1; e2; . . . ; eTÞ, where each event ei ¼ðci; ui; tiÞ refers to the number of events and whether cen- soring happened at time ti, and the events are in chronological order

(ie, ti < tiþ1). For example, consider a study over a period T ¼ 10 units of time (eg, months) comprising a total of N ¼ 6 individuals with time to events of 2, 4, 4, 5*, 6, 8*, where time marked with *

corresponds to when censoring happened (ie, a participant was lost

to follow up). Under our notation, we have an event stream S ¼ð0; 0; 1Þ;ð0; 1; 2Þ; ð0; 0; 3Þ;ð0; 2; 4Þ;ð1; 0; 5Þ;ð0; 1; 6Þ; ð0; 0; 7Þ;ð1; 0; 8Þ; ð0; 0; 9Þ;ð0; 0; 10Þ, where (0; 0; 3) indicates that no events were observed at time 3.

We assume a trusted data curator who wishes to release an esti-

mate of the survival probability sðtÞ at each time stamp 1 � t � T using the information in the poststream of events up to time t,

namely the prefix stream St ¼ðe1; e2; . . . ; etÞ.

Neighboring streams of time to events

Two streams of time to events St and S 0 t are neighboring streams if

there exists at most 1 ti 2 ½1; . . . ; t�; such that: ci – c’ij jþ jui – u’ij � 1 (ie, they differ at most by 1 event).

Using this notion, we present the definition of differential pri-

vacy considered in our work as follows.

Differential privacy

Let M �ð Þ be a randomized algorithm that takes in input a stream S, and let O be the set of all possible outputs of M �ð Þ. Then, we say that M �ð Þ satisfies e-differential privacy if, for all sets O 2 O; all neighboring streams St and S

0 t , and all t; it holds that:

Pr M Stð Þ¼ O½ � � ee � Pr½M S0tð Þ¼ O�

Intuitively, the notion of differential privacy ensures that neigh-

boring streams should be indistinguishable by an adversary who

Figure 1. Survival curves obtained using the Kaplan-Meier method. (Left panel) An adversary observes 2 exact curves s1’ (group 1) (eg, consisting of patients

without hepatitis B) and s2’ (group 2) (patients with hepatitis B) and compares them with the curves constructed with knowledge of s1 and s2 (eg, previously re-

leased curves). The adversary knows that the person of interest had an event at time 61 and thus can learn from the change in s2’ that this individual contributed

to group 2. This is an example of a difference attack. (Right panel) When the curves are generated using differential privacy (s1’-dp and s2’-dp), their difference

does not reveal individual time-to-event information. The data on this plot were obtained from a publicly available repository (http://lib.stat.cmu.edu/datasets/

veteran). Here, we only report on the first 80 time units (days) to highlight the difference between the survival curves.

368 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3

D ow

nloaded from https://academ

ic.oup.com /jam

ia/article/27/3/366/5637338 by guest on 09 M ay 2022

https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data

observes the output of the mechanism M �ð Þ at any time. In differen- tially private models, e denotes the privacy parameter (also known as privacy budget). Lower values indicate higher indistinguishabil-

ity, thus providing stronger privacy. Determining the right value for

e is a challenging problem, as specific values depend on the applica- tion (ie, risk tolerance).54 Typically, e assumes values in the range [1/1000, 10].

As an example consider e¼1, then the probability of a stream St being mapped to a particular output is no greater than 2 times

the probability that any of its neighboring streams getting mapped

to the same output. Perfect privacy can be achieved with e¼0 (ie, neighboring streams are equally likely to produce the same output);

however, it obviously leads to no utility in the released curve, as the

mechanism has to completely ignore each individual record in input.

The guarantee of indistinguishability between neighboring streams

protects the presence of the individual in the released statistics be-

cause, in survival analysis, an individual can contribute at most once

to the stream.

Typically, differential privacy is achieved via output perturba-

tion, in which the released statistics are perturbed with calibrated

random noise to hide the presence of individuals (details are

reported in the Supplementary Appendix). Intuitively, the noise per-

turbation “generalizes” the aggregated time to events, similarly to

traditional ad hoc techniques in which the released aggregated

counts are obtained by binning and thresholding (eg, reporting

counts as “less than 10”).

Our framework for privacy-protecting survival analyses Publishing survival values sðtÞ may pose significant privacy chal- lenges, as the event for an individual at time t0 will affect the sur-

vival curve at time t0 (ie, step) as well as subsequent values.

Therefore, an adversary who observes these changes may gain

knowledge about the individual associated with such an event: To

mitigate these risks, traditional differential privacy methods perturb

each released survival value sðtÞ. However, these methods may lead to overly perturbed results when the study spans over a long period

of time. To this end, we propose a framework that compresses the

stream of events into partitions in which the survival probabilities

can be accurately computed over time using an input perturbation

strategy. Overall, our framework (Figure 2) comprises 3 main steps:

(1) data partitioning, (2) survival curve computation, and (3) post-

processing.

In the data partitioning step, the time to events are grouped into

partitions generated in a differentially private manner. The idea is to

compress the stream, so that the privacy cost for computing the sur-

vival curve can be reduced while retaining the distribution of the

events. In the survival computation step, we estimate the number of

censored and uncensored events over time using a binary tree de-

composition. This step reduces the perturbation noise in the estima-

tion of the events, which are then used to compute the survival

probability. Specifically, we use an input perturbation approach in

which privacy is achieved by perturbing the counts of the events

rather than the output of the survival function, thus improving the

utility compared with standard output perturbation techniques. Be-

cause the noise perturbation may disrupt the shape of the survival

curve, we perform a postprocessing step, in which we enforce consis-

tency in the released curve (ie, monotonically decreasing survival

probabilities). For brevity, in the following we describe the instantia-

tion of our framework for the KM method. The private solution for

the actuarial method follows the same steps, except for the fact that

partitioning is performed over fixed intervals (see Supplementary

Appendix).

Data partitioning

Our partitioning strategy takes in input the stream of events St and

produces a stream of partitions as output, where multiple events are

grouped. We compress the stream into partitions of variable length

with the goal of retaining the distribution of the events. Our method

processes 1 event at the time and keeps an active partition, which is

sealed when more than H time to events are observed. Intuitively, this approach produces a coarser representation of the stream,

where each event is grouped with at least other H-1, by varying the interval of time to publish survival for a group of events. In this pro-

cess, we perturb the count of the events in the stream and the thresh-

old H with calibrated noise. As a result, the events and the size of partitions are protected, thus providing an additional level of protec-

tion compared with other privacy methods that rely on binning (ie,

rounds to the nearest 10). The privacy budget e1 dedicated to this step is equally divided among the threshold and event count pertur-

bation. As any neighboring streams may differ at most by 1 segment,

these perturbations ensure that the partitions returned by the algo-

rithm satisfy e1-differential privacy. 55

Survival curve computation

In this step, we determine the survival probability at time t using an

input perturbation strategy. The idea is to estimate the number of

uncensored and censored events in the partitions in a differentially

private manner and then use those values to compute the survival

curve, up to t. One could estimate these events by perturbing the

counts over the partitions processed so far. However, this simple

process leads to high perturbation noise, as the magnitude of the

noise grows linearly with the number of partitions. To this end, we

use a binary tree counting approach with privacy parameter e2, where leaves represent the original partitions and internal nodes de-

note partitions obtained by merging the partitions of their children.

Consider Figure 2, the internal node associated with the count C14 comprises the events over the partitions P1, P2, P3, and P4. This bi-

nary mechanism is very effective in reducing the overall impact of

perturbation noise.46,47 With this mechanism, the differentially pri-

vate number of uncensored û ið Þ and censored ĉ ið Þ events in the stream can be estimated with a perturbation noise that grows

only logarithmically with the number of partitions in the stream.

Table 1. Nonparametric models for survival analysis considered

Actuarial model (see Supplemen-

tary Appendix)

Kaplan-Meier model

• time to events are grouped into intervals of fixed length

(l) • survival computed on the set

interval fI1; I2; . . . ; ITg of length l

• the censored patients are as- sumed to withdraw from the

study at random during the

interval

• survival function computed on each time unit

Survival function at each interval Ii:

si ¼ Qi j¼1

1 � uj rj�

cj 2

� � Survival function at time t: s tð Þ¼

Q ti � t 1 �

ui ri

� �

Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 369

D ow

nloaded from https://academ

ic.oup.com /jam

ia/article/27/3/366/5637338 by guest on 09 M ay 2022

https://academic.oup.com/jamia/article-lookup/doi/10.1093/jamia/ocz195#supplementary-data

To compute the privacy-protecting survival curve for the KM

method, denoted ŝKM ið Þ, we rewrite the KM survival curve formula- tion as follows:

ŝKM ið Þ¼ ŝKM i � 1ð Þ� N � û ið Þ� ĉ i � 1ð Þ

N � û i � 1ð Þ� ĉ i � 1ð Þ

where, û ið Þ and ĉ ið Þ represent the total number of uncensored and censored events up to the time of partition i; respectively. At the end

of this step, we obtain a step function representing the survival prob-

ability of the patients over time that remains constant within each

partition.

Data postprocessing

A survival curve satisfies the following properties: (1) it assumes val-

ues in the range [0, 1] and (2) it monotonically decreases with time

(ie, ŝðtÞ� ŝðt þ 1Þ for 1 � t < T). While our solution ensures that the released curve satisfies differential privacy, the noise pertur-

bation may violate properties 1 and 2. To this end, we propose a

postprocessing step, in which we compute the survival curve ŝ�ðtÞ satisfying these properties and that best resembles ŝ tð Þ: Similarly to previous work,56,57 we solve this optimization problem with iso-

tonic regression methods (details in the Supplementary Appendix).

An illustrative example of our postprocessing step is repor

Collepals.com Plagiarism Free Papers

Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS

Why Hire Collepals.com writers to do your paper?

Quality- We are experienced and have access to ample research materials.

We write plagiarism Free Content

Confidential- We never share or sell your personal information to third parties.

Support-Chat with us today! We are always waiting to answer all your questions.

Please read the following study: Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020). Protecting patient privacy in survival analyses. Journal of th

Related Posts

Week 5 Assignment: Build a substance-based PowerPoint presentation

The studies on diversity and inclusion often find conflicting results in relation to the performance of diverse groups. Some research indicates tha

Review Chapters 17 and 18 from your textbook, Applied Psychology in Talent Management. Review the article How Do You Solve a Problem Like Quittin