Please read the following study: Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020). Protecting patient privacy in survival analyses. Journal of th
Please read the following study:
Bonomi, L., Jiang, X., & Ohno-Machado, L. (2020). Protecting patient privacy in survival analyses. Journal of the American Medical Informatics Association, 27(3), 366–375. https://doi.org/10.1093/jamia/ocz195
Discuss your response to this survival analysis study. Do you have the same concerns as the researchers regarding the patient privacy issues when presenting actuarial/survival analysis tables? Do you have other suggestions regarding protecting patient privacy within a study?
Be sure to support your statements with logic and argument, use at least two peer reviewed articles and cite them to support your statements.
Research and Applications
Protecting patient privacy in survival analyses
Luca Bonomi1, Xiaoqian Jiang2, and Lucila Ohno-Machado1,3
1Department of Biomedical Informatics, UC San Diego Health, University of California, San Diego, La Jolla, California, USA, 2School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA, and 3Division of
Health Services Research and Development, VA San Diego Healthcare System, La Jolla, California, USA
Corresponding Author: Luca Bonomi, PhD, UCSD Health Department of Biomedical Informatics, University of California
San Diego, 9500 Gilman Dr., La Jolla, California 92093, USA; [email protected]
Received 15 July 2019; Revised 9 September 2019; Editorial Decision 6 October 2019; Accepted 18 October 2019
ABSTRACT
Objective: Survival analysis is the cornerstone of many healthcare applications in which the “survival” proba-
bility (eg, time free from a certain disease, time to death) of a group of patients is computed to guide clinical
decisions. It is widely used in biomedical research and healthcare applications. However, frequent sharing of
exact survival curves may reveal information about the individual patients, as an adversary may infer the pres-
ence of a person of interest as a participant of a study or of a particular group. Therefore, it is imperative to de-
velop methods to protect patient privacy in survival analysis.
Materials and Methods: We develop a framework based on the formal model of differential privacy, which pro-
vides provable privacy protection against a knowledgeable adversary. We show the performance of privacy-
protecting solutions for the widely used Kaplan-Meier nonparametric survival model.
Results: We empirically evaluated the usefulness of our privacy-protecting framework and the reduced privacy risk
for a popular epidemiology dataset and a synthetic dataset. Results show that our methods significantly reduce the
privacy risk when compared with their nonprivate counterparts, while retaining the utility of the survival curves.
Discussion: The proposed framework demonstrates the feasibility of conducting privacy-protecting survival
analyses. We discuss future research directions to further enhance the usefulness of our proposed solutions in
biomedical research applications.
Conclusion: The results suggest that our proposed privacy-protection methods provide strong privacy protec-
tions while preserving the usefulness of survival analyses.
Key words: data privacy, survival analysis, data sharing, Kaplan-Meier, actuarial
INTRODUCTION
Survival analysis aims at computing the “survival” probability (ie,
how long it takes for an event to happen) for a group of observa-
tions that contain information about individuals, including time to
event. In medical research, the primary interest of survival analysis
is in the computation and comparison of survival probabilities
across patient groups (eg, standard of care vs. intervention), in
which survival may refer, for example, to the time free from the
onset of a certain disease, time free from recurrence, and time to
death. Survival analysis provides important insights, among other
things, on the effectiveness of treatments, identification of risk,
biomarker utility, and hypotheses testing.1–10 Survival curves ag-
gregate information from groups of interest and are easy to gener-
ate, interpret, compare, and publish online. Although aggregate
data can be protected by different approaches, such as, round-
ing,11,12 binning,13 and perturbation,14 survival analysis models
have special characteristics that warrant the development of cus-
tomized methods. Before describing our proposed solutions, we
briefly review how survival curves are derived and what their vul-
nerabilities are from a privacy perspective.
VC The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.
All rights reserved. For permissions, please email: [email protected]
366
Journal of the American Medical Informatics Association, 27(3), 2020, 366–375
doi: 10.1093/jamia/ocz195
Advance Access Publication Date: 21 November 2019
Research and Applications
D ow
nloaded from https://academ
ic.oup.com /jam
ia/article/27/3/366/5637338 by guest on 09 M ay 2022
Survival analysis methods and privacy Methods for survival analysis can be divided into 3 main categories:
parametric, semiparametric, and nonparametric models. Parametric
models rely on known probability distributions (eg, the Weibull distri-
bution) to learn a statistical model. These models are less frequently
used than semi- or nonparametric methods, as their parametric
assumptions hardly apply in practice. Even though the released curves
exhibit a natural “smoothing,” studies have shown that the parame-
ters of the model may reveal sensitive information.15 Semiparametric
methods are extremely popular for multivariate analyses and can be
used to identify important risk factors for the event of interest. As an
example, the Cox proportional hazards model16 only assumes a pro-
portional relationship between the baseline hazard and the hazard at-
tributed to a specific group (ie, it does not assume that survival
follows a known distribution, as is the case with parametric models).
Nonparametric models are frequently used to describe the survival
probability over time, without requiring assumptions on the underly-
ing data distribution. Among those models, the Kaplan-Meier (KM)
product-limit estimators are frequent in the biomedical literature. As
an example, a search for PubMed articles using the term Kaplan-
Meier retrieves more than 8000 articles each year, from 2013 to
2018. A search for actuarial returns about 500 articles per year. In
this article, we focus on the KM estimator and present results for the
actuarial model in the Supplementary Appendix. The KM method
generates a survival curve in which each event can be seen by a corre-
sponding drop in the probability of survival. For example, Foldvary
et al4 used the KM method to analyze seizure outcomes for patients
who underwent temporal lobectomy for epilepsy. In contrast, in the
actuarial method,17,18 the survival probability is computed over pre-
specified periods of time (eg, 1 week, 1 month). For example, Balsam
et al19 used actuarial curves to describe the long-term survival for
valve surgery in an elderly population.
It is surprising that relatively little attention has been given so far
to the protection of individual privacy in survival analysis. Survival
analyses generate aggregated results that are unlikely to directly re-
veal identifying information (eg, name, SSN).20 However, a knowl-
edgeable adversary, who observes survival analysis results over time,
may be able to determine whether a targeted individual participated
in the study and even if the individual belongs to a particular sub-
group in the study, thus learning sensitive phenotypes. Several previ-
ous privacy studies have shown that sharing aggregated results may
lead to this privacy risk.15,21,22 For example, small values of counts
(eg, <11) may reveal identifiable information about patients and
their demographics.11,23 As survival analyses rely on statistical prim-
itives (eg, counts of time to events), they share similar privacy risks.
In fact, each patient is responsible for a drop or step in the survival
curve. Therefore, the released curves may reveal, in combination
with personal or public knowledge, sensitive information about a
single patient. For example, an adversary who (1) has knowledge of
the time to events of individuals in various groups at a certain time
(eg, previously released survival curves for different groups) and (2)
knows that a person of interest joined the study may infer the pres-
ence of such an individual in a specific group (eg, patients in the hep-
atitis B subgroup) as the released curves are updated. Specifically, an
adversary can construct a survival curve based on their auxiliary
knowledge and can infer whether the person of interest is in the
group by comparing such a curve with the one from a group, as il-
lustrated with the curves s1’ and s2’ in Figure 1 (left panel). The dif-
ferences between the exact curves and those obtained by the
adversary disclose the participation of the person of interest in a
group (ie, the patient with time to event at time unit 61 contributed
to the curve s2’, thus the individual of interest was in group 2). This
scenario is realistic for dashboard of “aggregate” results, where
tools for data exploration (eg, web interfaces and application
programming interfaces) may enable users to obtain frequent fine-
grained releases, and certainly is not limited to survival analysis, ap-
plying also to counts, histograms, proportions (when accompanied
by information on the total number of participants), and other seem-
ingly harmless “aggregate” data.
It is imperative to develop privacy solutions to protect the individ-
ual presence in the released survival curves. In this work, we consider
the formal and provable notion of differential privacy,24 in which the
released statistics are perturbed with carefully calibrated random
noise. Specifically, differential privacy ensures that the output statis-
tics are “roughly the same” regardless of the presence or absence of
any individual, thus providing plausible deniability. In fact, the differ-
ences between differentially private survival curves s1’-dp and s2’-dp
and those obtained with the adversarial knowledge in Figure 1 (right
panel) do not reveal information about the presence of any individual
in either group, as opposed to the original curves (left panel).
Objective Current research in survival analysis includes the development of ac-
curate prediction models, under the assumption that sharing aggregate
survival data does not compromise privacy. For example, deep neural
networks have been recently used to learn the relationship between a
patient’s covariates and the survival distribution predictions.25–28 An-
other example by Lu et al29 describes a decentralized method for
learning a distributed Cox proportional hazards model without shar-
ing individual patient-level data. Those solutions disclose exact results
that may enable privacy attacks by untrusted users.15,22,30
Several approaches have been proposed for privacy-protecting
survival analyses.20,31–33 However, they do not provide provable
privacy guarantees. O’Keefe et al20 discussed privacy techniques
based on data suppression (eg, removal of censored events), smooth-
ing, and data perturbation. Yu et al32 proposed a method based on
affine projections for the Cox model. Similarly, Fung et al33 devel-
oped a privacy solution using random linear kernel approaches. De-
spite promising results, these solutions do not provide provable
privacy protection and may be vulnerable in the presence of an
adversary who has auxiliary information (eg, knowledge of the
time-to-event data [hospitalization, death, etc.] and from previous
publication of survival curves).
We developed a privacy framework, based on the notion of differ-
ential privacy, that provides formal and provable privacy protection
against a knowledgeable adversary who aims at determining the pres-
ence of an individual of interest in a particular group. Intuitively, our
framework transforms the data before the release, similarly to previ-
ous methods based on generalization (eg, smoothing) and truncation
(eg, censoring aggregate counts below a threshold).20,23 In our case,
privacy is protected with the injection of calibrated noise. We show
how this framework can be used to release differentially private sur-
vival analyses for the KM estimator (see the Supplementary Appendix
for the actuarial method). Furthermore, we define an empirical pri-
vacy risk that measures how well an informed adversary may recon-
struct the temporal information of time to event of an individual who
participated in the study. Our evaluations show that an adversary can
reconstruct the time to event with a small error from the observed
nonprivate survival curves, thus indicating high privacy risk (eg, po-
tential reidentification by linking the exact time intervals with external
data). Our proposed methods significantly reduce privacy risks while
Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 367
D ow
nloaded from https://academ
ic.oup.com /jam
ia/article/27/3/366/5637338 by guest on 09 M ay 2022
retaining the usefulness of the survival curves. We must emphasize
that an ideal privacy protection mechanism should not rely on specific
assumptions about what background knowledge the adversary has, as
violations in the adversary’s knowledge may make privacy protection
invalid. Thanks to differential privacy, our methods do not require
such assumptions and thus provide protection regardless of how
much information the adversary has.
MATERIALS AND METHODS
Nonparametric survival models Nonparametric survival models estimate the survival probability of
a group of individuals by analyzing the temporal distribution of the
recorded events during the study. Typically, each individual has a
single temporal event, which may represent the development of a
symptom, disease, or death. Some of these events may be only par-
tially known (eg, subject drops out of the study, no follow-up)17,34
and therefore are denoted as censored events. We assume a study of
N individuals over a period of T time units (eg, days, months). Fur-
thermore, ui denotes the number of uncensored patients (known
recorded event [eg, death]), ci denotes the censored patient at time ti;
and ri represents those remaining before ti (excluding any individual
censored previously). Table 1 summarizes the nonparametric models
considered in this article. Additional details are reported in the Supple-
mentary Appendix.
Differential privacy Differential privacy24 enables the release of statistical information
about a group of participants while providing strong and provable
privacy protection. Specifically, differential privacy ensures that the
probability distribution on the released statistics should be “roughly
the same” regardless the presence/absence of any individual, thus
providing plausible deniability. Differential privacy has been suc-
cessfully applied in a variety of settings,14,35 such as data publication
(eg, 1-time data release),36–40 iterative query answering,41–43 contin-
ual data release (eg, results are published over time),44–50 and in
combination with various machine learning models.30,51–53 Among
those works, we are inspired by the differentially private model pro-
posed for continual data release,46–49 as survival analyses estimate
the survival function at time t using the time to events up to t: In our
setting, we consider an event stream S ¼ðe1; e2; . . . ; eTÞ, where each event ei ¼ðci; ui; tiÞ refers to the number of events and whether cen- soring happened at time ti, and the events are in chronological order
(ie, ti < tiþ1). For example, consider a study over a period T ¼ 10 units of time (eg, months) comprising a total of N ¼ 6 individuals with time to events of 2, 4, 4, 5*, 6, 8*, where time marked with *
corresponds to when censoring happened (ie, a participant was lost
to follow up). Under our notation, we have an event stream S ¼ð0; 0; 1Þ;ð0; 1; 2Þ; ð0; 0; 3Þ;ð0; 2; 4Þ;ð1; 0; 5Þ;ð0; 1; 6Þ; ð0; 0; 7Þ;ð1; 0; 8Þ; ð0; 0; 9Þ;ð0; 0; 10Þ, where (0; 0; 3) indicates that no events were observed at time 3.
We assume a trusted data curator who wishes to release an esti-
mate of the survival probability sðtÞ at each time stamp 1 � t � T using the information in the poststream of events up to time t,
namely the prefix stream St ¼ðe1; e2; . . . ; etÞ.
Neighboring streams of time to events
Two streams of time to events St and S 0 t are neighboring streams if
there exists at most 1 ti 2 ½1; . . . ; t�; such that: ci – c’ij jþ jui – u’ij � 1 (ie, they differ at most by 1 event).
Using this notion, we present the definition of differential pri-
vacy considered in our work as follows.
Differential privacy
Let M �ð Þ be a randomized algorithm that takes in input a stream S, and let O be the set of all possible outputs of M �ð Þ. Then, we say that M �ð Þ satisfies e-differential privacy if, for all sets O 2 O; all neighboring streams St and S
0 t , and all t; it holds that:
Pr M Stð Þ¼ O½ � � ee � Pr½M S0tð Þ¼ O�
Intuitively, the notion of differential privacy ensures that neigh-
boring streams should be indistinguishable by an adversary who
Figure 1. Survival curves obtained using the Kaplan-Meier method. (Left panel) An adversary observes 2 exact curves s1’ (group 1) (eg, consisting of patients
without hepatitis B) and s2’ (group 2) (patients with hepatitis B) and compares them with the curves constructed with knowledge of s1 and s2 (eg, previously re-
leased curves). The adversary knows that the person of interest had an event at time 61 and thus can learn from the change in s2’ that this individual contributed
to group 2. This is an example of a difference attack. (Right panel) When the curves are generated using differential privacy (s1’-dp and s2’-dp), their difference
does not reveal individual time-to-event information. The data on this plot were obtained from a publicly available repository (http://lib.stat.cmu.edu/datasets/
veteran). Here, we only report on the first 80 time units (days) to highlight the difference between the survival curves.
368 Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3
D ow
nloaded from https://academ
ic.oup.com /jam
ia/article/27/3/366/5637338 by guest on 09 M ay 2022
observes the output of the mechanism M �ð Þ at any time. In differen- tially private models, e denotes the privacy parameter (also known as privacy budget). Lower values indicate higher indistinguishabil-
ity, thus providing stronger privacy. Determining the right value for
e is a challenging problem, as specific values depend on the applica- tion (ie, risk tolerance).54 Typically, e assumes values in the range [1/1000, 10].
As an example consider e¼1, then the probability of a stream St being mapped to a particular output is no greater than 2 times
the probability that any of its neighboring streams getting mapped
to the same output. Perfect privacy can be achieved with e¼0 (ie, neighboring streams are equally likely to produce the same output);
however, it obviously leads to no utility in the released curve, as the
mechanism has to completely ignore each individual record in input.
The guarantee of indistinguishability between neighboring streams
protects the presence of the individual in the released statistics be-
cause, in survival analysis, an individual can contribute at most once
to the stream.
Typically, differential privacy is achieved via output perturba-
tion, in which the released statistics are perturbed with calibrated
random noise to hide the presence of individuals (details are
reported in the Supplementary Appendix). Intuitively, the noise per-
turbation “generalizes” the aggregated time to events, similarly to
traditional ad hoc techniques in which the released aggregated
counts are obtained by binning and thresholding (eg, reporting
counts as “less than 10”).
Our framework for privacy-protecting survival analyses Publishing survival values sðtÞ may pose significant privacy chal- lenges, as the event for an individual at time t0 will affect the sur-
vival curve at time t0 (ie, step) as well as subsequent values.
Therefore, an adversary who observes these changes may gain
knowledge about the individual associated with such an event: To
mitigate these risks, traditional differential privacy methods perturb
each released survival value sðtÞ. However, these methods may lead to overly perturbed results when the study spans over a long period
of time. To this end, we propose a framework that compresses the
stream of events into partitions in which the survival probabilities
can be accurately computed over time using an input perturbation
strategy. Overall, our framework (Figure 2) comprises 3 main steps:
(1) data partitioning, (2) survival curve computation, and (3) post-
processing.
In the data partitioning step, the time to events are grouped into
partitions generated in a differentially private manner. The idea is to
compress the stream, so that the privacy cost for computing the sur-
vival curve can be reduced while retaining the distribution of the
events. In the survival computation step, we estimate the number of
censored and uncensored events over time using a binary tree de-
composition. This step reduces the perturbation noise in the estima-
tion of the events, which are then used to compute the survival
probability. Specifically, we use an input perturbation approach in
which privacy is achieved by perturbing the counts of the events
rather than the output of the survival function, thus improving the
utility compared with standard output perturbation techniques. Be-
cause the noise perturbation may disrupt the shape of the survival
curve, we perform a postprocessing step, in which we enforce consis-
tency in the released curve (ie, monotonically decreasing survival
probabilities). For brevity, in the following we describe the instantia-
tion of our framework for the KM method. The private solution for
the actuarial method follows the same steps, except for the fact that
partitioning is performed over fixed intervals (see Supplementary
Appendix).
Data partitioning
Our partitioning strategy takes in input the stream of events St and
produces a stream of partitions as output, where multiple events are
grouped. We compress the stream into partitions of variable length
with the goal of retaining the distribution of the events. Our method
processes 1 event at the time and keeps an active partition, which is
sealed when more than H time to events are observed. Intuitively, this approach produces a coarser representation of the stream,
where each event is grouped with at least other H-1, by varying the interval of time to publish survival for a group of events. In this pro-
cess, we perturb the count of the events in the stream and the thresh-
old H with calibrated noise. As a result, the events and the size of partitions are protected, thus providing an additional level of protec-
tion compared with other privacy methods that rely on binning (ie,
rounds to the nearest 10). The privacy budget e1 dedicated to this step is equally divided among the threshold and event count pertur-
bation. As any neighboring streams may differ at most by 1 segment,
these perturbations ensure that the partitions returned by the algo-
rithm satisfy e1-differential privacy. 55
Survival curve computation
In this step, we determine the survival probability at time t using an
input perturbation strategy. The idea is to estimate the number of
uncensored and censored events in the partitions in a differentially
private manner and then use those values to compute the survival
curve, up to t. One could estimate these events by perturbing the
counts over the partitions processed so far. However, this simple
process leads to high perturbation noise, as the magnitude of the
noise grows linearly with the number of partitions. To this end, we
use a binary tree counting approach with privacy parameter e2, where leaves represent the original partitions and internal nodes de-
note partitions obtained by merging the partitions of their children.
Consider Figure 2, the internal node associated with the count C14 comprises the events over the partitions P1, P2, P3, and P4. This bi-
nary mechanism is very effective in reducing the overall impact of
perturbation noise.46,47 With this mechanism, the differentially pri-
vate number of uncensored û ið Þ and censored ĉ ið Þ events in the stream can be estimated with a perturbation noise that grows
only logarithmically with the number of partitions in the stream.
Table 1. Nonparametric models for survival analysis considered
Actuarial model (see Supplemen-
tary Appendix)
Kaplan-Meier model
• time to events are grouped into intervals of fixed length
(l) • survival computed on the set
interval fI1; I2; . . . ; ITg of length l
• the censored patients are as- sumed to withdraw from the
study at random during the
interval
• survival function computed on each time unit
Survival function at each interval Ii:
si ¼ Qi j¼1
1 � uj rj�
cj 2
� � Survival function at time t: s tð Þ¼
Q ti � t 1 �
ui ri
� �
Journal of the American Medical Informatics Association, 2020, Vol. 27, No. 3 369
D ow
nloaded from https://academ
ic.oup.com /jam
ia/article/27/3/366/5637338 by guest on 09 M ay 2022
To compute the privacy-protecting survival curve for the KM
method, denoted ŝKM ið Þ, we rewrite the KM survival curve formula- tion as follows:
ŝKM ið Þ¼ ŝKM i � 1ð Þ� N � û ið Þ� ĉ i � 1ð Þ
N � û i � 1ð Þ� ĉ i � 1ð Þ
where, û ið Þ and ĉ ið Þ represent the total number of uncensored and censored events up to the time of partition i; respectively. At the end
of this step, we obtain a step function representing the survival prob-
ability of the patients over time that remains constant within each
partition.
Data postprocessing
A survival curve satisfies the following properties: (1) it assumes val-
ues in the range [0, 1] and (2) it monotonically decreases with time
(ie, ŝðtÞ� ŝðt þ 1Þ for 1 � t < T). While our solution ensures that the released curve satisfies differential privacy, the noise pertur-
bation may violate properties 1 and 2. To this end, we propose a
postprocessing step, in which we compute the survival curve ŝ�ðtÞ satisfying these properties and that best resembles ŝ tð Þ: Similarly to previous work,56,57 we solve this optimization problem with iso-
tonic regression methods (details in the Supplementary Appendix).
An illustrative example of our postprocessing step is repor
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.