1 Introduction
Distributed machine learning (ML) has received significant attention in recent years due to the growing complexity of ML models and the increasing computational resources required to train them [11, 33]. One of the most popular distributed ML settings is the parameter server architecture, wherein multiple machines (called workers) jointly learn a single large model on their collective dataset with the help of a trusted server running the stochastic gradient descent (SGD) algorithm [7]
. In this scheme, the server maintains an estimate of the model parameters, which is iteratively updated using stochastic gradients computed by the workers.
Compared to its centralized counterpart, distributed SGD is more susceptible to security threats. One of them is related to the violation of data privacy by an honestbutcurious server [43]. Another one is the malfunctioning due to (what is called) Byzantine behavior of workers [21, 5]. In the past, significant progress has been made in addressing these issues separately. In the former case, differential privacy
(DP) has become a dominant standard for preserving privacy in ML, especially when considering neural networks
[13, 1]. In the latter case, Byzantine resilience has emerged as the principal notion for demonstrating the Byzantine resilience (BR) of distributed SGD [5, 14]. Since DP and BR are two crucial pillars of distributed machine learning, practitioners will inevitably have to build systems satisfying both these requirements. It is thus natural to ask the question: Can we simultaneously ensure DP and BR in distributed ML?In this paper, we take a first step towards a positive answer to this question by studying the resilience of the renowned DPSGD algorithm [1] against Byzantine workers. More precisely, we consider distributed SGD where, in each learning step, the honest workers inject Gaussian noise to their gradients to ensure DP, while the server updates the parameters by applying an BR aggregation rule on the received gradients (to protect against Byzantine workers). Upon analyzing the convergence of this algorithm, we show that DP and BR can indeed be combined, but doing so is nontrivial. Our key contributions are summarized below.
1. Inapplicability of existing results from the BR literature.
We start by highlighting an inherent incompatibility between the supporting theory of BR and the Gaussian mechanism used in DPSGD. Specifically, we show (in Section 3.2) that the variancetonorm (VN) condition, critical to guarantee BR, cannot be satisfied when honest workers enforce DP via Gaussian noise injection. Hence, existing results on the resilience of distributed SGD to Byzantine workers are not applicable when considering DPSGD. More generally, this highlights limitations of many existing Byzantine resilient techniques in settings where the stochasticity of the gradients is nontrivial.
2. Adapting the theory of BR to account for DP.
To overcome the aforementioned shortcoming, we introduce a relaxation of the VN condition (in Section 3.3), namely the approximated VN condition. By doing so, we (1) generalize existing results from the BR literature and (2) demonstrate approximate convergence of DPSGD under Byzantine faults. Our convergence result can be roughly put as follows.
Theorem (Informal).
Let
be the loss function of the learning model, and
be the parameter vector obtained after
steps of our algorithm. If the approximated VN condition holds true, thenwhere , and denotes the Euclidean norm.
As the aforementioned result suggests, a smaller ensures better convergence. To quantify this convergence guarantee, we present (in Section 3.4) necessary and sufficient conditions for the approximated VN condition to hold. Specifically, we show that the condition holds only if where , , and denote the model size, batch size, and dataset size respectively. This showcases an important interplay between DP and BR, e.g., larger and leads to stronger resilience to Byzantine workers at the expense of weaker privacy.
3. From theoretical insights to practical convergence.
Importantly, our result (in Section 3.4) provides key insights on how to better integrate standard approaches to DP and BR using hyperparameter optimization (HPO), e.g., by increasing the batch size , or choosing an appropriate aggregation rule. The improvement is illustrated by a snippet of our experimental results in Figure 1. This finding is particularly interesting as these parameters have very little impact in most settings when considering DP or BR separately. We validate our theoretical insights in Section 4 through an exhaustive set of experiments on MNIST and FashionMNIST using neural networks.
Closely related prior works
There has been a long line of research on the interplay between DP and other notions of robustness in ML [12, 24, 31, 30, 34, 23, 27]. However, previous approaches do not apply to our setting for two main reasons; (1) they do not address the privacy of the dataset against an honestbutcurious server, and (2) their underlying notion of robustness are either weaker than or orthogonal to BR. Furthermore, recent works on the combination of privacy and BR in distributed learning either study a weaker privacy model than DP or provide only elementary analyses [9, 19, 29, 18]. We refer the interested reader to Appendix A for an in depth discussion of prior works. In short, we believe the present paper to be the first to provide an indepth analysis with practical relevance on the integration of DP and BR in distributed learning.
2 Problem Setting and Background
Let be the space of data points. We consider the parameter server architecture with workers owning a common dataset of points. The workers seek to collaboratively compute a parameter vector that minimizes the empirical loss function defined as follows:
(1) 
where is a pointwise loss function. We assume that function is differentiable and admits a nontrivial local minimum. In other words, admits a critical point, but it is not null everywhere. We also make the following standard assumptions.
Assumption 1 (Bounded norm).
There exists a finite real such that for all ,
Assumption 2 (Bounded variance).
There exists a real value such that for all ,
Assumption 3 (Smoothness).
There exists a real value such that for all ,
Assumptions 2 and 3 are classical to most optimization problems in machine learning [6]. Assumption 1
is merely used to avoid unnecessary technicalities, especially when studying differential privacy. In practice, it can be easily enforced by gradient clipping
[1].In an ideal setting, when all the workers are honest (i.e., nonByzantine) and data privacy is not an issue, a standard approach to solving the above learning problem is the distributed implementation of the stochastic gradient descent (SGD) method. In this algorithm, the server maintains an estimate of the parameter vector which is updated iteratively by using the average of the gradient estimates sent by the workers. However, this algorithm is vulnerable to both privacy and security threats.
Threat model.
We consider the server to be honestbutcurious, and that some of the workers are Byzantine. An honestbutcurious server follows the prescribed algorithm correctly, but may infer sensitive information about workers’ data using their gradients and any other additional information that can be gathered during the learning as demonstrated by [43]. On the other hand, Byzantine workers need not follow the prescribed algorithm correctly and can send arbitrary gradients. For instance, they may either crash or even send adversarial gradients to prevent convergence of the algorithm [5].
2.1 Distributed SGD with Differentially Privacy
Over the last decade, differential privacy (DP) has become a gold standard in privacypreserving data analysis [13]. Intuitively, a randomized algorithm is said to preserve DP if its executions on two adjacent datasets are indistinguishable. More formally, two datasets and are said to be adjacent if they differ by at most one sample. Then, DP is defined as follows.
Definition 1 (Dp).
Let , and an arbitrary output space. A randomized algorithm is differentially private if for any two adjacent datasets , and any possible set of outputs ,
By far, the most widely used approach to ensure DP in machine learning is to use the differentially private version of SGD, called DPSGD [32, 4, 1]. The distributed implementation of this scheme against an honestbutcurious server consists, at every step, in making the honest workers add Gaussian noise with variance to their stochastic gradients before sending them to the server. When is chosen appropriately (e.g., see Theorem 1), each learning step satisfies DP at the worker level. Finally, the privacy guarantee of the overall learning procedure is obtained by using the composition property of DP [20, 1, 37]. However, we are mainly interested in studying the impact of perstep and perworker privacy budget on the resilience of the algorithm to Byzantine workers.
2.2 Byzantine Resilience of Distributed SGD
In the presence of Byzantine workers, the server can no longer rely on the average of workers’ gradients to update the model parameters. Instead, it uses a gradient aggregation rule (GAR) that is resilient to incorrect gradients that may be sent by at most Byzantine workers. A standard notion for defining this resilience is Byzantine resilience stated below, which was originally proposed by [5].
Definition 2 (Byzantine resilience).
Let , and . Consider random vectors among which at least are i.i.d. from a common distribution . Let be a random vector characterizing this distribution. A GAR is said to be Byzantine resilient for if its output satisfies the following two properties:

[leftmargin=20pt]

, and

for any , is upper bounded by a linear combination of where .
This condition has been shown critical to ensure convergence of the distributed SGD algorithm in the presence of up to Byzantine workers [5, 14]. Thus, it serves as an excellent starting point for studying the Byzantine resilience of distributed DPSGD. Consequently, we consider the algorithm where the server implements a Byzantine robust GAR while the honest workers follow instructions prescribed in DPSGD.
3 Combining Differential Privacy and Byzantine Resilience
Algorithm 1, described below, combines the standard techniques to DP and BR in distributed SGD. Given a GAR and a noise injection parameter , Algorithm 1 computes steps of distributed DPSGD with as an aggregation rule at the server to guarantee BR.
(2) 
(3) 
(4) 
Note that when , i.e., when no noise is injected, Algorithm 1 reduces to a classical Byzantine resilient distributed SGD algorithm as presented in prior works such as Blanchard et al. [5] and El Mhamdi et al. [14]. Furthermore, when and is the average function, it reduces to a distributed implementation of the wellknown DPSGD scheme, e.g., from [1].
3.1 Differential Privacy Guarantee
Intuitively, Algorithm 1 should inherit the privacy guarantees of DPSGD. Indeed, the privacy preserving scheme applied at the worker level is the same and will not by altered by the GAR thanks to the postprocessing property of DP [13]. Then, owing to previous works, we can easily show that Algorithm 1 satisfies DP at each step and for each honest worker when . Furthermore, as shown in Theorem 1, we can obtain a much tighter analysis using advanced analytical tools such as privacy amplification via subsampling [2].
Theorem 1.
3.2 Inapplicability of Existing Results from the BR Literature
As discussed in Section 2, prior works on BR can demonstrate the convergence of Algorithm 1 if the GAR is Byzantine resilient during the entire learning process. However, verifying the validity of BR is nearly impossible as the condition depends upon the gradients of the Byzantine workers that can be arbitrary [5]. The only verifiable condition known in the literature to guarantee BR is the variancetonorm (VN) condition, which is defined as follows [14].
Definition 3 (VN Condition).
For a parameter vector , let denote the random vector characterizing the gradients sent by the honest workers to the server at . A GAR satisfies the VN condition if for any such that has a nonzero mean,
where is the multiplicative constant of GAR that depends on and .^{1}^{1}1Precise values of for most popular GARs can be found in Appendix B.
This condition means that for a GAR to guarantee convergence for the procedure, the distribution of the gradient estimates at parameter must be "wellbehaved". For instance, if the norm of the expected stochastic gradients converges to then so should the variance. Note that in the case of Algorithm 1, from (2) we obtain that for any ,
(5) 
where is a set of data points sampled randomly without replacement from , and . Thus, the VN condition can no longer be satisfied whenever , i.e., workers follow instructions prescribed in DPSGD. We show this formally in Proposition 1 below.
Proposition 1.
Note that when and are nonzero, we will have as explained in Section 3.1. Accordingly, Proposition 1 means that prior results on the convergence of existing Byzantine resilient GARs, including the works by Blanchard et al. [5] and El Mhamdi et al. [14], are no longer valid when enforcing any nonzero level of DP. Although the VN condition is only a sufficient one, due to the lack of necessary conditions in the literature, it is the most widely used tools for proving BR, e.g., see Blanchard et al. [5], El Mhamdi et al. [14], Xie et al. [40], ElMhamdi et al. [15], Boussetta et al. [8]. Hence, Proposition 1 highlights an inherent limitation of the theory of BR, especially when simultaneously enforcing DP via noise injection.
3.3 Adapting the Theory of BR to Account for DP
To circumvent the aforementioned limitation, we propose a relaxation of the theory of BR by relaxing the original VN condition to the approximated VN condition defined below.
Definition 4 (approximated VN condition).
Let denote the random vector characterizing the gradients sent by the honest workers to the server at parameter vector . For , a GAR satisfies the approximated VN condition if for all such that ,
where is the multiplicative constant of GAR that depends on and .
Definition 4 relaxes the initial VN condition by allowing a subset of (possible) parameter vectors to violate the inequality in Definition 3. In particular, as , when the gradients are sufficiently close to a local minimum, or , the inequality need not be satisfied. While the approximated VN condition is a natural extension of Definition 3, it enables us to study cases where the distribution of the gradients at is nontrivial, e.g., the variance of need not vanish when approaches . Consequently, we can utilize this new criterion to analyze the convergence of Algorithm 1 for different GARs and levels of privacy. Assuming approximated VN condition, we show in Theorem 2 the approximate convergence of Algorithm 1.
Theorem 2.
According to Theorem 2, Algorithm 1 can compute a parameter for which in expectation with a rate of . In other words, when the loss function is regularized (see, e.g., Bottou et al. [6]), it finds an approximate local minimum with an error proportional to . Note that, when (i.e., when DP is not enforced), the above result encapsulates the existing convergence results from the BR literature, e.g., Blanchard et al. [5], El Mhamdi et al. [14].
Remark 1.
For generality, we do not provide the exact values for parameters and in Theorem 2. These two constants depend on the learning scheme that is applied, in particular the resilience properties of the GAR used. However, since these parameters are constant throughout the learning procedure, keeping them to be generic does not affect our conclusions on the asymptotic error.
3.4 Studying the Interplay between DP and BR
The value of is intrinsically linked to the amount of noise that workers inject to the procedure. In a way, it represents the impact of perworker DP on the resilience of Algorithm 1 to Byzantine workers. To quantify this impact, we present in Proposition 2 sufficient and necessary conditions for a GAR to satisfy the approximated VN condition in the context of Algorithm 1.
Proposition 2.
Let , . Consider Algorithm 1 with privacy budget and GAR with multiplicative constant . Then, the following assertions hold true.

[leftmargin=1cm]
The above result, in conjunction with Theorem 2, presents a convergence guarantee that can be obtained by distributed DPSGD under Byzantine faults. In particular, we have the following corollary of Theorem 2 and Proposition 2.
Corollary 1.
Corollary 1 quantifies the impact of different parameters on the convergence of the algorithm. For instance, we observe that larger values of and , i.e., weaker DP guarantees, imply smaller worstcase convergence error and therefore, better guarantee of learning. But importantly, it also shows how the convergence guarantee of the algorithm depends upon other hyperparameters, namely the batch size , the number of parameters , and the multiplicative constant of the GAR. Let us for example take the case of the batch size below.
Impact of batch size. We consider the specific GAR of MinimumDiameter Averaging (MDA) for which [14]. Then, from Corollary 1, we obtain that
From above, we note that when parameters and are in the interval and , i.e., both DP and BR are enforced, then increasing the batch size indeed reduces the asymptotic convergence error of the algorithm. However, this is not the case when we consider DP and BR separately. When all workers are honest, , which implies . The algorithm then asymptotically converges to a local minimum regardless of the batch size used. On the other hand, when the workers do not obfuscate their gradients (), the approximated VN condition holds true for . Then, the asymptotic convergence error of the algorithm is again independent of the batch size. To conclude, the batch size plays a crucial role in improving the learning accuracy when enforcing DP and BR simultaneously, but it should have little influence when considering them individually.
Remark 2.
Although Corollary 1 provides some useful insights on improving the accuracy of the learning algorithm combining DP and BR, it need not be tight as it only provides an upper bound relying on a sufficient condition; the approximated VN condition. It turns out that providing a nontrivial lower bound for distributed SGD in the presence of Byzantine faults remains an open problem, even without DP. In spite of this, we show the practical relevance of the insights obtained from Corollary 1 through an exhaustive set of experiments in the subsequent section.
4 Numerical Experiments
The goal of our experiments is to investigate whether our theoretical insights are actually applicable in practice and whether hyperparameter optimization (HPO) can improve the integration of DP and BR. Accordingly, we assess the impact of varying different hyperparameters on the training losses and top1 crossaccuracies of a neural network under DP and attacks from Byzantine workers over a maximum of learning steps.
4.1 Experimental Setup
Datasets.
Architecture and fixed hyperparmaters.
We consider a feedforward neural network composed of two fullyconnected linear layers of respectively 784 and 100 inputs (for a total of
parameters) and terminated by a softmaxlayer of 10 dimensions. ReLU is used between the two linear layers. We use the Cross Entropy loss, a total number of workers
, Polyak’s momentum of at the workers, a constant learning rate of , and a clipping parameter . We also add an regularization factor of . Note that some of these constants are reused from the literature on BR, especially from Baruch et al. [3], Xie et al. [41], ElMhamdi et al. [16].Varying hyperparameters for HPO.
For both datasets, we vary the batch size within 25, 50, 150, 300, 500, 750, 1000, 1250, 1500, the perstep and perworker privacy parameter in ( is fixed to ), the number of Byzantine workers in as well as the attack they implement (little from Baruch et al. [3] and empire from Xie et al. [41]). We also vary the Byzantine resilient GAR in . Note that due to its large computational cost, we only use the Bulyan aggregation rule when .
Each of the 432 possible combinations of these hyperparameters is run 5 times using seeds from 1 to 5 (for reproducibility purposes), totalling in 2160 runs. Each run satisfies DP at every step under attacks from Byzantine workers. To assess the impact of the privacy noise alone, we also run the experiments specified above with the averaging GAR and without Byzantine workers (denoted by “No attack”). These experiments account for another 27 combinations, totalling in 135 additional runs. Overall, we performed a comprehensive set of 2295 runs for which we provide a brief summary below. More details on the experimental setup and results can be found in Appendices D and E.
4.2 Experimental Results
In Figure 2, we give a snapshot of our results by showcasing 4 characteristic outcomes encountered. Below, we present further characterization of them. Besides validating our theoretical insights on the impact of the batch size and the GAR selection on the convergence of Algorithm 1, these plots also showcase the threat scenarios in which hyperparameter optimization (HPO) has the most impact. Note that the little attack was more damaging than empire in our experiments; hence in our discussion below, we consider little to be a stronger threat than empire, ceteris paribus.

[leftmargin=0.5cm]

Strongest threat scenario (top left). We consider little with and , i.e., the strongest level of attack and privacy we implemented. In this stringent scenario, the algorithm fails to deliver good learning accuracy under Byzantine attacks. Although increasing the batch size helps improve the convergence, the accuracy remains quite poor (well below , even when ).

Relaxed threat scenario (bottom left). Here, we keep and , but we trade the attack for a weaker one (empire). This scenario validates our intuition on the advantage of increasing the batch size, but it mostly highlights the impact of GAR selection. Different GARs differ significantly in their maximum cross accuracies, while MDA performs the best.

Mild threat scenario (top right): We now consider and , i.e., a weaker privacy guarantee and a fewer number of Byzantine workers. However, we revert back to little attack. We see that, for all GARs, increasing the batch size significantly improves the maximum crossaccuracy. The choice of GAR also impacts the performance, with Bulyan being the best.

Weakest threat scenario (bottom right): We consider empire with and . The threat is so weak that all GARs perform almost the same. Although HPO still helps to obtain a better accuracy, it is not critical in this setting.
Main Takeaway.
Our empirical results show that training a feedforward neural network under both DP and BR is possible but expensive in some settings. Indeed, in the nontrivial threat scenarios, to achieve the same maximum crossaccuracy as DPSGD with , we need a perworker batch size , i.e., times larger than the Byzantinefree setting. Moreover, depending upon the setting, the selection of the GAR might be more influential than the batch size. Finally, note that in the Byzantinefree setting, the DPSGD algorithm obtains reasonable crossaccuracies (close to ) for most batch sizes considered. This validates our theoretical findings (discussed in Section 3.4) that the batch size has a more significant impact when combining DP and BR compared to when enforcing DP alone. Similar observations on the negligible impact of the batch size in the privacyfree setting (but under Byzantine attacks) can be found in Appendix E.
5 Conclusion & Open problems
In this paper, we have studied the integration of standard approaches to DP and BR, namely the distributed implementation of the popular DPSGD protocol in conjunction with BR GARs. Upon highlighting the limitations of the existing theory of BR when applied to this algorithm, we have proposed a generalization of this theory. By doing so, we have (1) quantified the impact of DP on BR, and (2) proposed an HPO scheme to effectively combine DP and BR. Our results have shown that DP and BR can be combined but at the expense of computational cost in some settings.
Our generalization of the theory of BR is also of independent interest. Specifically, we have proposed a relaxation of the VN condition as approximated VN condition. Although the VN condition is quite stringent and only sufficient, it is consistently relied upon to design and study different Byzantine resilient GARs [5, 14, 15, 16, 8]. Hence, our convergence result, obtained using the relaxed approximated VN condition, supersedes many existing results in the literature of BR.
Interestingly, we have observed through our experiments that even when the relaxed approximated VN condition is violated, the algorithm obtains reasonable learning accuracy. This observation opens two interesting problems expounded below.

[leftmargin=0.5cm]

A theoretical problem: The VN condition (either approximated or not) is not tight enough to fully characterize BR. That is, in some cases, a GAR may be BR without satisfying the VN condition. Furthermore, the theory of BR focuses on "worstcase" attacks that, for now, might not be achievable in practice. Hence, the question on the tightness of the VN condition for any specific attack, even without DP, remains open.

An empirical problem: The practice of BR focuses on stateoftheart realizable attacks. These attacks are arguably suboptimal explaining why we can obtain reasonable learning accuracy despite the violation of the VN condition. This also calls for designing better (or stronger) attacks.
Finally, while we have focused on adapting the theory of BR to make it more compatible with the standard DPSGD algorithm, an alternate future direction could be to investigate other DP mechanisms that may comply better with classical approaches to BR, while preserving DP guarantees.
References
 [1] (2016) Deep learning with differential privacy. pp. 308–318. Cited by: Appendix A, §D.2, §1, §1, §2.1, §2, §3.
 [2] (2018) Privacy amplification by subsampling: tight analyses via couplings and divergences. Red Hook, NY, USA, pp. 6280–6290. Cited by: §C.1, §3.1, Lemma 2.
 [3] (2019) A little is enough: circumventing defenses for distributed learning. Cited by: §D.3, Figure 1, §4.1, §4.1.
 [4] (2014) Private empirical risk minimization: efficient algorithms and tight error bounds. pp. 464–473. External Links: Document Cited by: §2.1.
 [5] (2017) Machine learning with adversaries: byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 119–129. External Links: Link Cited by: Appendix A, §B.1, §C.4, §1, §2, §2.2, §2.2, §3.2, §3.2, §3.3, §3, §5.
 [6] (2018) Optimization methods for largescale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §2, §3.3.
 [7] (2010) Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, Y. Lechevallier and G. Saporta (Eds.), Heidelberg, pp. 177–186. External Links: ISBN 9783790826043 Cited by: §1.
 [8] (2021) AKSEL: fast byzantine sgd. Cited by: Appendix A, §3.2, §5.
 [9] (2018) When machine learning meets blockchain: a decentralized, privacypreserving and secure design. pp. 1178–1187. Cited by: Appendix A, §1.

[10]
(202105)
Differentially private stochastic coordinate descent.
Proceedings of the AAAI Conference on Artificial Intelligence
35 (8), pp. 7176–7184. External Links: Link Cited by: Appendix A.  [11] (2012) Large scale distributed deep networks. pp. . External Links: Link Cited by: §1.
 [12] (2009) Differential privacy and robust statistics. New York, NY, USA, pp. 371–380. External Links: ISBN 9781605585062, Link, Document Cited by: Appendix A, §1.
 [13] (2014) The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (34), pp. 211–407. Cited by: §1, §2.1, §3.1, Lemma 1.
 [14] (201810–15 Jul) The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 3521–3530. External Links: Link Cited by: Appendix A, §B.2, §B.4, §C.4, §1, §2.2, §3.2, §3.2, §3.3, §3.4, §3, §5.
 [15] (2020) Genuinely distributed byzantine machine learning. In Proceedings of the 39th Symposium on Principles of Distributed Computing24th International Conference on Principles of Distributed Systems (OPODIS 2020)Proceedings of the 11th USENIX Conference on Operating Systems Design and ImplementationProceedings of the 32nd International Conference on Machine Learning2018 IEEE International Conference on Big Data (Big Data)NIPS Workshop on Private MultiParty Machine Learning2013 IEEE Global Conference on Signal and Information Processing2018 IEEE Conference on Decision and Control (CDC)2014 IEEE 55th Annual Symposium on Foundations of Computer ScienceInternational Conference on Learning Representations2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)Proceedings of the ThirtyFifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 2225, 2019Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 814 December 2019, Long Beach, CA, USAProceedings of the 30th International Conference on Machine Learning9th International Conference on Learning Representations, ICLR 2021, Vienna, Austria, May 4–8, 2021Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications SecurityProceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USAAdvances in Neural Information Processing SystemsProceedings of the 2021 ACM Symposium on Principles of Distributed Computing2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)Advances in Neural Information Processing SystemsProceedings of the TwentySecond International Conference on Artificial Intelligence and StatisticsProceedings of the 32nd International Conference on Neural Information Processing SystemsProceedings of the 4th Conference on Innovations in Theoretical Computer ScienceInternational Conference on Artificial Intelligence and Statistics2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 1923, 2019Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security2019 IEEE Security and Privacy Workshops, SP Workshops 2019, San Francisco, CA, USA, May 1923, 2019Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19NeurIPSProceedings of the FortyFirst Annual ACM Symposium on Theory of Computing2021 2021 IEEE Symposium on Security and Privacy (SP), F. Bach, D. Blei, S. Dasgupta, D. McAllester, J. G. Dy, A. Krause, K. Chaudhuri, R. Salakhutdinov, F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett, K. Chaudhuri, and M. Sugiyama (Eds.), PODC ’20OSDI’14Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning ResearchPODC’21Proceedings of Machine Learning ResearchNIPS’18ITCS ’13CCS ’19STOC ’09, Vol. 37288097252889, New York, NY, USA. External Links: ISBN 9781450375825, Link, Document Cited by: Appendix A, §B.2, §3.2, §5.
 [16] (2021) Distributed momentum for byzantineresilient stochastic gradient descent. External Links: Link Cited by: item 1, §4.1, §5.
 [17] (2018) Privacypreserving distributed learning via obfuscated stochastic gradients. pp. 184–191. External Links: Document Cited by: Appendix A.
 [18] (2021) Differential privacy and byzantine resilience in sgd: do they add up?. New York, NY, USA, pp. 391–401. External Links: ISBN 9781450385480, Link, Document Cited by: Appendix A, §1.
 [19] (2020) Secure byzantinerobust machine learning. External Links: 2006.04747 Cited by: Appendix A, §1.
 [20] (201507–09 Jul) The composition theorem for differential privacy. Lille, France, pp. 1376–1385. External Links: Link Cited by: §D.2, §2.1.
 [21] (198207) The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4 (3), pp. 382–401. External Links: ISSN 01640925, Link, Document Cited by: §1.
 [22] (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §4.1.
 [23] (2019) Certified robustness to adversarial examples with differential privacy. pp. 656–672. External Links: Document Cited by: Appendix A, §1.
 [24] (201907) Data poisoning against differentiallyprivate learners: attacks and defenses. pp. 4732–4738. External Links: Document, Link Cited by: Appendix A, §1.
 [25] (2020) Toward robustness and privacy in federated learning: experimenting with local and central differential privacy. ArXiv abs/2009.03561. Cited by: Appendix A.

[26]
(2021)
Opacus PyTorch library
. Note: Available from opacus.ai Cited by: §D.2.  [27] (2019) A unified view on differential privacy and robustness to adversarial examples. arXiv preprint arXiv:1906.07982. Cited by: Appendix A, §1.
 [28] (2015) Privacypreserving deep learning. pp. 909–910. External Links: Document Cited by: Appendix A.
 [29] (2020) Byzantineresilient secure federated learning. External Links: 2007.11115 Cited by: Appendix A, §1.
 [30] (2019) Membership inference attacks against adversarially robust deep learning models. pp. 50–56. External Links: Document Cited by: Appendix A, §1.
 [31] (2019) Privacy risks of securing machine learning models against adversarial examples. New York, NY, USA, pp. 241–257. External Links: ISBN 9781450367479, Document Cited by: Appendix A, §1.
 [32] (2013) Stochastic gradient descent with differentially private updates. pp. 245–248. External Links: Document Cited by: Appendix A, §2.1.
 [33] (2015) Training very deep networks. pp. . External Links: Link Cited by: §1.
 [34] (2019) Can you really backdoor federated learning?. CoRR abs/1911.07963. External Links: Link, 1911.07963 Cited by: Appendix A, §1.
 [35] (201904) Privacypreserving distributed deep learning via homomorphic reencryption. Electronics 8, pp. 411. External Links: Document Cited by: Appendix A.
 [36] (2020) Attack of the tails: yes, you really can backdoor federated learning. External Links: Link Cited by: Appendix A.

[37]
(201916–18 Apr)
Subsampled renyi differential privacy and analytical moments accountant
. pp. 1226–1235. External Links: Link Cited by: §2.1.  [38] (20170828) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.
 [39] (2018) Generalized byzantinetolerant sgd. External Links: 1802.10116 Cited by: Appendix A.
 [40] (2018) Phocas: dimensional byzantineresilient stochastic gradient descent. External Links: 1805.09682 Cited by: Appendix A, §3.2.
 [41] (2019) Fall of empires: breaking byzantinetolerant SGD by inner product manipulation. pp. 83. Cited by: §D.3, §4.1, §4.1.
 [42] (201810–15 Jul) Byzantinerobust distributed learning: towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 5650–5659. External Links: Link Cited by: Appendix A, §B.3.
 [43] (2019) Deep leakage from gradients. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 14774–14784. External Links: Link Cited by: §1, §2.
Appendix A Related Work
Privacy.
In the past, significant attention has been given to protecting data privacy for both centralized [32, 10, 1] and distributed SGD [28, 25]. Although several techniques for data protection exist such as the encryption [35] or obfuscation [17] of gradients, the most standard approach consists in adding DP noise to the gradients computed by the workers [1, 32, 28, 25], which is what we consider. However, these works only consider a faultfree setting where all workers are assumed to be honest.
Byzantine resilience.
In a separate line of research, several other works have designed Byzantine resilient schemes for distributed SGD in the parameterserver architecture [5, 14, 42, 39, 40, 15, 8]. Nevertheless, in these papers, the training data is not protected, meaning that their methods do not consider the privacy threat associated with sharing unencrypted gradients with the server.
Combining privacy and BR.
Although scarce, there has been some work on tackling the problem of combining privacy and BR. For instance, He et al. [19] consider this problem for a different framework that includes two honestbutcurious noncolluding servers, a strong assumption that does not always hold in practice. Furthermore, their additive secret sharing scheme is rendered ineffective in our setting where there is a single honestbutcurious server that obtains information from all the workers. In the context of privacy, the singleserver setting generalizes the multiserver setting with colluding servers. Another related work, the BREA framework, proposes the use of verifiable secret sharing amongst workers [29]. However, the presented privacy scheme scales more poorly than DP mechanisms, and is infeasible in most distributed ML settings with no interworker communication. Chen et al. [9] propose the LearningChain framework that is claimed to combine DP and BR. However, LearningChain is an experimental method, and Chen et al. do not provide any formal guarantees either on the resilience or on the convergence of the proposed algorithm.
Recently, Guerraoui et al. [18] studied the problem of satisfying both DP and BR in a singleserver distributed SGD framework. While they demonstrate the computational hardness of this problem in practice, we go beyond by showing an inherent incompatibility between the supporting theory of BR and the Gaussian mechanism from DP. Moreover, our approximate convergence result generalizes the prior works on BR. This generalization is critical to quantifying the interplay between DP and BR. Importantly, while Guerraoui et al. [18] only give elementary analysis explaining the difficulty of the problem, we show that a careful analysis can help combine DP and BR.
Studying the interplay between DP and other notions of robustness.
There has been a long line of work studying the interplay and mutual benefits of DP and robustness to data corruption in the centralized learning setting [12, 24]. However, these works do not consider the problem of a distributed scenario with an honestbutcurious server, and they are not applicable to our setting. Furthermore, data corruption is actually a weaker threat than BR as the adversary cannot select its gradients online to disrupt the learning process.
Recently, there have been some work on the interplay between DP and robustness to evasion attacks (a.k.a. adversarial examples). Interestingly, some findings in that line of research are similar to ours. DP and robustness to adversarial examples have been demonstrated to be very close from a highlevel theoretical point of view even if their semantics are very different [23, 27]. However, some recent works have pointed out that these two notions might be conflicting in some settings [30, 31]. It is however worth noting that BR and robustness to adversarial examples are two orthogonal concepts. In particular, the robustness of a model (at testing time) to evasion attacks does not provide any guarantee on the robustness (at training time) to Byzantine behaviors. Similarly, as BR focuses on the training (optimization) procedure, we can always train models using a Byzantine resilient aggregation rule but without obtaining robustness to evasion attacks. The connection between these two notions of robustness remains an open problem.
Appendix B Standard GARs With Associated Multiplicative Constants
In this section, we present the different GARs used in our experiments, along with their associated VN conditions (Definition 3) and multiplicative constants .
b.1 Krum
Krum is an aggregation rule introduced under the assumption that . It consists in selecting the gradient which has the smallest mean squared distance, where the mean is computed over its closest gradients [5]. Formally, let be the gradients received by the parameter server. For any and , we denote by the fact that is amongst the closest vectors (in distance) to within the submitted gradients. Krum assigns to each a score
(6) 
and outputs the gradient with the lowest score. Blanchard et al. [5] prove that Krum is Byzantine resilient, assuming that the following VN condition is satisfied:
(7) 
Therefore, the multiplicative constant for Krum is
(8) 
b.2 MinimumDiameter Averaging (Mda)
MDA is an aggregation rule introduced under the assumption that . It outputs the average of the most clumped gradients among the received ones [14, 15]. Formally, let be the set of gradients received by the parameter server and let be the set of all subsets of of cardinality . MDA chooses the set
(9) 
and outputs the average of the vectors in . El Mhamdi et al. [14] prove that MDA is Byzantine resilient, assuming that the following VN condition holds true:
Therefore, the multiplicative constant for MDA is
(10) 
b.3 Median
Yin et al. [42] introduce the Median aggregation rule under the assumption that . When using Median, the parameter server outputs the coordinatewise median of the submitted gradients. We recall that every submitted gradient , where is the number of parameters of the model. Formally, Median is defined as follows
(11) 
where is the th coordinate of , and median is the realvalued median. In other words,
where . The VN condition for Median is the following:
Therefore, the multiplicative constant for Median is
(12) 
b.4 Bulyan
Bulyan is an aggregation rule defined under the assumption that .
It is actually not an aggregation rule in the conventional sense, but rather an iterative method that repetitively uses an existing GAR [14]. In this paper, we use Bulyan on top of Krum defined above. Formally, Bulyan uses Krum times iteratively, each time discarding the highestscoring gradient. After that, the parameter server is left with a set of the "lowestscoring" gradients selected by Krum, as mentioned in Appendix B.1. Bulyan then outputs the average of the closest gradients to the coordinatewise median of the (selected) gradients .
The VN condition for Bulyan is the same as that of Krum (i.e., equation 7). Therefore, the multiplicative constant for Bulyan is
(13) 
Appendix C Proofs omitted from the main paper
c.1 Technical background on privacy
Before demonstrating Theorem 1, we recall some classical tools from the DP literature. Below, we recall the definition of sensitivity, the privacy guarantee of the Gaussian noise injection, and the notion notion of privacy amplification by subsampling.
Definition 5 (Sensitivity).
Let . The sensitivity of , denoted by , is the maximum norm of the difference between the outcomes of when applied on any two adjacent datasets, i.e.,
where denote the adjacency between the databases and from .
Using this notion of sensitivity, we can demonstrate that the Gaussian noise injection scheme (a.k.a. the Gaussian mechanism) satisfies DP for a well chosen noise injection parameter .
Lemma 1 ([13]).
Let , , and . The scheme that takes as input, and outputs
satisfies DP if .
Finally, let us introduce the concept of privacy amplification by subsampling. Here, we study subsampling without replacement defined as follows.
Definition 6.
(Subsampling) Given a dataset and a constant , the procedure selects points at random and without replacement from .
This subsampling procedure has been widely studied in the privacy preserving literature and is known to provide privacy amplification. In particular, Balle et al. [2] demonstrated that it satisfies the following privacy amplification lemma.
Lemma 2 (Balle et al. [2]).
Let , , , and be an arbitrary output space. Let be an DP algorithm and defined as . Then is DP, with and .
c.2 Proof of Theorem 1
Theorem 1.
Proof.
Let be an arbitrary step of Algorithm 1 and the parameter at step . Let us consider an arbitrary honest worker . Note that the batch on which computes its gradient estimate is constituted of points randomly sampled without replacement from . Hence we can write . We now denote by the function that evaluates the mean gradient at using . Specifically,
(14) 
We denote by the noise injection scheme, i.e., for any ,
(15) 
Following the above notation, at step , the honest worker computes the noisy gradient estimate . Hence, it suffices to show that satisfies DP to conclude the proof.
c.3 Proof of Proposition 1
Proposition 1.
Proof.
Let us consider an arbitrary GAR with multiplicative constant . We denote the set of critical points of by . While considering Algorithm 1
, the random variable that characterizes the gradients sent by the honest workers at a given parameter vector is defined as follows, for all
,where is a set of points randomly sampled without replacement from (denoted ) and . To show that the VN condition (in Definition 3) does not hold true, we show that there exists such that
For doing so, we first observe that for any ,
is an unbiased estimator of
, i.e., . Furthermore, note that the injected noise is independent from the stochasticity of gradient estimate . Hence, for all ,(16) 
As admits nontrivial minima, we know that . Accordingly, there exists and . Without loss of generality, we can always take and such that
where is the constant defined in Assumption 3. Thus, using Assumption 3 we get
(17) 
Furthermore, thanks to (16) we know that
(18) 
Finally, using (17) and (18) we obtain that
The above concludes the proof. ∎
c.4 Proof of Theorem 2
Before we prove the theorem, we note the following implication of Assumption 2.
Lemma 3.
Under Assumption 2, for a given parameter ,
where recall that is a batch of data points chosen randomly from dataset .
Proof.
Consider an arbitrary . Then,
By triangle inequality, and the fact that is a convex function, we obtain that
Recall that is a set of points randomly sampled without replacement from , which we denote by . Thus, given ,
Therefore, from above we obtain that
(19) 
Note that
Finally, substituting above from Assumption 2 we obtain that
Substitution from above in (19) concludes the proof. ∎
We now present the proof of Theorem 2, which is restated below for convenience.
Comments
There are no comments yet.