Differences

This shows you the differences between two versions of the page.

--- education [2019/05/24 16:59]
fablpd
+++ education [2019/05/27 12:08]
fablpd
@@ Line 31: / Line 31: @@
   * **Distributed computing using RDMA and/or NVRAM**: contact [[https://people.epfl.ch/igor.zablotchi|Igor Zablotchi]] for more information.
-  * **[[Distributed ML|Distributed Machine Learning]]**
+  * **[[Distributed ML|Distributed Machine Learning]]**: contact [[http://people.epfl.ch/georgios.damaskinos|Georgios Damaskinos]] for more information.
-  * **Distributed and Fault-tolerant algorithms**: projects here would consist in designing failure detection mechanisms suited for large-scale systems, real-time systems, and systems with unreliable communication or partial synchrony. This task also involves implementing, evaluating, and simulating the performance of the developed mechanisms to verify the achievable guarantees; please contact [[http://people.epfl.ch/david.kozhaya|David Kozhaya]] to get more information.
+  * **Robust Distributed Machine Learning**: With the proliferation of big datasets and models, Machine Learning is becoming distributed. Following the standard parameter server model, the learning phase is taken by two categories of machines: parameter servers and workers. Any of these machines could behave arbitrarily (i.e., said Byzantine) affecting the model convergence in the learning phase. Our goal in this project is to build a system that is robust against Byzantine behavior of both parameter server and workers. Our first prototype, AggregaThor(https://www.sysml.cc/doc/2019/54.pdf), describes the first scalable robust Machine Learning framework. It fixed a severe vulnerability in TensorFlow and it showed how to make TensorFlow even faster, while robust. Contact [[https://people.epfl.ch/arsany.guirguis|Arsany Guirguis]] or [[https://people.epfl.ch/sebastien.rouault|Sébastien Rouault]] for more information.
-  * **Consistency in global-scale storage systems**: We offer several projects in the context of storage systems, ranging from implementation of social applications (similar to [[http://retwis.redis.io/|Retwis]], or [[https://github.com/share/sharejs|ShareJS]]) to recommender systems, static content storage services (à la [[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf|Facebook's Haystack]]), or experimenting with well-known cloud serving benchmarks (such as [[https://github.com/brianfrankcooper/YCSB|YCSB]]); please contact [[http://people.epfl.ch/dragos-adrian.seredinschi|Adrian Seredinschi]] for further information.
+  * **Stochastic gradient: (artificial) reduction of the ratio variance/norm for adversarial distributed SGD**:
+One computationally-efficient and non-intrusive line of defense for adversarial distributed SGD (e.g. 1 parameter server distributing the gradient estimation to several, possibly adversarial workers) relies on the honest workers to send back gradient estimations with sufficiently low variance; assumption which is sometimes hard to satisfy in practice.
+One solution could be to (drastically) increase the batch-size at the workers, but doing so may as well defeat the very purpose of distributing the computation.
-  * **Distributed database algorithms**: a project here would consist in implementing and evaluating protocols that are running in today's database systems, e.g., [[https://en.wikipedia.org/wiki/Two-phase_commit_protocol|2PC]], and comparing them with those protocols that can  potentially be used in future database systems; please contact [[http://people.epfl.ch/jingjing.wang|Jingjing Wang]] to get more information.
+In this project, we propose two approaches that you can choose to explore (also you may propose a different approach) to (artificially) reduce the ratio variance/norm of the stochastic gradients, while keeping the benefits of the distribution.
+The first proposed approach, speculative, boils down to "intelligent" coordinate selection.
+The second makes use of some kind of "momentum" at the workers.
+[1] "Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent" (https://papers.nips.cc/paper/6617-machine-learning-with-adversaries-byzantine-tolerant-gradient-descent)
+[2] "Federated Learning: Strategies for Improving Communication Efficiency" (https://arxiv.org/abs/1610.05492)
+  * **Consistency in global-scale storage systems**: We offer several projects in the context of storage systems, ranging from implementation of social applications (similar to [[http://retwis.redis.io/|Retwis]], or [[https://github.com/share/sharejs|ShareJS]]) to recommender systems, static content storage services (à la [[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf|Facebook's Haystack]]), or experimenting with well-known cloud serving benchmarks (such as [[https://github.com/brianfrankcooper/YCSB|YCSB]]); please contact [[http://people.epfl.ch/dragos-adrian.seredinschi|Adrian Seredinschi]] or [[https://people.epfl.ch/karolos.antoniadis|Karolos Antoniadis]]  for further information.