LADC 2013 : Latin-American Symposium on Dependable Computing

Tutorial #1: Checkpointing and Debugging in Distributed Computing

Michel Raynal
Institut Universitaire de France & IRISA, Université de Rennes 1
Campus de Beaulieu, 35042 Rennes Cedex, France
raynal@irisa.fr
Duration: 1,5 hours

Abstract:

This tutorial addresses two complementary topics: checkpointing and debugging (detection of properties and replay) in distributed message-passing computations.

The first part, which is on checkpointing, will introduce the notions of local and global checkpoints, and present a theorem stating a necessary and sufficient condition for a set of local checkpoints to belong to the same consistent global checkpoint. Then, it will consider two consistency conditions, which can be associated with a distributed computation enriched with local checkpoints. The first consistency condition (called z-cycle-freedom) ensures that any local checkpoint, which has been taken by a process, belongs to a consistent global checkpoint. The second consistency condition (called rollback-dependency trackability) is stronger. It states that a consistent global checkpoint can be associated on the fly with any local checkpoint (i.e., without additional communication). The tutorial will discuss these consistency conditions and present algorithms that, once superimposed on a distributed execution, ensure that the corresponding consistency condition is satisfied. It will also present a message logging algorithm suited to uncoordinated checkpointing.

The second part of the tutorial will be devoted to the debugging of asynchronous message-passing computations. Distributed programs are more difficult to design than sequential programs. This is mainly due to asynchrony, which makes the notion of a global state difficult to master and exploit. It follows that distributed debugging of distributed programs is much more difficult than its sequential counterpart. Two fundamental issues are attached to distributed debugging: the replay of distributed executions and the detection of global properties of these executions, mainly when these properties are unstable (while, once true, a stable property remains true forever, this is no longer the case for unstable properties). The tutorial will discuss these two important issues and will present several approaches and algorithms that have been proposed to solve them.

Tutorial #2: Architecting Resilience in Self-Adaptive Systems

Javier Cámara
Department of Informatics Engineering, University of Coimbra
3030-290 Coimbra, Portugal
jcmoreno@dei.uc.pt

Rogério de Lemos
School of Computing, University of Kent
Canterbury, Kent CT2 7NF, UK
r.delemos@kent.ac.uk

Duration: 1.5 hours

Abstract:

The important concern for modern software systems is to become more cost-effective, while being versatile, flexible, resilient, dependable, energy-efficient, customisable, configurable and self-optimising when reacting to run-time changes that may occur within the system itself, its environment or requirements. One of the most promising approaches to achieving such properties is to equip software systems with self-managing capabilities using self-adaptation mechanisms. Despite recent advances in this area, one key aspect of self-adaptive systems that remains to be tackled in depth is assurances: that is, the provision of evidence that the system satisfies its stated functional and non-functional requirements during its operation in the presence of self-adaptation.

The goal of this tutorial is to discuss the fundamental principles, models, methods, techniques, mechanisms, state-of-the-art, and challenges for the provision of assurances in self-adaptive software systems.

Tutorial #3: Holistic Optimization of Distribution Automation Smart-Grid Designs using Survivability Modeling

Alberto Avritzer
Siemens Corporate Research
alberto.avritzer@siemens.com

Daniel Mensache
Universidade Federal do Rio de Janeiro (UFRJ)

Duration: 1,5 hours

Abstract:

Smart grids are fostering a paradigm shift in the realm of power distribution systems. Whereas traditionally different components of the power distribution system havebeen provided and analyzed by different teams, smart grids require a unified and holistic approach taking into consideration the interplay of distributed generation, distribution automation topology, intelligent features, and others.

In this tutorial, we describe the use of transient survivability metrics to create better distribution automation network designs. Our approach combines survivability analysis and power flow analysis to
assess the survivability of the distribution power grid network.

We conclude the tutorial by presenting practical optimization results based on real distribution grids.