The work carried out during my thesis aims to design and implement a platform for the design of reliable applications in environment distributed: STAR (System Tolerating the faults for the Applications Distributed). The main goal is to be transparent with the faults by automatically restarting failing processes without affecting the valid processes. The originality of this approach is to provide a fault tolerance with a low cost under normal operation in an environment where the faults are uncommon. STAR deals with the crash failure model for applications composed of deterministic communicating processes. To efficiently detect failures, hosts are organized in a virtual ring where each host only checks its neighbour. When a failure is detected, each process running on the faulty host is transparently restarted on a valid one. Periodically, processes save their state on a reliable storage so as to rollback only to the last saved state. These checkpoints are performed independently to reduce the checkpoint cost by avoiding synchronization between processes. To keep the global coherency of the system, a recovered process has no interaction with the others. This principle is implemented by means of message logging. A first version of STAR was currently carried out and turns on a whole of workstations of the Sparc type. The cost of the tolerance to the faults was evaluated for several parallel applications meeting a broad class of behavior in term of communications, report requirements and execution time.
Prototype GATOSTAR is the successor of STAR who includes tools powerful for distribution of load: system GATOS developed by Bertil Folliot at laboratory MASI. GATOS allows a judicious placement of the processes according to the state of the networks. In GATOSTAR the processes are allocated dynamically according to criteria of loads involving of better performances of the recovery of faults. We developed algorithms of migration of the processes by using the checkpoint mechanism provided by STAR. A version of GATOSTAR is now operational and a first evaluation of performance concludes on the effectiveness from the migration to reduce the response times.
With Philippe
Cadinot, we currently study the extension of GATOSTAR to large scale
networks. If several research directs towards the management of the communications
in wide area networks, few systems offer a real policy of management of
resources on a large scale. The idea is to adapt the algorithms developed
within GATOSTA to the particular constraints of large networks: important
rate of failure, low flow and very variables. A first prototype based on
mobile agents allowing a reliable access to a set of Web servers will be
carried out.