High availability for parallel computers
journal.info.unlp.edu.ar
Abstract
Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault ...
References (15)
- REFERENCES
- Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., & Ortega, D.: COTSon: infrastructure for full system simulation. SIGOPS Oper. Syst. Rev., Vol. 43 (Ed 1), pp. 52-61, 2009.
- Bouteiller A., Herault T., Krawezik G., Lemarinier P., and Cappello F.: MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI. Int. J. High Perform. Comput. Appl. Vol. 20, no.3, pp. 319-333, 2006.
- Chakravorty, S., Mendes, C. and Kale, L.V. Proactive fault tolerance in large systems. HPCRI Workshop in conjunction with HPCA 2005.pp 363-372, 2005.
- Duarte, A., Rexachs, D., Luque, E.: Increasing the cluster availability using RADIC. Cluster Computing, 2006 IEEE International Conference on, pp. 1-8, 2006.
- Duarte, A., Rexachs, D., Luque, E.: An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. LNCS Vol. 4192, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 150-157, 2006.
- Elnozahy E., Alvisi L., Wang Y., and Johnson D.: A Survey of Rollback-Recovery Protocols in Message Passing Systems. ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
- Fialho L., Santos G., Duarte, A., Rexachs, D., Luque, E.: Challenges and Issues of the Integration of RADIC into Open MPI. LNCS Vol. 5759, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 73-83, 2009.
- Fialho L., Duarte, A., Rexachs, D., Luque, E.: Outcomes of the Fault Tolerance Configuration. CACIC 2009.
- Engelmann C. and Geist A. Development of naturally fault tolerant algorithms for computing on 100,000 processors. http://www.csm.ornl.gov/~geist. 2002
- Gropp, W., Lusk, E.: Fault Tolerance in Message Passing Interface Programs. Int. J. High Perform. Comput. Appl. 18(3), pp. 363-372, 2004.
- Kalaiselvi S. and Rajaraman V.: A survey of checkpointing algorithms for parallel and distributed computers. Sadhana, vol. 25, no. 5, pp. 489-510, 2000.
- Mukherjee, S. S., Emer, J., & Reinhardt, S. K.. The Soft Error Problem: An Architectural Perspective. HPCA '05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pp. 243-247, 2005.
- Nagaraja, K., Gama, G., Bianchini, R., Martin, R. P., Meira Jr., W., and Nguyen. : Quantifying the Performability of Cluster-Based Services. IEEE Trans. Parallel Distrib. Syst. 16, 5, pp. 456-467, 2005.
- Santos G., Duarte, A., Rexachs, D., Luque, E.: Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC. LNCS Vol. 5168, Euro-Par 2008, pp. 58-67, 2008.