Key research themes
1. How can systems maintain continuous high availability despite sensor or component faults in layered cloud and edge computing environments?
This research theme explores fault detection and fault tolerance mechanisms at the sensor and component level within multi-layered cloud and edge computing systems. It focuses on maintaining high availability (HA) despite sensor failures that may otherwise disable fault detection capabilities. The relevance lies in ensuring uninterrupted service delivery in complex infrastructures composed of multiple interdependent layers, addressing both hardware and software component failures without human intervention to avoid downtime.
2. What are the effective proactive and coordinated fault tolerance mechanisms to preserve reliability and minimize downtime for cloud virtual machines and parallel applications?
This theme focuses on proactive fault tolerance strategies that anticipate failures based on system health indicators within cloud infrastructures hosting parallel applications across virtual machines (VMs). It addresses coordinated fault tolerance involving VM migration and resource optimization to prevent failures and reduce system unavailability by minimizing checkpoint frequency and downtime, enhancing reliability in cloud data centers with large-scale parallel workloads.
3. How can distributed database systems and cloud orchestrations be architected to achieve strong consistency, fault tolerance, and global high availability?
This area investigates architectural design patterns, replication protocols, and orchestration mechanisms that support high availability in distributed data storage and cloud infrastructure. It includes hybrid replication protocols for ensuring data consistency and availability, strategies for global database replication with failover capabilities, and container orchestration tools for resilient service deployment and scaling. These insights aim to guide systems supporting geo-distributed workloads with minimal downtime and strong data guarantees.
4. How can AI-driven techniques improve disaster recovery, fault tolerance, and high availability in dynamic cloud systems?
This research theme evaluates the application of artificial intelligence (AI) and machine learning methods to enhance cloud reliability. AI enables predictive failure detection, automated fault management, intelligent load balancing, and self-healing capabilities, exceeding limitations of static rule-based resilience methods. Insights cover how AI supports adaptive resource optimization, reduces downtime, and improves recovery speed, while acknowledging challenges such as model bias and data privacy.