Key research themes
1. How can the Hadoop ecosystem be optimized for scalable and efficient big data storage and processing?
This research area focuses on the architectural and configuration aspects of Hadoop and its key components—Hadoop Distributed File System (HDFS) and MapReduce—to improve fault tolerance, data locality, replication strategies, and performance in large-scale distributed storage and computation environments. Understanding these optimizations is crucial for enabling Hadoop to reliably process petabyte-scale datasets on commodity hardware with fault tolerance and high throughput.
2. What advances in SQL and query engines have improved interactive and high-performance analytics on Hadoop?
This theme investigates the integration and performance of SQL-on-Hadoop engines designed to enable interactive, low-latency, high-concurrency analytics directly on Hadoop data. Focusing on systems like Impala, this research area is significant because traditional batch frameworks like Apache Hive lack the latency and concurrency levels required for many BI and analytic workloads. Improvements in front-end optimizers, execution engines, and resource management constitute key enablers of scalable SQL processing over big data stored in Hadoop.
3. Which distributed computing frameworks beyond MapReduce are promising for overcoming big data analysis challenges in Hadoop environments?
This research area reviews the limitations of MapReduce-based frameworks such as Hadoop MapReduce in handling contemporary big data analysis tasks, especially those requiring complex, iterative, or memory-efficient computations. It also investigates alternative distributed computing frameworks that can reduce I/O overhead, enable scalability beyond memory constraints, and support serial algorithms. Exploring such frameworks is vital for evolving big data analytics to handle ever-growing data volume and complexity effectively.