On Processing Extreme Data
2016, Scalable Computing: Practice and Experience
https://doi.org/10.12694/SCPE.V16I4.1134Abstract
Extreme Data is an incarnation of Big Data concept distinguished by the massive amounts of data that must be queried, communicated and analyzed in near real-time by using a very large number of memory or storage elements and exascale computing systems. Immediate examples are the scientific data produced at a rate of hundreds of gigabits-per-second that must be stored, filtered and analyzed, the millions of images per day that must be analyzed in parallel, the one billion of social data posts queried in real-time on an in-memory components database. Traditional disks or commercial storage nowadays cannot handle the extreme scale of such application data. Following the need of improvement of current concepts and technologies, we focus in this paper on the needs of data intensive applications running on systems composed of up to millions of computing elements (exascale systems). We propose in this paper a methodology to advance the state-of-the-art. The starting point is the definition of new programming paradigms, APIs, runtime tools and methodologies for expressing data-intensive tasks on exascale systems. This will pave the way for the exploitation of massive parallelism over a simplified model of the system architecture, thus promoting high performance and efficiency, offering powerful operations and mechanisms for processing extreme data sources at high speed and/or real time.
References (106)
- J. Armstrong, Erlang, Communications of ACM, 53(9), September 2010.
- A. Acosta, F. Almeida, Towards a Unified Heterogeneous Development Model in AndroidTM, Euro-Par 2013, 238-248
- I. Ahmad, A. Abdulah, A. Alghamdi, Towards the Designing of a Robust Intrusion Detection System through an Optimized Advancement of Neural Networks, Advances in Comp. Science and IT, Springer 2010.
- M. Galea, M. Atkinson, C. S. Liew, P. Martin, Final Report on the ADMIRE Architecture. 2011
- E. Alpaydin, Introduction to machine learning. The MIT Press, 2004.
- J. Arnold, Software Defined Storage with OpenStack Swift, Amazon, 2013.
- A. Barak, T. Ben-Nun, E. Levy A. Shiloh, A package for OpenCL based heterogeneous computing on clusters with many GPU devices, 2010 IEEE Intl. Conf. on Cluster Computing Workshops and Posters, 17, 2010
- A. Bartzokas, V. Kotroni, K. Lagouvardos, C.J. Lolis, A. Gkikas, M.I. Tsirogianni, Weather forecast in north- western Greece: RISKMED warnings and verification of MM5 model, Natural Hazards and Earth System Sciences, 10, 383-394, 2010.
- A. Bourdon, A. Noureddine, R. Rouvoy, L. Seinturier, Powerapi: A software library to monitor the energy consumed at the process level, ERCIM News 2013, no. 92.
- J. B. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, S. Brandt, SciHadoop: Array-based Query Processing in Hadoop, ACM Intl Conf for High Performance Computing, Networking, Storage and Analysis (SC), 2011.
- J. Buisson, O. Sonmez, H. Mohamed, W. Lammers, D. Epema, Scheduling malleable applications in multicluster systems, IEEE International Conference on Cluster Computing, 372381, 2007.
- L. C. Canon and E. Emmanuel, MO-Greedy: an Extended Beam-Search Approach for Solving a Multi-Criteria Scheduling Problem on Heterogeneous Machines, International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, Shangai, 2011.
- P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, R. Ross, Understanding and Improving Computa- tional Science Storage Access through Continuous Characterization, ACM Transactions on Storage (TOS), vol 7, no 3, 2011.
- M. Cera, Y. Georgiou, O. Richard, N. Maillard, P. Navaux, Supporting Malleability in Parallel Architectures with Dynamic CPUSETs Mapping and Dynamic MPI, Distributed Computing and Networking, 242257. Springer, 2010.
- G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering, Volume 40, Issue 1, 16-28, 2014.
- F. Chen, J. Dudhia, Coupling an advanced land surface-hydrology model with the Penn State-NCAR MM5 modeling system. Part I: Model implementation and sensitivity, Mon. Wea. Rev., 129, 569-585, 2001
- Z. Chen, Y.F. Li, Anomaly Detection Based on Enhanced DBScan Algorithm, Procedia Engineering, 2011.
- S. Conway, C. Dekate, E. Joseph, IDC HPC End-User Special Study of High-Performance Data Analysis (HPDA): Where Big Data Meets HPC, Worldwide Study of HPC End-User Sites, 2013, www.idc.com
- L. F. Cupertino, G. Da Costa, J.-M. Pierson, Towards a generic power estimator, Computer Science -Research and Development, Springer Berlin / Heidelberg, Special Issue Ena-HPC 2014
- P. Martin, G. Yaikhom, DISPEL: Users' Manual, 2011, http://www.admire-project.eu/docs/DISPEL-manual.pdf
- R. F. Vicente, I. Klampanos, A. Krause, M. David, A. Moreno, M. Atkinson, dispel4py: A Python Framework for Data- Intensive Scientific Computing,2014 Data Intensive Scalable Computing Systems (DISCS-2014) workshop (SC14).2014
- J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM, 51(1):107113, 2008.
- Department of Energy, Cross-cutting Technologies for Computing at the Exascale, Washington, DC, Scientific Grand Challenges Workshop Series, pp. 99, 2009. http://extremecomputing.labworks.org/crosscut/CrosscutWSFinalRept Draft02.pdf.
- Department of Energy, DOE Exascale Roadmap Highlights Big Data, 2014, http://www.hpcwire.com/2014/04/07/doe- exascale-roadmap-highlights-big-data/
- J. Dongarra, P. Beckman, T. Moore, P. Aerts et al, The International Exascale Software Project roadmap, Interna- tional J. of High Performance Computing Applications, 2011.
- D. Engel, L. Huttenberger, B. Hamann, A Survey of Dimension Reduction Methods for High-Dimensional Data Analysis and Visualization, Visualization of Large and Unstructured Data Sets: Applications in Geospatial Planning, Modeling and Engineering -Proceedings of IRTG Workshop, 2011
- M. Eshel, R. Haskin, D. Hildebrand, M. Naik, F. Schmuck, R. Tewari, Panache: A Parallel File System Cache for Global File Access, FAST 2010. Usenix,. 2010.
- European Technology Platform for High Performance Computing, ETP4HPC Strategic Research Agenda Achieving HPC leadership in Europe, Barcelona May 2013, www.etp4hpc.eu.
- European Technological Platform for High Performance Computing, Vision Paper, 2012
- M.J. Flynn, O. Mencer, V. Milutinović, G. Rakočević, P. Stenstrom, R. Trobec, M. Valero, Moving from petaflops to petadata, Communications of the ACM. 2013;56:39-42;
- M.M. Gaber, A. Zaslavsky, S. Krishnaswamy, Mining data streams: a review, ACM Sigmod Record, volume 34, number 2, 18-26, ACM, 2005.
- T. Gamblin, B.R. De Supinski, M. Schulz, R. Fowler, D.A. Reed, Clustering performance data efficiently at massive scales, 24th ACM International Conference on Supercomputing, 243-252, ACM, 2010.
- C. George, S. S. Vadhiyar, ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science 9 (0) (2012) 166 175, Proceedings of the International Conference on Compu- tational Science, ICCSg 2012. doi: 10.1016/j.procs.2012.04.018.
- S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system, Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP 03, 2943, ACM, 2003.
- I. Grasso, S. Pellegrini, B. Cosenza, T. Fahringer, LibWater: heterogeneous distributed computing made easy, ICS 2013: 161-172, 2013
- W. Gropp, M. Snir, Programming for Exascale Computers, Computing in Science & Engineering, vol.15, no. 6, 27-35, 2013.
- HDF group, HDF5, http://www.hdfgroup.org/hdf5/.
- J. He, J. Bent, A. Torres, G. Grider, G. Gibson, C. Maltzahn, X.-H. Sun, I/O Acceleration with Pattern Detection, ACM Symp on High-Performance Parallel and Distributed Computing (HPDC), 2013.
- High Performance and Embedded Architecture and Compilation Consortium, HiPEAC Roadmap, 2011, www.hipeac.net/roadmap
- C.-L. Huang, J.-F. Dun, A distributed PSOSVM hybrid system with feature selection and parameter optimization, Applied Soft Computing, Vol. 8, Issue 4, 1381-1391, 2008
- J. Hungershofer, On the combined scheduling of malleable and rigid jobs, Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2004), 206213, 2004.
- IBM, GPFS. General Parallel File System, Efficient storage management for big data applications, http://www.ibm.com/ systems/software/gpfs.
- H. H. Inbarani, A. T. Azar, G. Jothi, Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis, Computer Methods and Programs in Biomedicine, Vol. 113, Issue 1, 175-185, 2014.
- F. Isaila, J. Garcia, J. Carretero, R.B. Ross, D. Kimpe, Making the Case for Reforming the I/O Software Stack of Extreme-Scale Systems,Technical Report. Argonne Labs, IL, USA, 2014.
- F. Jemili, M. Zaghdoud, M. Ben Ahmed, A Framework for an Adaptive Intrusion Detection System using Bayesian Network, Intelligence and Security Informatics, 2007.
- A. Jimenez-Molina and I.-Y. Ko, Spontaneous task composition in urban computing environments based on social, spatial, and temporal aspects, Eng. Appl. Artif. Intell. 24, 8 2011, 1446-1460.
- Md. M. Kabir, Md. M. Islam, K. Murase, A new wrapper feature selection approach using neural network, Neurocom- puting, Vol. 73, Issues 1618, 3273-3283, 2010.
- Y. Kessaci, N. Melab and E.-G. Talbi, A multi-start local search heuristic for an energy efficient VMs assignment on top of the OpenNebula Cloud manager, Future Generation Computer Systems 36, 237-256, 2014.
- A. K. M. Khaled, A. Talukder, M. Kirley and R. Buyya, Multiobjective differential evolution for scheduling workflow applications on global Grids, Concurrency and Computation: Practice & Experience, vol. 21, no. 13, 1742-1756, 2009.
- L. Khan, M. Awad, B. Thuraisingham, A New Intrusion Detection System Using Support Vector Machines and Hierar- chical Clustering,The VLDB Journal, 2007.
- V. R. Khare, X. Yao and K. Deb, Performance Scaling of Multi-Objective Evolutionary Algorithms,Conference on Evolu- tionary Multi-Criterion Optimization, 2003.
- Kufrin, R. Perfsuite, An accessible, open source performance analysis environment for Linux, 6th International Confer- ence on Linux Clusters: The HPC Revolution, volume 151, p 5, 2005.
- M. Kumar, M. Hanumanthappa, T. Kumar, Intrusion Detection System using decision tree algorithm, IEEE 14th Inter- national Conference on Communication Technology (ICCT), 2012.
- G. Korres, A. Papadopoulos, P. Katsafados, D. Ballas, L. Perivoliotis, K. Nittis, A 2-year intercomparison of the WAM-Cycle4 and the WAVEWATCH-III wave models implemented within the Mediterranean Sea, Mediterranean Marine Science 12(1), 129-152, 2011
- J. Li, W.-k. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, M. Zin- gale, Parallel netcdf: A high-performance scientific i/o interface, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p. 39, ACM, 2003.
- P. Llopis, F.J. Garca-Blas, F. Isaila, J. Carretero, VIDAS: Object-based Virtualized Data Sharing for High Perfor- mance Storage I/O, Proceedings of the ACM ScienceCloud'13, 2013.
- G. Llort, J. Gonzalez, H. Servat, J. Gimenez, J. Labarta, On-line detection of large-scale parallel application's structure, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), p 1-10, 2010
- T.V. Luong, N. Melab, E.-G. Talbi, GPU-Based multi-start local search algorithms, Learning and Intelligent Optimization. 6683, 321-335. 2011, Springer.
- T.V. Luong, N. Melab, E.-G. Talbi, GPU computing for parallel local search metaheuristic algorithms, Transactions on Computers 62, 173-185, 2013
- Lustre, High Performance and Scalability, http://www.lustre.org.
- K. El Maghraoui, T.J. Desell, B.K. Szymanski, C. A. Varela, Malleable iterative MPI applications, Concurrency Computat.: Pract. Exper., 21(3): 393413, 2009.
- J. Mair, Z. Huang, D. Eyers, L. F. Cupertino, G. Da Costa, J.-M. Pierson, H. Hlavacs, Power Modeling, Large-Scale Distributed Systems and Energy Efficiency: A holistic view. Jean-Marc Pierson (Eds.), John Wiley and Sons, 5, March 2015
- S. Matsuoka, H. Sato, O. Tatebe, M. Koibuchi, I. Fujiwara, S.Suzuki, et al, Extreme Big Data (EBD): Next Gener- ation Big Data Infrastructure Technologies Towards Yottabyte, Supercomputing Frontiesr and Innovations, vol. 1, no. 2, 2014, 89-107
- N. Melab, K. Boufaras, E.-G. Talbi, et al, ParadisEO-MO-GPU: a framework for parallel GPU-based local search metaheuristics, Proceeding of the 15th annual conference on Genetic and evolutionary computation conference, 1189- 1196. ACM, 2013.
- D. Mey, S. Biersdorf, C. Bischof, K. Diethelm, D. Eschweiler, M. Gerndt, A. Knpfer, D. Lorenz, A. Malony, W.E. Nagel, et al, Score-P: A Unified Performance Measurement System for Petascale Applications, Competence in High Performance Computing 2010, 85-97, Springer, 2012.
- J. Michalakes, J. Dudhia, D. Gill, T. Henderson, J. Klemp, W. Skamarock, W. Wang, The Weather Research and Forecast Model: Software Architecture and Performance, 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology, 2004.
- B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K.L., K. Kun- chithapadam, T. Newhall, The Paradyn parallel performance measurement tool, Computer, vol. 28, no. 11, 37-46, 1995.
- Mpi forum, High-end-computing systems, http://www.mpi-forum.org/.
- C. Modi, D. Patel, B. Borisaniya, H. Patel, A. Patel, M. Rajarajan, A survey of intrusion detection techniques in Cloud, Journal of Network and Computer Applications, 2013.
- M. Nanni, R. Trasarti, G. Rossetti, D. Pedreschi, Efficient distributed computation of human mobility aggregates through user mobility profiles, Proceedings of the ACM SIGKDD International Workshop on Urban Computing (Urb- Comp '12), 2012, 87-94.
- J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, E. Apra, Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit, Intl. J. of High Performance Computing Applications 20(2): 203-23, 2006.
- R.V. van Nieuwpoort, J.W. Romein, Correlating Radio Astronomy Signals with Many-Core Hardware, Intl. J. of Parallel Programming 39(1), 88114, 2011.
- R.W. Numrich, J. Reid, Co-array Fortran for parallel programming, ACM SIGPLAN FORTRAN Forum 17(2):131, 1998.
- PlanetHPC, Strategy for Research and Innovation through HPC, November 2011, http://www.planethpc.eu/images/ stories/planethpc-strategy2.pdf
- Partnership for Advanced Computing in Europe, PRACE Scientific Case 2012-2020, October 2012, www.prace- ri.eu/PRACE-The-Scientific-Case-for-HPC
- J. E. Pecero, P. Bouvry, H. J. Fraire Huacuja, S.U. Khan, A Multi-objective GRASP Algorithm for Joint Opti- mization of Energy Consumption and Schedule Length of Precedence-Constrained Applications, 9th IEEE International Conference on Dependable, Autonomic and Secure Computing, 2011.
- Orange, Orangefs/pvfs, http://www.pvfs.org.
- R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2014
- I.Raicu, I.T. Foster, P. Beckman, Making a case for distributed file systems at exascale, 3rd International workshop on Large-scale system and application performance, LSAP 11, 1118, ACM, 2011.
- P.C. Roth, B.P. Miller, On-line automated performance diagnosis on thousands of processes, 11th ACM SIGPLAN symposium on Principles and practice of parallel programming, 69-80, ACM, 2006.
- R. Reyes, I. Lpez-Rodrguez, J.J. Fumero, F. de Sande, accULL: An OpenACC Implementation with CUDA and OpenCL Support, Euro-Par 2012, 871-882, 2012.
- S.S. Shende, A.D. Malony, The Tau Parallel Performance System, International Journal of High Performance Computing Applications, vol. 20, no. 2, 287-311, 2006.
- D. Singh and R. Garg, A robust multi-objective optimization to workflow scheduling for dynamic grid, International Conference on Advances in Computing and Artificial Intelligence, 183-188, 2011.
- M. Snir, R.W. Wisniewski, J.A. Abraham, S.V. Adve, S. Bagchi et al, Addressing Failures in Exascale Computing, Tech. Report ANL/MCS-TM-332, Argonne Nat'l Laboratory, Mathematics and Computer Science Division, Apr. 2013.
- R. Hill, SPRINT: A new parallel framework for R, BMC Bioinformatics. 2008
- R. Sudarsan, C. J. Ribbens, ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment, International Conference on Parallel Processing, ICPP 2007, 44-44, 2007.
- C. Sweetlin Hemalatha, V. Vaidehi, R. Lakshmi, Minimal infrequent pattern based approach for mining outliers in data streams, Expert Systems with Applications, on-line October 2014
- I. Syarif, A. Prugel-Bennett, G. Wills, Unsupervised Clustering Approach for Network Anomaly Detection, Networked Digital Technologies, Springer 2012.
- A. K. A. Talukder, M. Kirley, R. Buyya, Multiobjective Differential Evolution for Workflow Execution on Grids, 5th International Workshop on Middleware for Grid Computing, California, 2007.
- C. J. Tan, C.P. Lim, Y.N Cheah, A multi-objective evolutionary algorithm-based ensemble optimizer for feature selection and classification with neural network models, Neurocomputing, Vol. 125, 217-228, 2014.
- W. Tantisiriroj, S. Patil, G. Gibson, S. W. Son, S. J. Lang, R. B. Ross, On the Duality of Data-intensive File System Design: Reconsiling HDFS and PVFS, ACM Intl Conf for High Performance Computing, Networking, Storage and Analysis (SC), 2011.
- A. Tenschert, M. Assel, A. Cheptsov, G. Gallizo, E.D. Valle, I. Celino, Parallelization and Distribution Techniques for Ontology Matching in Urban Computing Environments, Proceedings of OM 2009.
- C. Tsotskas, T. Kipouros, A.M. Savill, The Design and Implementation of a GPU-enabled Multi-objective Tabu-search Intended for Real World and High-dimensional Applications, Procedia Computer Science, Vol. 29, 2152-2161, 2014.
- D. Talia, P. Trunfio, O. Verta, The Weka4WS Framework for Distributed Data Mining in Service-oriented Grids, Concurrency and Computation: Practice and Experience. 20(16): 1933-1951, 2008
- UPC Consortium, UPC language specifications v1.2, 2005.
- VERCE consortium, VERCE, http://www.verce.eu/
- M.Viñas, Z. Bozkus, B.B. Fraguela, Exploiting heterogeneous parallelism with the Heterogeneous Programming Library, J. Parallel Distrib. Comput. 73(12): 1627-1638, 2013.
- M. Wachs, M. Abd-El-Malek, E. Thereska, G. R. Ganger, Argon: performance insulation for shared storage servers, USENIX Conf. on File and Storage Technologies (FAST), 61-76, 2007.
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume 11, Issue 1. 2009
- X. Wang, R. Buyya and J. Su, Reliability-Oriented Genetic Algorithm for Workflow Applications using Max-Min Strategy, 9th International Symposium on Cluster Computing and the Grid, 2009.
- H. Xia, J. Zhuang, D. Yu, Multi-objective unsupervised feature selection algorithm utilizing redundancy measure and negative epsilon-dominance for fault diagnosis, Neurocomputing, Vol. 146, 113-124, 2014.
- H. Yoon, C.-S. Park, J. S. Kim, J.-G. Baek, Algorithm learning based neural network integrating feature selection and classification, Expert Systems with Applications, Vol. 40, Issue 1, 231-241, 2013.
- H. Yu, R. Chen, G. Zhang, A SVM Stock Selection Model within PCA, Procedia Computer Science, Vol. 31, 406-412, 2014.
- J. Yu, M. Kirley and R. Buyya, Multi-Objective Planning for Workflow Execution on Grids, 8th International Conference on Grid Computing, 2007.
- Yu Zheng, Licia Capra, Ouri Wolfson, Hai Yang, Urban Computing: concepts, methodologies, and applications, ACM Transaction on Intelligent Systems and Technology (ACM TIST). 2014
- T. Zhu, A. Tumanov, M. A. Kozuch, M. Harchol-Balter, G. R. Ganger, PriorityMeister: Tail Latency QoS for Shared Neworked Storage, ACM Symp. Cloud Computing (SoCC), 2014.