Papers by Hubertus Franke

arXiv (Cornell University), Oct 23, 2022
Zero Trust is a novel cybersecurity model that focuses on continually evaluating trust to prevent... more Zero Trust is a novel cybersecurity model that focuses on continually evaluating trust to prevent the initiation and horizontal spreading of attacks. A cloud-native Service Mesh is an example of Zero Trust Architecture that can filter out external threats. However, the Service Mesh does not shield the Application Owner from internal threats, such as a rogue administrator of the cluster where their application is deployed. In this work, we are enhancing the Service Mesh to allow the definition and reinforcement of a Verifiable Configuration that is defined and signed off by the Application Owner. Backed by automated digital signing solutions and confidential computing technologies, the Verifiable Configuration allows changing the trust model of the Service Mesh, from the data plane fully trusting the control plane to partially trusting it. This lets the application benefit from all the functions provided by the Service Mesh (resource discovery, traffic management, mutual authentication, access control, observability), while ensuring that the Cluster Administrator cannot change the state of the application in a way that was not intended by the Application Owner.
The terms and conditions for the reuse of this version of the manuscript are specified in the pub... more The terms and conditions for the reuse of this version of the manuscript are specified in the publishing policy. For all terms of use and more information see the publisher's website.

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020
The need for higher energy efficiency has resulted in the proliferation of accelerators across pl... more The need for higher energy efficiency has resulted in the proliferation of accelerators across platforms, with custom and reconfigurable accelerators adopted in both edge devices and cloud servers. However, existing solutions fall short in providing accelerators with low-latency, high-bandwidth access to the working set and suffer from the high latency and energy cost of data transfers. Such costs can severely limit the smallest granularity of the tasks that can be accelerated and thus the applicability of the accelerators. In this work, we present FReaC Cache, a novel architecture that natively supports reconfigurable computing in the last level cache (LLC), thereby giving energy-efficient accelerators low-latency, high-bandwidth access to the working set. By leveraging the cache's existing dense memory arrays, buses, and logic folding, we construct a reconfigurable fabric in the LLC with minimal changes to the system, processor, cache, and memory architecture. FReaC Cache is a low-latency, low-cost, and low-power alternative to off-die/offchip accelerators, and a flexible, and low-cost alternative to fixed function accelerators. We demonstrate an average speedup of 3X and Perf/W improvements of 6.1X over an edge-class multi-core CPU, and add 3.5% to 15.3% area overhead per cache slice.

Optimization of Genomics Analysis Pipeline for Scalable Performance in a Cloud Environment
2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2018
Cost-effective and scalable analysis of the human genome is crucial for the democratization of pr... more Cost-effective and scalable analysis of the human genome is crucial for the democratization of precision medicine. The new version of the Genome Analysis Toolkit (GATK4), an industry-standard end-to-end tool for variant discovery analysis in next-generation sequencing (NGS) data, introduces Apache Spark support to improve scaling for both local multithreading and cluster-wide parallelization, as well as facilitate the deployment on cloud infrastructures. In this paper, we evaluate the performance and scalability of GATK4-Spark running on a next-generation cloud platform. After identifying bottlenecks and scaling challenges, we optimize the software stack that includes an optimized JVM, enhancements of Spark and targeted configuration tuning, which in turn enables more effective use of the underlying computing resources. We demonstrate the effectiveness of our comprehensive optimization techniques on a reference Single Nucleotide Polymorphisms (SNPs) pipeline, achieving ≤1 hr computation time for whole human genome analysis.

Platform and applications for massive-scale streaming network analytics
IBM Journal of Research and Development, 2013
ABSTRACT The ability to analyze massive amounts of network traffic data in real time is becoming ... more ABSTRACT The ability to analyze massive amounts of network traffic data in real time is becoming increasingly important for communication service providers, as it enables them to optimize use of their service infrastructure and develop innovative revenue-generating opportunities. In particular, the real-time analysis of perishable user traffic (which is not stored because of privacy, regulatory, and other constraints) can provide insights into the use of applications and services by telecommunication subscribers. In this paper, we describe the design and implementation of a novel system for real-time analysis of network traffic based on IBM InfoSphere® Streams, a scalable stream-processing platform, which provides access and analysis with respect to the data objects and communication patterns of users at the application layer, in contrast to simple packet- and flow-based analysis that most current systems provide. We discuss our design considerations for such a system and further describe analytics applications developed to showcase its capabilities: online identification of most-frequent objects, online social network discovery, and real-time sentiment analysis. We also present performance results from a pilot deployment of this platform and its applications that analyzed Internet traffic generated by users at a large corporate research lab.

PI Programming Environment for IBM SPl/SP2
In this paper we discuss an implementation of the M essage Passing I nterface standard (MPI) for ... more In this paper we discuss an implementation of the M essage Passing I nterface standard (MPI) for the IBM Scalable Power PARALLEL 1 and 2 (SPl, SP2). Key to a reliable and efficient implementation of a message passing library on these machines is the careful design of a UNIX-Socket like layer in the u ser space with controlled access to the communication adapters and with adequate recovery and flow control. The performance of this i mplementation is at the s ame level as the IBMproprietary message passing library (MPL). We also show t hat in the IBM SPl and SP2 we achieve integrated tracing ability, where both system events, such as context switches and page fault etc., and MPI related activities are traced, with minimal overhead to the application program, thus presenting application programmers the trace of all the events that ultimately affect efficiency of a parallel program.

arXiv (Cornell University), Mar 24, 2022
Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and ... more Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and safety requirements prior to deployment. With increasing complexity in AV applications, the system needs to meet stringent real-time demands of multiple safety-critical applications simultaneously. A typical AV-SoC is a heterogeneous multiprocessor consisting of accelerators supported by general-purpose cores. Such heterogeneity, while needed for power-performance efficiency, complicates the art of task (process) scheduling. In this paper, we demonstrate that hardware heterogeneity impacts the scheduler's effectiveness and that optimizing for only the real-time aspect of applications is not sufficient in AVs. Therefore, a more holistic approach is required-one that considers global Quality-of-Mission (QoM) metrics, as defined in the paper. We then propose HetSched, a multi-step scheduler that leverages dynamic runtime information about the underlying heterogeneous hardware platform, along with the applications' real-time constraints and the task traffic in the system to optimize overall mission performance. HetSched proposes two scheduling policies: M Sstat and M S dyn and scheduling optimizations like task pruning, hybrid heterogeneous ranking and rank update. HetSched improves overall mission performance on average by 4.6×, 2.6× and 2.6× when compared against CPATH, ADS and 2lvl-EDF (state-of-the-art real-time schedulers built for heterogeneous systems), respectively, and achieves an average of 53.3% higher hardware utilization, while meeting 100% critical deadlines for real-world applications of autonomous driving and aerial vehicles. Furthermore, when used as part of an SoC design space exploration loop, in comparison to the prior schedulers, HetSched reduces the number of processing elements required by an SoC to safely complete AV's missions by 35% on average while achieving 2.7× lower energy-mission time product. 1 Note that offline application profiling is a common approach across most of the schedulers considered in this work.

IEEE Transactions on Parallel and Distributed Systems, 2018
Recent research trends exhibit a growing imbalance between the demands of tenants' software appli... more Recent research trends exhibit a growing imbalance between the demands of tenants' software applications and the provisioning of hardware resources. Misalignment of demand and supply gradually hinders workloads from being efficiently mapped to fixed-sized server nodes in traditional data centers. The incurred resource holes not only lower infrastructure utilization but also cripple the capability of a data center for hosting large-sized workloads. This deficiency motivates the development of a new rack-wide architecture referred to as the composable system. The composable system transforms traditional server racks of static capacity into a dynamic compute platform. Specifically, this novel architecture aims to link up all compute components that are traditionally distributed on traditional server boards, such as central processing unit (CPU), random access memory (RAM), storage devices, and other application-specific processors. By doing so, a logically giant compute platform is created and this platform is more resistant against the variety of workload demands by breaking the resource boundaries among traditional server boards. In this paper, we introduce the concepts of this reconfigurable architecture and design a framework of the composable system for cloud data centers. We then develop mathematical models to describe the resource usage patterns on this platform and enumerate some types of workloads that commonly appear in data centers. From the simulations, we show that the composable system sustains nearly up to 1.6 times stronger workload intensity than that of traditional systems and it is insensitive to the distribution of workload demands. This demonstrates that this composable system is indeed an effective solution to support cloud data center services.

IBM Journal of Research and Development, 2010
In this paper, we examine two network-processing appliances, i.e., the IBM Proventia A Network In... more In this paper, we examine two network-processing appliances, i.e., the IBM Proventia A Network Intrusion Prevention System and the IBM WebSphere A DataPower A service-oriented architecture appliance, and the specific requirements they pose on emerging heterogeneous multicore-processor systems. We first describe the function and architecture of these applications. Next, we describe the computational requirements imposed on the applications as a result of the expectation that they operate at the maximum transmission rate on high-speed networks (i.e., on networks at speeds greater than 10 Gb/s) with minimal latency. Given that next-generation systems will provide on-chip and off-chip hardware acceleration functions, we identify and quantify the functions that can be offloaded onto hardware accelerators to provide latency reduction by more efficient execution, increased concurrence, or both. Referring to models of specific hardware accelerators, we estimate and quantify the impact on the performance of the applications. We conclude with a discussion of the modifications to these applications that are required to exploit the large number of hardware threads and accelerators available on emerging multicore-processor systems.
DRPM
Proceedings of the 30th annual international symposium on Computer architecture - ISCA '03, 2003
Dezentrale Produktionsablaufplanung mittels Agentensimulationen
Proceedings of the 7th ACM international conference on Distributed event-based systems - DEBS '13

IEEE Transactions on Parallel and Distributed Systems, 2003
Effective scheduling strategies to improve response times, throughput, and utilization are an imp... more Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Parallel machines in these environments have traditionally used space-sharing strategies to accommodate multiple jobs at the same time by dedicating the nodes to a single job until it completes. This approach, however, can result in low system utilization and large job wait times. This paper discusses three techniques that can be used beyond simple spacesharing to improve the performance of large parallel systems. The first technique we analyze is backfilling, the second is gangscheduling, and the third is migration. The main contribution of this paper is an analysis of the effects of combining the above techniques. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining the various techniques are shown over a spectrum of performance criteria.
Preface: Software defined environments
IBM Journal of Research and Development, 2014

Software defined infrastructures
IBM Journal of Research and Development, 2014
ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud comp... more ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud computing has completely changed the way infrastructure is viewed, offering more simplicity, flexibility, and monetary benefits compared to a traditional view of infrastructure. At the core of this transformation is the notion of virtualization of infrastructure as a whole, with providers offering infrastructure-as-a-service (IaaS) to consumers. However, just offering IaaS alone is insufficient for software defined environments (SDEs). This paper examines infrastructure in the context of SDE and discusses what we believe are some of the fundamental characteristics required of such infrastructure—called software defined infrastructure (SDI)—and how it fits into the larger landscape of cloud computing environments and SDEs. Various components of SDI are discussed, including core intelligence, monitoring pieces, and management, in addition to a brief discussion on silos such as compute, network and storage. Consumer and provider points of view are also presented along with infrastructure-level service-level agreements (SLAs). Also presented are the design principles and high-level architectural design of the infrastructure intelligence controller, which constantly transforms infrastructure to honor consumer requirements (SLAs) amidst provider constraints (costs). We believe that the insights presented in this paper can be used for better design of SDE architectures and of data-center systems software in general.

Software defined infrastructures
IBM Journal of Research and Development, 2014
ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud comp... more ABSTRACT A fundamental component of any large-scale computer system is infrastructure. Cloud computing has completely changed the way infrastructure is viewed, offering more simplicity, flexibility, and monetary benefits compared to a traditional view of infrastructure. At the core of this transformation is the notion of virtualization of infrastructure as a whole, with providers offering infrastructure-as-a-service (IaaS) to consumers. However, just offering IaaS alone is insufficient for software defined environments (SDEs). This paper examines infrastructure in the context of SDE and discusses what we believe are some of the fundamental characteristics required of such infrastructure—called software defined infrastructure (SDI)—and how it fits into the larger landscape of cloud computing environments and SDEs. Various components of SDI are discussed, including core intelligence, monitoring pieces, and management, in addition to a brief discussion on silos such as compute, network and storage. Consumer and provider points of view are also presented along with infrastructure-level service-level agreements (SLAs). Also presented are the design principles and high-level architectural design of the infrastructure intelligence controller, which constantly transforms infrastructure to honor consumer requirements (SLAs) amidst provider constraints (costs). We believe that the insights presented in this paper can be used for better design of SDE architectures and of data-center systems software in general.

Proceedings of the 2nd European Workshop on Machine Learning and Systems, 2022
Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees applic... more Serverless Function-as-a-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.6× higher function tail latency degradation on multi-tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-agent RL algorithm based on Proximal Policy Optimization, i.e., multi-agent PPO (MA-PPO). We show that in multi-tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4× improvement in S-RL performance (in terms of function tail latency) in multi-tenant cases. • Software and its engineering → Cloud computing; • Computing methodologies → Multi-agent planning; Multi-agent systems.

Toward building highly available and scalable OpenStack clouds
IBM Journal of Research and Development, 2016
OpenStack® (the leading open source platform for public and private infrastructure-as-a-service c... more OpenStack® (the leading open source platform for public and private infrastructure-as-a-service clouds) is composed of a set of loosely coupled and rapidly evolving projects that support a wide set of technologies and configuration options. Deciding how to combine and configure such projects is the determining factor on the overall quality of the cloud, in terms of performance, scalability, and availability. In this paper, we present a methodical framework and empirical analysis to help both cloud providers and users optimize their design and deployment decisions. Cloud providers can rely on this framework to select an appropriate configuration of their cloud for a given service-level agreement. Users developing and running applications on a cloud can better fit virtual resources to their workloads. We demonstrate the power of this framework using several scenarios collected by our CloudBench® tool using application benchmarks running on actual clouds.
Internet-oriented optimization schemes for joint compression and encryption
China Communications, 2015
Compression and encryption are widely used in network traffic in order to improve efficiency and ... more Compression and encryption are widely used in network traffic in order to improve efficiency and security of some systems. We propose a scheme to concatenate both functions and run them in a paralle pipelined fashion, demonstrating both a hardware and a software implementation. With minor modifications to the hardware accelerators, latency can be reduced to half. Furthermore, we also propose a seminal and more efficient scheme, where we integrate the technology of encryption into the compression algorithm. Our new integrated optimization scheme reaches an increase of 1.6X by using parallel software scheme However, the security level of our new scheme is not desirable compare with previous ones. Fortunately, we prove that this does not affect the application of our schemes.

IEEE Transactions on Computers, 2001
AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for comp... more AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for compressing main memory contents. MXT effectively doubles the physically available memory transparently to the CPUs, input/output devices, device drivers, and application software. An average compression ratio of two or greater has been observed for many applications. Since compressibility of memory contents varies dynamically, the size of the memory managed by the operating system is not fixed. In this paper, we describe operating system techniques that can deal with such dynamically changing memory sizes. We also demonstrate the performance impact of memory compression using the SPEC CPU2000 and SPECweb99 benchmarks. Results show that the hardware compression of memory has a negligible performance penalty compared to a standard memory for many applications. For memory starved applications and benchmarks such as SPECweb99, memory compression improves the performance significantly. Results also show that the memory contents of many applications can be compressed, usually by a factor of two to one.
Uploads
Papers by Hubertus Franke