Papers by Zinovi Rabinovich

We investigate equilibrium strategies for bidding agents that participate in multiple, simultaneo... more We investigate equilibrium strategies for bidding agents that participate in multiple, simultaneous second-price auctions with perfect substitutes. For this setting, previous research has shown that it is a best response for a bidder to participate in as many such auctions as there are available, provided that other bidders only participate in a single auction. In contrast, in this paper we consider equilibrium behaviour where all bidders participate in multiple auctions. For this new setting we consider mixed-strategy Nash equilibria where bidders can bid high in one auction and low in all others. By discretising the bid space, we are able to use smooth fictitious play to compute approximate solutions. Specifically, we find that the results do indeed converge to -Nash mixed equilibria and, therefore, we are able to locate equilibrium strategies in such complex games where no known solutions previously existed.

Hierarchical reinforcement learning (HRL) is a promising approach to solve tasks with long time h... more Hierarchical reinforcement learning (HRL) is a promising approach to solve tasks with long time horizons and sparse rewards. It is often implemented as a high-level policy assigning subgoals to a low-level policy. However, it suffers the high-level non-stationarity problem since the lowlevel policy is constantly changing. The nonstationarity also leads to the data efficiency problem: policies need more data at non-stationary states to stabilize training. To address these issues, we propose a novel HRL method: Interactive Influence-based Hierarchical Reinforcement Learning (I 2 HRL). First, inspired by agent modeling, we enable the interaction between the lowlevel and high-level policies, i.e., the low-level policy sends its policy representation to the highlevel policy. The high-level policy makes decisions conditioned on the received low-level policy representation as well as the state of the environment. Second, we stabilize the training of the high-level policy via an information-theoretic regularization with minimal dependence on the changing low-level policy. Third, we propose the influence-based exploration to more frequently visit the non-stationary states where more transition data is needed. We experimentally validate the effectiveness of the proposed solution in several tasks in MuJoCo domains by demonstrating that our approach can significantly boost the learning performance and accelerate learning compared with stateof-the-art HRL methods.
We present the Behaviosite paradigm, a new approach to affecting the behavior of distributed agen... more We present the Behaviosite paradigm, a new approach to affecting the behavior of distributed agents in a multiagent system, which is inspired by biological parasites with behavior manipulation properties. Behaviosites are special kinds of agents that "infect" a system composed of agents operating in that environment. The behaviosites facilitate behavioral changes in agents to achieve altered, potentially improved, performance of the overall system. Behaviosites need to be designed so that they are intimately familiar with the internal workings of the environment and of the agents operating within it, and behaviosites apply this knowledge for their manipulation, using various infection and manipulation strategies. To demonstrate and test this paradigm, we implemented a version of the El Farol problem, using behaviosites.

Adaptive Agents and Multi-Agents Systems, May 8, 2017
In recent years, there has been increasing interest within the computational social choice commun... more In recent years, there has been increasing interest within the computational social choice community regarding models where voters are biased towards specific behaviors or have secondary preferences. An important representative example of this approach is the model of truth bias, where voters prefer to be honest about their preferences, unless they are pivotal. This model has been demonstrated to be an effective tool in controlling the set of pure Nash equilibria in a voting game, which otherwise lacks predictive power. However, in the models that have been used thus far, the bias is binary, i.e., the final utility of a voter depends on whether he cast a truthful vote or not, independently of the type of lie. In this paper, we introduce a more robust framework, and eliminate this limitation, by investigating truth-biased voters with variable bias strength. Namely, we assume that even when voters face incentives to lie towards a better outcome, the ballot distortion from their truthful preference incurs a cost, measured by a distance function. We study various such distance-based cost functions and explore their effect on the set of Nash equilibria of the underlying game. Intuitively, one might expect that such distance metrics may induce similar behavior. To our surprise, we show that the presented metrics exhibit quite different equilibrium behavior. •Theory of computation → Solution concepts in game theory; Convergence and learning in games;

arXiv (Cornell University), May 26, 2022
We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-bea... more We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent, i.e., all actions have pre-set execution durations. During execution durations, the environment changes are influenced by, but not synchronised with, action execution. Such a setting is ubiquitous in many real-world problems. However, most MARL methods assume actions are executed immediately after inference, which is often unrealistic and can lead to catastrophic failure for multi-agent coordination with off-beat actions. In order to fill this gap, we develop an algorithmic framework for MARL with off-beat actions. We then propose a novel episodic memory, LeGEM, for model-free MARL algorithms. LeGEM builds agents' episodic memories by utilizing agents' individual experiences. It boosts multi-agent learning by addressing the challenging temporal credit assignment problem raised by the off-beat actions via our novel reward redistribution scheme, alleviating the issue of non-Markovian reward. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks. Empirical results show that LeGEM significantly boosts multi-agent coordination and achieves leading performance and improved sample efficiency.

arXiv (Cornell University), Aug 9, 2021
Recent studies in multi-agent communicative reinforcement learning (MACRL) have demonstrated that... more Recent studies in multi-agent communicative reinforcement learning (MACRL) have demonstrated that multi-agent coordination can be greatly improved by allowing communication between agents. Meanwhile, adversarial machine learning (ML) has shown that ML models are vulnerable to attacks. Despite the increasing concern about the robustness of ML algorithms, how to achieve robust communication in multi-agent reinforcement learning has been largely neglected. In this paper, we systematically explore the problem of adversarial communication in MACRL. Our main contributions are threefold. First, we propose an effective method to perform attacks in MACRL, by learning a model to generate optimal malicious messages. Second, we develop a defence method based on message reconstruction, to maintain multi-agent coordination under message attacks. Third, we formulate the adversarial communication problem as a two-player zero-sum game and propose a game-theoretical method ℜ-MACRL to improve the worst-case defending performance. Empirical results demonstrate that many state-of-the-art MACRL methods are vulnerable to message attacks, and our method can significantly improve their robustness.

arXiv (Cornell University), Nov 28, 2019
Constructive election control considers the problem of an adversary who seeks to sway the outcome... more Constructive election control considers the problem of an adversary who seeks to sway the outcome of an electoral process in order to ensure that their favored candidate wins. We consider the computational problem of constructive election control via issue selection. In this problem, a party decides which political issues to focus on to ensure victory for the favored candidate. We also consider a variation in which the goal is to maximize the number of voters supporting the favored candidate. We present strong negative results, showing, for example, that the latter problem is inapproximable for any constant factor. On the positive side, we show that when issues are binary, the problem becomes tractable in several cases, and admits a 2-approximation in the two-candidate case. Finally, we develop integer programming and heuristic methods for these problems.

Manipulation can be performed when intermediate voting results are known; voters might attempt to... more Manipulation can be performed when intermediate voting results are known; voters might attempt to vote strategically and try and manipulate the results during an iterative voting process. When only partial voting preferences are available, preference elicitation is necessary. In this paper, we combine two approaches of iterative processes: iterative preference elicitation and iterative voting and study the outcome and performance of a setting where manipulative voters submit partial preferences. We provide practical algorithms for manipulation under the Borda voting rule and evaluate those using different voting centers: the Careful voting center that tries to avoid manipulation and the Naive voting center. We show that in practice, manipulation happens in a low percentage of the settings and has a low impact on the final outcome. The Careful voting center reduces manipulation even further.

Centralized training with decentralized execution (CTDE) has become an important paradigm in mult... more Centralized training with decentralized execution (CTDE) has become an important paradigm in multi-agent reinforcement learning (MARL). Current CTDEbased methods rely on restrictive decompositions of the centralized value function across agents, which decomposes the global Q-value into individual Q values to guide individuals' behaviours. However, such expected, i.e., risk-neutral, Q value decomposition is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Our main contributions are in three folds: (i) We first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution; (ii) We then propose a dynamic risk level predictor for CVaR calculation to handle the temporal nature of the stochastic outcomes during executions; (iii) We finally propose risk-sensitive Bellman equation along with Individual-Global-MAX (IGM) for MARL training. Empirically, we show that our method significantly outperforms state-of-the-art methods on many challenging StarCraft II tasks, demonstrating significantly enhanced coordination and high sample efficiency. Demonstrative videos and results are available in this anonymous link: .

Trembling hand (TH) equilibria were introduced by Selten in 1975. Intuitively, these are Nash equ... more Trembling hand (TH) equilibria were introduced by Selten in 1975. Intuitively, these are Nash equilibria that remain stable when players assume that there is a small probability that other players will choose off-equilibrium strategies. This concept is useful for equilibrium refinement, i.e., selecting the most plausible Nash equilibria when the set of all Nash equilibria can be very large, as is the case, for instance, for Plurality voting with strategic voters. In this paper, we analyze TH equilibria of Plurality voting. We provide an efficient algorithm for computing a TH best response and establish many useful properties of TH equilibria in Plurality voting games. On the negative side, we provide an example of a Plurality voting game with no TH equilibria, and show that it is NP-hard to check whether a given Plurality voting game admits a TH equilibrium where a specific candidate is among the election winners.

Haste makes waste
Proceedings of the International Conference on Web Intelligence, Aug 23, 2017
Voting is a common way to reach a group decision. When possible, voters will attempt to vote stra... more Voting is a common way to reach a group decision. When possible, voters will attempt to vote strategically, in order to optimize their satisfaction from the outcome. Previous research has modelled how rational voter agents (bots) vote to maximize their personal utility in an iterative voting process that has a deadline (a timeout). However, it remains an open question whether human beings behave rationally when faced with the same settings. The focus of this paper is therefore to examine how the deadline factor affects manipulative behavior in real-world scenarios were humans are required to reach a decision before a deadline. An On-line platform was built to enable voting games by all types of users: agents (bots), humans, and mixed games with both humans and agents. We compare the results of human behavior and bot behavior and conclude that it might be wise to allow bots to make (certain) decisions on our behalf.

Robotics and Autonomous Systems, Sep 1, 2010
In this paper we extend the control methodology based on Extended Markov Tracking (EMT) by provid... more In this paper we extend the control methodology based on Extended Markov Tracking (EMT) by providing the control algorithm with capabilities to calibrate and even partially reconstruct the environment's model. This enables us to resolve the problem of performance deterioration due to model incoherence, a problem faced in all model-based control methods. The new algorithm, Ensemble Actions EMT (EA-EMT), utilises the initial environment model as a library of state transition functions and applies a variation of prediction with experts to assemble and calibrate a revised model. By so doing, this is the first hybrid control algorithm that enables on-line adaptation within the egocentric control framework which dictates the control of an agent's perceptions, rather than an agent's environment state. In our experiments, we performed a range of tests with increasing model incoherence induced by three types of exogenous environment perturbations: catastrophic -the environment becomes completely inconsistent with the model, deviatingsome aspect of the environment behaviour diverges compared to that specified in the model, and periodic -the environment alternates between several possible divergences. The results show that EA-EMT resolved model incoherence and significantly outperformed its EMT predecessor by up to 95%.

arXiv (Cornell University), Apr 24, 2023
This paper investigates policy resilience to training-environment poisoning attacks on reinforcem... more This paper investigates policy resilience to training-environment poisoning attacks on reinforcement learning (RL) policies, with the goal of recovering the deployment performance of a poisoned RL policy. Due to the fact that the policy resilience is an add-on concern to RL algorithms, it should be resource-efficient, time-conserving, and widely applicable without compromising the performance of RL algorithms. This paper proposes such a policy-resilience mechanism based on an idea of knowledge sharing. We summarize the policy resilience as three stages: preparation, diagnosis, recovery. Specifically, we design the mechanism as a federated architecture coupled with a meta-learning manner, pursuing an efficient extraction and sharing of the environment knowledge. With the shared knowledge, a poisoned agent can quickly identify the deployment condition and accordingly recover its policy performance. We empirically evaluate the resilience mechanism for both model-based and model-free RL algorithms, showing its effectiveness and efficiency in restoring the deployment performance of a poisoned policy.

arXiv (Cornell University), Feb 7, 2023
Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their beh... more Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multiagent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.

Group Decision and Negotiation, Sep 20, 2019
A voting center is in charge of collecting and aggregating voter preferences. In an iterative pro... more A voting center is in charge of collecting and aggregating voter preferences. In an iterative process, the center sends comparison queries to voters, requesting them to submit their preference between two items. Voters might discuss the candidates among themselves, figuring out during the elicitation process which candidates stand a chance of winning and which do not. Consequently, strategic voters might attempt to manipulate by deviating from their true preferences and instead submit a different response in order to attempt to maximize their profit. We provide a practical algorithm for strategic voters which computes the best manipulative vote and maximizes the voter's selfish outcome when such a vote exists. We also provide a careful voting center which is aware of the possible manipulations and avoids manipulative queries when possible. In an empirical study on four real world domains, we show that in practice manipulation occurs in a low percentage of settings and has a low impact on the final outcome. The careful voting center reduces manipulation even further, thus allowing for a non-distorted group decision process to take place. We thus provide a core technology study of a voting process that can be adopted in opinion or information aggregation systems and in crowdsourcing applications, e.g., peer grading in Massive Open Online Courses (MOOCs).

arXiv (Cornell University), Nov 16, 2019
We consider the problem of the limitedbandwidth communication for multi-agent reinforcement learn... more We consider the problem of the limitedbandwidth communication for multi-agent reinforcement learning, where agents cooperate with the assistance of a communication protocol and a scheduler. The protocol and scheduler jointly determine which agent is communicating what message and to whom. Under the limited bandwidth constraint, a communication protocol is required to generate informative messages. Meanwhile, an unnecessary communication connection should not be established because it occupies limited resources in vain. In this paper, we develop an Informative Multi-Agent Communication (IMAC) method to learn efficient communication protocols as well as scheduling. First, from the perspective of communication theory, we prove that the limited bandwidth constraint requires low-entropy messages throughout the transmission. Then inspired by the information bottleneck principle, we learn a valuable and compact communication protocol and a weightbased scheduler. To demonstrate the efficiency of our method, we conduct extensive experiments in various cooperative and competitive multi-agent tasks with different numbers of agents and different bandwidths. We show that IMAC converges faster and leads to efficient communication among agents under the limited bandwidth as compared to many baseline methods.

In this paper we extend the control methodology based on Extended Markov Tracking (EMT) by provid... more In this paper we extend the control methodology based on Extended Markov Tracking (EMT) by providing the control algorithm with capabilities to calibrate and even partially reconstruct the environment's model. This enables us to resolve the problem of performance deterioration due to model incoherence, a negative problem in all model based control methods. The new algorithm, Ensemble Actions EMT (EA-EMT), utilises the initial environment model as a library of state transition functions and applies a variation of prediction with experts to assemble and calibrate a revised model. By so doing, this is the first control algorithm that enables on-line adaptation within the Dynamics Based Control (DBC) framework. In our experiments, we performed a range of tests with increasing model incoherence induced by three types of exogenous environment perturbations: catastrophic, periodic and deviating. The results show that EA-EMT resolved model incoherence and significantly outperformed the best currently available DBC solution by up to 95%.

arXiv (Cornell University), Apr 23, 2015
Most models of Stackelberg security games assume that the attacker only knows the defender's mixe... more Most models of Stackelberg security games assume that the attacker only knows the defender's mixed strategy, but is not able to observe (even partially) the instantiated pure strategy. Such partial observation of the deployed pure strategy-an issue we refer to as information leakage-is a significant concern in practical applications. While previous research on patrolling games has considered the attacker's real-time surveillance, our settings, therefore models and techniques, are fundamentally different. More specifically, after describing the information leakage model, we start with an LP formulation to compute the defender's optimal strategy in the presence of leakage. Perhaps surprisingly, we show that a key subproblem to solve this LP (more precisely, the defender oracle) is NP-hard even for the simplest of security game models. We then approach the problem from three possible directions: efficient algorithms for restricted cases, approximation algorithms, and heuristic algorithms for sampling that improves upon the status quo. Our experiments confirm the necessity of handling information leakage and the advantage of our algorithms.

arXiv (Cornell University), May 30, 2019
The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model mu... more The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model multi-agent coordination problems that are distributed by nature. The formulation is suitable for problems where variables are discrete and constraint utilities are represented in tabular form. However, many real-world applications have variables that are continuous and tabular forms thus cannot accurately represent constraint utilities. To overcome this limitation, researchers have proposed the Functional DCOP (F-DCOP) model, which are DCOPs with continuous variables. But existing approaches usually come with some restrictions on the form of constraint utilities and are without quality guarantees. Therefore, in this paper, we (i) propose exact algorithms to solve a specific subclass of F-DCOPs; (ii) propose approximation methods with quality guarantees to solve general F-DCOPs; and (iii) empirically show that our algorithms outperform existing state-of-the-art F-DCOP algorithms on randomly generated instances when given the same communication limitations. 1 As Stranders et al. [2009] did not name their extension in their paper, we choose a name so that we can refer to it easily.
Uploads
Papers by Zinovi Rabinovich