Data-Dependent Random Features for Improved Generalization in Supervised Learning
At the heart of many machine learning problems, kernel methods (such as support vector machine) describe the nonlinear representation of data via mapping the features to a high-dimensional feature space. Without recourse to explicit form of the feature maps, one can compute their inner products inexpensively using a kernel function, an idea known as the “kernel trick”. However, unfortunately, methods using kernel matrices are not applicable to large-scale machine learning as they incur a massive computational cost on large datasets. This observation motivated researchers to consider kernel approximation using random features and extend the idea to train shallow architectures. A natural concern is then the stochastic oracle from which the features are sampled. Recently, it has been shown that employing data-dependent randomization improves the performance in terms of the required number of random features. Following this line of research in [AAAI-2018], we are concerned with the randomized-feature approach in supervised learning for accurate prediction (generalization). We propose the Energy-based Exploration of Random Features (EERF) algorithm which relies on a data-dependent score function that explores the set of possible features and exploits the promising regions. We theoretically prove that the proposed score function with high probability recovers the spectrum of the best fit within the model class. We have also applied EERF to several practical datasets (e.g. MNIST for digit recognition). Our empirical results verify that our method requires smaller number of random features to achieve a certain generalization error compared to the state-of-the-art while introducing negligible pre-processing overhead and requiring no additional tuning parameters.
Relevant work: [AAAI-2018]
Online Learning Using Tactile Sensors with Application to Object Recognition
In variety of engineering applications, the relevant information about a complex task can be acquired from multiple resources. The obtained data is often in different “modes”, i.e., it is provided by various types of sensors and detectors, or processed by several algorithms and experiments, motivated by the fact that a single mode cannot provide comprehensive knowledge of the task. In this project, we consider online fusion of multimodal data, where the goal is to combine information from multiple modes to identify an element in a large dictionary. We address this problem in the context of object recognition by focusing on tactile sensing as a mode. Using the tactile glove with seven sensors (shown in the picture), we grasp different objects to obtain a 7-dimensional time series, each component of which represents the pressures applied to one sensor. We then construct a dictionary using the vector times series corresponding to various objects grasped by different individuals. The problem is to match a streaming vector time series from grasping an unknown object to a dictionary object. We propose an algorithm that starts with prior knowledge provided by other modes. It then receives grasp samples sequentially, uses a dissimilarity metric to modify the prior, and forms a time-varying probability distribution calculating the likelihood of each dictionary object being grasped. When the dictionary objects are not similar in shape, our proposed method identifies the unknown object even with a uniform prior. When the shape of the unknown object is similar to another dictionary object, our method can only rule out the dissimilar objects, and for unique identification, it needs to incorporate the prior (other modes) to the algorithm. For each case, we show that our online algorithm maintains a similar performance to standard classification methods (e.g. SVM) with a significantly lower computational time.
Relevant work: [TCAS-II]
Best-Arm Identification in Multi-Armed Bandits
Multi-armed bandits is a well-known framework for modeling sequential decision making. A player explores a finite set of arms, and drawing each arm yields a reward drawn independently from a fixed, unknown distribution. In this context, the best arm is the one whose associated distribution has the largest expected value. The classical setting deals with the exploration-exploitation dilemma, i.e., the goal is to find the best arm in the pool (explore) and to pull it as many times as possible (exploit). However, there are applications for which the exploitation phase is not relevant. As an example, for commercializing a set of products, one only aims to identify the most popular product. The customers are not needed to test the products as often as they can, so the exploitation phase does not exist. Thus, in recent years, the pure-exploration setting has also received a great interest. We consider such problem in [TSP-2017], where the player is given a fixed budget to explore the arms and to find the best one as quickly as possible. We propose a general framework to unify sequential elimination algorithms, where the arms are dismissed iteratively until a unique arm is left. Our analysis reveals a novel performance measure expressed in terms of the sampling mechanism and number of eliminated arms at each round. Based on this result, we develop an algorithm that divides the budget according to a nonlinear function of remaining arms at each round. We provide theoretical guarantees for the algorithm, characterizing the suitable nonlinearity for different problem environments in terms of competitive arms. Matching the theoretical results, our experiments show that the nonlinear algorithm outperforms the state-of-the-art. We also extend our results to the side-observation model in which pulling an arm reveals the rewards of its related arms. Finally, in [Allerton-2017], we consider the best-M-arm identification problem, where the goal is to identify the top M arms in terms of expected rewards and extend the nonlinear budget allocation idea to this setting.
Online Optimization in Dynamic Environments
Online optimization is a popular area of interest in machine learning due to its modern, promising applications such as online advertisement placement and online web ranking. The problem is modeled as a game between a learner and an adversary where the learner sequentially chooses an action at each round, and the adversary in turn reveals a loss to the learner. Typically, the goal is to minimize the static regret defined with respect to the best single action in hindsight. In other words, the static regret is the difference between the accumulated loss versus the smallest possible loss (achieved with one single action) had the learner been aware of the entire loss sequence a priori. The literature has witnessed a host of works developing no-(static)regret algorithms. Perhaps less well-known, is the notion of “dynamic regret”, where the algorithm competes against the best action of each round. Indeed, aiming for this stringent benchmark is only possible under certain regularity conditions. In [AISTATS-2015], we proposed an algorithm whose dynamic regret can be bounded in terms of temporal variability of the loss sequence and regularity in the pattern of minima sequence. The algorithm is fully adaptive in the sense that the learner needs no prior knowledge of the environment. Importantly, combining both notions allows the dynamic regret to adapt to the smaller quantity and select the best of both worlds. More recently, we improved the regret bounds for the strongly convex losses [CDC-2016]. We further studied the problem of networked online optimization in the dynamic setting [TAC-2018], where instead of a single learner, we have a group of agents aiming to track the minimizer of a global time-varying function. These agents receive only local information about the global function, and they must communicate with each other to track the global minimizer collaboratively. We characterized the network and tracking errors in the context of networked dynamic regret, and we showed that our theoretical results are a valid generalization of existing results on the static setting.
Identification of Networks Using Power Spectral Analysis
Networks of interconnected systems are ubiquitous in engineering, biology, finance, and physics. The reconstruction of these networks is an inverse problem, and it has attracted considerable attention in various engineering domains (e.g. bio-engineering, circuits, and many more). We propose a methodology for reconstruction of directed networks using power spectral analysis. In particular, when inputs are wide-sense stationary noises, we derive a relationship between the power spectral densities of inputs and outputs, which encapsulates the network topology. Then, using a so-called node-knockout procedure, we develop a method to recover the Boolean structure of general directed networks. In addition, for undirected networks and non-reciprocal networks, we develop algorithms with less computation cost.
Distributed Detection and Estimation: Finite-time Analysis
Decentralized estimation, detection, and observational social learning has been an intense focus of research over the past three decades with applications ranging from sensor networks to social and economic networks. In these frameworks the computation burden is distributed among agents of a network, allowing them to achieve a “global” task using “local” information. Developments in distributed optimization have opened new venues to investigate principled distributed estimation. Viewing with an optimization lens, one can think of the problem as minimizing a network loss considered as a sum of individual losses. We considered in [TAC-2016] a stochastic optimization framework with linear losses, where the problem can be cast as a distributed detection. Observing private signals (individual gradients), agents use purely local interactions to detect the true state of the world which belongs to a finite state space. Unlike many existing approaches in the literature, we provided a “finite-time” analysis of the problem. We defined a cost for the algorithm (with respect to its centralized counterpart) which can be bounded uniformly in time in terms of relative entropy of signals, agents’ centralities, and network spectral gap. Benefiting from this fact, we show how one can speed up learning by designing an optimal network. We previously provided the asymptotic rate of a gossip variant of this problem in [CDC-2013]. On the other hand, using quadratic losses, we developed a distributed scheme in [NIPS-2013] where agents’ goal is to track a dynamic parameter. We proved both asymptotic and finite-time results about the algorithm highlighting the role of network structure in convergence quality.
Disclaimer: Any photo (from external source) used in this page is courtesy of the link associated to it.