SERVICE RATES IN CODED DISTRIBUTED STORAGE
Coding has traditionally been used in transmission and storage of data to provide reliability in a more effcient way than simple replication. The traditional performance indicators of codes are the minimum distance and the code rate. More recently, special codes have been developed that also provide efficient maintenance of storage under node failures and/or data updates. In addition to the traditional metrics, the properties of codes that matter in theses scenarios are, e.g., the code locality, availability, and update efficiency. Emerging applications, such as distributed learning and fog computing, are adding yet another use for coding. In these applications, the goal is to maximize the number of users that can be simultaneously served by the system.
SERVICE RATES OF CODES
There are k files (say, elements of some fnite field) to be stored (redundantly) across n servers, n ≥ k. We assume wlog that the time to download a file from server m is exponential with rate μm, 1 ≤ m ≤ n, and that requests to download file i arrive at rate λi, 1 ≤ i ≤ k. Each server can store a linear combination of a (sub)set of files (encoded file), and thus can be used to help fulfill a request for any file in that subset. The achievable service rate region of such a system is the set of request rate vectors (λ1,λ2,...,λk) that can be served so that no node is sent requests in excess of its service rate. Interestingly, the best schemes (those that maximize the achievable service rate region) often combine replication and coding. We started looking into this problem in order to find when some complex queues that arise in distributed storage are stable. The problem turned out to be independently rich and connected to several other problems. Some aspects of the problem generalize batch coding, and some give rise to interesting questions in (fractional) graph theory.
CODED STORAGE ALLOCATIONS FOR FAST ACCESS AND ROBUST SERVICE
The allocation of (redundant) file chunks throughout a distributed storage system affects important performance metrics, such as the probability of file recovery, data download time, or the service rate of the system, under a given data access model. Although these metrics seem very related, perhaps counterintuitively, they are optimized by different allocations. The questions associated with this problem are hard to answer even under the assumption of quasi-uniform storage allocations. Often, only general conclusions exist. For example, the service rate under the assumption that the stored data is large and its download time is not negligible is optimized by allocations that depend on different system parameters. This is not the case under the assumption that the download time does not scale with the size of data, where the minimal spreading allocation is universally optimal. Some questions are related to an old conjecture by Erdős on the maximum number of edges in a uniform hypergraph.
HIDING THE META DATA OF COMMUNICATIONS
In numerous circumstances, the very act of communicating needs to be hidden. In war time, not only the content of messages but also the volume of communication to or from suspected parties can alert the adversary. In everyday life, revealing the identity of communicating parties, not only the exchanged information, affects the increasingly important anonymity and privacy. One line of our research explores covert passing of information between (multiple) source/sink pairs in mobile IoT networks. Another line of our research is concerned with the anonymity mix routers, which are in some form present in essentially all anonymity platforms.
STRAGGLING FOR COVERT MESSAGE PASSING ON GRAPHS
NSF Grant No. SaTC-1816404
This line of work explores covert passing of information between (multiple) source/sink pairs by incremental and possibly redundant storage and retrieval of data in mobile networks hosting a multitude of IoT objects capable of storing, sending and receiving data. We recently proposed a protocol, involving random walks on graphs, that increases communication covertness at the expense of an increased delay of information transfer. We showed how two forms of redundancy can be used as an effective tool to control the tradeoff between the covertness and the delay.
ANONYMITY MIXES AS (PARTIAL) ASSEMBLY LIKE QUEUES
An anonymity (threshold) Mix is a sophisticated message router that receives and holds packets from message sources and forwards them in a batch to their respective destinations only when it accumulates messages from some prescribed number of sources. Because of such simultaneous transmissions of messages, the identities of communicating pairs remain hidden to possible adversaries that seek to link message sources and destinations. The price of achieving anonymity in this way is delay, because messages are held at the Mix until a batch of a certain size is formed. We explored two promising ideas about how to compute the delay, and have some preliminary results. One idea is to model batch mixes as generalized assembly-like queues and develop an approximate queuing analysis of these objects. The other idea is to model the source/destination channels as urns and messages as balls, and compute the channels' queues occupancy, and the time it takes to accumulate enough messages to have a departure, as a variant of the coupon collection problem.
STRAGGLER MITIGATION AT SCALE
Large-scale sharing of computing resources causes random fluctuations in service times. When a large job is split into many small tasks for parallel execution, even small-variance random service times will, with high probability, result in a non-negligible number of straggling tasks with long completion time. Redundancy, in the form of simple task replication, and more recently, erasure coding, has emerged as a potentially powerful way to curtail the variability in service. However, with added redundancy, the scale of resource sharing becomes even larger, and consequently, the fluctuations in service time (that redundancy is trying to counteract) increase. Is it then good to add redundancy? How much? When? We have begun to answer some of these questions analytically.
WHICH CLONES SHOULD ATTACK AND WHEN?
We consider systems where redundant tasks are introduced in executing jobs with many tasks at some time Δ ≥ 0. We are concerned with the latency reduction offered by using simple replication or erasure coding to create redundant tasks as well as with the computing costs of running such schemes. We show that the tail heaviness of the PDFs modeling task execution times is a decisive parameter for the cost vs. latency tradeoff. We find that coding achieves more favorable cost vs. latency tradeoff, and can yield reduction in both cost and latency under less heavy tailed execution times. We show that delaying redundancy is not effective in reducing cost.
DELAYED CANCELING AND RELAUNCHING OF (REDUNDANT) TASKS
Two properties of heavy tailed task execution times suggest that canceling & relaunching tasks still running at some time ∆ is a promising straggler mitigation straggler mitigation strategy. First, the longer a task has taken to execute the longer its residual lifetime is likely to be. Second, the majority of the mass in a set of samples drawn from a heavy tailed distribution is contributed by only a few of them. Consequently, among all tasks within a job, just a few are expected to be stragglers with much longer completion time compared to the non-stragglers, and thus should be canceled & relaunched. But at what time ∆ should the tasks that are still running be canceled & relaunched? We show that there is an optimal ∆, below which we would be canceling non-stragglers. and above which, the straggling tasks would increase both the latency and the cost of the job execution.
IMPACT OF REDUNDANCY ON THE SYSTEM
The impact of redundancy on the computing system is usually ignored in research papers on the topic (including some of ours), except for the computing cost in to run redundant tasks. However, redundant tasks exert extra load on the system, which is likely to aggravate the existing contention for the system resources. Given that resource contention is the primary cause of runtime variability, the added redundant tasks are likely to further increase the variability in task execution times. We show that redundancy should not be used beyond a certain level, and that that level is lower for replication than it is for coding.
CODES FOR STORAGE WITH QUEUES FOR ACCESS
NSF Grant No. CIF-1717314
DOWNLOAD TIME OF AVAILABILITY CODES
In distributed storage systems, a special sub-class of locally repairable codes, referred to as availability codes, has been proposed to enable recovery of each data block from one of its repair groups. A repair group typically contains a small number of nodes and does not overlap with any other repair group of the same data block. Availability codes have several important benefits, including high degree of fault tolerance, efficient recovery from failures, and efficient access to data by multiple users. We study the availability codes from a queuing-theoretical perspective in low and high traffic regime. Specifically, we analyze the average time necessary to download a block of data under the Poisson request arrival model. Our results indicate that availability codes can minimize the download time of data blocks in some settings, but are not always optimal. This is a notable difference from MDS codes which always reduce download time of complete files.
(n,k) FORK - JOIN SYSTEMS
We study how coding in distributed storage reduces expected download time, in addition to providing reliability against disk failures. The expected download time is reduced because when a content file is encoded with redundancy and distributed across multiple disks, reading only a subset of the disks is sufficient for content reconstruction. For the same total storage used, coding exploits the diversity in storage better than simple replication, and hence gives faster download. We introduce a novel fork-join queueing framework to model multiple users requesting the content simultaneously, and derive bounds on the expected download time. Our system model and results are a novel generalization of the fork-join system that is studied in queueing theory literature. Our results demonstrate the fundamental trade-off between the expected download time and the amount of storage space. This trade-off can be used for design of the amount of redundancy required to meet the delay constraints on content delivery.