Nihar B. Shah - Research

Home Research Publications Curriculum Vitae


Since the second half of my PhD so far, I am working in the areas of machine learning and game theory, with a particular focus on the application to crowdsourcing. Here is a brief description of some of my work.
Statistical inference from crowdsourced data

Data in the form of pairwise comparisons arises in many applications. Pairwise comparisons are free from several biases and are much faster and easier to make as compared to numeric scores. Given a noisy comparisons between various pairs of items, the question that now arises is how to draw meaningful inferences from this data? While prior literature mostly focus on quite restrictive "parametric" models, in my work, I instead consider models go beyond these parametric notions and are much more flexible. I develop estimation algorithms and fundamental theoretical guarantees for these models.


"Unique" mechanisms for obtaining high-quality data from crowdsourcing

A major problem with the data obtained from crowdsourcing is that it is extremely noisy. My work designs novel data collection mechanisms that lead to a collection of higher high-quality data. The proposed mechanisms are rooted in fundamental theory --- we show that our mechanisms are unique in that they the only mechanisms that can satisfy a natural 'no-free-lunch' axiom. Our mechanisms have a simple and interesting "multiplicative" form.


Fun and games with rank aggregation

And finally, just for fun, you may like playing this game that I've made.
Here are the in instructions.
is the algorithm underlying the game.


I have previously worked on problems in the area of coding and information theory. More specifically, my prior research focused on constructing (error correcting) codes for distributed storage systems which need to be highly reliable, make the data widely available, handle node failures efficiently, and must also be able to optimally use resources such as storage space and bandwidth.

The video on the left gives a gentle introduction to an aspect of our work along with a toy example of one of our code constructions. More details are provided below.

Click here to download the video

Today's large scale distributed storage systems comprise of thousands of nodes, storing hundreds of petabytes of data. In these systems, component failures are common, and this makes it essential to store the data in a redundant fashion to ensure reliability. The most common way of adding redundancy is replication. However, replication is highly inefficient in terms of storage utilization, and hence many distributed storage systems are now turning to Reed-Solomon (erasure) codes. While these codes are optimal in terms of storage utilization, they perform poorly in terms of network bandwidth. During repair of a failed node, these codes require download of the entire data to recover the small fraction that was stored in the failed node. Recovering data stored in a particular node is of interest not only during repair but also for many other applications.

Regenerating Codes are a new class of error correcting codes for distributed storage systems which are optimal with respect to both storage space and network bandwidth. Regenerating codes come in two flavors: (i) Minimum Storage (MSR): which minimize bandwidth usage while storing an optimal amount of data, (ii) Minimum Bandwidth (MBR): which further minimize bandwidth usage by allowing for a slightly higher storage space.

Here is a description of some of our codes.
Product-Matrix Codes

Product-Matrix codes are the most general class of regenerating code constructions available in the literature, spanning both MBR and MSR, and providing a complete solution to the high-redundancy regime. These codes are the first and the only explicit construction of regenerating codes that allow the number of nodes in the system to scale irrespective of other system parameters. This attribute allows one to vary the number of nodes in the system on the fly, which is very useful, for example, in provisioning resources based on the demand. Further, these codes can be implemented using Reed-Solomon encoders and decoders (which are already existing) as building blocks.

Under these codes, each node is associated with an encoding vector. The source data is assembled in the form of a matrix, and each node stores the projection of this matrix along its encoding vector. To recover data stored in a node, all helper nodes pass a projection of the data stored in them along the direction of the encoding vector of the failed node.

We have also optimized these codes with respect to other aspects of distributed storage systems such as security, error correction, scaling via ratelessness, etc.
Twin Codes

This is a simple, yet powerful, framework which allows one to use any erasure code and still remain bandwidth efficient. Under this framework, nodes are partitioned into two types, and the data is encoded using two (possibly different) codes. The data is encoded in a manner such that the problem of repairing nodes of one type is reduced to that of erasure-decoding of the code employed by the other type. Here, one can choose the constituent codes based on the properties required, for instance, employing LDPC codes allows for low-complexity decoding algorithms.
Repair-by-Transfer Codes

These are MBR codes which perform repair-by-transfer: the data stored in any node can be recovered by mere transfer of data from other nodes, without requiring any computation. This minimizes the disk IO since each node reads only the data that it transfers, and also permits the use of "dumb" nodes. The animated video above presents the working of these codes.

Interference Alignment based MISER Codes

Interference Alignment is a concept that was proposed in the field of wireless communication to efficiently handle multiple interfering communications. We show that this concept necessarily arises during repair in MSR codes. Using these insights we construct codes operating at the MSR point.
Non-achievability Results

It has been shown that there is a fundamental tradeoff between the two resources: storage space and network bandwidth, and MSR and MBR are the two extreme points of this tradeoff. While the codes described above operate at these extreme points, we have also shown that there are no codes which can achieve the interior points on this tradeoff.