Monday, November 24, 2008

Map-Reduce Performance

So this is Andy and Matei's fantastic OSDI work, improving the performance of the speculative execution of Hadoop. It turns out to be pretty simple, as hadoop does everything in a very naive way. Good work, and great problem selection. Lastly, great acronym selection. Matei is great at that.

Firstly, this work lies under the huge shadow of Google. No one really knows what their system does. They published in 2002 (I think), and hadoop is just about catching up to their system at that time. It's a solid academic area, but there's an open question as to the novelty of the work because of that.

This particular work is likely pretty novel, as Google has a fairly homogeneous network. I think it's 3 types of systems at any point.

Another question revolves around map-reduce itself. I think it's telling how few research efforts actually USE the system, rather than work on the system. I've found that there's a set of problems that it solves, and it's really not a huge set. They're pretty much the uses we've found in the rad lab, log searching. It doesn't build services. It doesn't scale databases (yet). The parlab has map-reduce as one of the "dwarves", a series of solved parallel problems. It's not the key one, as some in systems seem to think.

So the key, in my mind, is to expand the map-reduce problems. My understanding is that MSR is doing this, Yahoo as well. Turning it into a non-consistent database is interesting. Generalizing the model to encompass other parallel problems is interesting. However, the original map-reduce paper never would have been published without Google backing it. It took the experience of it's deployment to get attention. How many great parallel architectures sit in dusty conference proceedings? How do we change this research area so that the simplicity of the map-reduce model is seen as an advantage?

Lastly, I think you should have people read the map-reduce paper and stonebreaker's nagging paper.

2 comments:

Ari Rabkin said...

An awful lot of practical problems look like MapReduce -- it's basically a generalization of indexing.

I think we don't know all the [non-computer systems] research that uses MapRed as a computing platform. My sense is that there's a fair bit out there, and they just don't emphasize it.

Matei Zaharia said...

In terms of MapReduce-like programming models, DryadLINQ is pretty amazing. I think you saw the talk at Intel Research. That kind of integration with tools would really with ease of development.

For applications, the other word I want to throw out there is "data warehousing". (Okay, it's two words.) This is basically what Facebook uses MapReduce for - you have a lot of data, you put it together and then you can ask interesting questions about it. This is obviously hard to do as a researcher. I think this is the use case beyond indexing that will get more industry interest. The point of this workload is not necessarily that the individual computations you run are complex (they are mostly SQL-like), but that you just have lots of them because you have all this data in one place, so performance of the system matters.