How Fast Can Big Data Go?

Computer Science

Jessica Tang '16 spent the summer of 2015 exploring that question.

November 17, 2015

In the era of Big Data, the data storage needs of a computation exceed the capability of a single computer. The primary infrastructure currently in use for processing Big Data computations is called Hadoop MapReduce. MapReduce can split multi-terabyte data sets into independent chunks that are processed by “map” tasks on independent computers. The outputs are then sorted and combined by “reduce” tasks to get the results.

Jessica earned the Best Paper award, evaluated on the quality of both her paper and presentation, and was awarded a $100 prize.

For really large computations like this, we cannot just “run the program” and measure the time it takes. So it is increasingly important to develop ways to understand the factors that govern and predict performance. Working with Dr. Tom Bressoud, Jessica built a set of models using a combination of experimentation and analysis to help predict the performance of parallel applications running on Hadoop MapReduce. For much of her work, Jessica used the 32-node Denison Beowulf cluster.

Jessica's academic paper, describing her work, was accepted for the Midstates Conference for Undergraduate Research in Computer Science and Mathematics (MCURCSM 2015). She presented her work at the conference in November 2015, along with students from, among others, Ohio State, Bowling Green, Duquesne, and The College of Wooster. Jessica earned the Best Paper award, evaluated on the quality of both her paper and presentation, and was awarded a $100 prize.

Jessica is continuing her work, looking at additional performance factors of fault tolerance and of MapReduce computing in the cloud, as part of a year-long Senior Research Project. After she graduates this year, Jessica will be able to employ her Big Data expertise in her new role working for Amazon.com.