Ohio State Navbar

The Ohio State UniversityOffice of International Affairs

give a donation

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems

Research Scholar

Rubao Li, Computer Science and Engineering (China)
Yongqiang He, Co-Researcher
Yin Huai, Co-Researcher
Shao Zheng, Co-Researcher
Namit Jain, Co-Researcher
Zhiwei Xu, Co-Researcher
Robert M. Critchfield, Faculty Mentor

Biography

Rubao Li has been a postdoctoral researcher in the Department of Computer Science and Engineering at Ohio State since September 2008. His research advisor is Professor Xiaodong Zhang. He is a member of the High Performance Computing and Software Laboratory (HPCS) lab, and his research interests include database systems, distributed systems, and operating systems. Li received a Ph.D. in computer science from Institute of Computing Technology, Chinese Academy Sciences in 2008. He received his master's degree in computer science from Beijing University of Technology in 2003. He has published more than 10 papers on important computer science conferences, workshops and journals.

What is the issue or problem addressed in your research?

We have entered an era of big data with the rapid development of online stores, search service, and social network service (e.g., Facebook). Big data is critically valuable. For example, one critical task in Facebook, the largest social network in the world with 700 million users, is to understand quickly the dynamics of user behavior trends based on big data sets recording busy user activities. To manage big data, a data warehouse system is used to store the data and execute user queries on the data. One critical component in a data warehouse system is the data placement structure that controls how big data is stored in the warehouse.

It is a challenging research problem to design and implement an efficient structure for big data analytics because the structure must satisfy four critical requirements: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. After theoretical analysis and extensive experiments, we found that existing data placement structures cannot meet all the four goals.

What methodology did you use in your research?

Therefore, we had designed and implemented RCFile, a novel data placement structure optimized for big data analytics. First, RCFile has comparable data loading speed and workload adaptivity with the best one of existing structures. Second, RCFile is read-optimized by avoiding unnecessary operations during query execution, and it outperforms other structures in most of cases. Third, RCFile uses column-wise compression and thus provides efficient storage space utilization.

What are the purpose/rationale and implications of your research?

RCFile has been widely used in many large-scale production data warehouse systems deployed in Facebook, Yahoo!, and others. It will continue to play its important role as a data placement standard for big data analytics. Further information can be found in its Wikipedia page.