My research interests are in the fields of data management and information retrieval, including data-intensive computing,  databases, text processing, search, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building.  I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. My recent research projects are closely related to social media data analytics due to its increasing importance in many disciplines.

Current Projects

  • Apache AsterixDB: An open source parallel database system for Big Data.
  • Cloudberry: Big Data visualization.
  • Texera: Text analytics as a service using declarative workflows.

The following figure illustrates one scenario where these projects are integrated to support management of social media data. With other techniques on machine learning, we can complete the lifecycle of data analytics. Each system is independent and general purpose.

Research overview

Past Projects

Released Prototypes and Source Code Packages