Research

My research interests are in the fields of data management, data science, and data analytics related to machine learning, including databases, data-intensive computing,  search, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building.  I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. My recent research projects are closely related to social media data analytics due to its increasing importance in many disciplines.

Current Projects

  • Texera: Collaborative data analytics using interactive workflows.

Current Fundings

  • NSF Award 2107150: III: Medium: Collaborative Research: Collaborative Machine-Learning-Centric Data Analytics at Scale
  • NSF Award 2200274: PIPP Phase I: An End-to-End Pandemic Early Warning System by Harnessing Open-source Intelligence
  • An ICS Research award.
  • A research grant from the Orange County Health Care Agency (OCHCA)

Selected Past Projects

  • Flamingo: A project on data cleaning and string similarity queries.
  • Ipubmed: Efficient instant search on large amounts of data.
  • Apache AsterixDB: An open source parallel database system for Big Data.
  • Cloudberry: A middleware system for Big Data visualization.

Released Prototypes and Source Code Packages