Latest News

(4/21) Finished a new work called Snipe on progressive visualizatoin of large networks.
(2/21) Co-chairing the VLDB 2021 Industrial Track.
(2/21) A joint paper with Informatics colleagues titled "Why Do People Oppose Mask Wearing? A Comprehensive Analysis of US Tweets During the COVID-19 Pandemic" accepted by JAMIA .
(1/21) Co-teaching Stats170A titled "Project in Data Science".
(10/20) Together with Prof. David Timberlake of Public Health, received a grant from TRDRP on social media analysis on tobacco.
(10/20) Gave a keynote talk titled Collaborative Interdisciplinary ML-Centric Data Analytics at Scale at NDBC 2020.
(9/20) Check our Amber video and Texer demo video at VLDB 2020.
(7/20) Paper titled "Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing" accepted by VLDB 2021.
(7/20) Became the Faculty Director of the ICS Master of Computer Science Program.
(6/20) Received an NSF RAPID grant with Profs. Gloria Mark and Suellen Hopefer on Covid-19 analysis using social media.
(6/20) Our paper titled "Demonstration of Interactive Runtime Debugging of Distributed Dataflows in Texera" has been accepted by VLDB 2020.


My research interests are in the fields of data management and text analytics, including data-intensive computing,  databases, text processing, search, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building.  I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. My recent research projects are closely related to social media data analytics due to its increasing importance in many disciplines.

Current Projects

  • Apache AsterixDB: An open source parallel database system for Big Data.
  • Cloudberry: A middleware system for Big Data visualization.
  • Texera: Collaborative data analytics using interactive workflows.

Past Projects

  • FLAMINGO: A project on data cleaning and string similarity queries.
  • IPUBMED: Efficient instant search on large amounts of data.
  • Family Reunification: Help people find their loved ones during or after a disaster.

Released Prototypes and Source Code Packages