Latest News

(10/20) Together with Prof. David Timberlake of Public Health, received a grant from TRDRP on social media analysis on tobacco.
(10/20) Gave a keynote talk titled Collaborative Interdisciplinary ML-Centric Data Analytics at Scale at NDBC 2020.
(9/20) Check our Amber video and Texer demo video at VLDB 2020.
(7/20) Paper titled "Beanstalk: A General Cost-Based Optimizer Framework for Incremental Data Processing" accepted by VLDB 2021.
(7/20) Became the Faculty Director of the ICS Master of Computer Science Program.
(6/20) Received an NSF RAPID grant with Profs. Gloria Mark and Suellen Hopefer on Covid-19 analysis using social media.
(6/20) Our paper titled "Demonstration of Interactive Runtime Debugging of Distributed Dataflows in Texera" has been accepted by VLDB 2020.
(4/20) Teaching CS122B ("Projects in Databases and Web Applications") and STATS ("Project in Data Science") this quarter.
(3/20) Check the CoronavirusTwitterMap our team is developing to visualize coronavirus-related tweets.
(3/20) Our paper titled "Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using Materialization" has been accepted by ACM SIGMOD 2020.


My research interests are in the fields of data management and text analytics, including data-intensive computing,  databases, text processing, search, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building.  I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. My recent research projects are closely related to social media data analytics due to its increasing importance in many disciplines.

Current Projects

  • Apache AsterixDB: An open source parallel database system for Big Data.
  • Cloudberry: A middleware system for Big Data visualization.
  • Texera: Big data analytics using interactive workflows.

Past Projects

  • FLAMINGO: A project on data cleaning and string similarity queries.
  • IPUBMED: Efficient instant search on large amounts of data.
  • Family Reunification: Help people find their loved ones during or after a disaster.

Released Prototypes and Source Code Packages