My research interests are in the fields of data management, including data-intensive computing, databases, text processing, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building. I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. My recent research projects are closely related to social media data analytics due to its increasing importance in many disciplines.
- Apache AsterixDB: A scalable, open source Big Data Management System (BDMS).
- Cloudberry: Supports interactive analytics and visualization on big data sets (e.g., sub-second queries on billions of records).
- Texera: Supports cloud-based declarative text analytics by allowing users to formulate workflows using a Web service.
- FLAMINGO: A project on data cleaning and string similarity queries.
- IPUBMED: Efficient instant search on large amounts of data. It started with the joint research project with Tsinghua University on efficient auto-complete and type-ahead search on large data sets.
- Family Reunification. Help people find their loved ones during or after a disaster.
- The Raccoon Project on Data Integration and Sharing. I started this project several years ago, and it’s in its final stage. I still have some ongoing research related to this project. But compared to the first two projects, this one is less active.
- Data sets of the history of data objects collected from 6 web sites in 1.5 years.
Released Prototypes and Source Code Packages
- Flamingo Packge: C++ package to do approximate string queries.
- Fuzzy keyword search on maps
- qSpell: Spelling Correction of Web Search Queries (won the 3rd Prize in Microsoft’s speller challenge in 2011)
- Lightweight In-Memory Implementation of R*-Tree (maintained by Sattam Alsubaiee).
- iPubMed: Instant fuzzy search on more than 20 million medical publications from MEDLINE.
- Instant fuzzy search for learning.
- Location-based instant fuzzy search.
- Location-based approximate keyword search.
- CHIME: Error-tolerant Chinese input method.
- PSearch: Instant fuzzy search on the UCI directory.
- Efficient Parallel Set-Similarity Joins Using MapReduce.
- Haiti family reunification: Instant fuzzy search on records about people affected by the Haiti earthquake.
- DNAzip: DNA sequence compression using a reference genome.
- Hobbes: genome sequence mapping.