My research interests are in the fields of data management and information retrieval, including data-intensive computing, databases, text processing, search, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building. I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. My recent research projects are closely related to social media data analytics due to its increasing importance in many disciplines.
- Apache AsterixDB: An open source parallel database system for Big Data.
- Cloudberry: Interactive analytics and visualization on Big Data.
- Texera: Cloud-based text analytics using declarative workflows for Natural Language Processing (NLP) and Machine Learning (ML).
The following figure illustrates one scenario where these projects are integrated to support management of social media data. With other techniques on machine learning, we can complete the lifecycle of data analytics. Each system is independent and general purpose.
- FLAMINGO: A project on data cleaning and string similarity queries.
- IPUBMED: Efficient instant search on large amounts of data. It started with the joint research project with Tsinghua University on efficient auto-complete and type-ahead search on large data sets.
- Family Reunification. Help people find their loved ones during or after a disaster.
- The Raccoon Project on Data Integration and Sharing. I started this project several years ago, and it’s in its final stage. I still have some ongoing research related to this project. But compared to the first two projects, this one is less active.
- Data sets of the history of data objects collected from 6 web sites in 1.5 years.
Released Prototypes and Source Code Packages
- Flamingo Packge: C++ package to do approximate string queries.
- Fuzzy keyword search on maps
- qSpell: Spelling Correction of Web Search Queries (won the 3rd Prize in Microsoft’s speller challenge in 2011)
- Lightweight In-Memory Implementation of R*-Tree (maintained by Sattam Alsubaiee).
- iPubMed: Instant fuzzy search on more than 20 million medical publications from MEDLINE.
- Instant fuzzy search for learning.
- Location-based instant fuzzy search.
- Location-based approximate keyword search.
- CHIME: Error-tolerant Chinese input method.
- PSearch: Instant fuzzy search on the UCI directory.
- Efficient Parallel Set-Similarity Joins Using MapReduce.
- Haiti family reunification: Instant fuzzy search on records about people affected by the Haiti earthquake.
- DNAzip: DNA sequence compression using a reference genome.
- Hobbes: genome sequence mapping.