Research

Chen Li

My research interests are in the fields of data management, data science, and data analytics related to machine learning, including databases, data-intensive computing, search, and large-scale analytics and visualization. My PhD thesis at Stanford was on data integration, with an emphasis on both theoretical and practical aspects. My recent research, especially after spending a few quarters at Google and a few years doing a startup as its founder and CTO, has a strong preference on engineering and open source system building. I believe “Computer Science” is a “Science” to support great engineering, and we need to build systems to stay relevant in this fast-paced IT era. I was a co-PI of the Apache AsterixDB project. Since 2016, our team has been developing the Texera open-source system to support cloud-based collaborative data science, AI, and ML using workflows.

Current Projects

Texera: To support cloud-based collaborative data science, AI, and ML using workflows.

Current Fundings

NIH NIDDK Award “dkNET Coordinating Unit: Harnessing the Power of AI and Data Science forCollaborative Discovery and Sharing in the DK Community”
NSF Award 2107150: III: Medium: Collaborative Research: Collaborative Machine-Learning-Centric Data Analytics at Scale
NSF Award 2200274: PIPP Phase I: An End-to-End Pandemic Early Warning System by Harnessing Open-source Intelligence
An ICS Research award.
A research grant from the Orange County Health Care Agency (OCHCA)

Selected Past Projects

Flamingo: A project on data cleaning and string similarity queries.
Ipubmed: Efficient instant search on large amounts of data.
Apache AsterixDB: An open source parallel database system for Big Data.
Cloudberry: A middleware system for Big Data visualization.

Released Prototypes and Source Code Packages

Flamingo Packge: C++ package to do approximate string queries.
ASTERIX.
Fuzzy keyword search on maps
qSpell: Spelling Correction of Web Search Queries (won the 3rd Prize in Microsoft’s speller challenge in 2011)
Lightweight In-Memory Implementation of R*-Tree (maintained by Sattam Alsubaiee).
iPubMed: Instant fuzzy search on more than 20 million medical publications from MEDLINE.
Instant fuzzy search for learning.
Location-based instant fuzzy search.
Location-based approximate keyword search.
CHIME : Error-tolerant Chinese input method.
PSearch: Instant fuzzy search on the UCI directory.
Efficient Parallel Set-Similarity Joins Using MapReduce.
Haiti family reunification: Instant fuzzy search on records about people affected by the Haiti earthquake.
DNAzip: DNA sequence compression using a reference genome.
Hobbes: Genome sequence mapping.

Chen Li

External Links

Research

Current Projects

Current Fundings

Selected Past Projects

Released Prototypes and Source Code Packages