|
|
-
Bringing science to society
Learning from disparate data sources may become more manageabl thanks
to the work of computer science's Vasant Honavar
-
The amount of data available to researchers and the general public is
mind-boggling.
Development of high throughput data acquisition technologies together
with advances in computing and communications have resulted in an explosive
growth in the number, size and diversity of potentially useful information
sources.
Examples of such data respositories in biological sciences include Genbank
(a database of genome sequences), and Protein Data Bank (a database of
protein structures). NASA maintains large repositories of data gathered
from satellites while the U.S. Census Bureau and the Environmental Protection
Agency maintain information that is accessible to the public.
In principle, scientists or interested laypersons should be able to use
such data to explore specific scientific questions. But in practice, our
ability to exploit disparate, autonomously maintained data sources is
hindered by the massive size of the data repositories and unavoidable
semantic differences among them.
"If you are a scientist, you don't want to spend months writing code,"
said Vasant Honavar, professor of computer science. "If you have
to spend months writing code in order to extract the data that you need
in the form that you want from existing data repositories before you can
analyze the data, it hinders your ability to use available data effectively
to explore scientific hypotheses. If you had the right tools, you could
potentially pose a question and get an answer in 30 seconds instead of
two years.
"And because of the large amount of data, ideally you would want
to do the analysis where the data and computational resources are available,
instead of retrieving huge amounts of data when all you are interested
in are results of analysis."
Honavar is currently conducting research to make that task a little easier.
He has received funding from several sources including a four-year $1
million grant from the National Institute of Health to develop and use
computational tools for data-driven characterization of protein sequence-structure-function
relationships (in collaboration with Iowa State faculty members Robert
Jernigan and Drena Dobbs) and a three-year, $210,000 Information Technology
Research grant from the National Science Foundation (NSF) to develop some
of the necessary algorithms and software.
Honavar's research over the past several years has been supported by a
number of sources including the NSF, the Department of Defense, the Carver
Foundation, Pioneer Hi-Bred, IBM, John Deere, and Iowa State.
"Our research is aimed at overcoming some of the challenges in data-driven
scientific discovery through the design, analysis and implementation of
algorithms and software for knowledge acquisition from heterogeneous distributed
data," Honavar said. "The challenge is to extract, integrate,
and learn from semantically heterogeneous data."
But it's not just researchers and scientists that Honavar hopes to be
able to assist with the new algorithms and software.
"A longer term goal would be for a layperson, such as a high school
student or a journalist, to examine if certain findings are supported
by data," he said. “"We want to develop the software infrastructure
that can engage all interested individuals in discovery.
"This type of technology can make scientific data and analysis tools
available not only to specialists but to anyone that is interested in
it," he said.
Honavar and his research group are planning to customize information extraction
agents that can effectively exploit domain or context-specific ontologies
supplied by the users to extract the information needed for learning from
distributed data sources. They hope to accomplish this regardless of differences
in query capabilities, interfaces, and ontologies, and under privacy constraints.
"We think you can do this in restricted access settings, such as
hospital records," Honavar said. "There may be data available
in hospital records that researchers could use to analyze any number of
diseases, but they can't get to it because of privacy issues."
Honavar would like to develop privacy-preserving data-mining algorithms
for applications for such areas.
Honavar is working with a team of nine graduate students and two undergraduate
students and collaborators from several other disciplines on these projects.
His team is working with Dobbs and Jernigan on developing a test-bed knowledge
acquisition from heterogeneous distributed data in computational molecular
biology aimed at discovery of protein sequence-structure-function relationships.
Honavar's group collaborates with computer science faculty members Johnny
Wong, Les Miller and Robyn Lutz on applications in security informatics.
He is also working with Iowa State faculty members Heather Greenlee and
Jan Buss on applications in gene expression analysis, and with James McCalley
on applications in power systems.
Honavar leads the Computational Intelligence, Learning, and Discovery
(CILD) Program which aims to foster cross-disciplinary research on applications
of artificial intelligence, and in particular, machine learning, in scientific
discovery.
He will discuss his team's work in an Institute of Science and Society
seminar on Tuesday, Feb. 24, in 302 Catt Hall at 12:10 p.m.
Around LAS
February 23 to March 7, 2004
|
|