Data-driven discovery

Liberal Arts and Sciences thrives in data-rich environments

The old adage, “you can be anything you want to be” isn’t exactly true.

Well, at least not for honey bees.

That’s because their genes won’t allow it. After they reach a certain age, they’re either a worker or a queen, and there is no room for advancement. Amy Toth has been studying the impact this makes on their social behavior for 14 years. Honey bees are extremely social – they are helpful, communicative and content with, in most cases, being a worker their entire life.

Toth explores why honey bees are so community-minded by researching their genetic makeup.

“I study how they became social,” said Toth, an assistant professor of ecology, evolution and organismal biology at Iowa State University. “But studying the honey bee alone doesn’t fully answer that question.”

To understand how the honey bee’s unusual sociality evolved, she had to compare it with another species, choosing the paper wasp – an aggressive, independent insect. The year was 2006, scientists had just assembled the honey bee’s genome sequence, and a big advancement with big data was about to have a big effect on Toth’s research.

“Next-Generation Sequencing” was a fundamentally different approach to sequencing. It allowed researchers to sequence genes thousands of times faster than before, providing mountains of information about DNA and even RNA. It had taken Toth five months to clone and sequence a single, particularly troublesome paper wasp gene. Using the new Next-Generation Sequencing, she was provided with partial sequences for 3,000 genes in 24 hours.

“Literally, it changed my whole game plan overnight,” she said. “You could get information on thousands of genes from any species you wanted. The winds of change were blowing.”

Predicting outcomes

It’s not how much data we collect that has everyone abuzz. It’s how much we can do with it. The quadrillions of bytes of data supercomputers can handle are mind-boggling, but the interdisciplinary strengths of the College of Liberal Arts and Sciences iron out the wrinkles between data collection and its use.

“In the last 20 to 30 years data has become much easier to collect and store,” said Arne Hallam, LAS associate dean and professor of economics. “Now, we can develop models that predict what will happen in the real world.”

Our bioinformatics, computational, math and statistics programs have helped ease the challenges large sets of data can spur. “When you go from having one piece of data to a billion sequences, it is beyond valuable to have such strong departments to collaborate with,” Toth said.

Much like Amazon uses statistical techniques to create models that predict a shopper’s buying habits, researchers at Iowa State University create models from their large data sets to predict outcomes. These models analyze everything from genomics to astronomy to linguistics. It’s data-driven discovery, guided by improved statistical and computation methods, and allows researchers to skip much of the preliminary trial-and-error, since analytics weed out impossible answers first.

“With sophisticated computational statistics and efficient implementation, our graduate students can manage huge clusters of data on their cell phones,” Kris De Brabranter, assistant professor of statistics, said.

De Brabranter teaches “machine learning” – using a set of tools for modeling and understanding big data sets. He said in addition to “big,” data sets are complex and can be difficult to understand. Enrollment is full in his machine learning course (a high-level course for advanced undergraduates or graduate students), proving the wealth of challenges big data brings to science, marketing, finance and business fields puts graduates with machine learning skills in demand.

Revolutionary research

Iowa State’s STEM departments are leaders in the field of large data set analysis, and our humanities, communications and social sciences programs have all made moves to hire faculty with proficiency in writing code and analyzing information.

Science and research in LAS is revolutionary because it is data-driven. The tools we develop to analyze data connect disciplines across campus while introducing departments to the complexities of large data sets. The beauty of big data analytics is that it doesn’t always give you answers – many times, it sparks more questions before helping to provide answers.

Martin Spalding is an LAS associate dean and a professor of genetics, development and cell biology. To improve the productivity of algae as a possible source of renewable biofuels, his research focuses on understanding photosynthetic metabolism. Sequencing multiple genomes requires the comparison of billions of genome sequences to identify genes that are responsible for certain characteristics.

“We need to understand relationships amongst the genes and the data because the amount of data is too large to look at manually and make any sense out of it,” Spalding said.

A bridge across disciplines

David Oakey is an assistant professor of English who researches applied linguistics – specifically “corpus linguistics,” the study of large, systematically organized collections of texts. His office is lined with books, encyclopedias and every volume of The Oxford English Dictionary dating back to the original 1933 edition. He even keeps a small paper diary to jot down his appointments.

Amidst the nostalgia of pulp and ink, Oakey sits in front of two computer screens that display detailed analytics of the opening few paragraphs of Lewis Carroll’s “Alice’s Adventures in Wonderland.” He scrolls through numerous examples of how a computer can display the written text: digitally encoded, in word frequency lists and charts, clustered in word clouds, in concept maps. Each tells a different story. Each shows a different pattern.

“Looking at English on a computer in this way shows you things which are impossible to discover if you only rely on your own personal knowledge of the language,” he said. “In this way, you are able to answer research questions which no one would have ever thought about asking without computers.”

Oakey studies “lexico-grammar,” a theoretical account of how English form and function are connected. He uses corpora of academic writing to study how new word meanings emerge in interdisciplinary research collaboration.

The assumption that science-related data and human-related communication do not correlate reflects a 20th century epistemological view, Oakey says. For the millennial generation of “digital natives” who grew up with digital social media, electronic language data is not merely a record of communication that has been digitized. Instead, digital language data actually is the communication.

“If we look at data as a human phenomenon rather than an extrapolation of one, then we can say something about emerging patterns of communication and interaction that were not there before people started doing the majority of their communication through digital devices,” he said.

Language is not just about the meaning of words on their own; it’s about their potential to mean something in a particular situation. “In the mid-20th century, linguistics was about studying an individual’s mental grammar rules, irrespective of use,” Oakey said. “Corpus linguistic research was criticized as studying ‘usage’ rather than ‘language,’ but the current big data approach has made linguists more aware of the shared contribution of grammar and vocabulary towards making meaning in different situations.”

A quantitative campus

The algorithms our statistics and computer science departments are constantly improving have allowed everything on campus to become more quantitative. From political science to chemistry, the exploration of large data sets has created a culture of curiosity and collaboration on campus.

“There are many researchers on campus who are in the same boat – all looking to make sense of their data and share ideas,” Toth said. “Besides figuring out what to do with our data, learning how to interpret it opens doors for some fun collaboration.”

No longer a dry topic, data is a hip, motivating, attention-grabbing field. It drives our research. It is valuable for our students. It engages the public and provides tangible benefits that keep Iowa State University at the forefront of data-driven discovery.

We have the numbers to prove it.