A team of Auburn University faculty members from the colleges of Agriculture, Engineering, and Sciences and Mathematics placed first in the national Coleridge Initiative Food for Thought Data Challenge, in association with the U.S. Department of Agriculture.
The team comprises Wenying Li, assistant professor of agricultural economics; Jingyi Zheng, assistant professor in mathematics and statistics; Shubhra Kanti Karmaker, assistant professor in computer science and software engineering; and his doctoral students, Naman Bansal and Alex Knipper.
The Coleridge Initiative Food for Thought Data Challenge asks data scientists to use natural language processing and machine learning to link food and nutrition databases on a large scale.
Auburn’s team, named “Auburn Big Data,” spent more than 1,000 hours developing a model predicting links between scanner data and the Food and Nutrient Database for Dietary Studies.
“In the field of food, health and nutrition economics, a dataset known as scanner data is widely used,” Li said. “The scanner data are derived from over 120,000 households who report what food products they purchased, when they shopped, and where they shopped. These households also report demographic and health information. The household purchase data can also be linked to product characteristics, such as the brand, which shows the type of products households are purchasing.”
Over the last 10 years, the USDA has been developing a larger data resource: the Purchase to Plate Crosswalk, which combines scanner data with the USDA Food and Nutrient Database for Dietary Studies. This crosswalk provides a comprehensive picture of the healthfulness of household purchases, according to Li, allowing agencies to assess USDA Food Plan costs and measure the quality of Americans’ diets.
The goal of this competition, Li said, is to provide the USDA with innovative ways to compile this crosswalk using natural language processing and machine learning. Zheng first involved Bansal and Knipper in the competition after teaching them in her data science classes. They worked on the competition as a course project and continued working on it after the class ended.
“Along the way of being data scientists, it is extremely important for students to work on real data and solve real-world problems,” Zheng said. “Beyond class and research, there are various problems one can encounter when handling real data.”
Zheng said one challenge the team encountered was the secure platform they were required to use when analyzing the data.
“The platform imposes large and restrictive constraints, which heavily influenced our process,” she said. “The students, especially Alex, spent lots of time and effort working on the platform and testing various algorithms, which contributes the most to our success.”
The team won first place and $10,000 in the first interim phase of the challenge in September. Their model generated the most links with the highest accuracy amongst the 12 teams of nationwide researchers.
Then in November, the team took first place and $10,000 again in the competition’s second interim challenge.
In this third and final round, the team focused on hyper-parameter tuning, an essential part of optimizing machine learning algorithms, to fine-tune both models and see if further improvements in the performances are possible.
The results of the challenge’s final round were announced Wednesday, Dec. 14. The team earned $30,000 for its first-place finish.
Zheng and Karmaker said they hope their success motivates more students to participate in real data competitions and solve real world problems.
“We’re talking about real money,” Karmaker said. “We’re talking about human effort. We’re talking about people’s health. Our students are getting their hands dirty with real data. You cannot gain this type of experience in a simulated environment.”
Li said he hopes the accomplishment facilitates two major goals moving forward. First, he hopes to see greater collaborations between colleges like the three participating in this project.
“Every researcher specializes in their own field, but when we get together, we have so many ideas for combining our skill sets and conducting research,” he said. “I can’t believe how lucky I am to have such a great team with Santu and Jingyi, and I am continually impressed by the results Alex and Naman produce. It is amazing how the two Ph.D. students are always able to overcome any obstacle thrown their way.”
He also hopes to see fruitful uses of the data garnered from the challenge.
“It is not always meaningful to know what products people buy,” he said. “But there is meaning in knowing the healthfulness of a purchase. This is providing basically a new research field with this data, and we’re at the front of it.”