Ellie Burton’s summer job might be described as “dental detective.”
Using 3-D images of bones, she and teammates Kevin Kuo and GiSeok Choi are teaching a computer to calculate similarities between the fine bumps, grooves and ridges on teeth from dozens of lemurs, chimps and other animals.
They were among more than 50 students — majoring in everything from political science to engineering — who gathered on the third floor of Gross Hall this week for a lunch to share status updates on some unusual summer jobs.
The budding data scientists included 40 students selected for a summer research program at Duke called Data+. For ten weeks from mid-May to late July, students work in small teams on projects using real-world data.
Another group of students is working as high-tech weather forecasters.
Using a method called “topological data analysis,” Joy Patel and Hans Riess are trying to predict the trajectory and intensity of tropical cyclones based on data from Hurricane Isabel, a deadly hurricane that struck the eastern U.S. in 2003.
The student teams are finding that extracting useful information from noisy and complex data is no simple feat.
Some of the datasets are so large and sprawling that just loading them onto their computers is a challenge.
“Each of our hurricane datasets is a whopping five gigabytes,” said Patel, pointing to an ominous cloud of points representing things like wind speed and pressure.
They encounter other challenges along the way, such as how to deal with missing data.
Andy Cooper, Haoyang Gu and Yijun Li are analyzing data from Duke’s massive open online courses (MOOCs), not-for-credit courses available for free on the Internet.
Duke has offered dozens of MOOCs since launching the online education initiative in 2012. But when the students started sifting through the data there was just one problem: “A lot of people drop out,” Li said. “They log on and never do anything again.”
Some of the datasets also contain sensitive information, such as salaries or student grades. These require the students to apply special privacy or security measures to their code, or to use a special data repository called the SSRI Protected Research Data Network (PRDN).
Lucy Lu and Luke Raskopf are working on a project to gauge the success of job development programs in North Carolina.
One of the things they want to know is whether counties that receive financial incentives to help businesses relocate or expand in their area experience bigger wage boosts than those that don’t.
To find out, they’re analyzing data on more than 450 grants awarded between 2002 and 2012 to hundreds of companies, from Time Warner Cable to Ann’s House of Nuts.
Another group of students is analyzing people’s charitable giving behavior.
By looking at past giving history, YunChu Huang, Mike Gao and Army Tunjaicon are developing algorithms similar to those used by Netflix to help donors identify other nonprofits that might interest them (i.e., “If you care about Habitat for Humanity, you might also be interested in supporting Heifer International.”)
One of the cool things about the experience is if the students get stuck, they already know other students using the same programming language who they can turn to for help, said Duke mathematician Paul Bendich, who coordinates the program.
…
The other students in the 2015 Data+ program are Sachet Bangia, Nicholas Branson, David Clancy, Arjun Devarajan, Christine Delp, Bridget Dou, Spenser Easterbrook, Manchen (Mercy) Fang, Sophie Guo, Tess Harper, Brandon Ho, Alex Hong, Christopher Hong, Ethan Levine, Yanmin (Mike) Ma, Sharrin Manor, Hannah McCracken, Tianyi Mu , Kang Ni, Jeffrey Perkins, Molly Rosenstein, Raghav Saboo, Kelsey Sumner, Annie Tang, Aharon Walker, Kehan Zhang and Wuming Zhang.
Data+ is sponsored by the Information Initiative at Duke, the Social Sciences Research Institute and Bass Connections. Additional funding was provided by the National Science Foundation via a grant to the departments of mathematics and statistical science.
Writing by Robin Smith; video by Christine Delp and Hannah McCracken