Duke Research Blog

Following the people and events that make up the research community at Duke.

Search results: "datafest"

The Art of Asking Questions at DataFest 2016

During DataFest, students engaged in intense collaboration. Image courtesy of Rita Lo.

Students engaged in intense collaboration during DataFest 2016, a stats and data analysis competition held from April 1-3 at Duke. Image courtesy of Rita Lo.

On Saturday night, while most students were fast asleep or out partying, Duke junior Callie Mao stayed up until the early hours of the morning pushing and pulling a real-world data set to see what she could make of it — for fun. Callie and her team had planned for months in advance to take part in DataFest 2016, a statistical analysis competition that occurred from April 1 to April 3.

A total of 277 students, hailing from schools as disparate as Duke, UNC Chapel Hill, NCSU, Meredith College, and even one high school, the North Carolina School of Science and Mathematics, gathered in the Edge to extract insight from a mystery data set. The camaraderie was palpable, as students animatedly sketched out their ideas on whiteboard walls and chatted while devouring mountains of free food.

Callie Mao ponders which aspects of data to include in her analysis.

Duke junior Callie Mao ponders which aspects of the data to include in her analysis.

Callie observed that the challenges the students faced at DataFest were extremely unique: “The most difficult part of DataFest is coming up with an idea. In class, we get specific problems, but at DataFest, we are thrown a massive data set and must figure out what to do with it. We originally came up with a lot of ideas, but the data set just didn’t have enough information to fully visualize though.”

At the core, Callie and her team, instead of answering questions posed in class, had to come up with innovative and insightful questions to pose themselves. With virtually no guidance, the team chose which aspects of the data to include and which to exclude.

Another principal consideration across all categories was which tools to use to quickly and clearly represent the data. Callie and her team used R to parse the relevant data, converted their desired data into JSON files, and used D3, a Javascript library, to code graphics to visualize the data. Other groups, however, used Tableau, a drag and drop interface that provided an expedited method for creating beautiful graphics.

Mentors assisted participants with formulating insights and presenting their results

Mentors assisted participants with formulating insights and presenting their results. Image courtesy of Rita Lo.

On Sunday afternoon, students presented their findings to their attentive peers and to a panel of judges, comprised of industry professionals, statistics professors from various universities, and representatives from Data and Visualization Services at Duke Libraries. Judges commended projects based on aspects such as incorporation of other data sources, like Google Adwords, comprehensibility of the data presentation, and the applicability of findings in a real industry setting.

Students competed in four categories:  best use of outside data, best data insight, best visualization, and best recommendation. The Baeesians, pictured below, took first place in best outside data, the SuperANOVA team won best data insight, the Standard Normal team won best visualization, and the Sample Solution team won best recommendation. The winning presentations will be available to view by May 2 at http://www2.stat.duke.edu/datafest/.

Bayesian, the winner of the Best Outside Data category

The Baeasians, winner of the Best Outside Data category at DataFest 2016: Rahul Harikrishnan, Peter Shi, Qian Wang, Abhishek Upadhyaya. (Not pictured Justin Wang) Image courtesy of Rita Lo.

 

By student writer Olivia Zhu  professionalpicture

Got Data? 200+ Crunch Numbers for Duke DataFest

Photos by Rita Lo; Writing by Robin Smith

While many students’ eyes were on the NCAA Tournament this weekend, a different kind of tournament was taking place at the Edge. Students from Duke and five other area schools set up camp amidst a jumble of laptops and power cords and white boards for DataFest, a 48-hour stats competition with real-world data. Now in its fourth year at Duke, the event has grown from roughly two dozen students to more than 220 participants.

Teams of two to five students had 48 hours to make sense of a single data set. The data was kept secret until the start of the competition Friday night. Consisting of visitor info from a popular comparison shopping site, it was spread across five tables and several million rows.

“The size and complexity of the data set took me by surprise,” said junior David Clancy.

For many, it was their first experience with real-world data. “In most courses, the problems are guided and it is very clear what you need to accomplish and how,” said Duke junior Tori Hall. “DataFest is much more like the real world, where you’re given data and have to find your own way to produce something meaningful.”

“I didn’t expect the challenge to be so open-ended,” said Duke junior Greg Poore. “The stakeholder literally ended their ‘pitch’ to the participants with the company’s goals and let us loose from there.”

As they began exploring the data, the Poke.R team discovered that 1 in 4 customers spend more than they planned. The team then set about finding ways of helping the company identify these “dream customers” ahead of time based on their demographics and web browsing behavior — findings that won them first place in the “best insight” category.

“On Saturday afternoon, after 24 hours of working, we found all the models we tried failed miserably,” said team member Hong Xu. “But we didn’t give up and brainstormed and discussed our problems with the VIP consultants. They gave us invaluable insights and suggestions.”

Consultants from businesses and area schools stayed on hand until midnight on both Friday and Saturday to answer questions. Finally, on Sunday afternoon the teams presented their ideas to the judges.

Seniors Matt Tyler and Justin Yu of the Type 3 Errors team combined the assigned data set with outside data on political preferences to find out if people from red or blue cities were more likely to buy eco-friendly products.

“I particularly enjoyed DataFest because it encouraged interdisciplinary collaboration, not only between members from fields such as statistics, math, and engineering, but it also economics, sociology, and, in our case, political science,” Yu said.

The Bayes’ Anatomy team won the best visualization category by illustrating trends in customer preferences with a flow diagram and a network graph aimed at improving the company’s targeting advertising.

“I was just very happily surprised to win!” said team member and Duke junior Michael Lin.

Sign Up For Datafest 2014 to Work on Mystery Big Data

DATAFESTFLYER


Heads up Duke undergrads and graduate students — here’s an opportunity to hang out in the beautifully renovated Gross Hall, get creative with your friends using big data and compete for cash prizes and statistics fame.

Datafest, a data analysis competition that started at UCLA, is in its third year in the Triangle. Every year, a mystery client provides a dataset that teams can analyze, tinker with and visualize however they’d like over the course of a weekend. Think hackathon, but for data junkies.

“The datasets are bigger and more complex than what you’ll see in a classroom, but they’re of general interest,” said organizer Mine Çetinkaya-Rundel, an assistant professor of the practice in the Duke statistics department. “We want to encourage students from all levels.”

Last year’s mystery client was online dating website eHarmony (you can read about it here), and teams investigated everything from heightism to Myers-Briggs personality matches in online dating. In 2012, the dataset came from Kiva, the  microlending site.

This year’s dataset provider will be revealed on the first day of Datafest. Sign up ends this Friday, March 7, Monday, March 10, so assemble your team and register here!

 

Can’t Decide What Clubs to Join Outside of Class? There’s a Web App for That

With 400-plus student organizations to choose from, Duke has more co-curriculars than you could ever hope to take advantage of in one college career. Navigating the sheer number of options can be overwhelming. So how do you go about finding your niche on campus?

Now there’s a Web app for that: the Duke CoCurricular Eadvisor. With just a few clicks it comes up with a personalized ranked list of student clubs and programs based on your interests and past participation compared to others.

“We want it to be like the activity fair, but online,” said  Duke computer science major Dezmanique Martin, who was part of a team of Duke undergrads in the Data+ summer research program who developed the “recommendation engine.”

“The goal is to make a web app that recommends activities like Netflix recommends movies,” said team member Alec Ashforth.

The project is still in the testing stage, but you can try it out for yourself, or add your student organization to the database, at https://eadvisorduke.shinyapps.io/login/

A “co-curricular” can be just about any learning experience that takes place outside of class and doesn’t count for credit, be it a student magazine, Science Olympiad or community service. Research shows that students who get involved on campus are more likely to graduate and thrive in the workplace post-graduation.

For the pilot version, the team compiled a list of more than 150 student programs related to technology. Each program was tagged with certain attributes.

Students start by entering a Net ID, major, and expected graduation date. Then they enter all the programs they have participated in at Duke so far, submit their profile, and hit “recommend.”

The e-advisor algorithm generates a ranked list of activities recommended just for the user.

The e-advisor might recognize that a student who did DataFest and HackDuke in their first two years likes computer science, research, technology and competitions. Based on that, the Duke Robotics Club might be highly recommended, while the Refugee Health Initiative would be ranked lower.

A new student can just indicate general interests by selecting a set of keywords from a drop-down menu. Whether it’s literature and humanities, creativity, competition, or research opportunities, the student and her advisor won’t have to puzzle over the options — the e-advisor does it for them.

The tool comes up with its recommendations using a combination of approaches. One, called content-based filtering, finds activities you might like based on what you’ve done in the past. The other, collaborative filtering, looks for other students with similar histories and tastes, and recommends activities they tried.

This could be a useful tool for advisors, too, noted Vice Provost for Interdisciplinary Studies Edward Balleisen, while learning about the EAdvisor team at this year’s Data+ Poster Session.

“With sole reliance on the app, there could be a danger of some students sticking with well-trodden paths, at the expense of going outside their comfort zone or trying new things,” Balleisen said.

But thinking through app recommendations along with a knowledgeable advisor “might lead to more focused discussions, greater awareness about options, and better decision-making,” he said.

Led by statistics Ph.D. candidate Lindsay Berry, so far the team has collected data from more than 80 students. Moving forward they’d like to add more co-curriculars to the database, and incorporate more features, such as an upvote/downvote system.

“It will be important for the app to include inputs about whether students had positive, neutral, or negative experiences with extra-curricular activities,” Balleisen added.

The system also doesn’t take into account a student’s level of engagement. “If you put Duke machine learning, we don’t know if you’re president of the club, or just a member who goes to events once a year,” said team member Vincent Liu, a rising sophomore majoring in computer science and statistics.

Ultimately, the hope is to “make it a viable product so we can give it to freshmen who don’t really want to know what they want to do, or even sophomores or juniors who are looking for new things,” said Brooke Keene, rising junior majoring in computer science and electrical and computer engineering.

Video by Paschalia Nsato and Julian Santos; writing by Robin Smith

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of Mathematics and Statistical Science and MEDx. This project team was also supported by the Duke Office of Information Technology.

Other Duke sponsors include DTECH, Duke Health, Sanford School of Public Policy, Nicholas School of the Environment, Development and Alumni Affairs, Energy Initiative, Franklin Humanities Institute, Duke Forge, Duke Clinical Research, Office for Information Technology and the Office of the Provost, as well as the departments of Electrical & Computer Engineering, Computer Science, Biomedical Engineering, Biostatistics & Bioinformatics and Biology.

Government funding comes from the National Science Foundation.

Outside funding comes from Lenovo, Power for All and SAS.

Community partnerships, data and interesting problems come from the Durham Police and Sheriff’s Department, Glenn Elementary PTA, and the City of Durham.

Data Geeks Go Head to Head

For North Carolina college students, “big data” is becoming a big deal. The proof: signups for DataFest, a 48-hour number-crunching competition held at Duke last weekend, set a record for the third time in a row this year.

DataFest 2017

More than 350 data geeks swarmed Bostock Library this weekend for a 48-hour number-crunching competition called DataFest. Photo by Loreanne Oh, Duke University.

Expected turnout was so high that event organizer and Duke statistics professor Mine Cetinkaya-Rundel was even required by state fire code to sign up for “crowd manager” safety training — her certificate of completion is still proudly displayed on her Twitter feed.

Nearly 350 students from 10 schools across North Carolina, California and elsewhere flocked to Duke’s West Campus from Friday, March 31 to Sunday, April 2 to compete in the annual event.

Teams of two to five students worked around the clock over the weekend to make sense of a single real-world data set. “It’s an incredible opportunity to apply the modeling and computing skills we learn in class to actual business problems,” said Duke junior Angie Shen, who participated in DataFest for the second time this year.

The surprise dataset was revealed Friday night. Just taming it into a form that could be analyzed was a challenge. Containing millions of data points from an online booking site, it was too large to open in Excel. “It was bigger than anything I’ve worked with before,” said NC State statistics major Michael Burton.

DataFest 2017

The mystery data set was revealed Friday night in Gross Hall. Photo by Loreanne Oh.

Because of its size, even simple procedures took a long time to run. “The dataset was so large that we actually spent the first half of the competition fixing our crushed software and did not arrive at any concrete finding until late afternoon on Saturday,” said Duke junior Tianlin Duan.

The organizers of DataFest don’t specify research questions in advance. Participants are given free rein to analyze the data however they choose.

“We were overwhelmed with the possibilities. There was so much data and so little time,” said NCSU psychology major Chandani Kumar.

“While for the most part data analysis was decided by our teachers before now, this time we had to make all of the decisions ourselves,” said Kumar’s teammate Aleksey Fayuk, a statistics major at NCSU.

As a result, these budding data scientists don’t just write code. They form theories, find patterns, test hunches. Before the weekend is over they also visualize their findings, make recommendations and communicate them to stakeholders.

This year’s participants came from more than 10 schools, including Duke, UNC, NC State and North Carolina A&T. Students from UC Davis and UC Berkeley also made the trek. Photo by Loreanne Oh.

“The most memorable moment was when we finally got our model to start generating predictions,” said Duke neuroscience and computer science double major Luke Farrell. “It was really exciting to see all of our work come together a few hours before the presentations were due.”

Consultants are available throughout the weekend to help with any questions participants might have. Recruiters from both start-ups and well-established companies were also on site for participants looking to network or share their resumes.

“Even as late as 11 p.m. on Saturday we were still able to find a professor from the Duke statistics department at the Edge to help us,” said Duke junior Yuqi Yun, whose team presented their results in a winning interactive visualization. “The organizers treat the event not merely as a contest but more of a learning experience for everyone.”

Caffeine was critical. “By 3 a.m. on Sunday morning, we ended initial analysis with what we had, hoped for the best, and went for a five-hour sleep in the library,” said NCSU’s Fayuk, whose team DataWolves went on to win best use of outside data.

By Sunday afternoon, every surface of The Edge in Bostock Library was littered with coffee cups, laptops, nacho crumbs, pizza boxes and candy wrappers. White boards were covered in scribbles from late-night brainstorming sessions.

“My team encouraged everyone to contribute ideas. I loved how everyone was treated as a valuable team member,” said Duke computer science and political science major Pim Chuaylua. She decided to sign up when a friend asked if she wanted to join their team. “I was hesitant at first because I’m the only non-stats major in the team, but I encouraged myself to get out of my comfort zone,” Chuaylua said.

“I learned so much from everyone since we all have different expertise and skills that we contributed to the discussion,” said Shen, whose teammates were majors in statistics, computer science and engineering. Students majoring in math, economics and biology were also well represented.

At the end, each team was allowed four minutes and at most three slides to present their findings to a panel of judges. Prizes were awarded in several categories, including “best insight,” “best visualization” and “best use of outside data.”

Duke is among more than 30 schools hosting similar events this year, coordinated by the American Statistical Association (ASA). The winning presentations and mystery data source will be posted on the DataFest website in May after all events are over.

The registration deadline for the next Duke DataFest will be March 2018.

DataFest 2017

Bleary-eyed contestants pose for a group photo at Duke DataFest 2017. Photo by Loreanne Oh.

s200_robin.smith

Post by Robin Smith

Powered by WordPress & Theme by Anders Norén