Duke Research Blog

Following the people and events that make up the research community at Duke.

Category: Data (Page 1 of 5)

What Happens When Data Scientists Crunch Through Three Centuries of Robinson Crusoe?

Reading 1,400-plus editions of “Robinson Crusoe” in one summer is impossible. So one team of students tried to train computers to do it for them.

Reading 1,400-plus editions of “Robinson Crusoe” in one summer is impossible. So one team of students tried to train computers to do it for them.

Since Daniel Defoe’s shipwreck tale “Robinson Crusoe” was first published nearly 300 years ago, thousands of editions and spinoff versions have been published, in hundreds of languages.

A research team led by Grant Glass, a Ph.D. student in English and comparative literature at the University of North Carolina at Chapel Hill, wanted to know how the story changed as it went through various editions, imitations and translations, and to see which parts stood the test of time.

Reading through them all at a pace of one a day would take years. Instead, the researchers are training computers to do it for them.

This summer, Glass’ team in the Data+ summer research program used computer algorithms and machine learning techniques to sift through 1,482 full-text versions of Robinson Crusoe, compiled from online archives.

“A lot of times we think of a book as set in stone,” Glass said. “But a project like this shows you it’s messy. There’s a lot of variance to it.”

“When you pick up a book it’s important to know what copy it is, because that can affect the way you think about the story,” Glass said.

Just getting the texts into a form that a computer could process proved half the battle, said undergraduate team member Orgil Batzaya, a Duke double major in math and computer science.

The books were already scanned and posted online, so the students used software to download the scans from the internet, via a process called “scraping.” But processing the scanned pages of old printed books, some of which had smudges, specks or worn type, and converting them to a machine-readable format proved trickier than they thought.

The software struggled to decode the strange spellings (“deliver’d,” “wish’d,” “perswasions,” “shore” versus “shoar”), different typefaces between editions, and other quirks.

Special characters unique to 18th century fonts, such as the curious f-shaped version of the letter “s,” make even humans read “diftance” and “poffible” with a mental lisp.

Their first attempts came up with gobbledygook. “The resulting optical character recognition was completely unusable,” said team member and Duke senior Gabriel Guedes.

At a Data+ poster session in August, Guedes, Batzaya and history and computer science double major Lucian Li presented their initial results: a collection of colorful scatter plots, maps, flowcharts and line graphs.

Guedes pointed to clusters of dots on a network graph. “Here, the red editions are American, the blue editions are from the U.K.,” Guedes said. “The network graph recognizes the similarity between all these editions and clumps them together.”

Once they turned the scanned pages into machine-readable texts, the team fed them into a machine learning algorithm that measures the similarity between documents.

The algorithm takes in chunks of texts — sentences, paragraphs, even entire novels — and converts them to high-dimensional vectors.

Creating this numeric representation of each book, Guedes said, made it possible to perform mathematical operations on them. They added up the vectors for each book to find their sum, calculated the mean, and looked to see which edition was closest to the “average” edition. It turned out to be a version of Robinson Crusoe published in Glasgow in 1875.

They also analyzed the importance of specific plot points in determining a given edition’s closeness to the “average” edition: what about the moment when Crusoe spots a footprint in the sand and realizes that he’s not alone? Or the time when Crusoe and Friday, after leaving the island, battle hungry wolves in the Pyrenees?

The team’s results might be jarring to those unaccustomed to seeing 300 years of publishing reduced to a bar chart. But by using computers to compare thousands of books at a time, “digital humanities” scholars say it’s possible to trace large-scale patterns and trends that humans poring over individual books can’t.

“This is really something only a computer can do,” Guedes said, pointing to a time-lapse map showing how the Crusoe story spread across the globe, built from data on the place and date of publication for 15,000 editions.

“It’s a form of ‘distant reading’,” Guedes said. “You use this massive amount of information to help draw conclusions about publication history, the movement of ideas, and knowledge in general across time.”

This project was organized in collaboration with Charlotte Sussman (English) and Astrid Giugni (English, ISS). Check out the team’s results at https://orgilbatzaya.github.io/pirating-texts-site/

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of Mathematics and Statistical Science and MEDx. This project team was also supported by the Duke Office of Information Technology.

Other Duke sponsors include DTECH, Duke Health, Sanford School of Public Policy, Nicholas School of the Environment, Development and Alumni Affairs, Energy Initiative, Franklin Humanities Institute, Duke Forge, Duke Clinical Research, Office for Information Technology and the Office of the Provost, as well as the departments of Electrical & Computer Engineering, Computer Science, Biomedical Engineering, Biostatistics & Bioinformatics and Biology.

Government funding comes from the National Science Foundation.

Outside funding comes from Lenovo, Power for All and SAS.

Community partnerships, data and interesting problems come from the Durham Police and Sheriff’s Department, Glenn Elementary PTA, and the City of Durham.

Videos by Paschalia Nsato and Julian Santos; writing by Robin Smith

Can’t Decide What Clubs to Join Outside of Class? There’s a Web App for That

With 400-plus student organizations to choose from, Duke has more co-curriculars than you could ever hope to take advantage of in one college career. Navigating the sheer number of options can be overwhelming. So how do you go about finding your niche on campus?

Now there’s a Web app for that: the Duke CoCurricular Eadvisor. With just a few clicks it comes up with a personalized ranked list of student clubs and programs based on your interests and past participation compared to others.

“We want it to be like the activity fair, but online,” said  Duke computer science major Dezmanique Martin, who was part of a team of Duke undergrads in the Data+ summer research program who developed the “recommendation engine.”

“The goal is to make a web app that recommends activities like Netflix recommends movies,” said team member Alec Ashforth.

The project is still in the testing stage, but you can try it out for yourself, or add your student organization to the database, at https://eadvisorduke.shinyapps.io/login/

A “co-curricular” can be just about any learning experience that takes place outside of class and doesn’t count for credit, be it a student magazine, Science Olympiad or community service. Research shows that students who get involved on campus are more likely to graduate and thrive in the workplace post-graduation.

For the pilot version, the team compiled a list of more than 150 student programs related to technology. Each program was tagged with certain attributes.

Students start by entering a Net ID, major, and expected graduation date. Then they enter all the programs they have participated in at Duke so far, submit their profile, and hit “recommend.”

The e-advisor algorithm generates a ranked list of activities recommended just for the user.

The e-advisor might recognize that a student who did DataFest and HackDuke in their first two years likes computer science, research, technology and competitions. Based on that, the Duke Robotics Club might be highly recommended, while the Refugee Health Initiative would be ranked lower.

A new student can just indicate general interests by selecting a set of keywords from a drop-down menu. Whether it’s literature and humanities, creativity, competition, or research opportunities, the student and her advisor won’t have to puzzle over the options — the e-advisor does it for them.

The tool comes up with its recommendations using a combination of approaches. One, called content-based filtering, finds activities you might like based on what you’ve done in the past. The other, collaborative filtering, looks for other students with similar histories and tastes, and recommends activities they tried.

This could be a useful tool for advisors, too, noted Vice Provost for Interdisciplinary Studies Edward Balleisen, while learning about the EAdvisor team at this year’s Data+ Poster Session.

“With sole reliance on the app, there could be a danger of some students sticking with well-trodden paths, at the expense of going outside their comfort zone or trying new things,” Balleisen said.

But thinking through app recommendations along with a knowledgeable advisor “might lead to more focused discussions, greater awareness about options, and better decision-making,” he said.

Led by statistics Ph.D. candidate Lindsay Berry, so far the team has collected data from more than 80 students. Moving forward they’d like to add more co-curriculars to the database, and incorporate more features, such as an upvote/downvote system.

“It will be important for the app to include inputs about whether students had positive, neutral, or negative experiences with extra-curricular activities,” Balleisen added.

The system also doesn’t take into account a student’s level of engagement. “If you put Duke machine learning, we don’t know if you’re president of the club, or just a member who goes to events once a year,” said team member Vincent Liu, a rising sophomore majoring in computer science and statistics.

Ultimately, the hope is to “make it a viable product so we can give it to freshmen who don’t really want to know what they want to do, or even sophomores or juniors who are looking for new things,” said Brooke Keene, rising junior majoring in computer science and electrical and computer engineering.

Video by Paschalia Nsato and Julian Santos; writing by Robin Smith

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of Mathematics and Statistical Science and MEDx. This project team was also supported by the Duke Office of Information Technology.

Other Duke sponsors include DTECH, Duke Health, Sanford School of Public Policy, Nicholas School of the Environment, Development and Alumni Affairs, Energy Initiative, Franklin Humanities Institute, Duke Forge, Duke Clinical Research, Office for Information Technology and the Office of the Provost, as well as the departments of Electrical & Computer Engineering, Computer Science, Biomedical Engineering, Biostatistics & Bioinformatics and Biology.

Government funding comes from the National Science Foundation.

Outside funding comes from Lenovo, Power for All and SAS.

Community partnerships, data and interesting problems come from the Durham Police and Sheriff’s Department, Glenn Elementary PTA, and the City of Durham.

Becoming the First: Nick Carnes

Editor’s Note: In the “Becoming the First” series,  first-generation college student and Rubenstein Scholar Lydia Goff explores the experiences of Duke researchers who were the first in their families to attend college.

A portrait of Duke Professor Nick Carnes

Nick Carnes

Should we care that we are governed by professionals and millionaires? This is one of the questions Nick Carnes, an assistant professor in the Sanford School of Public Policy, seeks to answer with his research. He explores unequal social class representation in the political process and how it affects policy making. But do any real differences even exist between politicians from lower socioeconomic classes and those from the upper classes? Carnes believes they do, not only because of his research but also because of his personal experiences.

When Carnes entered Princeton University as a political science graduate student, he was the only member of his cohort who had done restaurant, construction or factory work. While obtaining his undergraduate degree from the University of Tulsa, he worked twenty hours a week and during the summer clocked in at sixty to seventy hours a week between two jobs. He considered himself and his classmates “similar on paper,” just like how politicians from a variety of socioeconomic classes can also appear comparable. However, Carnes noticed that he approached some problems differently than his classmates and wondered why. After attributing his distinct approach to his working class background, without the benefits of established college graduate family members (his mother did go to college while he was growing up), he began developing his current research interests.

Carnes considers “challenging the negative stereotypes about working class people” the most important aspect of his research. When he entered college, his first meeting with his advisor was filled with confusion as he tried to decipher what a syllabus was. While his working class status did restrict his knowledge of college norms, he overcame these limitations. He is now a researcher, writer, and professor who considers his job “the best in the world” and whose own story proves that working class individuals can conquer positions more often inhabited by the experienced. As Carnes states, “There’s no good reason to not have working class people in office.” His research seeks to reinforce that.

His biggest challenge is that the data he needs to analyze does not exist in a well-documented manner. Much of his research involves gathering data so that he can generate results. His published book, White-Collar Government: The Hidden Role of Class in Economic Policy Making, and his book coming out in September, The Cash Ceiling: Why Only the Rich Run for Office–and What We Can Do About It, contain the data and results he has produced. Presently, he is beginning a project on transnational governments because “cash ceilings exist in every advanced democracy.” Carnes’ research proves we should care that professionals and millionaires run our government. Through his story, he exemplifies that students who come from families without generations of college graduates can still succeed.    


Post by Lydia Goff


Artificial Intelligence Knows How You Feel

Ever wondered how Siri works? Afraid that super smart robots might take over the world soon?

On April 3rd researchers from Duke, NCSU and UNC came together for Triangle Machine Learning Day to provoke everyone’s curiosities about the complex field that is Artificial Intelligence. A.I. is an overarching term for smart technologies, ranging from self-driving cars to targeted advertising. We can arrive at artificial intelligence through what’s known as “machine learning.” Instead of explicitly programming a machine with the basic capabilities we want it to have, we can make it so that its code is flexible and adapts based on information it’s presented with. Its knowledge grows as a result of training it. In other words, we’re teaching a computer to learn.

Matthew Philips is working with Kitware to get computers to “see,” also known as “machine vision.” By providing thousands and thousands of images, a computer with the right coding can learn to actually make sense of what an image is beyond different colored pixels.

Machine vision has numerous applications. An effective way to search satellite imagery for arbitrary objects could be huge in the advancement of space technology – a satellite could potentially identify obscure objects or potential lifeforms that stick out in those images. This is something we as humans can’t do ourselves just because of the sheer amount of data there is to go through. Similarly, we could teach a machine to identify cancerous or malignant cells in an image, thus giving us a quick diagnosis if someone is at risk of developing a disease.

The problem is, how do you teach a computer to see? Machines don’t easily understand things like similarity, depth or orientation — things that we as humans do automatically without even thinking about. That’s exactly the type of problem Kitware has been tackling.

One hugely successful piece of Artificial Intelligence you may be familiar with is IBM’s Watson. Labeled as “A.I. for professionals,” Watson was featured on Sixty Minutes and even played Jeopardy on live television. Watson has visual recognition capabilities, can work as a translator, and can even understand things like tone, personality or emotional state. And obviously it can answer crazy hard questions. What’s even cooler is that it doesn’t matter how you ask the question – Watson will know what you mean. Watson is basically Siri on steroids, and the world got a taste of its power after watching it smoke its competitors on Jeopardy. However, Watson is not to be thought of as a physical supercomputer. It is a collection of technologies that can be used in many different ways, depending on how you train it. This is what makes Watson so astounding – through machine learning, its knowledge can adapt to the context it’s being used in.

Source: CBS News.

IBM has been able to develop such a powerful tool thanks to data. Stacy Joines from IBM noted, “Data has transformed every industry, profession, and domain.” From our smart phones to fitness devices, data is being collected about us as we speak (see: digital footprint). While it’s definitely pretty scary, the point is that a lot of data is out there. The more data you feed Watson, the smarter it is. IBM has utilized this abundance of data combined with machine learning to produce some of the most sophisticated AI out there.

Sure, it’s a little creepy how much data is being collected on us. Sure, there are tons of movies and theories out there about how intelligent robots in the future will outsmart humans and take over. But A.I. isn’t a thing to be scared of. It’s a beautiful creation that surpasses all capabilities even the most advanced purely programmable model has. It’s joining the health care system to save lives, advising businesses and could potentially find a new inhabitable planet. What we choose to do with A.I. is entirely up to us.

Post by Will Sheehan

Will Sheehan

How a Museum Became a Lab

Encountering and creating art may be some of mankind’s most complex experiences. Art, not just visual but also dancing and singing, requires the brain to understand an object or performance presented to it and then to associate it with memories, facts, and emotions.

A piece in Dario Robleto’s exhibit titled “The Heart’s Knowledge Will Decay” (2014)

In an ongoing experiment, Jose “Pepe” Contreras-Vidal and his team set up in artist Dario Robleto’s exhibit “The Boundary of Life Is Quietly Crossed” at the Menil Collection near downtown Houston. They then asked visitors if they were willing to have their trips through the museum and their brain activities recorded. Robleto’s work was displayed from August 16, 2014 to January 4, 2015. By engaging museum visitors, Contreras-Vidal and Robleto gathered brain activity data while also educating the public, combining research and outreach.

“We need to collect data in a more natural way, beyond the lab” explained Contreras-Vidal, an engineering professor at the University of Houston, during a talk with Robleto sponsored by the Nasher Museum.

More than 3,000 people have participated in this experiment, and the number is growing.

To measure brain activity, the volunteers wear EEG caps which record the electrical impulses that the brain uses for communication. EEG caps are noninvasive because they are just pulled onto the head like swim caps. The caps allow the museum goers to move around freely so Contreras-Vidal can record their natural movements and interactions.

By watching individuals interact with art, Contreras-Vidal and his team can find patterns between their experiences and their brain activity. They also asked the volunteers to reflect on their visit, adding a first person perspective to the experiment. These three sources of data showed them what a young girl’s favorite painting was, how she moved and expressed her reaction to this painting, and how her brain activity reflected this opinion and reaction.

The volunteers can also watch the recordings of their brain signals, giving them an opportunity to ask questions and engage with the science community. For most participants, this is the first time they’ve seen recordings of their brain’s electrical signals. In one trip, these individuals learned about art, science, and how the two can interact. Throughout this entire process, every member of the audience forms a unique opinion and learns something about both the world and themselves as they interact with and make art.

Children with EEG caps explore art.

Contreras-Vidal is especially interested in the gestures people make when exposed to the various stimuli in a museum and hopes to apply this information to robotics. In the future, he wants someone with a robotic arm to not only be able to grab a cup but also to be able to caress it, grip it, or snatch it. For example, you probably can tell if your mom or your best friend is approaching you by their footsteps. Contreras-Vidal wants to restore this level of individuality to people who have prosthetics.

Contreras-Vidal thinks science can benefit art just as much as art can benefit science. Both he and Robleto hope that their research can reduce many artists’ distrust of science and help advance both fields through collaboration.

Post by Lydia Goff

High as a Satellite — Integrating Satellite Data into Science

Professor Tracey Holloway researches air quality at the University of Wisconsin-Madison.

Professor Tracey Holloway researches air quality at the University of Wisconsin-Madison.

Satellite data are contributing more and more to understanding air quality trends, and professor Tracey Holloway wants the world to know.

As a professor of the Department of Atmospheric and Oceanic Science at University of Wisconsin-Madison and the current Team Lead of the NASA Health and Air Quality Applied Sciences Team (HAQAST), she not only helps with the science related to satellites, but also the communication of findings to larger audiences.

Historically, ground-based monitors have provided estimates on changes in concentrations of air pollutants, Holloway explained in her March 2, 2018 seminar, “Connecting Science with Stakeholders,” organized by Duke’s Earth and Ocean Sciences department.

Despite the valuable information ground-based monitors provide, however, factors like high costs limit their widespread use. For example, only about 400 ground-based monitors for nitrogen dioxide currently exist, with many states in the U.S. entirely lacking even a single one. Almost no information on nitrogen dioxide levels had therefore existed before satellites came into the picture.

To close the gap, HAQAST employed earth-observing and polar-orbiting satellites — with fruitful results. Not only have they provided enough data to make more comprehensive maps showing nitrogen dioxide distributions and concentrations, but they also have detected formaldehyde, one of the top causes of cancer, in our atmosphere for the first time.

Satellites have additional long-term benefits. They can help determine potential monitoring sites before actually having to invest large amounts of resources. In the case of formaldehyde, satellite-generated information located areas of higher concentrations — or formaldehyde “hotspots” —  in which HAQAST can now prioritize placing a ground-based monitor. Once established, the site can evaluate air dispersion models, provide air quality information to the public and add to scientific research.

A slide form Holloway’s presentation, in the LSRC A building on March 2, explaining the purposes of a monitoring site.

A slide from Holloway’s presentation, in the LSRC A building on March 2, explaining the purposes of a monitoring site.

Holloway underscored the importance of effectively communicating science. She explained that many policymakers don’t have the strong science backgrounds and therefore need quick and friendly explanations of research from scientists.

Perhaps more significant, though, is the fact that some people don’t even realize that information exists. Specifically, people don’t realize that more satellites are producing new information every day; Holloway has made it a personal goal to have more one-on-one conversations with stakeholders to increase transparency.

Breakthroughs in science aren’t made by individuals: science and change are collaborative. And for Holloway, stakeholders also include the general public. She founded the Earth Science Women’s Network, with one of her goals being to change the vision of what a “scientist” looks like. Through photo campaigns and other communication and engagement activities, she interacted with adults and children to make science more appealing. By making science more sexy, it would be easier to inspire new and continue old discussions, create a more diverse research environment, and make the field more open for all.

Professor Tracey Holloway, air quality researcher at University of Wisconsin-Madison, presented her research at Duke on March 2, 2018.

Professor Tracey Holloway, air quality researcher at University of Wisconsin-Madison, presented her research at Duke on March 2, 2018.

Post by Stella Wang, class of 2019

Post by Stella Wang, class of 2019

What is a Model?

When you think of the word “model,” what do you think?

As an Economics major, 
the first thing that comes to my mind is a statistical model, modeling phenomena such as the effect of class size on student test scores. A
car connoisseur’s mind might go straight to a model of their favorite vintage Aston
Martin. Someone else studying fashion even might imagine a runway model. The point is, the term “model” is used in popular discourse incredibly frequently, but are we even sure what it implies?

Annabel Wharton, a professor of Art, Art History, and Visual Studies at Duke, gave a talk entitled “Defining Models” at the Visualization Friday Forum. The forum is a place “for faculty, staff and students from across the university (and beyond Duke) to share their research involving the development and/or application of visualization methodologies.” Wharton’s goal was to answer the complex question, “what is a model?”

Wharton began the talk by defining the term “model,” knowing that it can often times be rather ambiguous. She stated the observation that models are “a prolific class of things,” from architectural models, to video game models, to runway models. Some of these types of things seem unrelated, but Wharton, throughout her talk, pointed out the similarities between them and ultimately tied them together as all being models.

The word “model” itself has become a heavily loaded term. According to Wharton, the dictionary definition of “model” is 9 columns of text in length. Wharton then stressed that a model “is an autonomous agent.” This implies that models must be independent of the world and from theory, as well as being independent of their makers and consumers. For example, architecture, after it is built, becomes independent of its architect.

Next, Wharton outlined different ways to model. They include modeling iconically, in which the model resembles the actual thing, such as how the video game Assassins Creed models historical architecture. Another way to model is indexically, in which parts of the model are always ordered the same, such as the order of utensils at a traditional place setting. The final way to model is symbolically, in which a model symbolizes the mechanism of what it is modeling, such as in a mathematical equation.

Wharton then discussed the difference between a “strong model” and a “weak model.” A strong model is defined as a model that determines its weak object, such as an architect’s model or a runway model. On the other hand, a “weak model” is a copy that is always less than its archetype, such as a toy car. These different classifications include examples we are all likely aware of, but weren’t able to explicitly classify or differentiate until now.

Wharton finally transitioned to discussing one of her favorite models of all time, a model of the Istanbul Hagia Sophia, a former Greek Orthodox Christian Church and later imperial mosque. She detailed how the model that provides the best sense of the building without being there is found in a surprising place, an Assassin’s Creed video game. This model is not only very much resembles the actual Hagia Sophia, but is also an experiential and immersive model. Wharton joked that even better, the model allows explorers to avoid tourists, unlike in the actual Hagia Sophia.

Wharton described why the Assassin’s Creed model is a highly effective agent. Not only does the model closely resemble the actual architecture, but it also engages history by being surrounded by a historical fiction plot. Further, Wharton mentioned how the perceived freedom of the game is illusory, because the course of the game actually limits players’ autonomy with code and algorithms.

After Wharton’s talk, it’s clear that models are definitely “a prolific class of things.” My big takeaway is that so many thing in our everyday lives are models, even if we don’t classify them as such. Duke’s East Campus is a model of the University of Virginia’s campus, subtraction is a model of the loss of an entity, and an academic class is a model of an actual phenomenon in the world. Leaving my first Friday Visualization Forum, I am even more positive that models are powerful, and stretch so far beyond the statistical models in my Economics classes.

By Nina Cervantes

Game-Changing App Explores Conservation’s Future

In the first week of February, students, experts and conservationists from across the country were brought together for the second annual Duke Blueprint symposium. Focused around the theme of “Nature and Progress,” this conference hoped to harness the power of diversity and interdisciplinary collaboration to develop solutions to some of the world’s most pressing environmental challenges.

Scott Loarie spoke at Duke’s Mary Duke Biddle Trent Semans Center.

One of the most exciting parts of this symposium’s first night was without a doubt its all-star cast of keynote speakers. The experiences and advice each of these researchers had to offer were far too diverse for any single blog post to capture, but one particularly interesting presentation (full video below) was that of National Geographic fellow Scott Loarie—co-director of the game-changing iNaturalist app.

iNat, as Loarie explained, is a collaborative citizen scientist network with aspirations of developing a comprehensive mapping of all terrestrial life. Any time they go outside, users of this app can photograph and upload pictures of any wildlife they encounter. A network of scientists and experts from around the world then helps the users identify their finds, generating data points on an interactive, user-generated map of various species’ ranges.

Simple, right? Multiply that by 500,000 users worldwide, though, and it’s easy to see why researchers like Loarie are excited by the possibilities an app like this can offer. The software first went live in 2008, and since then its user base has roughly doubled each year. This has meant the generation of over 8 million data points of 150,000 different species, including one-third of all known vertebrate species and 40% of all known species of mammal. Every day, the app catalogues around 15 new species.

“We’re slowly ticking away at the tree of life,” Loarie said.

Through iNaturalist, researchers are able to analyze and connect to data in ways never before thought possible. Changes to environments and species’ distributions can be observed or modeled in real time and with unheard-of collaborative opportunities.

To demonstrate the power of this connectedness, Loarie recalled one instance of a citizen scientist in Vietnam who took a picture of a snail. This species had never been captured, never been photographed, hadn’t been observed in over a century. One of iNat’s users recognized it anyway. How? He’d seen it in one of the journals from Captain James Cook’s 18th-century voyage to circumnavigate the globe.

It’s this kind of interconnectivity that demonstrates not just the potential of apps like iNaturalist, but also the power of collaboration and the possibilities symposia like Duke Blueprint offer. Bridging gaps, tearing down boundaries, building up bonds—these are the heart of conservationism’s future. Nature and Progress, working together, pulling us forward into a brighter world.

Post by Daniel Egitto



Duke Scholars Bridge Disciplines to Tackle Big Questions

A visualization showing faculty as dots that are connected by lines

This visualization, created by James Moody and the team at the Duke Network Analysis Center, links faculty from across schools and departments who serve together on Ph.D. committees. An interactive version is available here.

When the next big breakthrough in cancer treatment is announced, no one will ask whether the researchers are pharmacologists, oncologists or cellular biologists – and chances are, the team will represent all three.

In the second annual Scholars@Duke Visualization Challenge, Duke students explored how scholars across campus are drawing from multiple academic disciplines to tackle big research questions.

“I’m often amazed at how gifted Duke faculty are and how they can have expertise in multiple fields, sometimes even fields that don’t seem to overlap,” said Julia Trimmer, Director of Faculty Data Systems and Analysis at Duke.

In last year’s challenge, students dug into Scholars@Duke publication data to explore how Duke researchers collaborate across campus. This year, they were provided with additional data on Ph.D. dissertation committees and asked to focus on how graduate education and scholarship are reaching across departmental boundaries.

“The idea was to see if certain units or disciplines contributed faculty committee members across disciplines or if there’s a lot of discipline ‘overlap.’” Trimmer said.

The winning visualization, created by graduate student Matthew Epland, examines how Ph.D. committees span different fields. In this interactive plot, each marker represents an academic department. The closer together markers are, the more likely it is that a faculty member from one department will serve on the committee of a student in the other department.

Epland says he was intrigued to see the tight-knit community of neuroscience-focused departments that span different schools, including psychology and neuroscience, neurobiology, neurology and psychiatry and behavioral Sciences. Not surprisingly, many of the faculty in these departments are members of the Duke Institute for Brain Sciences (DIBS).

Duke schools appear as dots and are connected by lines of different thicknesses

Aghil Abed Zadeh and Varda F. Hagh analyzed publication data to visualize the extent to which faculty at different Duke schools collaborate with one another. The size of each dot represents the number of publications from each school, and thickness of each line represents the number of faculty collaborations between the connected schools.

Sociology Professor James Moody and the team at the Duke Network Analysis Center took a similar approach, creating a network of individual faculty members who are linked by shared students. Faculty who sit on committees in only one field are bunched together, highlighting researchers who bridge different disciplines. The size of each marker represents the extent to which each researcher sits “between” two fields.

The map shows a set of strong ties within the natural sciences and within the humanities, but few links between the two groups. Moody points out that philosophy is a surprising exception to this rule, lying closer to the natural sciences cluster than to the humanities cluster.

“At Duke, the strong emphasis on philosophy of science creates a natural link between philosophy and the natural sciences,” Moody said.

Duke graduate student Aghil Abed Zadeh teamed up with Varda F. Hagh, a student at Arizona State University, to create elegant maps linking schools and departments by shared authorship. The size of each marker represents the number of publications in that school or department, and the thickness of the connecting lines indicate the number of shared authorships.

“It is interesting to see how connected law school and public policy school are. They collaborate with many of the sciences as well, which is a surprising fact,” Zadeh said. “On the other hand, we see Divinity school, one the oldest at Duke, which is isolated and not connected to others at all.”

The teams presented their visualizations Jan. 20 at the Duke Research Computing Symposium.

Post by Kara Manke


Generating Winning Sports Headlines

What if there were a scientific way to come up with the most interesting sports headlines? With the development of computational journalism, this could be possible very soon.

Dr. Jun Yang is a database and data-intensive computing researcher and professor of Computer Science at Duke. One of his latest projects is computational journalism, in which he and other computer science researchers are considering how they can contribute to journalism with new technological advances and the ever-increasing availability of data.

An exciting and very relevant part of his project is based on raw data from Duke men’s basketball games. With computational journalism, Yang and his team of researchers have been able to generate diverse player or team factoids using the statistics of the games.

Grayson Allen headed for the hoop.

Grayson Allen headed for the hoop.

An example factoid might be that, in the first 8 games of this season, Duke has won 100% of its games when Grayson Allen has scored over 20 points. While this fact is obvious, since Duke is undefeated so far this season, Yang’s programs will also be able to generate very obscure factoids about each and every player that could lead to unique and unprecedented headlines.

While these statistics relating player and team success can only imply correlation, and not necessarily causation, they definitely have potential to be eye-catching sports headlines.

Extracting factoids hasn’t been a particularly challenging part of the project, but developing heuristics to choose which factoids are the most relevant and usable has been more difficult.

Developing these heuristics so far has involved developing scoring criteria based on what is intuitively impressive to the researcher. Another possible measure of evaluating the strength of a factoid is ranking the types of headlines that are most viewed. Using this method, heuristics could, in theory, be based on past successes and less on one researcher’s human intuition.

Something else to consider is which types of factoids are more powerful. For example, what’s better: a bolder claim in a shorter period of time, or a less bold claim but over many games or even seasons?

The ideal of this project is to continue to analyze data from the Duke men’s basketball team, generate interesting factoids, and put them on a public website about 10-15 minutes after the game.

Looking forward, computational journalism has huge potential for Duke men’s basketball, sports in general, and even for generating other news factoids. Even further, computational journalism and its scientific methodology might lead to the ability to quickly fact-check political claims.

Right now, however, it is fascinating to know that computer science has the potential to touch our lives in some pretty unexpected ways. As our current men’s basketball beginning-of-season winning streak continues, who knows what unprecedented factoids Jun Yang and his team are coming up with.

By Nina Cervantes

Page 1 of 5

Powered by WordPress & Theme by Anders Norén