Duke Research Blog

Following the people and events that make up the research community at Duke.

Category: Visualization (Page 1 of 8)

What Happens When Data Scientists Crunch Through Three Centuries of Robinson Crusoe?

Reading 1,400-plus editions of “Robinson Crusoe” in one summer is impossible. So one team of students tried to train computers to do it for them.

Reading 1,400-plus editions of “Robinson Crusoe” in one summer is impossible. So one team of students tried to train computers to do it for them.

Since Daniel Defoe’s shipwreck tale “Robinson Crusoe” was first published nearly 300 years ago, thousands of editions and spinoff versions have been published, in hundreds of languages.

A research team led by Grant Glass, a Ph.D. student in English and comparative literature at the University of North Carolina at Chapel Hill, wanted to know how the story changed as it went through various editions, imitations and translations, and to see which parts stood the test of time.

Reading through them all at a pace of one a day would take years. Instead, the researchers are training computers to do it for them.

This summer, Glass’ team in the Data+ summer research program used computer algorithms and machine learning techniques to sift through 1,482 full-text versions of Robinson Crusoe, compiled from online archives.

“A lot of times we think of a book as set in stone,” Glass said. “But a project like this shows you it’s messy. There’s a lot of variance to it.”

“When you pick up a book it’s important to know what copy it is, because that can affect the way you think about the story,” Glass said.

Just getting the texts into a form that a computer could process proved half the battle, said undergraduate team member Orgil Batzaya, a Duke double major in math and computer science.

The books were already scanned and posted online, so the students used software to download the scans from the internet, via a process called “scraping.” But processing the scanned pages of old printed books, some of which had smudges, specks or worn type, and converting them to a machine-readable format proved trickier than they thought.

The software struggled to decode the strange spellings (“deliver’d,” “wish’d,” “perswasions,” “shore” versus “shoar”), different typefaces between editions, and other quirks.

Special characters unique to 18th century fonts, such as the curious f-shaped version of the letter “s,” make even humans read “diftance” and “poffible” with a mental lisp.

Their first attempts came up with gobbledygook. “The resulting optical character recognition was completely unusable,” said team member and Duke senior Gabriel Guedes.

At a Data+ poster session in August, Guedes, Batzaya and history and computer science double major Lucian Li presented their initial results: a collection of colorful scatter plots, maps, flowcharts and line graphs.

Guedes pointed to clusters of dots on a network graph. “Here, the red editions are American, the blue editions are from the U.K.,” Guedes said. “The network graph recognizes the similarity between all these editions and clumps them together.”

Once they turned the scanned pages into machine-readable texts, the team fed them into a machine learning algorithm that measures the similarity between documents.

The algorithm takes in chunks of texts — sentences, paragraphs, even entire novels — and converts them to high-dimensional vectors.

Creating this numeric representation of each book, Guedes said, made it possible to perform mathematical operations on them. They added up the vectors for each book to find their sum, calculated the mean, and looked to see which edition was closest to the “average” edition. It turned out to be a version of Robinson Crusoe published in Glasgow in 1875.

They also analyzed the importance of specific plot points in determining a given edition’s closeness to the “average” edition: what about the moment when Crusoe spots a footprint in the sand and realizes that he’s not alone? Or the time when Crusoe and Friday, after leaving the island, battle hungry wolves in the Pyrenees?

The team’s results might be jarring to those unaccustomed to seeing 300 years of publishing reduced to a bar chart. But by using computers to compare thousands of books at a time, “digital humanities” scholars say it’s possible to trace large-scale patterns and trends that humans poring over individual books can’t.

“This is really something only a computer can do,” Guedes said, pointing to a time-lapse map showing how the Crusoe story spread across the globe, built from data on the place and date of publication for 15,000 editions.

“It’s a form of ‘distant reading’,” Guedes said. “You use this massive amount of information to help draw conclusions about publication history, the movement of ideas, and knowledge in general across time.”

This project was organized in collaboration with Charlotte Sussman (English) and Astrid Giugni (English, ISS). Check out the team’s results at https://orgilbatzaya.github.io/pirating-texts-site/

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of Mathematics and Statistical Science and MEDx. This project team was also supported by the Duke Office of Information Technology.

Other Duke sponsors include DTECH, Duke Health, Sanford School of Public Policy, Nicholas School of the Environment, Development and Alumni Affairs, Energy Initiative, Franklin Humanities Institute, Duke Forge, Duke Clinical Research, Office for Information Technology and the Office of the Provost, as well as the departments of Electrical & Computer Engineering, Computer Science, Biomedical Engineering, Biostatistics & Bioinformatics and Biology.

Government funding comes from the National Science Foundation.

Outside funding comes from Lenovo, Power for All and SAS.

Community partnerships, data and interesting problems come from the Durham Police and Sheriff’s Department, Glenn Elementary PTA, and the City of Durham.

Videos by Paschalia Nsato and Julian Santos; writing by Robin Smith

Researcher Turns Wood Into Larger-Than-Life Insects

Duke biologist Alejandro Berrio creates larger-than-life insect sculptures. This wooden mantis was exhibited at the Art Science Gallery in Austin, Texas in 2013.

Duke biologist Alejandro Berrio creates larger-than-life insect sculptures. This wooden mantis was exhibited at the Art Science Gallery in Austin, Texas in 2013.

On a recent spring morning, biologist Alejandro Berrio took a break from running genetic analyses on a supercomputer to talk about an unusual passion: creating larger-than-life insect sculptures.

Berrio is a postdoctoral associate in professor Greg Wray’s lab at Duke. He’s also a woodcarver, having exhibited his shoebox-sized models of praying mantises, wasps, crickets and other creatures in museums and galleries in his hometown and in Austin, Texas, where his earned his Ph.D.

The Colombia-born scientist started carving wood in his early teens, when he got interested in model airplanes. He built them out of pieces of lightweight balsa wood that he bought in craft shops.

When he got to college at the University of Antioquia in Medellín, Colombia’s second-largest city, he joined an entomology lab. “One of my first introductions to science was watching insects in the lab and drawing them,” Berrio said. “One day I had an ‘aha’ moment and thought: I can make this. I can make an insect with wings the same way I used to make airplanes.”

Beetle carved by Duke biologist Alejandro Berrio.

His first carvings were of mosquitoes — the main insect in his lab — hand carved from soft balsa wood with an X-Acto knife.

Using photographs for reference, he would sketch the insects from different positions before he started carving.

He worked at his kitchen table, shaping the body from balsa wood or basswood. “I might start with a power saw to make the general form, and then with sandpaper until I started getting the shape I wanted,” Berrio said.

He used metal to join and position the segments in the legs and antennae, then set the joints in place with glue.

“People loved them,” Berrio said. “Scientists were like: Oh, I want a fly. I want a beetle. My professors were giving them to their friends. So I started making them for people and selling them.”

Soon Berrio was carving wooden fungi, dragons, turtles, a snail. “Whatever people wanted me to make,” Berrio said.

He earned just enough money to pay for his lunch, or the bus ride to school.

Duke biologist Alejandro Berrio carved this butterfly using balsa wood for the body and legs, and paper for the wings.

His pieces can take anywhere from a week to two months to complete. “This butterfly was the most time-consuming,” he said, pointing to a model with translucent veined wings.

Since moving to Durham in 2016, he has devoted less time to his hobby than he once did. “Last year I made a crab for a friend who studies crustaceans,” Berrio said. “She got married and that was my wedding gift.”

Still no apes, or finches, or prairie voles — all subjects of his current research. “But I’m planning to restart,” Berrio said. “Every time I go home to Colombia I bring back some wood, or my favorite glue, or one of my carving tools.”

Insect sculptures by Duke biologist Alejandro Berrio.

Insect sculptures by Duke biologist Alejandro Berrio.

Explore more of Berrio’s sculpture and photography at https://www.flickr.com/photos/alejoberrio/.

by Robin Smith

by Robin Smith

“I Heart Tech Fair” Showcases Cutting-Edge VR and More

Duke’s tech game is stronger than you might think.

OIT held an “I Love Tech Fair” in the Technology Engagement Center / Co-Lab on Feb. 6 that was open to anyone to come check out things like 3D printers and augmented reality, while munching on some Chick-fil-a and cookies. There was a raffle for some sweet prizes, too.

I got a full demonstration of the 3D printing process—it’s so easy! It requires some really expensive software called Fusion, but thankfully Duke is awesome and students can get it for free. You can make some killer stuff 3D printing, the technology is so advanced now. I’ve seen all kinds of things: models of my friend’s head, a doorstop made out of someone’s name … one guy even made a working ukulele apparently!

One of the cooler things at the fair was Augmented Reality books. These books look like ordinary picture books, but looking at a page through your phone’s camera, the image suddenly comes to life in 3D with tons of detail and color, seemingly floating above the book! All you have to do is download an app and get the right book. Augmented reality is only getting better as time goes on and will soon be a primary tool in education and gaming, which is why Duke Digital Initiative (DDI) wanted to show it off.

By far my favorite exhibit at the tech fair was  virtual reality. Throw on a headset and some bulky goggles, grab a controller in each hand, and suddenly you’re in another world. The guy running the station, Mark McGill, had actually hand-built the machine that ran it all. Very impressive guy. He told me the machine is the most expensive and important part, since it accounts for how smooth the immersion is. The smoother the immersion, the more realistic the experience. And boy, was it smooth. A couple years ago I experienced virtual reality at my high school and thought it was cool (I did get a little nauseous), but after Mark set me up with the “HTC Vive” connected to his sophisticated machine, it blew me away (with no nausea, too).

I smiled the whole time playing “Super Hot,” where I killed incoming waves of people in slow motion with ninja stars, guns, and rocks. Mark had tons of other games too, all downloaded from Steam, for both entertainment and educational purposes. One called “Organon” lets you examine human anatomy inside and out, and you can even upload your own MRIs. There’s an unbelievable amount of possibilities VR offers. You could conquer your fear of public speaking by being simulated in front of a crowd, or realistically tour “the VR Museum of Fine Art.” Games like these just aren’t the same were you to play them on, say, an Xbox, because it simply doesn’t have that key factor of feeling like you’re there. In Fallout 4, your heart pounds fast in your chest as you blast away Feral Ghouls and Super Mutants right in front of you. But in reality, you’re just standing in a green room with stupid looking goggles on. Awesome!

There’s another place on campus — the Bolt VR in Edens residence hall — that also has a cutting-edge VR setup going. Mark explained to me that Duke wants people to get experience with VR, as it will soon be a huge part of our lives. Having exposure now could give Duke graduates a very valuable head start in their career (while also making Duke look good). Plus, it’s nice to have on campus for offering students a fun break from all the hard work we put in.

If you’re bummed you missed out, or even if you don’t “love tech,” I recommend checking out the Tech Fair next time — February 13, from 6-8pm. See you there.

Post By Will Sheehan

Will Sheehan

Duke Scholars Bridge Disciplines to Tackle Big Questions

A visualization showing faculty as dots that are connected by lines

This visualization, created by James Moody and the team at the Duke Network Analysis Center, links faculty from across schools and departments who serve together on Ph.D. committees. An interactive version is available here.

When the next big breakthrough in cancer treatment is announced, no one will ask whether the researchers are pharmacologists, oncologists or cellular biologists – and chances are, the team will represent all three.

In the second annual Scholars@Duke Visualization Challenge, Duke students explored how scholars across campus are drawing from multiple academic disciplines to tackle big research questions.

“I’m often amazed at how gifted Duke faculty are and how they can have expertise in multiple fields, sometimes even fields that don’t seem to overlap,” said Julia Trimmer, Director of Faculty Data Systems and Analysis at Duke.

In last year’s challenge, students dug into Scholars@Duke publication data to explore how Duke researchers collaborate across campus. This year, they were provided with additional data on Ph.D. dissertation committees and asked to focus on how graduate education and scholarship are reaching across departmental boundaries.

“The idea was to see if certain units or disciplines contributed faculty committee members across disciplines or if there’s a lot of discipline ‘overlap.’” Trimmer said.

The winning visualization, created by graduate student Matthew Epland, examines how Ph.D. committees span different fields. In this interactive plot, each marker represents an academic department. The closer together markers are, the more likely it is that a faculty member from one department will serve on the committee of a student in the other department.

Epland says he was intrigued to see the tight-knit community of neuroscience-focused departments that span different schools, including psychology and neuroscience, neurobiology, neurology and psychiatry and behavioral Sciences. Not surprisingly, many of the faculty in these departments are members of the Duke Institute for Brain Sciences (DIBS).

Duke schools appear as dots and are connected by lines of different thicknesses

Aghil Abed Zadeh and Varda F. Hagh analyzed publication data to visualize the extent to which faculty at different Duke schools collaborate with one another. The size of each dot represents the number of publications from each school, and thickness of each line represents the number of faculty collaborations between the connected schools.

Sociology Professor James Moody and the team at the Duke Network Analysis Center took a similar approach, creating a network of individual faculty members who are linked by shared students. Faculty who sit on committees in only one field are bunched together, highlighting researchers who bridge different disciplines. The size of each marker represents the extent to which each researcher sits “between” two fields.

The map shows a set of strong ties within the natural sciences and within the humanities, but few links between the two groups. Moody points out that philosophy is a surprising exception to this rule, lying closer to the natural sciences cluster than to the humanities cluster.

“At Duke, the strong emphasis on philosophy of science creates a natural link between philosophy and the natural sciences,” Moody said.

Duke graduate student Aghil Abed Zadeh teamed up with Varda F. Hagh, a student at Arizona State University, to create elegant maps linking schools and departments by shared authorship. The size of each marker represents the number of publications in that school or department, and the thickness of the connecting lines indicate the number of shared authorships.

“It is interesting to see how connected law school and public policy school are. They collaborate with many of the sciences as well, which is a surprising fact,” Zadeh said. “On the other hand, we see Divinity school, one the oldest at Duke, which is isolated and not connected to others at all.”

The teams presented their visualizations Jan. 20 at the Duke Research Computing Symposium.

Post by Kara Manke

 

Researchers Get Superman’s X-ray Vision

X-ray vision just got cooler. A technique developed in recent years boosts researchers’ ability to see through the body and capture high-resolution images of animals inside and out.

This special type of 3-D scanning reveals not only bones, teeth and other hard tissues, but also muscles, blood vessels and other soft structures that are difficult to see using conventional X-ray techniques.

Researchers have been using the method, called diceCT, to visualize the internal anatomy of dozens of different species at Duke’s Shared Materials Instrumentation Facility (SMIF).

There, the specimens are stained with an iodine solution that helps soft tissues absorb X-rays, then placed in a micro-CT scanner, which takes thousands of X-ray images from different angles while the specimen spins around. A computer then stitches the scans into digital cross sections and stacks them, like slices of bread, to create a virtual 3-D model that can be rotated, dissected and measured as if by hand.

Here’s a look at some of the images they’ve taken:

See-through shrimp

If you get flushed after a workout, you’re not alone — the Caribbean anemone shrimp does too.

Recent Duke Ph.D. Laura Bagge was scuba diving off the coast of Belize when she noticed the transparent shrimp Ancylomenes pedersoni turn from clear to cloudy after rapidly flipping its tail.

To find out why exercise changes the shrimp’s complexion, Bagge and Duke professor Sönke Johnsen and colleagues compared their internal anatomy before and after physical exertion using diceCT.

In the shrimp cross sections in this video, blood vessels are colored blue-green, and muscle is orange-red. The researchers found that more blood flowed to the tail after exercise, presumably to deliver more oxygen-rich blood to working muscles. The increased blood flow between muscle fibers causes light to scatter or bounce in different directions, which is why the normally see-through shrimp lose their transparency.

Peer inside the leg of a mouse

Duke cardiologist Christopher Kontos, M.D., and MD/PhD student Hasan Abbas have been using the technique to visualize the inside of a mouse’s leg.

The researchers hope the images will shed light on changes in blood vessels in people, particularly those with peripheral artery disease, in which plaque buildup in the arteries reduces blood flow to the extremities such as the legs and feet.

The micro-CT scanner at Duke’s Shared Materials Instrumentation Facility made it possible for Abbas and Kontos to see structures as small as 13 microns, or a fraction of the width of a human hair, including muscle fibers and even small arteries and veins in 3-D.

Take a tour through a tree shrew

DiceCT imaging allows Heather Kristjanson at the Johns Hopkins School of Medicine to digitally dissect the chewing muscles of animals such as this tree shrew, a small mammal from Southeast Asia that looks like a cross between a mouse and a squirrel. By virtually zooming in and measuring muscle volume and the length of muscle fibers, she hopes to see how strong they were. Studying such clues in modern mammals helps Kristjanson and colleagues reconstruct similar features in the earliest primates that lived millions of years ago.

Try it for yourself

Students and instructors who are interested in trying the technique in their research are eligible to apply for vouchers to cover SMIF fees. People at Duke University and elsewhere are encouraged to apply. For more information visit https://smif.pratt.duke.edu/Funding_Opportunities, or contact Dr. Mark Walters, Director of SMIF, via email at mark.walters@duke.edu.

Located on Duke’s West Campus in the Fitzpatrick Building, the SMIF is a shared use facility available to Duke researchers and educators as well as external users from other universities, government laboratories or industry through a partnership called the Research Triangle Nanotechnology Network. For more info visit http://smif.pratt.duke.edu/.

Post by Robin Smith, News and Communications

Post by Robin Smith, News and Communications

Panic in the Poster Session!

For their recent retreat, Regeneration Next tried something a little different for the time-honored poster session.

Rather than simply un-tubing that poster they took to the American Association of Whatever a few months ago, presenters were asked to DRAW their poster fresh and hot on a plain sheet of white paper in 15 minutes, using nothing more than an idea and a couple of markers.

Concerns were shared, shall we say, with the leadership of the regenerative medicine initiative when the rules were announced.

“People are always nervous about something they haven’t tried before,” said Regeneration Next Executive Director Sharlini Sankaran. “There was a lot of anxiety about the new format and how they would explain their research without charts and graphs.”

There was palpable poster panic as the retreat moved to the wide open fifth floor of the Trent Semans Center in the late afternoon. Administrative coordinator Tiffany Casey had spread out a rainbow of brand-new sharpies and the moveable bulletin boards stood in neat, numbered ranks with plain white sheets of giant post-it paper.

After some nervous laughter and a few attempts at color-swapping, the trainees and junior faculty got down to drawing their science on the wobbly tackboards.

And then, it worked! It totally worked. “I think I saw a lot more interactivity and conversation,” Sankaran said.

Valentina Cigliola

A fist-full of colorful sharpies gave Valentina Cigliola a colorful launching point for some good conversations about spinal cord repair, rather than just standing there mutely while visitors read and read and read.

 

Louis-Jan Pilaz

Louis-Jan Pilaz used the entire height of the giant post-it notes to draw a beautifully detailed neuron, with labeled parts explaining how the RNA-binding protein FMRP does some neat tricks during development of the cortex.

 

Delisa Clay

Delisa Clay’s schematics of fruitfly cells having too many chromosomes made it easier to explain. Well, that and maybe a glass of wine.

 

Jamie Garcia

Jamie Garcia used her cell-by-cell familiarity with the zebrafish to make a bold, clear illustration of notochord development and the fish’s amazing powers of self-repair.

 

Lihua Wang

Don’t you think Lihua Wang’s schematic of experimental results is so much more clear than a bunch of panels of tiny text and bar charts?

In the post-retreat survey, Sankaran said people either absolutely loved the draw-your-poster or hated it, but the Love group was much larger.

“Those who hated it felt they couldn’t represent data accurately with hand-drawn charts and graphs,” Sankaran said. “Or that their artistic skills were ‘being judged’.”

A few folks also pointed out that the drawing approach might work against people with a disability of some sort – a concern Sankaran said they will try to address next time.

There WILL be a next time, she added. “I had a few trainees come up to me to say they weren’t sure how it was going to go, but then they said they had fun!”

Post and pix by Karl Leif Bates, whose hand-drawn poster on working with the news office contained no data and was largely ignored.

Durham Traffic Data Reveal Clues to Safer Streets

Ghost bikes are a haunting site. The white-painted bicycles, often decorated with flowers or photographs, mark the locations where cyclists have been hit and killed on the street.

A white-painted bike next to a street.

A Ghost Bike located in Chapel Hill, NC.

Four of these memorials currently line the streets of Durham, and the statistics on non-fatal crashes in the community are equally sobering. According to data gathered by the North Carolina Department of Transportation, Durham county averaged 23 bicycle and 116 pedestrian crashes per year between 2011 and 2015.

But a team of Duke researchers say these grim crash data may also reveal clues for how to make Durham’s streets safer for bikers, walkers, and drivers.

This summer, a team of Duke students partnered with Durham’s Department of Transportation to analyze and map pedestrian, bicycle and motor vehicle crash data as part of the 10-week Data+ summer research program.

In the Ghost Bikes project, the team created an interactive website that allows users to explore how different factors such as the time-of-day, weather conditions, and sociodemographics affect crash risk. Insights from the data also allowed the team to develop policy recommendations for improving the safety of Durham’s streets.

“Ideally this could help make things safer, help people stay out of hospitals and save lives,” said Lauren Fox, a Duke cultural anthropology major who graduated this spring, and a member of the DATA+ Ghost Bikes team.

A map of Durham county with dots showing the locations of bicycle crashes

A heat map from the team’s interactive website shows areas with the highest density of bicycle crashes, overlaid with the locations of individual bicycle crashes.

The final analysis showed some surprising trends.

“For pedestrians the most common crash isn’t actually happening at intersections, it is happening at what is called mid-block crossings, which happen when someone is crossing in the middle of the road,” Fox said.

To mitigate the risks, the team’s Executive Summary includes recommendations to install crosswalks, median islands and bike lanes to roads with a high density of crashes.

They also found that males, who make up about two-thirds of bicycle commuters over the age of 16, are involved in 75% of bicycle crashes.

“We found that male cyclists over age 16 actually are hit at a statistically higher rate,” said Elizabeth Ratliff, a junior majoring in statistical science. “But we don’t know why. We don’t know if this is because males are riskier bikers, if it is because they are physically bigger objects to hit, or if it just happens to be a statistical coincidence of a very unlikely nature.”

To build their website, the team integrated more than 20 sets of crash data from a wide variety of different sources, including city, county, regional and state reports, and in an array of formats, from maps to Excel spreadsheets.

“They had to fit together many different data sources that don’t necessarily speak to each other,” said faculty advisor Harris Solomon, an associate professor of cultural anthropology and global health at Duke.  The Ghost Bikes project arose out of Solomon’s research on traffic accidents in India, supported by the National Science Foundation Cultural Anthropology Program.

In Solomon’s Spring 2017 anthropology and global health seminar, students explored the role of the ghost bikes as memorials in the Durham community. The Data+ team approached the same issues from a more quantitative angle, Solomon said.

“The bikes are a very concrete reminder that the data are about lives and deaths,” Solomon said. “By visiting the bikes, the team was able to think about the very human aspects of data work.”

“I was surprised to see how many stakeholders there are in biking,” Fox said. For example, she added, the simple act of adding a bike lane requires balancing the needs of bicyclists, nearby residents concerned with home values or parking spots, and buses or ambulances who require access to the road.

“I hadn’t seen policy work that closely in my classes, so it was interesting to see that there aren’t really simple solutions,” Fox said.

[youtube https://www.youtube.com/watch?v=YHIRqhdb7YQ&w=629&h=354]

 

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of Mathematics and Statistical Science and MEDx.

Other Duke sponsors include DTECH, Duke Health, Sanford School of Public Policy, Nicholas School of the Environment, Development and Alumni Affairs, Energy Initiative, Franklin Humanities Institute, Duke Institute for Brain Sciences, Office for Information Technology and the Office of the Provost, as well as the departments of Electrical & Computer Engineering, Computer Science, Biomedical Engineering, Biostatistics & Bioinformatics and Biology.

Government funding comes from the National Science Foundation. Outside funding comes from Accenture, Academic Analytics, Counter Tools and an anonymous donation.

Community partnerships, data and interesting problems come from the Durham Police Department, Durham Neighborhood Compass, Cary Institute of Ecosystem Studies, Duke Marine Lab, Center for Child and Family Policy, Northeast Ohio Medical University, TD Bank, Epsilon, Duke School of Nursing, University of Southern California, Durham Bicycle and Pedestrian Advisory Commission, Duke Surgery, MyHealth Teams, North Carolina Museum of Art and Scholars@Duke.

Writing by Kara Manke; video by Lauren Mueller and Summer Dunsmore

Pinpointing Where Durham’s Nicotine Addicts Get Their Fix

DURHAM, N.C. — It’s been five years since Durham expanded its smoking ban beyond bars and restaurants to include public parks, bus stops, even sidewalks.

While smoking in the state overall may be down, 19 percent of North Carolinians still light up, particularly the poor and those without a high school or college diploma.

Among North Carolina teens, consumption of electronic cigarettes in particular more than doubled between 2013 and 2015.

Now, new maps created by students in the Data+ summer research program show where nicotine addicts can get their fix.

Studies suggest that tobacco retailers are disproportionately located in low-income neighborhoods.

Living in a neighborhood with easy access to stores that sell tobacco makes it easier to start young and harder to quit.

The end result is that smoking, secondhand smoke exposure, and smoking-related diseases such as lung cancer, are concentrated among the most socially disadvantaged communities.

If you’re poor and lack a high school or college diploma, you’re more likely to live near a store that sells tobacco.

If you’re poor and lack a high school or college diploma, you’re more likely to live near a store that sells tobacco. Photo from Pixabay.

Where stores that sell tobacco are located matters for health, but for many states such data are hard to come by, said Duke statistics major James Wang.

Tobacco products bring in more than a third of in-store sales revenue at U.S. convenience stores — more than food, beverages, candy, snacks or beer. Despite big profits, more than a dozen states don’t require businesses to get a special license or permit to sell tobacco. North Carolina is one of them.

For these states, there is no convenient spreadsheet from the local licensing agency identifying all the businesses that sell tobacco, said Duke undergraduate Nikhil Pulimood. Previous attempts to collect such data in Virginia involved searching for tobacco retail stores by car.

“They had people physically drive across every single road in the state to collect the data. It took three years,” said team member and Duke undergraduate Felicia Chen.

Led by UNC PhD student in epidemiology Mike Dolan Fliss, the Duke team tried to come up with an easier way.

Instead of collecting data on the ground, they wrote an automated web-crawler program to extract the data from the Yellow Pages websites, using a technique called Web scraping.

By telling the software the type of business and location, they were able to create a database that included the names, addresses, phone numbers and other information for 266 potential tobacco retailers in Durham County and more than 15,500 statewide, including chains such as Family Fare, Circle K and others.

Map showing the locations of tobacco retail stores in Durham County, North Carolina.

Map showing the locations of tobacco retail stores in Durham County, North Carolina.

When they compared their web-scraped data with a pre-existing dataset for Durham County, compiled by a nonprofit called Counter Tools, hundreds of previously hidden retailers emerged on the map.

To determine which stores actually sold tobacco, they fed a computer algorithm data from more than 19,000 businesses outside North Carolina so it could learn how to distinguish say, convenience stores from grocery stores. When the algorithm received store names from North Carolina, it predicted tobacco retailers correctly 85 percent of the time.

“For example we could predict that if a store has the word “7-Eleven” in it, it probably sells tobacco,” Chen said.

As a final step, they also crosschecked their results by paying people a small fee to search for the stores online to verify that they exist, and call them to ask if they actually sell tobacco, using a crowdsourcing service called Amazon Mechanical Turk.

Ultimately, the team hopes their methods will help map the more than 336,000 tobacco retailers nationwide.

“With a complete dataset for tobacco retailers around the nation, public health experts will be able to see where tobacco retailers are located relative to parks and schools, and how store density changes from one neighborhood to another,” Wang said.

The team presented their work at the Data+ Final Symposium on July 28 in Gross Hall.

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of mathematics and statistical science and MEDx. This project team was also supported by Counter Tools, a non-profit based in Carrboro, NC.

Writing by Robin Smith; video by Lauren Mueller and Summer Dunsmore

Sizing Up Hollywood's Gender Gap

DURHAM, N.C. — A mere seven-plus decades after she first appeared in comic books in the early 1940s, Wonder Woman finally has her own movie.

In the two months since it premiered, the film has brought in more than $785 million worldwide, making it the highest grossing movie of the summer.

But if Hollywood has seen a number of recent hits with strong female leads, from “Wonder Woman” and “Atomic Blonde” to “Hidden Figures,” it doesn’t signal a change in how women are depicted on screen — at least not yet.

Those are the conclusions of three students who spent ten weeks this summer compiling and analyzing data on women’s roles in American film, through the Data+ summer research program.

The team relied on a measure called the Bechdel test, first depicted by the cartoonist Alison Bechdel in 1985.

Bechdel test

The “Bechdel test” asks whether a movie features at least two women who talk to each other about anything besides a man. Surprisingly, a lot of films fail. Art by Srravya [CC0], via Wikimedia Commons.

To pass the Bechdel test, a movie must satisfy three basic requirements: it must have at least two named women in it, they must talk to each other, and their conversation must be about something other than a man.

It’s a low bar. The female characters don’t have to have power, or purpose, or buck gender stereotypes.

Even a movie in which two women only speak to each other briefly in one scene, about nail polish — as was the case with “American Hustle” —  gets a passing grade.

And yet more than 40 percent of all U.S. films fail.

The team used data from the bechdeltest.com website, a user-compiled database of over 7,000 movies where volunteers rate films based on the Bechdel criteria. The number of criteria a film passes adds up to its Bechdel score.

“Spider Man,” “The Jungle Book,” “Star Trek Beyond” and “The Hobbit” all fail by at least one of the criteria.

Films are more likely to pass today than they were in the 1970s, according to a 2014 study by FiveThirtyEight, the data journalism site created by Nate Silver.

The authors of that study analyzed 1,794 movies released between 1970 and 2013. They found that the number of passing films rose steadily from 1970 to 1995 but then began to stall.

In the past two decades, the proportion of passing films hasn’t budged.

Since the mid-1990s, the proportion of films that pass the Bechdel test has flatlined at about 50 percent.

Since the mid-1990s, the proportion of films that pass the Bechdel test has flatlined at about 50 percent.

The Duke team was also able to obtain data from a 2016 study of the gender breakdown of movie dialogue in roughly 2,000 screenplays.

Men played two out of three top speaking roles in more than 80 percent of films, according to that study.

Using data from the screenplay study, the students plotted the relationship between a movie’s Bechdel score and the number of words spoken by female characters. Perhaps not surprisingly, films with higher Bechdel scores were also more likely to achieve gender parity in terms of speaking roles.

“The Bechdel test doesn’t really tell you if a film is feminist,” but it’s a good indicator of how much women speak, said team member Sammy Garland, a Duke sophomore majoring in statistics and Chinese.

Previous studies suggest that men do twice as much talking in most films — a proportion that has remained largely unchanged since 1995. The reason, researchers say, is not because male characters are more talkative individually, but because there are simply more male roles.

“To close the gap of speaking time, we just need more female characters,” said team member Selen Berkman, a sophomore majoring in math and computer science.

Achieving that, they say, ultimately comes down to who writes the script and chooses the cast.

The team did a network analysis of patterns of collaboration among 10,000 directors, writers and producers. Two people are joined whenever they worked together on the same movie. The 13 most influential and well-connected people in the American film industry were all men, whose films had average Bechdel scores ranging from 1.5 to 2.6 — meaning no top producer is regularly making films that pass the Bechdel test.

“What this tells us is there is no one big influential producer who is moving the needle. We have no champion,” Garland said.

Men and women were equally represented in fewer than 10 percent of production crews.

But assembling a more gender-balanced production team in the early stages of a film can make a difference, research shows. Films with more women in top production roles have female characters who speak more too.

“To better represent women on screen you need more women behind the scenes,” Garland said.

Dollar for dollar, making an effort to close the Hollywood gender gap can mean better returns at the box office too. Films that pass the Bechdel test earn $2.68 for every dollar spent, compared with $2.45 for films that fail — a 23-cent better return on investment, according to FiveThirtyEight.

Other versions of the Bechdel test have been proposed to measure race and gender in film more broadly. The advantage of analyzing the Bechdel data is that thousands of films have already been scored, said English major and Data+ team member Aaron VanSteinberg.

“We tried to watch a movie a week, but we just didn’t have time to watch thousands of movies,” VanSteinberg said.

A new report on diversity in Hollywood from the University of Southern California suggests the same lack of progress is true for other groups as well. In nearly 900 top-grossing films from 2007 to 2016, disabled, Latino and LGBTQ characters were consistently underrepresented relative to their makeup in the U.S. population.

Berkman, Garland and VanSteinberg were among more than 70 students selected for the 2017 Data+ program, which included data-driven projects on photojournalism, art restoration, public policy and more.

They presented their work at the Data+ Final Symposium on July 28 in Gross Hall.

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of mathematics and statistical science and MEDx. 

Writing by Robin Smith; video by Lauren Mueller and Summer Dunsmore

Mapping Electricity Access for a Sixth of the World's People

DURHAM, N.C. — Most Americans can charge their cell phones, raid the fridge or boot up their laptops at any time without a second thought.

Not so for the 1.2 billion people — roughly 16 percent of the world’s population — with no access to electricity.

Despite improvements over the past two decades, an estimated 780 million people will still be without power by 2030, especially in rural parts of sub-Saharan Africa, Asia and the Pacific.

To get power to these people, first officials need to locate them. But for much of the developing world, reliable, up-to-date data on electricity access is hard to come by.

Researchers say remote sensing can help.

For ten weeks from May through July, a team of Duke students in the Data+ summer research program worked on developing ways to assess electricity access automatically, using satellite imagery.

“Ground surveys take a lot of time, money and manpower,” said Data+ team member Ben Brigman. “As it is now, the only way to figure out if a village has electricity is to send someone out there to check. You can’t call them up or put out an online poll, because they won’t be able to answer.”

India at night

Satellite image of India at night. Large parts of the Indian countryside still aren’t connected to the grid, but remote sensing, machine learning could help pinpoint people living without power. Credits: NASA Earth Observatory images by Joshua Stevens, using Suomi NPP VIIRS data from Miguel Román, NASA’s Goddard Space Flight Center

Led by researchers in the Energy Data Analytics Lab and the Sustainable Energy Transitions Initiative, “the initial goal was to create a map of India, showing every village or town that does or does not have access to electricity,” said team member Trishul Nagenalli.

Electricity makes it possible to pump groundwater for crops, refrigerate food and medicines, and study or work after dark. But in parts of rural India, where Nagenalli’s parents grew up, many households use kerosene lamps to light homes at night, and wood or animal dung as cooking fuel.

Fires from overturned kerosene lamps are not uncommon, and indoor air pollution from cooking with solid fuels contributes to low birth weight, pneumonia and other health problems.

In 2005, the Indian government set out to provide electricity to all households within five years. Yet a quarter of India’s population still lives without power.

Ultimately, the goal is to create a machine learning algorithm — basically a set of instructions for a computer to follow — that can recognize power plants, irrigated fields and other indicators of electricity in satellite images, much like the algorithms that recognize your face on Facebook.

Rather than being programmed with specific instructions, machine learning algorithms “learn” from large amounts of data.

This summer the researchers focused on the unsung first step in the process: preparing the training data.

Phoenix power plant

Satellite image of a power plant in Phoenix, Arizona

Fellow Duke students Gouttham Chandrasekar, Shamikh Hossain and Boning Li were also part of the effort. First they compiled publicly available satellite images of U.S. power plants. Rather than painstakingly framing and labeling the plants in each photo themselves, they tapped the powers of the Internet to outsource the task and hired other people to annotate the images for them, using a crowdsourcing service called Amazon Mechanical Turk.

So far, they have collected more than 8,500 image annotations of different kinds of power plants, including oil, natural gas, hydroelectric and solar.

The team also compiled firsthand observations of the electrification rate for more than 36,000 villages in the Indian state of Bihar, which has one of the lowest electrification rates in the country. For each village, they also gathered satellite images showing light intensity at night, along with density of green land and other indicators of irrigated farms, as proxies for electricity consumption.

Using these data sets, the goal is to develop a computer algorithm which, through machine learning, teaches itself to detect similar features in unlabeled images, and distinguishes towns and villages that are connected to the grid from those that aren’t.

“We would like to develop our final algorithm to essentially go into a developing country and analyze whether or not a community there has access to electricity, and if so what kind,” Chandrasekar said.

Electrification map of Bihar, India

The proportion of households connected to the grid in more than 36,000 villages in Bihar, India

The project is far from finished. During the 2017-2018 school year, a Bass Connections team will continue to build on their work.

The summer team presented their research at the Data+ Final Symposium on July 28 in Gross Hall.

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of mathematics and statistical science and MEDx. This project team was also supported by the Duke University Energy Initiative.

Writing by Robin Smith; video by Lauren Mueller and Summer Dunsmore

Page 1 of 8

Powered by WordPress & Theme by Anders Norén