Duke Research Blog

Following the people and events that make up the research community at Duke.

Category: Data (Page 1 of 5)

Game-Changing App Explores Conservation’s Future

In the first week of February, students, experts and conservationists from across the country were brought together for the second annual Duke Blueprint symposium. Focused around the theme of “Nature and Progress,” this conference hoped to harness the power of diversity and interdisciplinary collaboration to develop solutions to some of the world’s most pressing environmental challenges.

Scott Loarie spoke at Duke’s Mary Duke Biddle Trent Semans Center.

One of the most exciting parts of this symposium’s first night was without a doubt its all-star cast of keynote speakers. The experiences and advice each of these researchers had to offer were far too diverse for any single blog post to capture, but one particularly interesting presentation (full video below) was that of National Geographic fellow Scott Loarie—co-director of the game-changing iNaturalist app.

iNat, as Loarie explained, is a collaborative citizen scientist network with aspirations of developing a comprehensive mapping of all terrestrial life. Any time they go outside, users of this app can photograph and upload pictures of any wildlife they encounter. A network of scientists and experts from around the world then helps the users identify their finds, generating data points on an interactive, user-generated map of various species’ ranges.

Simple, right? Multiply that by 500,000 users worldwide, though, and it’s easy to see why researchers like Loarie are excited by the possibilities an app like this can offer. The software first went live in 2008, and since then its user base has roughly doubled each year. This has meant the generation of over 8 million data points of 150,000 different species, including one-third of all known vertebrate species and 40% of all known species of mammal. Every day, the app catalogues around 15 new species.

“We’re slowly ticking away at the tree of life,” Loarie said.

Through iNaturalist, researchers are able to analyze and connect to data in ways never before thought possible. Changes to environments and species’ distributions can be observed or modeled in real time and with unheard-of collaborative opportunities.

To demonstrate the power of this connectedness, Loarie recalled one instance of a citizen scientist in Vietnam who took a picture of a snail. This species had never been captured, never been photographed, hadn’t been observed in over a century. One of iNat’s users recognized it anyway. How? He’d seen it in one of the journals from Captain James Cook’s 18th-century voyage to circumnavigate the globe.

It’s this kind of interconnectivity that demonstrates not just the potential of apps like iNaturalist, but also the power of collaboration and the possibilities symposia like Duke Blueprint offer. Bridging gaps, tearing down boundaries, building up bonds—these are the heart of conservationism’s future. Nature and Progress, working together, pulling us forward into a brighter world.

Post by Daniel Egitto

 

 

Duke Scholars Bridge Disciplines to Tackle Big Questions

A visualization showing faculty as dots that are connected by lines

This visualization, created by James Moody and the team at the Duke Network Analysis Center, links faculty from across schools and departments who serve together on Ph.D. committees. An interactive version is available here.

When the next big breakthrough in cancer treatment is announced, no one will ask whether the researchers are pharmacologists, oncologists or cellular biologists – and chances are, the team will represent all three.

In the second annual Scholars@Duke Visualization Challenge, Duke students explored how scholars across campus are drawing from multiple academic disciplines to tackle big research questions.

“I’m often amazed at how gifted Duke faculty are and how they can have expertise in multiple fields, sometimes even fields that don’t seem to overlap,” said Julia Trimmer, Director of Faculty Data Systems and Analysis at Duke.

In last year’s challenge, students dug into Scholars@Duke publication data to explore how Duke researchers collaborate across campus. This year, they were provided with additional data on Ph.D. dissertation committees and asked to focus on how graduate education and scholarship are reaching across departmental boundaries.

“The idea was to see if certain units or disciplines contributed faculty committee members across disciplines or if there’s a lot of discipline ‘overlap.’” Trimmer said.

The winning visualization, created by graduate student Matthew Epland, examines how Ph.D. committees span different fields. In this interactive plot, each marker represents an academic department. The closer together markers are, the more likely it is that a faculty member from one department will serve on the committee of a student in the other department.

Epland says he was intrigued to see the tight-knit community of neuroscience-focused departments that span different schools, including psychology and neuroscience, neurobiology, neurology and psychiatry and behavioral Sciences. Not surprisingly, many of the faculty in these departments are members of the Duke Institute for Brain Sciences (DIBS).

Duke schools appear as dots and are connected by lines of different thicknesses

Aghil Abed Zadeh and Varda F. Hagh analyzed publication data to visualize the extent to which faculty at different Duke schools collaborate with one another. The size of each dot represents the number of publications from each school, and thickness of each line represents the number of faculty collaborations between the connected schools.

Sociology Professor James Moody and the team at the Duke Network Analysis Center took a similar approach, creating a network of individual faculty members who are linked by shared students. Faculty who sit on committees in only one field are bunched together, highlighting researchers who bridge different disciplines. The size of each marker represents the extent to which each researcher sits “between” two fields.

The map shows a set of strong ties within the natural sciences and within the humanities, but few links between the two groups. Moody points out that philosophy is a surprising exception to this rule, lying closer to the natural sciences cluster than to the humanities cluster.

“At Duke, the strong emphasis on philosophy of science creates a natural link between philosophy and the natural sciences,” Moody said.

Duke graduate student Aghil Abed Zadeh teamed up with Varda F. Hagh, a student at Arizona State University, to create elegant maps linking schools and departments by shared authorship. The size of each marker represents the number of publications in that school or department, and the thickness of the connecting lines indicate the number of shared authorships.

“It is interesting to see how connected law school and public policy school are. They collaborate with many of the sciences as well, which is a surprising fact,” Zadeh said. “On the other hand, we see Divinity school, one the oldest at Duke, which is isolated and not connected to others at all.”

The teams presented their visualizations Jan. 20 at the Duke Research Computing Symposium.

Post by Kara Manke

 

Generating Winning Sports Headlines

What if there were a scientific way to come up with the most interesting sports headlines? With the development of computational journalism, this could be possible very soon.

Dr. Jun Yang is a database and data-intensive computing researcher and professor of Computer Science at Duke. One of his latest projects is computational journalism, in which he and other computer science researchers are considering how they can contribute to journalism with new technological advances and the ever-increasing availability of data.

An exciting and very relevant part of his project is based on raw data from Duke men’s basketball games. With computational journalism, Yang and his team of researchers have been able to generate diverse player or team factoids using the statistics of the games.

Grayson Allen headed for the hoop.

Grayson Allen headed for the hoop.

An example factoid might be that, in the first 8 games of this season, Duke has won 100% of its games when Grayson Allen has scored over 20 points. While this fact is obvious, since Duke is undefeated so far this season, Yang’s programs will also be able to generate very obscure factoids about each and every player that could lead to unique and unprecedented headlines.

While these statistics relating player and team success can only imply correlation, and not necessarily causation, they definitely have potential to be eye-catching sports headlines.

Extracting factoids hasn’t been a particularly challenging part of the project, but developing heuristics to choose which factoids are the most relevant and usable has been more difficult.

Developing these heuristics so far has involved developing scoring criteria based on what is intuitively impressive to the researcher. Another possible measure of evaluating the strength of a factoid is ranking the types of headlines that are most viewed. Using this method, heuristics could, in theory, be based on past successes and less on one researcher’s human intuition.

Something else to consider is which types of factoids are more powerful. For example, what’s better: a bolder claim in a shorter period of time, or a less bold claim but over many games or even seasons?

The ideal of this project is to continue to analyze data from the Duke men’s basketball team, generate interesting factoids, and put them on a public website about 10-15 minutes after the game.

Looking forward, computational journalism has huge potential for Duke men’s basketball, sports in general, and even for generating other news factoids. Even further, computational journalism and its scientific methodology might lead to the ability to quickly fact-check political claims.

Right now, however, it is fascinating to know that computer science has the potential to touch our lives in some pretty unexpected ways. As our current men’s basketball beginning-of-season winning streak continues, who knows what unprecedented factoids Jun Yang and his team are coming up with.

By Nina Cervantes

Duke’s Researchers Are 1 Percent of the Top 1 Percent

This year’s listing of the world’s most-cited researchers is out from Clarivate Analytics, and Duke has 34 names on the list of 3,400 researchers from 21 fields of science and social science.

Having your publication cited in a paper written by other scientists is a sign that your work is significant and advances the field. The highly-cited list includes the top 1 percent of scientists cited by others in the years 2005 to 2015.

“Citations by other scientists are an acknowledgement that the work our faculty has published is significant to their fields,” said Vice Provost for Research Lawrence Carin. “In research, we often talk about ‘standing on the shoulders of giants,’ as a way to explain how one person’s work builds on another’s. For Duke to have so many of our people in the top 1 percent indicates that they are leading their fields and their work is indeed something upon which others can build.”

In addition to the Durham researchers, Duke-NUS, our medical school in Singapore,  claims another 13 highly cited scientists.

The highly-cited scientists on the Durham campus are:

Barton Haynes

CLINICAL MEDICINE
Robert Califf
Christopher Granger
Kristin Newby
Christopher O’Connor
Erik Magnus Ohman
Manesh Patel
Michael Pencina
Eric Peterson

ECONOMICS AND BUSINESS
Dan Ariely
John Graham
Campbell Harvey

Drew Shindell

ENVIRONMENT/ECOLOGY
John Terborgh
Mark Wiesner

GEOSCIENCES
Drew Shindell

IMMUNOLOGY
Barton Haynes

MATHEMATICS
James Berger

Georgia Tomaras

Georgia Tomaras

MICROBIOLOGY
Bryan Cullen
Barton Haynes
David Montefiori
Georgia Tomaras

PHARMACOLOGY & TOXICOLOGY
Robert Lefkowitz

PHYSICS
David R. Smith

PLANT AND ANIMAL SCIENCE
Philip Benfey

Terrie Moffitt

Terrie Moffitt

PSYCHIATRY & PSYCHOLOGY
Angold, Adrian
Caspi, Avshalom
Copeland, William E
Costello, E J
Dawson, Geraldine
Keefe, Richard SE
McEvoy, Joseph P
Moffitt, Terrie E

SOCIAL SCIENCES (GENERAL)
Deverick Anderson
Kelly Brownell
Michael Pencina

Opportunities at the Intersection of Technology and Healthcare

What’d you do this Halloween?

I attended a talk on the intersection of technology and healthcare by Dr. Erich Huang, who is an assistant professor of Biostatistics & Bioinformatics and Assistant Dean for Biomedical Informatics. He’s also the new co-director of Duke Forge, a health data science research group.

This was not a conventional Halloween activity by any means, but I felt lucky to be exposed to this impactful research surrounded by views of the Duke forest in fall in Penn Pavilion at IBM-Duke Day.

Erich Huang

Erich Huang, M.D., PhD. is the co-director of Duke Forge, our new health data effort.

Dr. Huang began his talk with a statistic: only six out of 53 landmark cancer biology research papers are reproducible. This fact was shocking (and maybe a little bit scary?), considering  that these papers serve as the foundation for saving cancer patients’ lives. Dr. Huang said that it’s time to raise standards for cancer research.

What is his proposed solution? Using data provenance, which is essentially a historical record of data and its origins, when dealing with important biomedical data.

He mentioned Duke Data Service (DukeDS), which is an information technology service that features data provenance for scientific workflows. With DukeDS, researchers are able to share data with approved team members across campus or across the world.

Next, Dr. Huang demonstrated the power of data science in healthcare by describing an example patient. Mr. Smith is 63 years old with a history of heart attacks and diabetes. He has been having trouble sleeping and his feet have been red and puffy. Mr. Smith meets the criteria for heart failure and appropriate interventions, such as a heart pump and blood thinners.

A problem that many patients at risk of heart failure face is forgetting to take their blood thinners. Using Pillsy, a company that makes smart pill bottles with automatic tracking, we could record Mr. Smith’s medication taking and record this information on the blockchain, or by storing blocks of information that are linked together so that each block points to an older version of that information. This type of technology might allow for the recalculation of dosage so that Mr. Smith could take the appropriate amount after a missed dose of a blood thinner.

These uses of data science, and specifically blockchain and data provenance, show great opportunity at the intersection of technology and healthcare. Having access to secure and traceable data can lead to research being more reproducible and therefore reliable.

At the end of his presentation, Dr. Huang suggested as much collaboration in research between IBM and Duke as possible, especially in his field. Seeing that the Research Triangle Park location of IBM is the largest IBM development site in the world and is conveniently located to one of the best research universities in the nation, his suggestion makes complete sense.

By Nina Cervantes        

Who Gets Sick and Why?

During his presentation as part of the Chautauqua lecture series, Duke sociologist Dr. Tyson Brown explained his research exploring the ways racial inequalities affect a person’s health later in life. His project mainly looks at the Baby Boomer generation, Americans born between 1946 and 1964.

With incredible increases in life expectancy, from 47 years in 1900 to 79 today, elderly people are beginning to form a larger percentage of the population. However among black people, the average life expectancy is three and a half years shorter.

“Many of you probably do not think that three and half years is a lot,” Brown said. “But imagine how much less time that is with your family and loved ones. In the end, I think all of us agree we want those extra three and a half years.”

Not only does the black population in America have shorter lives on average but they also tend to have sicker lives with higher blood pressures, greater chances of stroke, and higher probability of diabetes. In total, the number of deaths that would be prevented if African-American people had the same life expectancy as white people is 880,000 over a nine-year span. Now, the question Brown has challenged himself with is “Why does this discrepancy occur?”

Brown said he first concluded that health habits and behaviors do not create this life expectancy gap because white and black people have similar rates of smoking, drinking, and illegal drug use. He then decided to explore socioeconomic status. He discovered that as education increases, mortality decreases. And as income increases, self-rated health increases. He said that for every dollar a white person makes, a black person makes 59 cents.

This inequality in income points to the possible cause for the racial inequality in health, he said.  Additionally, in terms of wealth instead of income, a black person has 6 cents compared to the white person’s dollar. Possibly even more concerning than this inconsistency is the fact that it has gotten worse, not better, over time. Before the 2006 recession, blacks had 10-12 cents of wealth for every white person’s dollar.

Brown believes that this financial stress forms one of many stressors in black lives including chronic stressors, everyday discrimination, traumatic events, and neighborhood disorder which affect their health.

Over time, these stressors create something called physiological dysregulation, otherwise known as wear and tear, through repeated activation of  the stress response, he said. Recognition of the prevalence of these stressors in black lives has lead to Brown’s next focus on the extent of the effect of stressors on health. For his data, he uses the Health and Retirement Study and self-rated health (proven to predict mortality better than physician evaluations). For his methods, he employs structural equation modeling. Racial inequalities in socioeconomic resources, stressors and biomarkers of physiological dysregulation collectively explain 87% of the health gap with any number of causes capable of filling the remaining percentage.

Brown said his next steps include using longitudinal and macro-level data on structural inequality to understand how social inequalities “get under the skin” over a person’s lifetime. He suggests that the next steps for society, organizations, and the government to decrease this racial discrepancy rest in changing economic policy, increasing wages, guaranteeing work, and reducing residential segregation.

Post by Lydia Goff

Smoking Weed: the Good, Bad and Ugly

DURHAM, N.C. — Research suggests that the earlier someone is exposed to weed, the worse it is for them.

Very early on in our life, we develop basic motor and sensory functions. In adolescence, our teenage years, we start developing more complex functions — cognitive, social and emotional functions. These developments differ based on one’s experience growing up — their family, their school, their relationships — and are fundamental to our growth as healthy human beings.

This process has shown to be impaired when marijuana is introduced, according to Dr. Diana Dow-Edwards of SUNY Downstate Medical Center.

Sure, a lot of people may think marijuana isn’t so bad…but think again. At an Oct. 11 seminar at Duke’s Center on Addiction & Behavior Change, Dow-Edwards enlightened those who attended with correlations between smoking the reefer and things like IQ, psychosis and memory.

(https://media.makeameme.org/created/Littering-and-SMOKIN.jpg)

Dow-Edwards is currently a professor of physiology and pharmacology and clearly knows her stuff. She was throwing complicated graphs and large studies at us, all backing up her primary claim: the “dose-response relationship.” Basically the more you smoke (“dose”), the more of a biological effect it will have on you (“response”).

Looking at pot users after adolescence showed that occasionally smoking did not cause a big change in IQ, and frequently smoking affected IQ a little. However, looking at adults who smoked during adolescence correlated to a huge drop of around 7 IQ points for infrequent smokers and 10 points for frequent smokers. Here we see how both age and frequency play a role in weed’s effect on cognition. So if you are going to make the choice to light up, maybe wait until your executive functions mature around 24 years old.

Smoking weed earlier in life also showed a strong correlation with an earlier onset of psychosis, a very serious mental disorder in which you start to lose sense of reality. Definitely not good. I’m not trynna get diagnosed with psychosis any time soon!

One perhaps encouraging study for you smokers out there was that marijuana really had no effect on long-term memory. Non-smokers were better at verbal learning than heavy smokers…until after a three week abstinence break, where the heavy smokers’ memories recovered to match the control groups’. So while smoking weed when you have a test coming up maybe isn’t the best idea, there’s not necessarily a need to fear in the long run.

(Hanson et al, 2010)

A similar study showed that signs of depression and anxiety also normalized after 28 days of not smoking. Don’t get too hyped though, because even after the abstinence period, there was still “persistent impulsivity and reduced reward responses,” as well as a drop in attention accuracy.

A common belief about weed is that it is not addicting, but it actually is. What happens is that after repetitively smoking, feeling high no longer equates to feeling better than normal, but rather being sober equates to feeling worse than normal. This can lead to irritability, reduced appetite, and sleeplessness. Up to 1/2 of teens who smoke pot daily become dependent, and in broader terms, 9 percent of people who just experiment become dependent.

In summary, “marijuana interferes with normal brain development and maturation.” While it’s not going to kill you, it does effect your cognitive functions. Plus, you are at a higher risk for mental disorders like psychosis and future dependence. So choose wisely, my friends.

By Will Sheehan

Will Sheehan

Students Bring Sixty Years of Data to Life on the Web

For fields like environmental science, collecting data is hard.

Fall colors by Mariel Carr

Fall colors in the Hubbard Brook Experimental Forest, in New Hampshire’s White Mountains.

Gathering results on a single project can mean months of painstaking measurements, observations and notes, likely in limited conditions, hopefully to be published in a highly specialized journal with a target audience made up mostly of just other specialists in the field.

That’s why when, this past summer, Duke students Devri Adams, Camila Restrepo and Annie Lott set out with  graduate students Richard Marinos, Matt Ross and Professor Emily Bernhardt to combine over six decades of data on the Hubbard Brook Experimental Forest into a workable, aesthetically pleasing visualization website, they were really breaking new ground in the way the public can appreciate this truly massive store of information.

The site’s navigation shows users what kinds of data they might explore in beautiful fashion.

Spanning some 8,000 acres of New Hampshire’s sprawling White Mountain National Forest, Hubbard Brook has captured the thoughts and imaginations of generations of environmental researchers. Over 60 years of study and authorized experimentation in the region have brought us some of the longest continuous environmental data sets ever collected, tracking changes across a variety of factors for the second half of the 20th century.

Now, for the first time ever, this data has been brought together into a comprehensive, agile interface available to specialists and students alike. This website is developed with the user constantly in mind. At once in-depth and flexible, each visualization is designed so that a casual viewer can instantly grasp a variety of factors all at the same time—pH, water source, molecule size and more all made clearly evident from the structures of the graphs.

Additionally, this website’s axes can be as flexible as you need them to be; users can manipulate them to compare any two variables they want, allowing for easy study of all potential correlations.

All code used to build this website has been made entirely open source, and a large chunk of the site was developed with undergrads and high schoolers in mind. The team hopes to supplement textbook material with a series of five “data stories” exploring different studies done on the forest. The effects of acid rain, deforestation, dilutification, and calcium experimentation all come alive on the website’s interactive graphs, demonstrating the challenges and changes this forest has faced since studies on it first began.

The team hopes to have created a useful and user-friendly interface that’s easy for anyone to use. By bringing data out of the laboratory and onto the webpage, this project brings us one step further in the movement to make research accessible to and meaningful for the entire world.

Post by Daniel Egitto

Durham Traffic Data Reveal Clues to Safer Streets

Ghost bikes are a haunting site. The white-painted bicycles, often decorated with flowers or photographs, mark the locations where cyclists have been hit and killed on the street.

A white-painted bike next to a street.

A Ghost Bike located in Chapel Hill, NC.

Four of these memorials currently line the streets of Durham, and the statistics on non-fatal crashes in the community are equally sobering. According to data gathered by the North Carolina Department of Transportation, Durham county averaged 23 bicycle and 116 pedestrian crashes per year between 2011 and 2015.

But a team of Duke researchers say these grim crash data may also reveal clues for how to make Durham’s streets safer for bikers, walkers, and drivers.

This summer, a team of Duke students partnered with Durham’s Department of Transportation to analyze and map pedestrian, bicycle and motor vehicle crash data as part of the 10-week Data+ summer research program.

In the Ghost Bikes project, the team created an interactive website that allows users to explore how different factors such as the time-of-day, weather conditions, and sociodemographics affect crash risk. Insights from the data also allowed the team to develop policy recommendations for improving the safety of Durham’s streets.

“Ideally this could help make things safer, help people stay out of hospitals and save lives,” said Lauren Fox, a Duke cultural anthropology major who graduated this spring, and a member of the DATA+ Ghost Bikes team.

A map of Durham county with dots showing the locations of bicycle crashes

A heat map from the team’s interactive website shows areas with the highest density of bicycle crashes, overlaid with the locations of individual bicycle crashes.

The final analysis showed some surprising trends.

“For pedestrians the most common crash isn’t actually happening at intersections, it is happening at what is called mid-block crossings, which happen when someone is crossing in the middle of the road,” Fox said.

To mitigate the risks, the team’s Executive Summary includes recommendations to install crosswalks, median islands and bike lanes to roads with a high density of crashes.

They also found that males, who make up about two-thirds of bicycle commuters over the age of 16, are involved in 75% of bicycle crashes.

“We found that male cyclists over age 16 actually are hit at a statistically higher rate,” said Elizabeth Ratliff, a junior majoring in statistical science. “But we don’t know why. We don’t know if this is because males are riskier bikers, if it is because they are physically bigger objects to hit, or if it just happens to be a statistical coincidence of a very unlikely nature.”

To build their website, the team integrated more than 20 sets of crash data from a wide variety of different sources, including city, county, regional and state reports, and in an array of formats, from maps to Excel spreadsheets.

“They had to fit together many different data sources that don’t necessarily speak to each other,” said faculty advisor Harris Solomon, an associate professor of cultural anthropology and global health at Duke.  The Ghost Bikes project arose out of Solomon’s research on traffic accidents in India, supported by the National Science Foundation Cultural Anthropology Program.

In Solomon’s Spring 2017 anthropology and global health seminar, students explored the role of the ghost bikes as memorials in the Durham community. The Data+ team approached the same issues from a more quantitative angle, Solomon said.

“The bikes are a very concrete reminder that the data are about lives and deaths,” Solomon said. “By visiting the bikes, the team was able to think about the very human aspects of data work.”

“I was surprised to see how many stakeholders there are in biking,” Fox said. For example, she added, the simple act of adding a bike lane requires balancing the needs of bicyclists, nearby residents concerned with home values or parking spots, and buses or ambulances who require access to the road.

“I hadn’t seen policy work that closely in my classes, so it was interesting to see that there aren’t really simple solutions,” Fox said.

[youtube https://www.youtube.com/watch?v=YHIRqhdb7YQ&w=629&h=354]

 

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of Mathematics and Statistical Science and MEDx.

Other Duke sponsors include DTECH, Duke Health, Sanford School of Public Policy, Nicholas School of the Environment, Development and Alumni Affairs, Energy Initiative, Franklin Humanities Institute, Duke Institute for Brain Sciences, Office for Information Technology and the Office of the Provost, as well as the departments of Electrical & Computer Engineering, Computer Science, Biomedical Engineering, Biostatistics & Bioinformatics and Biology.

Government funding comes from the National Science Foundation. Outside funding comes from Accenture, Academic Analytics, Counter Tools and an anonymous donation.

Community partnerships, data and interesting problems come from the Durham Police Department, Durham Neighborhood Compass, Cary Institute of Ecosystem Studies, Duke Marine Lab, Center for Child and Family Policy, Northeast Ohio Medical University, TD Bank, Epsilon, Duke School of Nursing, University of Southern California, Durham Bicycle and Pedestrian Advisory Commission, Duke Surgery, MyHealth Teams, North Carolina Museum of Art and Scholars@Duke.

Writing by Kara Manke; video by Lauren Mueller and Summer Dunsmore

Pinpointing Where Durham’s Nicotine Addicts Get Their Fix

DURHAM, N.C. — It’s been five years since Durham expanded its smoking ban beyond bars and restaurants to include public parks, bus stops, even sidewalks.

While smoking in the state overall may be down, 19 percent of North Carolinians still light up, particularly the poor and those without a high school or college diploma.

Among North Carolina teens, consumption of electronic cigarettes in particular more than doubled between 2013 and 2015.

Now, new maps created by students in the Data+ summer research program show where nicotine addicts can get their fix.

Studies suggest that tobacco retailers are disproportionately located in low-income neighborhoods.

Living in a neighborhood with easy access to stores that sell tobacco makes it easier to start young and harder to quit.

The end result is that smoking, secondhand smoke exposure, and smoking-related diseases such as lung cancer, are concentrated among the most socially disadvantaged communities.

If you’re poor and lack a high school or college diploma, you’re more likely to live near a store that sells tobacco.

If you’re poor and lack a high school or college diploma, you’re more likely to live near a store that sells tobacco. Photo from Pixabay.

Where stores that sell tobacco are located matters for health, but for many states such data are hard to come by, said Duke statistics major James Wang.

Tobacco products bring in more than a third of in-store sales revenue at U.S. convenience stores — more than food, beverages, candy, snacks or beer. Despite big profits, more than a dozen states don’t require businesses to get a special license or permit to sell tobacco. North Carolina is one of them.

For these states, there is no convenient spreadsheet from the local licensing agency identifying all the businesses that sell tobacco, said Duke undergraduate Nikhil Pulimood. Previous attempts to collect such data in Virginia involved searching for tobacco retail stores by car.

“They had people physically drive across every single road in the state to collect the data. It took three years,” said team member and Duke undergraduate Felicia Chen.

Led by UNC PhD student in epidemiology Mike Dolan Fliss, the Duke team tried to come up with an easier way.

Instead of collecting data on the ground, they wrote an automated web-crawler program to extract the data from the Yellow Pages websites, using a technique called Web scraping.

By telling the software the type of business and location, they were able to create a database that included the names, addresses, phone numbers and other information for 266 potential tobacco retailers in Durham County and more than 15,500 statewide, including chains such as Family Fare, Circle K and others.

Map showing the locations of tobacco retail stores in Durham County, North Carolina.

Map showing the locations of tobacco retail stores in Durham County, North Carolina.

When they compared their web-scraped data with a pre-existing dataset for Durham County, compiled by a nonprofit called Counter Tools, hundreds of previously hidden retailers emerged on the map.

To determine which stores actually sold tobacco, they fed a computer algorithm data from more than 19,000 businesses outside North Carolina so it could learn how to distinguish say, convenience stores from grocery stores. When the algorithm received store names from North Carolina, it predicted tobacco retailers correctly 85 percent of the time.

“For example we could predict that if a store has the word “7-Eleven” in it, it probably sells tobacco,” Chen said.

As a final step, they also crosschecked their results by paying people a small fee to search for the stores online to verify that they exist, and call them to ask if they actually sell tobacco, using a crowdsourcing service called Amazon Mechanical Turk.

Ultimately, the team hopes their methods will help map the more than 336,000 tobacco retailers nationwide.

“With a complete dataset for tobacco retailers around the nation, public health experts will be able to see where tobacco retailers are located relative to parks and schools, and how store density changes from one neighborhood to another,” Wang said.

The team presented their work at the Data+ Final Symposium on July 28 in Gross Hall.

Data+ is sponsored by Bass Connections, the Information Initiative at Duke, the Social Science Research Institute, the departments of mathematics and statistical science and MEDx. This project team was also supported by Counter Tools, a non-profit based in Carrboro, NC.

Writing by Robin Smith; video by Lauren Mueller and Summer Dunsmore

Page 1 of 5

Powered by WordPress & Theme by Anders Norén