For half an hour this rainy Wednesday, October 6th, I logged on to a LinkedIn Live series webinar with Dr. Jiaming Xu from the Fuqua School of Business. I sat inside the bridge between Perkins and Bostock, my laptop connected to DukeBlue wifi. I had Instagram open on my phone and was tapping through friends’ stories while I waited for the broadcast to start. I had Google Docs open in another tab to take notes.
The title of the webinar was “Can Anyone Truly Be Anonymous Online?”
Xu spoke about “network privacy,” which is “the intersection of network analysis and data privacy.” When you make an account, connect to wifi, share your location, search something online, or otherwise hint at your personal information, you are creating a “user profile”: a network of personal data that hints at your identity.
You are probably familiar with how social media companies track your decisions to curate a more engaging experience for you (i.e. the reason I scroll through TikTok for 5 minutes, then 30 minutes, then… Oh no! Two hours have gone by). Other companies track other kinds of data— data that isn’t always just for algorithmic manipulation or creepy-accurate Amazon ads (i.e. “Hey! I was just thinking about buying cat litter. How did Mr. Bezos know?”). Your name, work history, date of birth, address, location, and other critical identifying factors can be collected even if you think your profile is scrubbed clean. In a rather on-the-nose anecdote to his LinkedIn audience on Wednesday, Xu explained that in April 2021, over 500 million user profiles on LinkedIn were hacked. Valuable, “sensitive, work-related data,” he noted, was made vulnerable.
So, what do you have to worry about? I know I tend to not worry about my personal information online; letting companies collect my data benefits me. I can get targeted Google ads about things I’m interested in and cool filters on Snapchat. In a medical setting, Xu said, prediction algorithms may help patients’ health in the long run. But even anonymized and sanitized data can be traced back to you. For further reading: in an essay published in July 2021, philosophers Evan Selinger and Judy Rhee elaborate on the dangers of “normalizing surveillance.”
The meat of Xu’s talk was how your data can be traced back to you. Xu gave three examples.
The first was a study conducted by researchers at the University of Texas- Austin attempting to identify users submitting “anonymous” reviews for movies on Netflix (keep in mind this was 2007, so picture the red Netflix logo on the DVD box accordingly). To achieve this, they cross-referenced the network of reviews published by Netflix with the network of individuals signed up on IMDB; they matched those who reviewed movies similarly on both platforms with their public profiles on IMDB. You can read more about that specific study here. (For those unafraid of the full research paper, click here).
Let’s take a pause to learn a new vocab word! “Signatures.” In this example, the signature was users’ movie ratings. See if you can name the signature in the other two examples.
The second example was conducted by the same researchers; to identify users on Twitter who shared their data anonymously, it was simply a matter of cross-referencing the network of Twitter users with Flickr users. If you know a guy who knows a guy who knows a guy who knows a guy, you and that group of people are likely to initiate that same chain of following each other on every social media platform you have (it may remind you of the theory that you are connected by “six degrees of separation” from every person on the planet, which, as it turns out, is also supported by social media data). The researchers were able to identify the correct users 30.8% of the time.
Time for another vocab break! Those users who connect groups of people who know a guy who know a guy who know a guy are called “seeds.” Speaking of which, did you identify the signature in this example?
The third and final example was my personal favorite because it was the funkiest and creative. Facebook user data— also “scrubbed clean” before being sold to third-party advertisers— was overlain with LinkedIn user data to reveal a network of connections that are repeated. How did they match up those networks, you ask? First, the algorithm assigned a computed score to every individual user based on how many Facebook friends they have and one for every user based on how many LinkedIn connections they have. Then, each user was assigned a list of integers based on their friends’ popularity score. Bet you weren’t expecting that.
This method sort of improves upon the Twitter/Flickr example, but in addition to overlaying networks and chains of users, it better matches who is who. Since you are likely to know a guy who knows a guy who knows a guy, but you are also likely to know all of those guys down the line, following specific chains does not always accurately convey who is who. Unlike the seeds signature, the friends’ popularity signature was able to correctly re-identify users most of the time.
Sitting in the bridge Wednesday, I was connected to many networks that I wouldn’t think could be used to identify me through my limited public data. Now, I’m not so sure.
So, what’s the lesson here? At the least, it was fun to learn about, even if the ultimate realization leaves us powerless against big data analytics. Your data has monetary value, and it is not as secure as you think: but it may be worth asking whether or not we even have the ability to protect our anonymity.