When it comes to artificial intelligence and machine learning, there is a growing understanding that rather than...
constantly striving for more data, data scientists should be striving for better data when creating AI models.
Laura Norén, director of research at Obsidian Security, spoke about data science ethics at Black Hat USA 2018, and discussed the potential pitfalls of not having quality data, including AI bias learned from the people training the model.
Norén also looked forward to the data science ethics questions that have yet to be asked around what should happen to a person's data after they die.
Editor's note: This is part two of our talk with Norén and it has been edited for length and clarity.
What do you think about how companies go about AI and machine learning right now?
Laura Norén: I think some of them are getting smarter. At a very large scale, it's not noise, but you get a lot of data that you don't really need to store forever. And frankly it costs money to store data. It costs money to have lots and lots and lots of variable features in your model. If you get a more robust model and you're aware of where your signal is coming from, you may also decide not to store particular kinds of data because it's actually inefficient at some point.
For instance, astronomers have this problem. They've been building telescopes that are generating so much data, it cripples the system. They've had seven years of planning just to figure out which data to keep, because they can't keep it all.
There's a myth out there that in order to develop really great machine learning systems you need to have everything, especially at the outset, when you don't really know what the predictive features are going to be. It's nontrivial to do the math and to use the existing data and tests and simulations to figure out what you really need to store and what you don't need to capture in the first place. It's part of the hoarding mythology that somehow we need all of the data all of the time for all time for every person.
How does data science ethics relate to issues of AI bias caused by the data that's fed in?
Norén: That is such a great, great question. I absolutely know that it's going to be important. We're aware of that, we're watching for it, we're monitoring for it so we can test for bias in this case against Russians. Because it's cybersecurity, that's a bias we might have. You can test for that kind of thing. And so we're building tests for those kinds of predictable biases we might have.
I wish I had a great story of how we discovered that we're biased against Russians or North Koreans or something like that. But I don't have that yet because it would just be wrong to kind of run into some of the great stories that I'm sure we're going to run into soon enough.
How do you identify what could be an AI bias that you need to worry about when first building the system?
Norén: When you have low data or your models are kind of all over the place because it's the very beginning, you might be able to use social science to help you look for early biases. All of the data that we're feeding into these systems are generated by humans and humans are inherently biased, that's how we've evolved. That turns out to be really strong, evolutionarily speaking, and then not so great in advanced evolution.
You can test for things that you think might have a known bias, which then it helps to know your history. Like I said, in cybersecurity you might worry about being biased specifically against particular regions. So you may have a higher false-positive rate for Russians or for Russian language content or Chinese language content, or something like that. You could specifically test for those because you went in knowing that you might have a bias. It's a little bit more technical and difficult to unearth biases that you were not expecting. We're using technical solutions and data social science to try to help surface those.
I think social science has been kind of the sleeper hit in data science. It turns out it really helps if you know your domain really well. In our case, that's social science because we're dealing with humans. In other cases, it might help to be a really good biologist if you're starting to do genomics at a predictive level. In general, the strongest data scientists we see are people who have both very high technical skills in the data science vertical but also deep knowledge of their domain.
It sounds like a lot of the potential mitigations for AI bias and data science issues boil down to being more proactive rather than reactive. In that spirit, what is an issue that you think will become a bigger topic of discussion in the next five years?
Norén: I do actually think it's going to be very interesting just how people feel about what happens to their data as more and more companies have more and more data about people forever and their data are going to outlive them. There have been some people who are already working on that kind of thing.
Say you have a best friend and your best friend dies, but you have all these emails and chats, texts, back-and-forth with your best friend. Someone is developing a chatbot that mimics your best friend by being trained on all those actual conversations you had and will then live on past your best friend. So you can continue to talk with your best friend even though your best friend is dead. That's an interesting, kind of provocative, almost artistic take on that point.
But I think it's going to be a much bigger topic of conversation to try to understand what it means to have yourself, profiles and data live out beyond the end of your own life and be able to extend to places that you're not actually in. It will drive decisions about you that you will have no agency over. The dead best friend has no agency over that chatbot.
Indefinite data storage will become much, much more topical in conversation and we'll also start to see then why the right to be forgotten is an insufficient response to that kind of thing because it assumes that you know where to go as your agency, or that you even have agency at all. You're dead; you obviously don't have any agency. Maybe you should, maybe you shouldn't. That's an interesting ethical question.
Users are already finding they don't always have agency over their data even when alive, aren't they?
Norén: Even if you're alive, if you don't really know who holds your data, you may have no agency to get rid of it. I can't call up Equifax and tell them to delete my data. I'm an American, but I don't have that. I know they're stewards of it but there's nothing I could do about that.
We'll probably favor conversation a lot more in terms of being good guardians of data rather than talking about it in terms of something that we own or don't own; it will be about stewardship and guardianship. That's a language that I'm borrowing from medical ethics because they're using that type of language to deal with DNA.
Can someone else own your DNA? They've decided no. DNA is such an intrinsic part of a person's identity and a person's physicality that it can't be owned in whole by someone else. But that someone else, like a hospital or a research lab, could take guardianship of it.
The language is out there, but we haven't really seen it move all the way through the field of data science. It's kind of stuck over in genomics and the Henrietta Lacks story. She was a woman who had ovarian cancer, and she died. But her cells, her cancer cells, were really robust. They worked really well in research settings and they lived on well past Henrietta's life. Her family was unaware of this. There's this beautiful book written about what it means to find out that part of your family -- this diseased family member that you cared about a lot -- is still alive and is still fueling all this research when you didn't even know anything about it. That's kind of where that conversation got started, but I see a lot of parallels there between data science and what people think of when they think of DNA.
One of the things that's so different about data science is that we now can actually have a much more complete record of an individual than we have ever been able to have. It's not just a different iteration on the same kind of thing. You used to be able to have some sort of dossier on you that has your birthdate and your Social Security number, your name and whether you were married. That's such a small amount of information compared to every single interaction that you've had with a piece of software, with another person, with a communication, every medical record, everything that we might know about your DNA. And our knowledge will continue to get deeper and deeper and deeper as science progresses. And we don't really know what that's going to do to the concept of individuality and finiteness.
I think about these things very deeply. We're going to see that in terms of, 'Wow, what does it mean that your data is so complete and it exists in places and times that you could never exist and will never exist?' That's why I think that decay by design thing is so important.