Nmedia - Fotolia
Interview with Jay Jacobs, senior data scientist at BitSight Technologies (photograph by Marcus Ranum).
How did you get into being a numbers guy focused on security data analysis? What's your background?
Jay Jacobs: Information technology. Early on, I got interested in security through breaking stuff, and that led me (of course!) to asking: How do we actually stop people from breaking stuff? So I got into fixing things, and then what dawned on me was that cryptography is the penultimate solution: You can encrypt things; you can throw them on the internet and it's protected. So I started learning all I could about cryptography, and then I realized that the math in the cryptography is the easy stuff.
There's always a key somewhere. What do I do with it? Do you try to put it in a config file? You can get a hardware security module and put the key in there, but then you have to authenticate to that. Where do you put that authentication, in the config file? Those are all questions of risk: Is just doing file permissions on the config file with the key in it, is that good enough?
I've noticed cryptographers follow this path where they start off thinking cryptography is the answer, and then they realize, 'Wait a minute, we've got to go past systems integrity before we can get there.' They go through this maturing process.
Jacobs: Same with me! What I realized is that when you start talking about risk, what is good enough? That is an incredibly complex problem. For me the answer lies in data: How do we make sense out of what has happened in the past? How do we make sense of the breaches that have occurred, the events, the tickets that get opened at the help desk? How do we make sense of these things in order to learn what can probably happen tomorrow?
So that dragged you into it. How did you learn about it? What was your path?
Jacobs: I started out teaching myself and doing a whole lot of reading, and then I went back to school in a master's program for applied statistics. That was actually a pretty bad experience because school was horribly outdated -- 10 to 15 years behind the times. For example, in the night course, I learned a technique and the next day I went to an R [statistical programming language] user group and they were talking about linear regression, which is what I was studying. They were talking about all these techniques that I had never heard of before. So I raised my hand and asked, 'Hey, you didn't mention this one technique,' and [the group leader] laughed at me. He said, 'Nobody does that technique anymore.' And I had just learned it in class the night before. Then he said, 'No, people are doing this, that and the other thing.' And those were things that were not taught in my class. So I realized the education system is behind the times; they aren't keeping current with statistics.
Jay Jacobssenior data scientist, BitSight Technologies
You know that, in security, it's very difficult to keep up with everything because of how fast things are changing. It's the same in statistics; there's a lot of research … with machine learning and so forth. So I dropped out and decided to invest my tuition in books from Amazon.com.
I started to learn techniques and get to know people in the local community who are statisticians. I did not turn to the security community to learn security data analysis -- I went to the statistics community instead.
Well, that's good. The security [community] doesn't know anything about security data analysis! You don't want to ask us because we're behind the times, too.
What would you say are your primary sources of information? If someone wanted to follow along your path and learn on their own, what would you tell them to do?
Jacobs: What school did teach me was how to read and research. The [teachers] would assign me different pieces of the puzzle, and it taught me the underlying language. For example, when you see n, you know that's a sample size. And sigma and mu and all these notations -- [they're] hurdles. It might be a good idea to take a class or two but not try to do a whole program.
Have you ever looked at any of the online learning options, like The Great Courses [college-level coursework] or something like that?
Jacobs: Anything that can structure the learning is good.
My experience with learning something is that you want a mixture of practical and theoretical so that, as you switch back and forth, the practice is explained by the theory and the theory is informed by the practicalities. What do you use in practice? What does your statistician's tool bag have in it? Big data?
Marcus Ranumchief of security, Tenable Network Security
Jacobs: It's all R. There's a package in R called ggplot, which I render a lot of things with. It's all code-based. It's not point-and-click -- you'd be very comfortable with it. There's an integrated development environment [IDE] that lets you write your code, and when you're done, you click a button and it gives you a chart.
So you're not using SPSSX or GNU plots and AWK scripts like I learned back in the dark ages of computing?
Jacobs: Nope, not anymore. SPSS is still out there; it's kind of like COBOL. The R environment I'm working with is all open source. There's a statistical programming language called S, and someone decided to make an open source version with extensions and they called it R. You can download RStudio, which is the IDE I use.
One of the challenges in taking a course or teaching yourself is they'll give you a set of data to work with, and it's always good, clean data. 'Here's the problem, and here's the data you can use to practice the problem.'
In the real world, you get messy, unbelievable, god-awful junk thrown at you. Most of my time is spent taking this really ugly garbage and trying to extract meaning from it before I can do statistical analysis on it.
There's a lot of creativity in this. It's not step 1, step 2, step 3 and you get the answer. You have to figure out the steps, and they have to be the right steps or you get things horribly wrong.
So are you a fan of Tufte [Edward Tufte, author of The Visual Display of Quantitative Information] or are you anti-Tufte?
Jacobs: He is more on the art side of the field, and he has a bit of a reputation in the field as being difficult to have a one-on-one with. So a lot of people use him more as a reference than a resource.
There are times when I am presenting a chart and I see 20 different things you can point out in it, but most of the people looking at it are just seeing a couple bars there. There are stories in the information that you've just got to figure out how to tell.
When someone says, 'Sure, it's a bad statistic, but it's useful,' what do you say?
Jacobs: I think that there's some truth that bad statistics are useful, but I think people put more stock in them than they should. Sometimes, when data is wrong, you can say, 'There might be something wrong with this data, but it's still useful.' Even if the data has bias, you can look at it and, sometimes, you can see where the bias came in. You don't correct it; you explain it.
Sometimes, you have to look at it and say, 'There's no value in that.' But sometimes, if you have something that's very biased and you put it in your back pocket, then look for other data that might back it up, that might put value in it. You don't know yet.
I noticed that the data you tend to present is about how things were. You're recording a thing that happened, and then you present data about that thing. That way, you're going to have less of a chance that you're actually dealing with people's opinions about something -- measure outcomes and experiences.
Jacobs: Surveys are more problematic. When you're talking about people's opinions about security, it's always tricky because security is incredibly complex. There is no opinion that you are going to find that will be accurate.
What's your next step?
Jacobs: I'm having a tremendous amount of fun where I am. We get enormous quantities of data, and we get to correlate it to specific entities because we've got a whole team that's been mapping IP addresses to companies. We have mapped 50,000-plus addresses to companies, and that enables us to set up something like a sinkhole and a scanning result set and ask, 'What is the connection between this data?' Now, we can start to say, 'How good are these controls?' If someone has every port open in a scan, we can infer they don't have a firewall, and then we can look at the sinkhole and see they have a lot of botnet traffic. It's very interesting.
Learn the basics of predictive analytics and modeling
Why drawing conclusions from security statistics is hard
Will big data analytics as a service benefit your enterprise?