Most organizations want to leverage their existing authentication and user-management systems for big data, if at all possible. No one wishes to re-create user accounts for a new database environment, and most people balk at having to manage multiple identity stores.
Every user who interacts with NoSQL clusters and distributed processing frameworks, such as open source Apache Hadoop, should be required to authenticate their identity. Hadoop by default doesn't authenticate any of its users or services. In 2010, Yahoo, an early adopter of the distributed computing platform, began to address Hadoop security holes by implementing Kerberos authentication. Originally created by the Massachusetts Institute of Technology (MIT), the open source Kerberos network-authentication protocol is increasingly used to validate users and client/server nodes in Hadoop clusters.
Kerberos authentication uniquely identifies users, requiring them to prove their identities before they can connect to the cluster. Once validated, users are issued a token that provides access and embeds access permissions to define what resources they can use. Kerberos is designed to offer strong authentication and works within insecure networks. It also provides security benefits beyond authentication, such as resource authorization and immunity to eavesdropping and replay attacks. And system administrators can control which nodes are allowed to join the secure cluster.
Many developers don't like the network authentication protocol because of the difficulty involved in setting it up. There is simply no way to tap dance around it: Kerberos is complex, and its protocols are not well understood. But it's far better to use built-in security tools in the Hadoop framework than to roll your own authentication and user-management system.
Many Hadoop variants offer fully integrated Kerberos out of the box, with facilities to ease setup and link to your existing identity repository. That means you can employ existing infrastructure and still get high-security identity tokens for authentication. And most NoSQL clusters offer integration with LDAP (Lightweight Directory Access Protocol) or Microsoft Windows Active Directory so you can use your existing identity store.
To configure Hadoop to use Active Directory, you'll have to set up the Kerberos authentication server and key store -- the Key Distribution Center -- within the cluster's network-addressable space. Next, you'll need to configure the nodes as "service principals" so they can authenticate with the Kerberos server and handle incoming client requests. Finally, you'll have to set up the Active Directory store as a trusted realm for the cluster, allowing Kerberos to link to the repository. Using this method, there is no need to create service principal names in Active Directory, but users -- Active Directory principals -- can be authenticated to Hadoop.
There is not enough room in this tip to delve into all the complexities, but Kerberos is a commonly used tool for Hadoop security, with lots of community resources to offer guidance. You'll likely want to script this setup process to ease ongoing setup and administration. More importantly, you'll want to have failover capabilities for the Kerberos server; if the Kerberos server goes down without backup, no one can log in and the cluster becomes inaccessible.