Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
It would seem everyone is taking notice of the phenomenon known as big data, even the federal government. The Federal Trade Commission (FTC) late last year issued orders to nine companies in the data broker industry, requiring them to provide information on their consumer data collection and usage practices. The FTC action is a clear signal that while the emerging uses of big data offer promising business benefits, there are significant privacy implications.
Why Use Big Data?
Big data is different from past data warehousing efforts because it performs analytics on almost any type of data file or format, including images, videos, and data gathered from social media. Another characteristic of big data is that it does not have the “one to one” relationship of server to data storage, but relies on virtualization architecture, needed to be able to draw from large content stores and archives as a single global resource.
Among corporate executives and line-of-business managers, the compelling motivation in using big data is to formulate more accurate, detailed forecasts or predictions that can offer highly probable advantages to an organization. Examples of business benefits cover a wide spectrum ranging from new product development and enhancements to optimum pricing, to screening job applications and designing effective marketing campaigns. In fact, political campaigns have already entered into big data analytics: The Obama 2012 campaign utilized big data analytics to identify likely voters and not only influence them, but also zero in on them in an effort to raise campaign funds and get out the vote, ultimately a key strategy in its victory.
Big Data Privacy Concerns
The FTC’s recent action is specific to data brokers: companies that collect and analyze specific consumer behavioral data and then sell the results to other companies looking to improve their consumer marketing and sales efforts. However, it is important to acknowledge that growing privacy concerns about the use of big data are not limited to these conventional data brokers. The Economist Intelligence Unit, an independent business within the Economist Group, has published a study of leaders in the use of big data that spanned 19 industry sectors including manufacturing, IT and technology, financial services, professional services, healthcare, pharmaceuticals and biotechnology, and consumer goods.  There can be no doubt that the big data revolution has begun.
In light of the characteristics of big data, and the business motivators to pursue its use, one of the most critical privacy aspects is, simply, the quality or accuracy of the data; and how an enterprise uses it might, negatively, affect an individual in decisions that are made. For example, how accurate is personal information obtained from social media? Should information from social media or other Web-enabled sources be used to screen or rank job applications, or increase the price of medical insurance? Basic profile data, such as age, marital status, education or employment, is typically unverified. A similar lack of verification is common in free email services in which the account holder, by accepting terms and conditions, has agreed to relinquish some degree of privacy for data aggregation purposes.
Another quality issue is the way that Internet search terms or phrases can be misinterpreted, when this type of data is collected. Examples of poor enterprise use of big data would include using Internet search terms to evaluate product pricing or, perhaps, target potential customers. There can be multiple users on a household computer, and there are many reasons why someone might research a subject on the Internet that is not
directly relevant to them. This type of data collection, analysis and usage can result in flawed analytic results leading to bad decisions, a lose-lose scenario for both individuals and the organizations acting upon that data. This lack of big data quality control points to another well-established privacy principle, which is to collect personal data that is consistent and appropriate for the intended purpose.
Best Practices for Big Data Privacy
Enterprise best practices for working with big data are still emerging, but there are already lessons that can help move this promise of innovation forward without sacrificing the privacy of personal data.
The first step in effective use of big data is to become highly competent in procuring and managing cloud services, which are considered a prerequisite for big data to be cost effective: most organizations can’t or won’t make the IT infrastructure investments necessary to support a big data initiative, and instead rely on cloud-based applications, infrastructure and processing power. Further, even those willing to make the commitment will find it difficult to proceed without the added flexibility the cloud provides. Yet this represents a weak spot for many organizations in that the competency required to ensure the security and privacy of data in the cloud is generally lacking. It is not enough to implement standard general security contractual clauses. There must be well-defined responsibilities for both the cloud services provider and the cloud services user regarding specific data privacy controls that are required. There must also be ongoing monitoring and audits of cloud services along with any relevant metrics that indicate levels of data integrity, confidentiality and availability. An excellent data protection resource for using cloud computing services is the Cloud Security Alliance, which publishes guidance documents and makes them available on its website.
Experience to date indicates that ideally, in deployment of cloud services, it is best to perform big data prototyping on a public cloud and then move to a private cloud. Why? A public cloud deployment, by definition, is with a third-party and may be accessed by “untrusted” parties. Private cloud deployments are directly controlled and managed by an organization or enterprise even though data computing facilities may be located off-premises; private cloud deployments can be accessed only by trusted parties.
The next tactic to enable better use of big data is to implement converged storage. Converged storage is more efficient and will reduce the likelihood of errors that influence data quality or accuracy. A critical characteristic of converged storage that relates to data quality and accuracy is data de-duplication, although it has cost efficiency benefits as well. 
Another best practice is to properly sanitize data, as it helps avoid a number of the aforementioned privacy issues. “Apply filtering, cleansing, pruning, conforming, matching, joining, and diagnosing at the earliest touch points possible,” said Amy Dean, a data warehouse specialist with Emory University in Atlanta. Dean recommends that varied and disparate data sources can be weighted or scored in terms of data quality to factor into the analytics.  Dean also suggests that the data sources need to be linked or available for reference so any data element in question can be traced back to its source.
Ultimately, the best safety net for accuracy of personal data (and in turn enable better data privacy practices) is to encourage and invite, not just provision, a process for consumers to access, review and correct information that has been collected about them. Further, the consumer review process needs to be easy to use and at no cost to the consumer. This is daunting to many early adopters of big data because they often collect large volumes of data they never even use. There may be a fear of letting consumers see just how much detailed personal data has been collected about them, but this level of transparency is the best way to achieve consumer trust and confidence in decisions being made using big data. Credit reporting entities have long made consumer data access, review and correction procedures a long-standing practice, and it is a U.S. regulatory requirement for that industry. Similarly, privacy notices, statements or disclosures on websites, which include contact details for questions or concerns, is another best practice to better enable transparency and a way to address incorrect data.
The Big Data Conundrum
One of the most contentious privacy concepts for enterprises is the idea of obtaining consent or permission to collect and use personal data. If it were possible to turn the clock back and start over, this would be an ideal ground rule. However, asking individuals for their consent to collect their personal data may no longer be adequate, due to the sheer volumes of personal data that have already been collected and extensively shared. The hard truth is that it is impossible to identify every organization that may have collected information about an individual.
A practice that can help individuals restore “control” of their personal data would be to allow them to have their data removed and purged altogether. Of course, big data users are not inclined to offer this feature and it is the “acid test” of whether consumers would recognize and buy in to the advantages of allowing their data to be used. The ability to have data removed is certainly a requirement that regulators might consider in the interests of protecting consumer privacy. As big data uses continue to evolve, the functional capability to allow deletions or removal of specific data fields related to an individual should be planned for in the technical design and architecture stage of any enterprise big data implementation.
Similarly, an intriguing option to make the use of personal data more palatable to individuals is to perform “anonymization” on any personal data. Unfortunately, the concept of anonymization, referring to removing any identifying fields or attributes, has not proven to be viable. As far back as 2000, Latanya Sweeney, Ph.D., who is now a professor of Government and Technology in Residence at Harvard University, demonstrated that 87% of all Americans could be uniquely identified with only three bits of information: ZIP code, birthdate, and gender, all of which can be found in public records. Considering these research findings, even with an anonymization system in place, the risk of re-identification of nearly any individual consumer living in the United States would be quite likely.
So with all these issues and tactics in mind, the common denominator to enable privacy in the burgeoning era of enterprise big data use is to ensure reliable and accurate personal data and to interpret it appropriately. Businesses that integrate the privacy principles described above into their development and use of big data will be the ones to experience the strongest outcomes, or, perhaps, the least amount of consumer pushback.
 “Big Data: Lessons from the Leaders”, Economist Intelligence Unit, August 2012
 OECD Privacy Principles: http://oecdprivacy.org
 http://www.cloudsecurityalliance.org, see Security Guide and Cloud Controls Matrix documents
 “Newly Emerging Best Practices for Big Data”, a Kimball Group Whitepaper by Ralph Kimball, Nov. 12, 2012
 NIST Special Publication 800-145: “The NIST Definition of Cloud Computing”
 “(IDG) Converged Storage: A Next Gen Storage Strategy for Big Data”, Hewlett-Packard Co., Nov. 12, 2012
 “Newly Emerging Best Practices for Big Data”, a Kimball Group Whitepaper by Ralph Kimball, Nov. 12, 2012
 “Anonymized” data really isn’t—and here’s why not” by Nate Anderson, Ars Technica, September 8, 2009, http://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/
About the author:
Lynn Goodendorf, CIPP, CISSP, leads the data privacy practice at VerSprite, LLC, a security risk advisory firm. Contact VerSprite at versprite.com or on Twitter @VerSprite.Send comments on this column to firstname.lastname@example.org.