Big Data Security ?

Big Data Security?

With increased customer adoption, there are now hundreds of organizations attempting to capitalize on the “big data” movement. One could make a full time job just out of attending all of the conferences, IEEE, TechCon, Strata, Structure, Government, BioMedicine , to name just a few.

The most popular big data application, Hadoop, began as part of Yahoo’s search engine in 2006 and has since become the primary method for building a warehouse of unstructured data. These data warehouses can start with just a few servers, even running as virtual machines, and they can scale to hundreds of PetaBytes running on thousands of CPU cores.

Batch-oriented analytic applications are best suited for big data solutions, with jobs being scheduled as they can run, from a few minutes to a few days, depending on the amount of data being processed and the processing power of the cluster in use. While the data warehouse was designed originally to process large amounts of public information, the demand is increasing for it to house more sensitive business-critical data, and security is an increasing concern.

There are two aspects to consider when we talk about big data security:

  • How to use big data solutions to process information in a manner that improves the security posture of an organization.
  • How best to protect big data infrastructure, and thus ensure the security of the information contained within.

Security Applications Using Big Data:

More and more organizations are pulling activity data (logs) into Hadoop to better analyze it. This correlation of disparate log data traditionally has been difficult or expensive (or both).

  • The University of Montana now has classes to educate students on cyber security including how to use big data effectively.
  • The IRS uses Hadoop with data from eBay and Facebook combined with banking data to detect tax evasion.
  • NSA pulls assorted surveillance data into a Hadoop data warehouse and processes it to detect suspicious activity.
  • Twitter, Facebook, Netflix, Yahoo, Amazon, and many others use Hadoop to track everything you do.
  • HP Fortify CloudScan and their new HAVEn framework both leverage Hadoop to process large amounts of data and improve an organizations security posture.
  • Enterprises can use the commercial Splunk Hadoop Connect for bidirectional interoperability to a data warehouse.
  • The Open Source project Flume is fast becoming a standard method to pull log data from web servers and other applications into Hadoop where other tools can perform analysis.

Securing Big Data:

Some very sensitive data will end up being processed and potentially available in a shared environment. Traditional security controls such as authorization, encryption, tokenization, and data classification were not built in to Hadoop from the onset. An entire ecosystem of vendors like Vormetric have sprung up to help protect these sensitive big data clusters. There is potentially a higher level of uncertainty here than is seen with cloud technology. Combining cloud technology and big data involves significant potential risk; things can go wrong, and data can be inadvertently exposed to unauthorized access. The following are examples of teams working to secure big data environments:

  • The Cloud Security Alliance (CSA) Big Data Security working group, with leadership from Fujitsu, Verizon, and eBay is researching the areas of encryption, infrastructure, data analysis, taxonomy, governance, and privacy.
  • Organizations that were early users of Hadoop had to make modifications to improve the internal security controls. Accumulo was developed and open sourced by the NSA to manage the way in which data could be accessed.
  • When there is a large quantity of sensitive data in one place, the data must be scrubbed to ensure the data warehouse is storing only the data required to solve current business issues. Failure to do this elevates the risk of exposure or cost of protection. Netflix encountered this while gathering customer data during a 2009 contest. Tokenization is a challenging issue to solve but important to get right.
  • Encryption might seem like a simple solution despite the inherit performance penalty or “security tax” but with large data sets and many stake holders key management becomes an overwhelming concern. Solutions such as High Cloud Security encrypt data at rest and offer the ability to shred data when needed.

Much of Hadoop’s improved security controls will be available out of the box for its upcoming versions; however, many existing applications will need significant effort to be ported over. As adoption of big data increases, a common approach has been to rotate out older compute clusters as new upgraded clusters become available. This causes a lot of churn but does ensure that the newest features with the improved security and the faster processing power are always available. Gone are the days of installing a cluster of thousands of servers and then expecting it to be in use for a few years without interruption. Stay tuned as we cover the latest trends in big data security.

Follow me on Twitter: @iben.


Leave a comment