The world is now entering a new era of big scale in which the amount of data processed and stored by enterprises is breaking down existing architectural constructs. A major reason for this breakdown is that data are growing on two axes: volume and usage.

The industry has coined the expression “big data” to describe this, and although there is no universally accepted definition, International Data Corp. defines big data as “a new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis.”

The massive increase in scale is occurring for a number of reasons. The one that has the most impact on the oil and gas industry is the enormous growth in machine- and user-generated data such as microseismic, multicomponent 4-C, and time-lapse seismic data from distributed control and surveillance systems; downhole sensors; mobile devices; and myriad other sources. Cost pressures also are driving consolidation of data centers – enterprises can no longer afford business units to run their own IT infrastructure. Planned moves to cloud computing and the demands of hundreds of thousands of users on fewer centralized systems all contribute to the need for new thinking to accommodate this increased scale.

Big Data will challenge companies to master their ABCs – analytics, bandwidth, and content. (Image courtesy of NetApp)

A change in mindset

For decades, the industry has been living on the frontier of data storage, compute, and visualization technologies in areas such as seismic imaging, real-time data, and reservoir modeling. Companies manage petabytes of data and generate new data at rates between 30% and 70% a year.

The ability to store large amounts of data is evident, but the real challenge is being able to acquire, process, manage, and turn such voluminous data into insight, make information available to the right people, and to do it all in much shorter time frames. All of these forces together are putting an enormous amount of pressure on existing infrastructures, from compute and applications to network and, especially, the data storage platform. Traditional approaches are not able to scale to the level needed to ingest all of the data and to analyze, deliver, and store at the speeds required in the new big data era. Big data is breaking today’s storage infrastructure along three major axes:

Complexity. Data are no longer just text and numbers; data deal with real-time events and shared infrastructure. The information is now linked, it is high-fidelity, and it consists of multiple data types. Applying normal algorithms for search, storage, and categorization is becoming much more complex and inefficient;

Speed. High-definition video, sensor data, seismic data – all of these have very high effective ingestion rates. Businesses have to keep up with the data flow to make the information useful. They also have to keep up with ingestion rates in order to drive ever faster business outcomes; and Volume. All collected data must be stored in a location that is secure and always available. With high volumes

of data and such ridiculously large files, IT teams have to make decisions about how, where, and how long to store data without increasing operational complexity. This can cause the infrastructure to quickly break on the axis of volume.

NetApp has divided the solution sets for managing data at scale into three main areas called the “Big Data ABCs”: analytics, bandwidth, and content. Each area has its own specific challenges and unique infrastructure requirements.

Providing efficient analytics for extremely large datasets is critical. Companies are laying the foundations for the digital oil field, and monitoring technologies are streaming hundreds of gigabytes of information a day for a single field. With new computational approaches like Hadoop and next-generation data warehouses from vendors like Teradata, companies will be able to gain increased insight from this data, predict future performance, and solve problems in real time.

Bandwidth solutions focus on obtaining better performance for very fast workloads like seismic imaging. Companies are acquiring higher resolution and higher density wide-azimuth datasets and are using much more complex processing algorithms to meet the reservoir challenges around deep water, subsalt, and presalt. Trace densities now are in the millions of traces per square mile, and channel counts are in the hundreds of thousands. All of this is pushing past the limits of today’s processing facilities and infrastructures. In these environments it is common to talk about data throughputs in the tens of gigabytes per second and storage densities approaching two petabytes in the space of a single computer room floor tile.

The content solution area focuses on the need to provide boundless secure, scalable data storage. Content solutions must enable storing virtually unlimited amounts of data so enterprises can store as much data as they want, find them when they need them, and keep them forever. It is estimated that over the next five years, digital archive capacity will grow nearly 10 times. This is ushering in new object-based storage solutions and access methods like cloud data management interface to addresses the needs of these organizations to access petabyte-scale globally distributed repositories across multiple sites. Enterprise-level efficiency features also are helping to stem the tide of capacity growth through features such as data deduplication and lossless compression that can reduce the storage requirements for both pre- and post-stack seismic data by as much as 50% – significant savings for the multipetabyte environments typical in upstream oil and gas.

Data life cycle

The Big Data ABCs should not be seen as distinct silos but rather as stages within a data lifecycle as data flow through these segments. For example, an operator might see massive amounts of data coming from seismic imaging (Big Bandwidth), but then the data need to be made available to researchers all over the globe to be analyzed (Big Analysis), and certainly the data need to be kept for a very long period of time (Big Content).

The trend of using bigger datasets offers opportunities to spark innovation, deliver new insights, and solve much bigger problems than before. However, many of today’s legacy systems cannot effectively scale to support the new techniques required to create value from this data. Companies need to deploy new technology stacks and new design methodologies to overcome these barriers.