Engineering simulations are essential in reducing costs and speeding time to production. As a result, oil and gas companies have empowered their engineers with the most advanced high performance computing (HPC) resources.

HPC solutions minimize the time required for many oil and gas industry tasks, including processing massive amounts of data and performing complex simulations. Today, reservoir simulation models are being used in new-field development, as well as in developed fields where production forecasts are needed to help make investment decisions, identify the number of wells required, improve oil recovery, identify opportunities to increase production in heavy oil deposits, and many more critical aspects of production.

networks, nodes

This figure plots the elapsed time in seconds as measured by wall clock time for a variable number of nodes for three types of networks – 20 GB/second InfiniBand, 10 GigE, and 1 GigE. (Images courtesy of HPC Advisory Council)

“Productivity” means executing the maximum number of jobs per day. In calculating productivity, performance measurement is more important than the wall clock time of a single application job because getting the most work from the system is the key to solving problems and getting the best return on investment.

Maximizing productivity in today’s cluster platforms requires using enhanced data messaging techniques even on a single-server platform. These techniques also help with parallel simulations by using appropriate cluster interconnects.

One of HPC’s strengths is its ability to achieve the best sustained performance by driving CPU performance toward its limits. Compute clusters have become the most used hardware solution for HPC simulations. Cluster productivity and flexibility are two of the most important factors for cluster hardware and software configuration.

High-speed interconnect technology

figure, nodes

This figure shows productivity results on the physical cluster using a job placement process. The “1 Job/Node” entry means the job uses all of the cores on the node for that specific job. In the case of “2 Jobs/Node,” each job uses two cores per socket per server. In the case of “8 Jobs/Node,” each job uses a single core of each node in the job.

Choosing the right interconnect technology is essential for maximizing HPC system efficiency. Slower interconnects can delay data transfers, causing poor use of compute resources, slow simulation execution, and low productivity.

An interconnect that requires CPU cycles as part of the networking process decreases the compute resources available to the application and negatively impacts productivity. Furthermore, using the CPU as part of the data passing process increases the system jitter that, in return, limits the cluster’s scalability. Jitter is the inability of the compute nodes to synchronize communication, which forces compute nodes to wait to send or receive data.

InfiniBand ??has become the most deployed high-speed interconnect for large-scale simulations, replacing proprietary or low-performance solutions. The InfiniBand Architecture is an industry-standard fabric designed to provide high-bandwidth low-latency computing scalability for tens of thousands of nodes with multiple CPU cores per server platform, while offering efficient use of compute processing resources. Current InfiniBand solutions can deliver up to 40 GB/second of bandwidth between servers, and up to 120 GB/second between switches. InfiniBand is designed to be fully offloaded, meaning all communications are handled within the interconnect without involving the CPU. This further enables the ability to scale up systems with linear performance.

Evaluating performance
Schlumberger’s Eclipse Reservoir Engineering software was selected for performance investigation because it is a

Eclipse MPI, message distribution

A look at Eclipse MPI message distribution profiling shows message distribution as a function of the number of nodes for the primary data exchange MPI functions. The distribution between small network messages and large network messages is about the same as the number of nodes per hob is increased. Small network messages (less than 128 bytes) are responsible for the synchronization between the multiple application processes and server nodes, and account for 40% of the messages.

widely used oil and gas reservoir numerical simulation suite. Like many other HPC applications, it can run on a complex combination of hardware and software components.

Maximizing Eclipse performance requires a deep understanding of how each component impacts the overall solution. The 4,000,000-cell (2048 200 10) Blackoil 3 phase model with approximately 800 wells was used as the benchmark case (provided by Schlumberger). Performance and profiling analysis was carried out as part of the HPC Advisory Council research activities using the organization’s High-Performance Center.

nodes, cluster

This chart shows the total network throughput from a single node per job. For eight nodes in the job, each node uses almost 400 MB/second of network bandwidth (top graph). In a 16-node cluster configuration (middle graph), each node uses around 800 MB/second of network bandwidth. For the 24-node configuration, each node reaches almost 1,200 MB/second of network bandwidth. This trend indicates the need for the fastest possible cluster interconnect to maintain application scalability.

For the evaluation process, the test bed cluster consisted of 24 Dell PowerEdge SC1435 servers with 2 Quad-Core AMD Opteron 2382 processors at 2.6 GHz/node, 8 by 2 GB memory, 800 MHz Registered DDR-2 DIMMs/node. The operating system was Red Hat Enterprise Linux 5 Update 1 OS, and the Message Passing Interface (MPI) library used was Platform MPI 5.6.5. Servers were connected via Mellanox MT25408 ConnectX 20GB/second InfiniBand, and the InfiniBand software drivers were from the OpenFabrics Enterprise Distribution 1.3 software stack.

Because InfiniBand provides higher bandwidth and lower latency, it enables higher performance and scalability for Eclipse when compared to 1 GigE and 10 GigE. With Ethernet (either 1 GigE or 10 GigE), the wall clock time actually increased beyond eight nodes. With InfiniBand, on the other hand, there was a continuous reduction in run time with increasing number of nodes.

Cluster productivity
For the test, productivity was defined as the number of application jobs that could be completed in a given time, in this case, one day. Because good scaling has been shown when using the InfiniBand interconnect between nodes, the question became, “How much scalability will occur if a single job is limited to running within each cluster node using all of the cores?”

With the increased complexity of high-performance applications, a single job consuming all cluster resources can create bottlenecks. The test would allow bottlenecks and productivity to be explored by comparing a single job run on the entire cluster vs. several jobs run in parallel.

In the productivity test, the CPU-to-CPU communication within the node was reduced, but the CPU-to-CPU communication over the network increased. The net result of this strategy was increased productivity up to a factor of three on eight parallel jobs.

Profiling the application is essential to understanding its performance dependency on the other cluster components because profiling can provide useful information for choosing the most efficient interconnect and MPI library. It also helps to identify critical communication sensitivity points that can influence the application’s performance, scalability, and productivity. Profiling also can help explain performance differences between platforms and between physical and virtualized environments.

This study showed that a high-speed low-latency interconnect is essential for high performance and scalability. The InfiniBand interconnect outperformed Ethernet by more than 400% and showed better scalability than Ethernet. With InfiniBand, Eclipse performance improved as additional compute nodes were added, while Ethernet showed few performance gains after 16 nodes.

The study also established that running multiple jobs in parallel across the cluster increases productivity.