GVRS Performance for Data Reading, Writing, and Compression

Introduction

The primary design goal for Gridfour's GVRS module was to create an API that would provide a simple and reliable interface for working with grid-based data sources. And, in designing the GVRS file format, we sought to develop a specification that would be stable across future versions, portable across different operating environments, and suitable for long-term archiving of data.

None of that would matter if the GVRS software did not provide acceptable performance.

This document describes the results of performance tests for the GVRS software that were aimed at assessing the usefulness of the implementation. The tests involved performing various patterns of data access on two medium-to-large raster data products. The tests were performed on a medium-quality laptop with 16 GB of installed memory and standard Solid State Drive (SSD).

In evaluating the Java implementation for GVRS, we were interested in two kinds of performance: speed of access and data compression ratios. Statistics for these performance areas was collected using multiple tests on large, publicly available data sets. The speed-of-access tests included the time to write new files and the time to read their entire content. The data-compression tests focused on the effectiveness of data compression for the reduction of storage size. They also considered the processing overhead required for applying data compression and decompression.

A discussion of the results from the testing are given in the sections listed below:

The Test Data

We evaluated GVRS performance using two well-known and easily obtained geophysical data sets: ETOPO1 and GEBCO_2019. These products provide world-wide elevation and ocean bottom depth information. Because they describe a phenomonon that is readily observed in ordinary experience, they have the advantage of familiarity and immediacy. The data sets are large enough to present a significant processing load for a test series. And their content is varied enough thoroughly exercise the GVRS's function set.

The following table gives physical details for the test data files. Grid dimensions are given in number of rows by number of columns.

Product File Size Grid Dimensions # Values Bits/Value
ETOPO1 890 MB 10800x21600 > 233 Million 32.01
GEBCO 2019 11180 MB 43200x86400 > 3.7 Billion 25.18

Both of the test data sets are available free-of-charge on the web. Both are available in different formats. For this test, we selected versions of the files that were stored in the well-known NetCDF data format. The ETOPO1 data is stored as 4-byte integers giving depth/elevation in meters. Its values range from -10803 meters depth to 8333 meters height. The GEBCO_2019 data is stored as floating point-numbers in a moderately compact form that requires just over 3 bytes for each data value. Its values range from -10880.588 meters to 8613.156 meters.

Data-Access Time

Performance at a Glance

We'll begin our discussion of GVRS performance by providing some of our most useful information first. The table below is intended to sense of how fast GVRS can read from or write to a data file. The information in the table should provide a basis for making a rough estimate of the kind of performance that can be expected for other raster based data sources used in Java applications.

Transfer rates in millions of grid cells per second (M/s)
Operation ETOPO1 ETOP01
Compressed
GEBCO 2019 GEBCO 2019
Compressed
Read 106.1 M/s 49.9 M/s 68.0 M/s 39.9 M/s
Write 50.7 M/s 3.58 M/s 31.0 M/s 5.92 M/s

Not surprisingly, the transfer rates for compressed data are substantially slower than those for uncompressed. Data compression reduces the number of bytes that have to be read from or written to an external storage medium (such as a disk drive), but requires additional processing to do so. The rate information in the table above reflects the overhead for converting data between its compressed and uncompressed forms.

Multi-threading

To reduce the processing time for compressed data, some of the overhead required for decompression can be shared across multple threads. The GVRS API includes support for the optional use of multiple threads when reading compressed data. It also includes support for the integer-based compression used for the ETOPO1 data set. At this time, GVRS does not implement an effective multi-threaded enhancement for the data compression used when writing floating-point formats.

Transfer rates using multiple threads
Operation ETOP01
Compressed
GEBCO 2019
Compressed
Read 62.4 M/s 53.8 M/s
Write 4.89 M/s N/A

More details on the use of multiple threads when reading and writing data are available at the Gridfour wiki page Using Multiple Threads to Speed Processing

Writing the Data

The GVRS API can be used to simplify storage operations in data collection or analysis applications. In those applications, it serves as a small part of a larger work flow. So it is reasonable to require that the API does not make an undue contribution to the processing time for the overall job. Therefore efficiency considerations were an important part of the GVRS design.

The time required for GVRS to write the test data sets was measured using the PackageData application that is included in the Gridfour software distribution. That application transcribed the original NetCDF-formatted data into a matching GVRS-formatted file. Naturally, reading a NetCDF file carries its own processing overhead. The time required to read data from the NetCDF source files was evaluated using a separate program and removed from the data-writing statistics shown in the table below.

Test Set Grid Dimensions GVRS File Size Time to Write
ETOPO1 10800x21600 456 MB 4.6 sec.
GEBCO_2019 43200x86400 13.9 GB 120.2 sec.

Here, one clarification is in order. In the table above, the file size for the non-compressed ETOPO1 is half that of the original NetCDF formatted file (which was quoted as 980 MB above). The authors of the original data file chose to store their data using 4-byte integers. But the range of depths and elevations for the data fits within the capacity of a two-byte, short integer. So we chose to store the data in that form. NetCDF also supports short-integer formats. The selection of the larger or smaller representations were matters of convenience and do not reflect on the capabilities of either product.

Reading the Data

Both of the test data products are distributed in the well-known NetCDF data format. The data in the products is stored in row-major order, so the most efficient pattern of access for reading them is one-row-at-a-time.

The GVRS data is intended to support efficient data access regardless of what pattern is used to read or write them. Like the source data, the GVRS test files used for this evaluation were written in a row-major order. So accessing the GVRS data on grid-point-at-a-time in row-major order would traverse the file in the order that it was laid out. Not surprisingly, sequential access of the GVRS files tends to be the fastest pattern of access.

In practice, real-world applications tend to process data from large grids in blocks of neighboring data points. So they usually focus on a small subset of the overall collection. The test patterns used for this evaluation were based on that assumption. Except for the "Tile Load" case, they involved 100 percent data retrieval. The "block" tests used the GVRS block API to read multiple values at once (either an entire row or an entire tile). Other tests read values one-at-a-time.

The following table lists timing results for various patterns of access when reading all data in the indicated files. Times are given in seconds. Results are given for both the standard (non-compressed) variations of the GVRS files and the compressed forms.

Read operations using different access patterns (times in seconds)
Pattern ETOPO1 ETOP01 Comp GEBCO_2019 GEBCO Comp
Row-Major 2.22 4.67 56.3 95.4
Column-Major 2.49 5.11 75.8 110
Blocks of One Row 0.53 3.50 65.3 69.6
Blocks of Full Tile 0.65 3.52 29.4 67.9
Tile Load 0.27 2.71 20.8 60.3

Reading Data in Row-Major Order versus Blocks

The Row-Major test reads the data one grid cell at a time traversing the grid in row-major order. The following code snippet illustrates the process. In this example, a GvrsElement named "z" is obtained from the GvrsFile and used to access the data. The code snippet then uses that element to access data from the file. Because there are 233 million grid cells in ETOPO1, this operation requires 233 million calls to the readValueInt() method.


                GvrsElement zElement = gvrs.getElement("z");
                for (int iRow = 0; iRow < nRowsInRaster; iRow++) {
                    for (int iCol = 0; iCol < nColumnsInRaster; iCol++) {
                      int sample = zElement.readValueInt(iRow, iCol);
                    }
                }

The Row Blocks test treats each row of data as a "block" of grid values. It reads each row of data in the source grid in a single operation. There are 10800 rows of data in the ETOPO1 product and 21600 columns. Each block-read operation returns 21600 values, or one for each column. This pattern of access is illustrated by the following code snippet


                GvrsElement zElement = gvrs.getElement("z");
                for (int iRow = 0; iRow < nRowsInRaster; iRow++) {
                   int []block = zElement.readBlock(iRow,  0, 1, nColsInRaster);
                }

The advantage of the row block operation is that it performs far fewer data access operations than would be required for reading values for the source one grid cell at a time. However, it is worth noting that the GvrsFile is designed to support both patterns of access efficiently. Internally, it maintains a data cache that minimizes the frequency with which it must read data from disk. The single-cell read pattern entails more overhead than the block-read operation, but only because it involves so many more method calls.

The Column-Major pattern is similar to the Row-Major pattern, but it uses columns as the outer loop rather than rows.


                GvrsElement zElement = gvrs.getElement("z");
                for (int iCol = 0; iCol < nColumnsInRaster; iCol++) {
                    for (int iRow = 0; iRow < nRowsInRaster; iRow++) {
                      int sample = zElement.readValueInt(iRow, iCol);
                    }
                }

Referring to the table above, we see that the Column-Major pattern required more time than the Row-Major operation. The difference in data-fetching time between the Row-Major and Column-Major access pattern is due to the fact that the tiles stored in the particular GVRS files used for this test were themselves written to the file in row-major order (rows of tiles). So two tiles that were adjacent in a single row were also adjacent in their file locations. However, two tiles that were adjacent in a single column were not adjacent in the file. Thus the Column-Major pattern required more file seek-and-fetch operations than the Row-Major alternative.

The Cost of Reading Data from the File

The Tile Load test retrieves just one point for tile. Thus the run time for the test is essentially just the time required to load read the individual tiles from the data file. The Tile Block test follows the same pattern as Tile Load, but loads the entire content of the tile using the readBlock() method. So the difference in time for the tests is just the overhead of indexing and copying the data from the tile to the result array.

At present, the Tile Block read times are significantly lower than those of the Row Block test. Again, this result suggests that there may be opportunities to optimize the access code for the read routines.

Access Times for GVRS versus NetCDF

As mentioned above, the source data used for the performance tests described in these notes is distributed in a format called NetCDF. GVRS is not intended to compete with NetCDF. The two data formats are intended for different purposes and to operate in different environments. But the Java API for NetCDF is a well-written module and, as such, provides a good standard for speed of access. Any raster-based file implementation should be able to operate with performance that is at least as good as NetCDF.

The table below gives accessing times for reading the entire data set for ETOPO1 and GEBCO_2019, using the Java API for NetCDF and GVRS file formats. Both files can be accessed in random order, but the most efficient way of retrieving their content is by following a pattern of access established by their underlying data definition. The NetCDF distributions of the ETOPO1 and GEBCO products can be accessed most efficiently in row-major order (one complete row at a time). The GVRS format can be accessed most efficiently one complete tile at a time. So performance statistics for those two patterns are included in the table below. Because the NetCDF version of GEBCO_2019 is stored in a semi-compressed format, the timing values for the compressed version of GVRS are more relevant than the uncompressed version. Finally, to clarify the differences in behavior between the two API's, the table repeats the GVRS timing data for row-major access.

Read operations for different data formats (times in seconds)
Format ETOPO1 GEBCO_2019 Compressed GEBCO_2019
NetCDF 2.6 132.2 N/A
GVRS (rows) 2.22 95.4 56.3
GVRS (tiles) 0.65 67.9 29.4

The differences in timing reflect the differences in the intention of the two packages. NetCDF is a widely used standard for distributing data (it also has support for a number of languages besides Java, including C/C++, C#, and Python). GVRS is intended more for back-end processing for large raster data sets in cases where the pattern-of-access is arbitrary or unpredictable. When using NetCDF (at least the Java version), it is often helpful if the pattern of data access sticks to the underlying organization of the file. For example, when accessing the NetCDF file in column-major order, the full-grid ETOPO1 read operation took 1184.3 seconds (19 minutes, 44 seconds). The GVRS format took 2.49 seconds. Of course, we need to be a little candid here. It is possible to devise access patterns that defeat both file formats. The authors of the ETOPO1 NetCDF file specified a layout that was optimized for row-major access and poorly suited to the column-major pattern. Thus, the Java-based NetCDF API had to perform thousands of redundant file-read operations for the test. Even the GVRS API is not immune to an incompatible access pattern. Had we accessed the GVRS and NetCDF files in a completely random pattern for a very large number of data points, we could have achieved equally bad performance for both.

Why GVRS does not use Memory Mapped File Access

The GVRS API does not use Java's memory-mapped file access. Instead, it uses old-school file read-and-write operations. The reason for this is based on known problems in Java. First, Java does not support memory-mapped file access for files larger than 2.1 gigabytes, and raster files of that size or larger are common in scientific applications. Additionally, there are reports that Java does not always close memory mapped files and clean up resources when running under Windows. Thus, for the initial implementation, we decided to avoid memory mapped files.

Data Compression

Although there are many good general-purpose data compression utilities available to the software community, raster-based data files tend to compress only moderately well. For example, consider the compression results for ETOPO1 using two well-known compression formats, Zip and 7z. Since the elevation values in ETOPO1 range from -10803 to 8333, they can be stored comfortably in a two-byte short integer. There are just over 233 million sample points in ETOPO1. Storing them as short integers leads to a storage size of 466,560,000 bytes. Using Zip and 7z compressors on that data yields the following results.

Product Size (bytes) Relative Size Bits/Value
ETOPO1 (standard) 466,560,000 100.0 % 16.00
ETOPO1 Zip 309,100,039 66.3 % 10.60
ETOPO1 7z 201,653,141 43.2 % 6.92

It is also worth nothing that general-purpose compression has the disadvantage that in order to access any of the data in the file, it is necessary to decompress the whole thing. For many grid-based applications, only a small portion of a raster file is needed at any particular time.

The GVRS API uses standard compression techniques (Huffman coding, Deflate), but transforms the data so that it is more readily compressed and yields better compression ratios. Each tile is compressed individually, so an application that wants only part of the data in the file can obtain it without decompressing everything else.

The GVRS file format supports 3 different data formats:

  1. Integer (4-byte integers)
  2. Float (4-byte, 32-bit IEEE-754 standard floating-point values)
  3. Integer-coded Float (floating-point values scaled and stored as integers)

The GEBCO data is expressed in a non-integral form. While the data can be stored as integer-coded-floats, the integer coding requires some loss of precision. So, to store the data in a lossless form, it needs to be represented using the 4-byte floating-point format. While the floating-point representation of the data preserves all the precision in the original, it requires a different approach to data compression than the integer forms.

Unfortunately, the GVRS floating-point implementation does not achieve quite as favorable compression ratios as the integer-based compressors. Floating-point numbers simply do not present the kind of redundancy in their form that facilitates data compression. On the other hand, GVRS's lossless floating-point compression is as good as any solution currently to be found.

The table below shows the relative storage size required for different products and storage options. The entries marked GEBCO x 1, GEBCO x 2, are scaled integer representations of the original sample points. There are two reasons that the scaled versions compress more readily than the floating-point. First, the GVRS integer compressors are more powerful than the floating-point compressor. Second, the scaling operation truncates some of the fractional part of the original values and, thus, discards some of the information in the original product.

Product Size (bits/sample) Number of Samples Time to Compress (sec)
ETOPO1 4.46 233,280,000 68.3
GEBCO x 1 2.89 3,732,480,000 1252.1
GEBCO x 2 3.56 3,732,480,000 1210.7
GEBCO x 4 4.35 3,732,480,000 1193.7
GEBCO (floats) 15.41 3,732,480,000 748.2

ETOPO1 data is stored in the form of integers with a small amount of overhead for metadata and file-management elements. As table above shows, GVRS data compression reduces the raw data for ETOPO1 from 16 bits per sample to 4.46 bits per sample. So the storage required for the data is reduced to about 27.9 % of the original size.

When deciding whether to use the native floating-point representation of the GEBCO data or the more compact integer-scaled float representation, the main consideration is how much precision is required for the application that uses it. One can reasonably ask just how much precision the GEBCO data really requires. For example, a scaling factor of 1 essentially rounds the floating-point values to their nearest integers. In effect, this process reduces the data to depth and elevation given in integer meters. For ocean data with 1/2 kilometer sample spacing, an accuracy of 1 meter may be more than is truly required. Even on land, it is unlikely than many of the points in the data set are accurate to within 1 meter of the actual surface elevation. However, the extra precision in the floating-point format (or that achieved with a larger scaling factor) may help reduce data artifacts due to quantization noise. Also, for purposes of this test, we were interested in how well the algorithm would work across varying degrees of precision in the converted data. So we applied various scaling factors as shown in the results above.

Future Work

When data compression is applied to a GVRS file, it adds a considerable amount of processing overhead to the data-writing operation. In future work, it may be possible to improve the throughput of write operations by using a multi-threaded implementation to handle data compression. Although the current implementation of the GVRS API is limited to a single thread, the structure of its internal data cache could be adapted to support multiple threads without major revision. A multi-threaded implementation may be considered for future investigations.

One item that was not fully explored in this article is the effect of disk speed in read operations. The testing was, after all, conducted using a system equiped with a fast solid-state drive. A future investigation will use an USB external disk drive to see how much extra access time a slower disk requires.

Some of the test results above suggest that there may still be opportunities to improve the access speeds, especially for the block-based file access routines. Future work will involve a careful review of the code used for the readBlock() method.

Conclusion

The figures cited in this article provide a rough guide to estimating the kind of performance an application designer can expect using the GVRS API. It also gives an indication of the resources that would be needed to process and store data in GVRS format.

As with any API, the design of GVRS makes assumptions about the way it will be accessed. Our hope is that those assumptions are broad enough that GVRS will provide acceptable performance for most applications.