PDF Version
RealTime Normal Map DXT Compression
J.M.P. van Waveren 
Ignacio Castaño 
February 7th 2008
© 2008, id Software, Inc.
Abstract
Using today's graphics hardware, normal maps can be stored in several compressed formats, that are decompressed on the fly in hardware during rendering. Several objectspace and tangentspace normal map compression techniques using existing texture compression formats are evaluated. While decompression from these formats happens in realtime in hardware during rendering, compression to these formats may take a considerable amount of time using existing compressors. Two highly optimized tangentspace normal map compression algorithms are presented that can be used to achieve realtime performance on both the CPU and GPU. 
1. Introduction
Bump mapping uses a texture to perturb the surface normal to give objects a more geometrically complex appearance without increasing the number of geometric primitives. Bump mapping, as originally described by Blinn [1], uses the gradient of a bump map heightfield to perturb the interpolated surface normal in the direction of the surface derivatives (tangent vectors), before calculating the illumination of the surface. By changing the surface normal, the surface is lit as if it has more detail, and as a result is also perceived to have more detail than the geometric primitives used to describe the surface.
Normal mapping is an application of bump mapping, and was introduced by Peercy et al. [2]. While bump mapping perturbs the existing surface normals of an object, normal mapping replaces the normals entirely. A normal map is a texture that stores normals. These normals are usually stored as unitlength vectors with three components: X, Y and Z. Normal mapping has significant performance benefits over bump mapping, in that far fewer operations are required to calculate the surface lighting.
Normal mapping is usually found in two varieties: objectspace and tangentspace normal mapping. They differ in coordinate systems in which the normals are measured and stored. Objectspace normal maps store normals relative to the position and orientation of a whole object. Tangentspace normals are stored relative to the interpolated tangentspace of the triangle vertices. While objectspace normals can be anywhere on the unitsphere, tangentspace normals are only on the unithemisphere at the front of the surface, because the normals always point out of the surface.
Example of an objectspace normal map (left), and the same normal map in tangentspace (right). 
A normal does not necessarily have to be stored as a vector with the components X, Y and Z. However, rendering from other representations usually comes at a performance cost. A normal could, for instance, be stored as an angle pair (pitch, yaw). However, this representation has the problem that interpolation or filtering does not work properly, because there are orientations in which there may not exist a simple change to the angles to represent a local rotation. Before interpolating, filtering, or calculating the surface illumination for that matter, the angle pair has to be converted to a different representation like a vector, which requires expensive trigonometric functions.
Although a normal map can be stored as a floatingpoint texture, a normal map is typically stored as a signed or unsigned integer texture, because the components of normal vectors take values within a well defined range (usually [1, +1]), and there is a benefit to having the same precision across the whole range without wasting any bits for a floatingpoint exponent. For instance, to store a normal map as an unsigned integer texture with 8 bits per component, the X, Y and Z components are rescaled from real values in the range [1, +1] to integer values in the range [0, 255]. As such, the realvalued vector [0, 0, 1] is converted to the integer vector [128, 128, 255], which, when interpreted as a point in RGB space, is the purple/blue color that is predominant in tangentspace normal maps. To render a normal map stored as an unsigned integer texture, the vector components are first mapped from an integer value to the floatingpoint range [0, +1] in hardware. For instance, in the case of a texture with 8 bits per component, the integer range [0, 255] is mapped to the floatingpoint range [0, +1] by division with 255. Then the components are typically mapped from the [0, +1] range to the [1, +1] range during rendering in a fragment program by subtracting 1 after multiplication with 2. When a signed integer texture is used, the mapping from an integer value to the floatingpoint range [1, +1] is performed directly in hardware.
Whether using a signed or unsigned integer texture, a fundamental problem is that it is not possible to derive a linear mapping from binary integer numbers to the floatingpoint range [1, +1], such that the values 1, 0, and +1 are represented exactly. The mapping in hardware of signed integer textures, used in earlier NVIDIA implementations, does not exactly represent +1. For an nbit unsigned integer component, the integer 0 maps to 1, the integer 2^{n1} maps to 0, and the maximum integer value 2^{n}1 maps to 1  2^{1n}. In other words, the values 1 and 0 are represented exactly, but the value +1 is not. The mapping used for DirectX 10 class hardware is nonlinear. For an nbit signed integer component, the integer 2^{n1} maps to 1, the integer 2^{n1}+1 also maps to 1, the integer 0 maps to 0, and the integer 2^{n1}1 maps to +1. In other words, the values 1, 0 and +1 are all represented exactly, but the value 1 is represented twice.
Signed textures are not supported on older hardware. Furthermore, the mapping from binary integers to the range [1, +1] may be hardware specific. Some implementations may choose to not represent +1 exactly, whereas the conventional OpenGL mapping specifies that 1 and +1 can be represented exactly, but 0 can not. Other implementations may choose a nonlinear mapping, or allow values outside the range [1, +1], such that all three values 1, 0 and +1 can be represented exactly. To cover the widest range of hardware without any hardware specific dependencies, all normal maps used here are assumed to be stored as unsigned integer textures. The mapping from the range [0, +1] to [1, +1] is performed in a fragment program by subtracting 1 after multiplication with 2. This may result in an additional fragment program instruction, which can be trivially removed when a signed texture is used. The mapping used here is the same as the conventional OpenGL mapping which results in an exact representation of the values 1 and +1, but not 0.
An integer normal map texture can typically be stored with 16 (5:6:5), 24 (8:8:8), 48 (16:16:16) or 96 (32:32:32) bits per normal vector. Most of today's normal maps, however, are stored with no more than 24 (8:8:8) bits per normal vector. It is important to realize there are relatively few 8:8:8 bit vectors that are actually close to unitlength. For instance, the integer vector [0, 0, 64], which is dark blue in RGB space, does not represent a unitlength normal vector (the length is 0.5 as opposed to 1.0). The following figure shows the percentage of representable 8:8:8 bit vectors that are less than a certain percentage off from being unitlength.
For instance, if it is not considered acceptable for normal vectors to be more than 5% from unitlength, then only about 15% of all representable 8:8:8 bit vectors can be used to represent normal vectors. Going to fewer bits of precision, like 5:6:5 bits, the number of representable vectors that are close to unitlength decreases quickly.
To significantly increase the number of vectors that can be used, each normal vector can be stored as a direction that is not necessarily unitlength. This direction then needs to be normalized in a fragment program. However, there is still some waste because only 83% of the all 8:8:8 bit vectors represent unique directions. For instance, the integer vectors [0, 0, 32], [0, 0, 64] and [0, 0, 96] all specify the exact same direction (they are multiples of each other). Furthermore, the unique normalized directions are not uniformly distributed over the unitsphere. There are more representations for directions close to the four diagonals of the bounding box of the [1, +1] x [1, +1] x [1, +1] vector space, than there are representations for directions close to the coordinate axes. For instance, there are three times more directions represented within a 15 degrees radius around the vector [1, 1, 1], than there are directions represented within a 15 degrees radius around the vector [0, 0, 1]. The figure below shows the distribution of all representable 8:8:8 bit vectors projected onto the unitsphere. The areas with a low density of vectors are green, and the areas with a high density are red.
distribution of 8:8:8 bit vectors projected on the unitsphere 
On today's graphics hardware, normal maps can also be stored in several compressed formats, that are decompressed in realtime during rendering. Compressed normal maps do not only require significantly less memory on the graphics card, but also generally render faster than uncompressed normal maps, due to reduced bandwidth requirements. Various different ways to exploit existing texture compression formats for normal map compression, have been suggested in literature [7, 8, 9]. Several of these normal map compression techniques, and extensions to them, are evaluated in section 2 and 3.
While decompression from these formats is done realtime in hardware, compression to these formats may take a considerable amount of time. Existing compressors are designed for highquality offline compression, not realtime compression [20, 21, 22]. However, realtime compression is quite useful for transcoding normal maps from a different format, compression of dynamically generated normal maps, and for compressed normal map render targets. In sections 4 and 5 two highly optimized tangentspace normal map compression algorithms are presented, that can be used to achieve realtime performance on both the CPU and GPU.
2. ObjectSpace Normal Maps
Objectspace normal maps store normals relative to the position and orientation of a whole object. A normal in objectspace can be anywhere on the full unitsphere, and is typically stored as a vector with three components: X, Y and Z. Objectspace normal maps can be stored using regular color texture compression techniques, but these techniques may not be as effective, because normal map textures do not have the same properties as color textures.
2.1 ObjectSpace DXT1
DXT1 [3, 4], also known as BC1 in DirectX 10 [5], is a lossy compression format for color textures, with a fixed compression ratio of 8:1. The DXT1 format is designed for realtime decompression in hardware on the graphics card during rendering. DXT1 compression is a form of Block Truncation Coding (BTC) [6] where an image is divided into nonoverlapping blocks, and the pixels in each block are quantized to a limited number of values. The color values of pixels in a 4x4 pixel block are approximated with equidistant points on a line through RGB color space. This line is defined by two endpoints, and for each pixel in the 4x4 block a 2bit index is stored to one of the equidistant points on the line. The endpoints of the line through color space are quantized to 16bit 5:6:5 RGB format and either one or two intermediate points are generated through interpolation. The DXT1 format allows a 1bit alpha channel to be encoded, by switching to a different mode based on the order of the end points, where only one intermediate point is generated and one additional color is specified, which is black and fully transparent.
Although the DXT1 format is designed for color textures this format can also be used to store normal maps. To compress a normal map to DXT1 format, the X, Y and Z components of the normal vectors are mapped to the RGB channels of a color texture. In particular for DXT1 compression each normal vector component is mapped from the range [1, +1] to the integer range [0, 255]. The DXT1 format is decompressed in hardware during rasterization, and the integer range [0, 255] is mapped to the floating point range [0, 1] in hardware. In a fragment program the range [0, 1] will have to be mapped back to the range [1, +1] to perform lighting calculations with the normal vectors. The following fragment program shows how this conversion can be implemented using a single instruction.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = normal.z [0, 1] # input.w = 0 MAD normal, input, 2.0, 1.0
Compressing a normal map to DXT1 format generally results in rather poor quality. There are noticeable blocking and banding artifacts. Only four distinct normal vectors can be encoded per 4x4 block, which is typically not enough to accurately represent all original normal vectors in a block. Because the normals in each block are approximated with equidistance points on a line, it is also impossible to encode four distinct normal vectors per 4x4 block that are all unitlength. Only two normal vectors per 4x4 block can be close to unitlength at a time, and usually a compressor selects a line through vector space which minimizes some error metric, such that, none of the vectors are actually close to unitlength.
The DXT1 compressed normal map on the right shows noticeable blocking artifacts compared to the original normal map on the left. 
To improve the quality, each normal vector can be encoded as a direction that is not necessarily unitlength. This direction then has to be renormalized in a fragment program. The following fragment program shows how a normal vector can be renormalized.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = normal.z [0, 1] # input.w = 0 MAD normal, input, 2.0, 1.0 DP3 scale, normal, normal RSQ scale.x, scale.x MUL normal, normal, scale.x
Encoding directions gives the compressor more freedom, because the compressor does not have to worry about the magnitude of the vectors, and a much larger percentage of all representable vectors can be used for the end points of the line through normal space. However, this increased freedom makes compression a much harder problem.
The DXT1 compressed normal map with renormalization on the right compared to the original normal map on the left. 
The above images show that, although the quality is a little bit better, the quality is generally still rather poor. Whether renormalizing in a fragment program or not, the quality of DXT1 compressed objectspace normal maps is generally not considered to be acceptable.
2.2 ObjectSpace DXT5
The DXT5 format [3, 4], also known as BC3 in DirectX 10 [5], stores three color channels the same way DXT1 does, but without 1bit alpha channel. Instead of the 1bit alpha channel, the DXT5 format stores a separate alpha channel which is compressed similarly to the DXT1 color channels. The alpha values in a 4x4 block are approximated with equidistant points on a line through alpha space. The endpoints of the line through alpha space are stored as 8bit values, and based on the order of the endpoints either 4 or 6 intermediate points are generated through interpolation. For the case with 4 intermediate points, two additional points are generated, one for fully opaque and one for fully transparent. For each pixel in a 4x4 block a 3bit index is stored to one of the equidistant points on the line through alpha space, or one of the two additional points for fully opaque or fully transparent. The same number of bits are used to encode the alpha channel as the three DXT1 color channels. As such, the alpha channel is stored with higher precision than each of the color channels, because the alpha space is onedimensional, as opposed to the threedimensional color space. Furthermore, there are a total of 8 samples to represent the alpha values in a 4x4 block, as opposed to 4 samples to represent the color values. Because of the additional alpha channel, the DXT5 format consumes twice the amount of memory of the DXT1 format.
The DXT5 format is designed for color textures with a smooth alpha channel. However, this format can also be used to store objectspace normal maps. In particular, better quality normal map compression can be achieved by using the DXT5 format and moving one of the components to the alpha channel. By moving one of the components to the alpha channel this component is stored with more precision. Furthermore, by encoding only two components in the DXT1 block of the DXT5 format, the accuracy with which these components are stored typically improves as well. For objectspace normal maps there is no clear benefit to moving any particular component to the alpha channel, because the normal vectors may point in any direction, and all values can occur with similar frequencies for all components. When an objectspace normal map does have most vectors in a specific direction, then there is clearly a benefit to mapping the axis most orthogonal to that direction to the alpha channel. However, in general it is not practical to change the encoding on a per normal map basis, because a different fragment program is required for each encoding. The following fragment program assumes the Z component is moved to the alpha channel. The fragment program shows how the components are mapped from the range [0, 1] to the range [1, +1], while the Z component is also moved back in place from the alpha channel.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = 0 # input.w = normal.z [0, 1] MAD normal, input.xywz, 2.0, 1.0
Just like DXT1 without renormalization, this format results in minimal overhead in a fragment programs. The quality is significantly better than DXT1 compression of objectspace normal maps. However, there are still noticeable blocking and banding artifacts.
The DXT5 compressed normal map on the right compared to the original normal map on the left. 
Using the third channel to store a scale factor like done for the YCoCgDXT5 compression from [24] does not help much to improve the quality. The dynamic range of the individual components is typically too large, or the different components span different ranges that are far apart, while there is only one scale factor for the combined dynamic range.
Just like DXT1 compression of objectspace normal maps, the quality can be improved by encoding a normal vector as a direction that is not necessarily unitlength. The following fragment program shows how to perform the swizzle and renormalization.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = 0 # input.w = normal.z [0, 1] MAD normal, input.xywz, 2.0, 1.0 DP3 scale, normal, normal RSQ scale.x, scale.x MUL normal, normal, scale.x
Encoding directions gives the compressor a lot more freedom, because the compressor can ignore the magnitude of the vectors, and a much larger percentage of all representable vectors can be used for the end points of the lines through normal space. The normal vectors are encoded using both the DXT1 block of the DXT5 format and the alpha channel, where the end points of the alpha channel are stored without quantization. As such, the potential search space for the end points of the lines can be very large, and high quality compression may take a considerable amount of time.
The DXT5 compressed normal map with renormalization on the right compared to the original normal map on the left. 
On current hardware, the DXT5 format with renormalization in a fragment program results in the best quality compression of objectspace normal maps.
3. TangentSpace Normal Maps
Tangentspace normal vectors are stored relative to the interpolated tangentspace of the triangle vertices. Compression of tangentspace normal maps generally works better than compression of objectspace normal maps, because the dynamic range is lower. The vectors are only on the unithemisphere at the front of the surface (the normal vectors never point into the object). Furthermore, most normal vectors are close to the tip of the unithemisphere with Z close to 1.
Using tangentspace normal maps in itself can be considered a form of compression compared to using objectspace normal maps. A local transform is used to change the frequency domain of the vector components which reduces their storage requirements. The transform does require tangent vectors to be stored at the triangle vertices and, as such, comes at a cost. However, the storage requirements for the tangent vectors is relatively very small compared to the storage requirements for normal maps.
The compression of tangentspace normal maps can be improved by only storing the X and Y components of unitlength normal vectors, and deriving the Z components. The normal vectors are always pointing up out of the surface and the Z is always positive. Furthermore, the normal vectors are unitlength and, as such, the Z can be derived as follows.
Z = sqrt( 1  X * X  Y * Y )
The problem with reconstructing Z from X and Y is that it is a nonlinear operation, and breaks down under bilinear filtering. The problem is most noticeable when interpolating between two normals in the XYplane. Ideally a normal map is scaled up using spherical interpolation of the normal vectors, where the interpolated samples follow the shortest great arc on the unit sphere at a constant speed. Bilinear filtering of a three component normal map, with renormalization in the fragment program, does not result in spherical interpolation at a constant speed, but at least the interpolated samples follow the shortest great arc. With a twocomponent normal map, however, where the Z is derived from the X and Y, the interpolated samples no longer necessarily follow the shortest great arc on the unit sphere. For instance, interpolation between the two vectors in the figure below is expected to follow the dotted line. Instead, however, the interpolated samples are on the arc that goes up on the unit sphere.
Fortunately, realworld normal maps usually do not have many sharp normal boundaries with adjacent vectors close to the XYplane, and most of the normals point straight up. As such, there are usually no noticeable artifacts when bilinearly or trilinearly filtering a two component normal map before deriving the Z components.
Only storing the X and Y components is in essence an orthographic projection of the normal vectors along the Zaxis onto the XYplane. To reconstruct an original normal vector, a projection back onto the unithemisphere is used, by deriving the Z component from the X and Y. Instead of this orthographic projection, a stereographic projection can be used as well. For the stereographic projection the X and Y components are divided by one plus Z as follows, where (pX, pY) is the projection of the normal vector.
pX = X / ( 1 + Z ) pY = Y / ( 1 + Z )
The original normal vector is reconstructed by projecting the stereographically projected vector back onto the unithemisphere as follows.
denom = 2 / ( 1 + pX * pX + pY * pY ) X = pX * denom Y = pY * denom Z = denom  1
The advantage of using the stereographic projection is that the interpolated normal vectors behave better under bilinear or trilinear filtering. The interpolated normal vectors are still not on the shortest great arc, but they are closer, and have less of a tendency to go up on the unithemisphere.
The stereographic projection also causes a more even distribution of the pX and pY components with the angle on the unithemisphere. Although this may seem desirable, it is actually not, because most tangentspace normal vectors are close to the tip of the unithemisphere. As such, there is actually an advantage to using the orthographic projection which results in more representations of vectors with Z close to 1. The compression techniques discussed below use the orthographic projection because for most normal maps it results in better quality compression.
3.1 TangentSpace DXT1
Using tangentspace normal maps only the X and Y components have to be stored in the DXT1 format, and the Z component can be derived in a fragment program. The following fragment program shows how the Z can be derived from the X and Y.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = 0 # input.w = 0 MAD normal, input, 2.0, 1.0 DP4_SAT normal.z, normal, normal; MAD normal, normal, { 1, 1, 1, 0 }, { 0, 0, 1, 0 }; RSQ temp, normal.z; MUL normal.z, temp;
The following images show a XY_ DXT1 compressed normal map on the right, next to the original normal map on the left. The DXT1 compressed normal map shows noticeable blocking and banding artifacts.
XY_ DXT1 compressed normal map on the right compared to the original normal map on the left. 
Although at first it may seem this kind of compression should produce superior quality, better quality compression can generally be achieved by storing all three components and renormalizing in a fragment program, just like for objectspace normal maps. When only the X and Y components are stored in the DXT1 format, the reconstructed normal vectors are automatically normalized by deriving the Z component. When the X and Y components are distorted due to the DXT1 compression, where all points are placed on a straight line through XYspace, the error in the derived Z can be quite large.
The fragment program shown below for renormalizing the DXT1 compressed normals, is the same as the one used for DXT1 compressed objectspace normal maps with renormalization.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = normal.z [0, 1] # input.w = 0 MAD normal, input, 2.0, 1.0 DP3 scale, normal, normal RSQ scale.x, scale.x MUL normal, normal, scale.x
The following images show a DXT1 compressed normal map with renormalization on the right, next to the original normal map on the left.
DXT1 compressed normal map with renormalization on the right compared to the original normal map on the left. 
Either way, whether only storing two components in the DXT1 and deriving the Z, or storing all three components in the DXT1 format with renormalization in the fragment program, the quality is rather poor.
3.2 TangentSpace DXT5
Just like for objectspace normal maps, all three components can be stored in the DXT5 format. The best results are usually achieved when storing _YZX data. In other words the X component is moved to the alpha channel. This technique is also known as RxGB compression, and was employed in the computer game DOOM III. By moving the X component to the alpha channel, the X and Y components are encoded separately. This improves the quality because the X and Y components are most independent with the largest dynamic range. The Z is always positive and typically close to 1 and, as such, storing the Z component with the Y component in the DXT1 part of the DXT5 format causes little distortion of the Y component. Storing all three components results in minimal overhead in a fragment program as shown below.
# input.x = 0 # input.y = normal.y [0, 1] # input.z = normal.z [0, 1] # input.w = normal.x [0, 1] MAD normal, input.wyzx, 2.0, 1.0
The following images show that, although the quality is better than DXT1 compression, there are still noticeable banding artifacts.
DXT5 compressed normal map on the right compared to the original normal map on the left. 
Just like for objectspace normal maps the quality can be improved by storing directions that are not necessarily unitlength. The best quality is typically achieved by also moving the X component to the DXT5 alpha channel. The following fragment program shows how the directions are renormalized after moving the X component back in place from the alpha channel.
# input.x = normal.x [0, 1] # input.y = normal.y [0, 1] # input.z = 0 # input.w = normal.z [0, 1] MAD normal, input.wyzx, 2.0, 1.0 DP3 scale, normal, normal RSQ scale.x, scale.x MUL normal, normal, scale.x
The following images show that encoding directions with renormalization in a fragment program reduces the banding artifacts, but they are still quite noticeable.
The DXT5 compressed normal map with renormalization on the right compared to the original normal map on the left. 
For most tangentspace normal maps better quality compression can be achieved by only storing the X and Y components in the DXT5 format and deriving the Z. This is also known as DXT5nm compression, and is most popular in today's computer games. The following fragment program shows how the Z is derived from the X and Y components.
# input.x = 0 # input.y = normal.y [0, 1] # input.z = 0 # input.w = normal.x [0, 1] MAD normal, input.wyzx, 2.0, 1.0 DP4_SAT normal.z, normal, normal; MAD normal, normal, { 1, 1, 1, 0 }, { 0, 0, 1, 0 }; RSQ temp, normal.z; MUL normal.z, temp;
The following images show that only storing the X and Y and deriving the Z, further reduces the banding artifacts.
DXT5 compressed normal map storing only X and Y on the right compared to the original normal map on the left. 
When using XY_ DXT1, _YZX DXT5 or _Y_X DXT5 compression for tangentspace normal maps, there is at least one spare channel that can be used to store a scale factor, which can be used to counter quantization errors similar to what the YCoCgDXT5 compressor from [24] does. However, trying to upscale the components to counter quantization errors does not improve the quality much (typically a PSNR improvement of less than 0.1 dB). The components can only be scaled up when they have a low dynamic range. Although most normals point straight up, and the magnitude of most XY vectors is relatively small, the dynamic range of the XY components is actually still quite large. Even if all normals never deviate more than 45 degrees from straight up, then each X or Y component may still map to the range [cos( 45° ), +cos( 45° )], where cos( 45° ) ≅ 0.707. In other words even with a deviation of less than 45 degrees from straight up, which is 50% of the angular range, each component may still cover more than 70% of the maximum dynamic range. On one hand, this is a good thing, because for the components of tangentspace normal vectors this means the largest part of the dynamic range covers the most frequently occurring values. On the other hand this means it is hard to upscale the components because of a relatively large dynamic range.
In the case of the _Y_X DXT5 compression of tangentspace normal maps there are two unused channels, and one of these channels can be used to also store a bias to center the dynamic range. This significantly increases the number of 4x4 blocks for which the values can be scaled up (such that typically more than 75% of all 4x4 blocks use a scale factor of at least 2). However, even using a bias to increase the number of scaled 4x4 blocks does not help much to improve the quality. The real problem is that the four sample points of the DXT1 block are simply not enough to accurately represent all the Y components of the normals in a 4x4 block. Introducing more sample points would significantly improve the quality but this is obviously not possible within the DXT5 format.
Instead of storing a bias and scale, one of the spare channels can also be used to store a rotation of the normal vectors in a 4x4 block about the Zaxis, as suggested in [11, 12]. Such a rotation can be used to find a much tighter bounding box of the XY vectors. In particular using _Y_X DXT5 compression such a rotation can be used to make sure that the axis with the largest dynamic range maps to the alpha channel, which, as such, is compressed with more precision. To be able to map the axis with the largest dynamic range to the alpha channel, a rotation of up to 180 degrees may be required. This rotation can be stored as a constant value over the whole 4x4 block in one of the 5bit channels. Instead of storing the angle of rotation, the cosine of the angle can be stored, such that the cosine does not have to be calculated in a fragment program where the vectors need to be rotated back to their original positions. The sine for a rotation in the range [0, 180] degrees is always positive and can, as such, trivially be derived from the cosine in a fragment program as follows.
sine = sqrt( 1  cosine * cosine )
The PSNR improvement from rotating the normals in a 4x4 block is significant and typically in the range 2 to 3 dB. Unfortunately adjacent 4x4 blocks may need vastly different rotations, and under bilinear or trilinear filtering noticeable artifacts may appear for filtered texel samples at borders between two 4x4 blocks with different rotations. The X, Y and rotation are filtered separately before the rotation is applied to the X and Y components. As such, a filtered rotation is applied to filtered X and Y components, which is not the same as filtering X and Y components that are first rotated back to their original position. In other words, unless the normal map is only point sampled, using a rotation is also not an option to improve the quality of DXT1 or DXT5 normal map compression.
Of course a denormalization value can still be stored in one of the spare channels as described in [8]. The denormalization value is used to scale down the normal vectors for lower mip levels, such that specular highlights fade with distance to alleviate aliasing artifacts.
3.3 TangentSpace 3Dc
The 3Dc format [10] is specifically designed for tangentspace normal map compression and produces much better quality than DXT1 or DXT5 normal map compression. The 3Dc format stores only two channels and, as such, cannot be used for objectspace normal maps. The format basically consists of two DXT5 alpha blocks for each 4x4 block of normals. In other words for each 4x4 block there are 8 samples for the X components and also 8 independent samples for the Y components. The Z components have to be derived in a fragment program.
The 3Dc format is also known as BC5 in DirectX 10 [5]. The same format can be loaded in OpenGL as LATC or RGTC. Using the LATC format the luminance is replicated in all three RGB channels. This can be particularly convenient, because this way the same swizzle (and fragment program code) can be used for both LATC and _Y_X DXT5 (DXT5nm) compressed normal maps. In other words the same fragment program can be used on hardware that does, and does not support 3Dc. The following fragment program shows how the Z is derived from the X and Y components when the normal map is stored in RGTC format.
# normal.x = x [0, 1] # normal.y = y [0, 1] # normal.z = 0 # normal.w = 0 MAD normal, normal, 2.0, 1.0 DP4 normal.z, normal, normal; MAD normal, normal, { 1.0, 1.0, 1, 0 }, { 0, 0, 1, 0 }; RSQ temp, normal.z; MUL normal.z, temp;
The following images show how 3Dc compression of normal maps, results in significantly less banding compared to _Y_X DXT5 (DXT5nm).
3Dc compressed normal map on the right compared to the original normal map on the left. 
Several extensions to 3Dc are proposed in [11] and a new format specifically designed for improved normal map compression is presented in [12]. However, these formats are not available in current graphics hardware. On all DirectX 10 compatible hardware the 3Dc (or BC5) format results in the best quality tangentspace normal map compression. On older hardware which does not implement 3Dc the best quality is generally achieved using _Y_X DXT5 (DXT5nm).
4. RealTime Compression on the CPU
While decompression from the formats described in the previous sections is done realtime in hardware, compression to these formats may take a considerable amount of time. Existing compressors are designed for highquality offline compression, not realtime compression [20, 21, 22]. However, realtime compression is quite useful to compress normal maps that are stored on disk in a different (more space efficient) format, and to compress dynamically generated normal maps.
In today's rendering engines, tangentspace normal maps are far more popular than objectspace normal maps. On current hardware there are no compression formats available for objectspace normal maps that work really well. The objectspace normal map compression techniques described in section 2 all result in noticeable artifacts, or the compression is exceedingly expensive.
An objectspace normal map can also not be used on an animated object. While the object surface animates the objectspace normal vectors stay pointing in the same objectspace direction. Tangentspace normal maps on the other hand, store normals relative to the tangentspace at the triangle vertices. When the surface of an object animates and the tangent vectors (stored at the triangle vertices) are transformed with the surface, the tangentspace normal vectors that are stored relative to these tangent vectors will also animate with the surface. As such the focus here is on realtime compression of tangentspace normal maps.
On hardware where the 3Dc (BC5 or LATC) format is not available, the _Y_X DXT5 (DXT5nm) format generally results in the best quality tangentspace normal map compression. The realtime _Y_X DXT5 compressor is very similar to the realtime DXT5 compressor from [23].
First the bounding box of XY normal space is calculated. The two lines that are used to approximate the X and Yvalues go from the minimums to the maximums of this bounding box. To improve the Mean Square Error (MSE), the bounding box is inset on either end with a quarter the distance between the sample points on the lines. The Y components are stored in the "green" channel and there are 4 sample points on the line through "color" space. As such, the minimum and maximum Y values are inset with 1/16th of the range. The X components are stored in the "alpha" channel and there are 8 sample points on the line through "alpha" space. As such, the minimum and maximum X values are inset with 1/32nd of the range. The inset is implemented such that the minimum and maximum values are rounded outwards just like the YCoCgDXT5 compressor from [24] does.
Only a single channel of the "color" channels is used to store the Y components of the normal vectors. Using this knowledge, the realtime DXT5 compressor from [23] can be optimized further specifically for _Y_X DXT5 compression. The best matching points on the line through Yspace can be found in a similar way the best matching points on the line through "alpha" space are found in the DXT5 compressor from [23]. First a set of crossover points are calculated where a Y value goes from being closest to one sample point to another.
byte mid = ( max  min ) / ( 2 * 3 ); byte gb1 = max  mid; byte gb2 = ( 2 * max + 1 * min ) / 3  mid; byte gb3 = ( 1 * max + 2 * min ) / 3  mid;
A Y value can then be tested for being greaterequal to each of the crossover points, and the results of these comparisons (0 for false and 1 for true) can be added together to calculate an index. This results in the following order where index 0 through 3 go from the minimum to the maximum.

However, the "color" sample points are ordered differently in the DXT5 format as follows.

Subtracting the results of the comparisons from four, and wrapping the result with a bitwise logical AND with 3, results in the following order.

The order is close to correct, but the min and max are still swapped. The following code shows how the Y values are compared to the crossover points, and how the indices are calculated from the results of the comparisons, where index 0 and 1 are swapped at the end by XORing with the result of the comparison ( 2 > index ).
unsigned int result = 0; for ( int i = 15; i >= 0; i ) { result <<= 2; byte g = block[i*4]; int b1 = ( g >= gb1 ); int b2 = ( g >= gb2 ); int b3 = ( g >= gb3 ); int index = ( 4  b1  b2  b3 ) & 3; index ^= ( 2 > index ); result = index; }
Using SIMD instructions each byte comparison results in a byte with either all zero bits (when the expression is false), or all one bits (when the expression is true). When interpreted as a signed (two'scomplements) integer, the result of a byte comparison is equal to either the number 0 (for false) or the number 1 (for true). Instead of explicitly subtracting a 1 for a comparison that results in true, the actual result of the comparison can simply be added to the value four as a signed integer.
The calculation of the indices for the "alpha" channel is very similar to the calculation used in the realtime DXT5 compressor from [23]. However, the calculation can be optimized further by also selecting the best matching sample points with subtraction as opposed to addition. First a set of crossover points are calculated where an X value goes from being closest to one sample point to another.
byte mid = ( max  min ) / ( 2 * 7 ); byte ab1 = max  mid; byte ab2 = ( 6 * max + 1 * min ) / 7  mid; byte ab3 = ( 5 * max + 2 * min ) / 7  mid; byte ab4 = ( 4 * max + 3 * min ) / 7  mid; byte ab5 = ( 3 * max + 4 * min ) / 7  mid; byte ab6 = ( 2 * max + 5 * min ) / 7  mid; byte ab7 = ( 1 * max + 6 * min ) / 7  mid;
An X value can then be tested for being greaterequal to each of the crossover points, and the results of these comparisons (0 for false and 1 for true) can be subtracted from 8 and wrapped using a bitwise logical AND with 7 to calculate the index. The first two indices are also swapped by xoring with the result of the comparison ( 2 > index ) as shown in the following code.
byte indices[16]; for ( int i = 0; i < 16; i++ ) { byte a = block[i*4]; int b1 = ( a >= ab1 ); int b2 = ( a >= ab2 ); int b3 = ( a >= ab3 ); int b4 = ( a >= ab4 ); int b5 = ( a >= ab5 ); int b6 = ( a >= ab6 ); int b7 = ( a >= ab7 ); int index = ( 8  b1  b2  b3  b4  b5  b6  b7 ) & 7; indices[i] = index ^ ( 2 > index ); }
The full implementation of the realtime _Y_X DXT5 compressor can be found in appendix A. MMX and SSE2 implementations of this realtime compressor can be found in appendix B and C respectively.
Where available, the 3Dc (BC5 or LATC) format results in the best quality tangentspace normal map compression. The realtime 3Dc compressor first calculates the bounding box of XY normal space just like the _Y_X DXT5 compressor does. The two lines that are used to approximate the X and Yvalues go from the minimums to the maximums of this bounding box. To improve the Mean Square Error (MSE), the bounding box is inset on either end with a quarter the distance between the sample points on the lines. The 3Dc format basically stores two DXT5 alpha channels both with the same encoding and 8 sample points. As such, on both axes the bounding box is inset on either end with 1/32th of the range. The same code as used for the _Y_X DXT5 compression, is used here as well to calculate the "alpha" channel indices, except that it is used twice. The full implementation of the realtime 3Dc compressor can be found in appendix A. MMX and SSE2 implementations of this realtime compressor can be found in appendix B and C respectively.
5. RealTime Compression on the GPU
Realtime compression of tangentspace normal maps can also be performed on the GPU. This is possible thanks to new features available on DX10class graphics hardware that enable rendering to integer textures and the use of bitwise and arithmetic integer operations..
To compress a normal map, a fragment program is used for each block of 4x4 texels by rendering a quad over the entire destination surface. The result of this fragment program is a compressed DXT block that is written to the texels of an integer texture. Both, DXT5 and 3Dc blocks are 128 bits, which is equal to one RGBA texel with 32 bits per component. As such, an unsigned integer RGBA texture is used as the render target when compressing a normal map to either format. The contents of this render target are then copied to the corresponding DXT texture by using Pixel Buffer Objects. This process is very similar to the one used for YCoCgDXT5 compression that is described in more detail in [24].
3Dc compressed textures are exposed in OpenGL through two different extensions: GL_EXT_texture_compression_latc [25], and GL_EXT_texture_compression_rgtc [26]. The former maps the X and Y components to the luminance and alpha channels, while the latter maps the X and Y components to red and green respectively, where the remaining channels are set to 0.
In the implementation described here the LATC format is used. This is slightly more convenient, because it allows sharing the same shader code used for the normal reconstruction:
N.xy = 2 * tex2D(image, texcoord).wy  1; N.z = sqrt(saturate(1  N.x * N.x  N.y * N.y));
When using LATC the luminance is replicated in the RGB channels, so the WY swizzle maps the luminance and alpha components to X and Y. Similarly, when using _Y_X DXT5, the WY swizzle maps the green and alpha components to X and Y.
The same code as used in [24] to encode the alpha channel for YCoCgDXT5 compression, can also be used to encode the X and Y components for 3Dc compression, and the X component for _Y_X DXT5 compression. As shown in Section 4, the _Y_X DXT5 compressor can also be optimized to compute the DXT1 block by fitting only the Y component. However, as noted in [23], the alpha space is a onedimensional space and the points on the line through alpha space are equidistant, which allows the closest point for each original alpha value to be calculated through division. On the CPU this requires a rather slow scalar integer division, because there are no MMX or SSE2 instructions available for integer division. The division can be implemented as an integer multiplication with a shift. However, the divisor is not a constant which means a lookup table is required to get the multiplier. Multiplication also increases the dynamic range which limits the amount of parallelism that can be exploited through a SIMD instruction set. On the CPU there is a clear benefit to exploiting maximum parallelism by using simple operations on the smallest possible elements (bytes) without increasing the dynamic range. However, on the GPU, scalar floating point math is used, and a division and/or multiplication is relatively cheap. As such, the X and Y components can be mapped to the respective indices by applying only a scale and a bias. The CG code for the index calculation of the Y component for the _Y_X DXT5 format is as follows:
const int GREEN_RANGE = 3; float bias = maxGreen + (maxGreen  minGreen) / (2.0 * GREEN_RANGE); float scale = 1.0f / (maxGreen  minGreen); // Compute indices uint indices = 0; for (int i = 0; i < 16; i++) { uint index = saturate((bias  block[i].y) * scale) * GREEN_RANGE; indices = index << (i * 2); } uint i0 = (indices & 0x55555555); uint i1 = (indices & 0xAAAAAAAA) >> 1; indices = ((i0 ^ i1) << 1)  i1;
The same can be done for the X component of the _Y_X DXT5 format, and for both the X and Y component of the 3Dc format:
const int ALPHA_RANGE = 7; float bias = maxAlpha + (maxAlpha  minAlpha) / (2.0 * ALPHA_RANGE); float scale = 1.0f / (maxAlpha  minAlpha); uint2 indices = 0; for (int i = 0; i < 6; i++) { uint index = saturate((bias  block[i].x) * scale) * ALPHA_RANGE; indices.x = index << (3 * i); } for (int i = 6; i < 16; i++) { uint index = saturate((bias  block[i].x) * scale) * ALPHA_RANGE; indices.y = index << (3 * i  18); } uint2 i0 = (indices >> 0) & 0x09249249; uint2 i1 = (indices >> 1) & 0x09249249; uint2 i2 = (indices >> 2) & 0x09249249; i2 ^= i0 & i1; i1 ^= i0; i0 ^= (i1  i2); indices.x = (i2.x << 2)  (i1.x << 1)  i0.x; indices.y = (((i2.y << 2)  (i1.y << 1)  i0.y) << 2)  (indices.x >> 16); indices.x <<= 16;
The full Cg 2.0 implementations of the realtime _Y_X DXT5 (DXT5nm) normal map compressor, and the realtime 3Dc (BC5 or LATC) normal map compressor, can be found in appendix D.
6. Compression on the CPU vs. GPU
As shown in the previous sections high performance normal map compression can be implemented on both the CPU and the GPU. Whether the compression is best implemented on the CPU or the GPU is application dependent.
Realtime compression on the CPU is useful for normal maps that are dynamically created on the CPU. Compression on the CPU is also particularly useful for transcoding normal maps that are streamed from disk in a format that cannot be used for rendering. For example, a normal map or a height map may be stored in JPEG format on disk and, as such, cannot be used directly for rendering. Only some parts of the JPEG decompression algorithm can currently be implemented efficiently on the GPU. Memory can be saved on the graphics card, and rendering performance can be improved, by decompressing the original data and recompressing it to DXT format. The advantage of recompressing the texture data on the CPU is that the amount of data uploaded to the graphics card is minimal. Furthermore, when the compression is performed on the CPU, the full GPU can be used for rendering work as it does not need to perform any compression. With a definite trend to a growing number of cores on today's CPUs, there are typically free cores laying around that can easily be used for texture compression.
Realtime compression on the GPU may be less useful for transcoding, because of increased bandwidth requirements for uploading uncompressed texture data and because the GPU may already be tasked with expensive rendering work. However, realtime compression on the GPU is very useful for compressed render targets. The compression on the GPU can be used to save memory when rendering to a texture. Furthermore, such compressed render targets can improve the performance if the data from the render target is used for further rendering. The render target is compressed once, while the resulting data may be accessed many times during rendering. The compressed data results in reduced bandwidth requirements during rasterization and can, as such, significantly improve performance.
7. Results
7.1 ObjectSpace
The objectspace normal map compression techniques have been tested with the objectspace normal maps shown below.
ObjectSpace Normal Maps  
1. arcade  2. tentacle  3. chest  4. face 
The Peak Signal to Noise Ratio (PSNR) has been calculated over the unweighted X, Y and Z values, stored as 8bit unsigned integers.
PSNR  
image  XYZ DXT1 
renormalized XYZ DXT1 
XY_Z DXT5 
renormalized XY_Z DXT5  

01_arcade  30.90  32.95  34.02  37.23  
02_tentacle  36.68  38.29  41.04  41.62  
03_chest  39.24  40.79  42.22  43.47  
04_face  37.38  38.99  41.03  42.60 
7.2 TangentSpace
The tangentspace normal map compression techniques have been tested with the tangentspace normal maps shown below.
TangentSpace Normal Maps  
1. dot1  2. dot2  3. dot3  4. dot4 
5. lumpy  6. voronoi  7. turtle  8. normalmap 
9. metal  10. skin  11. onetile  12. barrel 
13. arcade  14. tentacle  15. chest  16. face 
The Peak Signal to Noise Ratio (PSNR) has been calculated over the unweighted X, Y and Z values, stored as 8bit unsigned integers.
PSNR  
image  XY_ DXT1 
renormalized XYZ DXT1 
_YZX DXT5 
renormalized _YZX DXT5 
_Y_X DXT5 
3Dc  

01_dot1  27.61  29.51  32.00  35.16  35.07  40.15  
02_dot2  25.39  26.45  29.55  32.92  32.68  36.70  
03_dot3  21.88  23.05  27.34  30.77  30.02  34.13  
04_dot4  23.18  24.46  29.16  32.81  31.38  35.80  
05_lumpy  30.54  31.13  34.70  37.15  37.73  41.92  
06_voronoi  37.53  38.16  41.72  42.16  43.93  48.23  
07_turtle  36.12  37.06  38.74  39.93  41.22  45.76  
08_normalmap  35.57  36.36  37.78  38.95  40.00  44.49  
09_metal  41.65  41.99  46.37  46.55  49.03  54.10  
10_skin  28.95  29.48  34.68  36.20  36.83  41.37  
11_onetile  29.08  29.82  34.17  35.98  36.76  41.14  
12_barrel  29.93  31.67  33.15  36.79  37.03  40.20  
13_arcade  32.31  33.63  36.86  39.24  39.81  44.61  
14_tentacle  39.03  40.47  40.30  41.39  43.23  47.82  
15_chest  38.92  41.03  41.64  42.29  42.87  46.52  
16_face  38.27  39.58  41.59  42.55  43.71  48.61 
The following graph uses the 3Dc format to show the quality difference between the orthographic and stereographic projections. The stereographic projection results in more consistent results but for most normal maps the quality is significantly lower.
The following graph is only of theoretical interest, in that it shows the quality improvement from rotating the normals in a 4x4 block, and storing the rotation in one of the unused channels in the _Y_X DXT5 format. The graph shows the quality improvement for normal maps that are only point sampled, because filtering causes noticeable artifacts for texel samples between 4x4 blocks with different rotations.
7.3 RealTime TangentSpace
The realtime tangentspace normal map compressors have been tested with the same tangentspace normal maps shown above. The Peak Signal to Noise Ratio (PSNR) has been calculated over the unweighted X, Y and Z values, stored as 8bit unsigned integers.
PSNR  
image  offline _Y_X DXT5 
realtime _Y_X DXT5 
offline 3Dc 
realtime 3Dc  

01_dot1  35.07  33.36  40.15  37.99  
02_dot2  32.68  31.67  36.70  35.67  
03_dot3  30.02  29.03  34.13  33.22  
04_dot4  31.38  30.49  35.80  34.89  
05_lumpy  37.73  36.63  41.92  40.63  
06_voronoi  43.93  42.99  48.23  46.99  
07_turtle  41.22  40.30  45.76  44.50  
08_normalmap  40.00  38.99  44.49  43.26  
09_metal  49.03  47.60  54.10  52.45  
10_skin  36.83  35.69  41.37  40.20  
11_onetile  36.76  35.67  41.14  39.92  
12_barrel  37.03  35.51  40.20  39.11  
13_arcade  39.81  38.05  44.61  42.18  
14_tentacle  43.23  41.90  47.82  46.31  
15_chest  42.87  41.95  46.52  45.38  
16_face  43.71  42.85  48.61  47.53 
The performance of the SIMD optimized realtime compressors has been tested on an Intel® 2.8 GHz dualcore Xeon® ("Paxville" 90nm NetBurst microarchitecture) and an Intel® 2.9 GHz Core™2 Extreme ("Conroe" 65nm Core 2 microarchitecture). Only a single core of these processors was used for the compression. Since the texture compression is block based, the compression algorithms can easily use multiple threads to utilize all cores of these processors. When using multiple cores there is an expected linear speed up with the number of available cores. The performance of the Cg 2.0 implementations has also been tested on a NVIDIA GeForce 8600 GTS and a NVIDIA GeForce 8800 GTX.
The following figure shows the number of Mega Pixels that can be compressed to the _Y_X DXT5 format per second (higher MP/s = better).
The following figure shows the number of Mega Pixels that can be compressed to the 3Dc format per second (higher MP/s = better).
8. Conclusion
Existing color texture compression formats can also be used to store normal maps, but the results vary. The latest graphics hardware also implements formats specifically designed for normal map compression. While decompression from these formats happens in realtime in hardware during rendering, compression to these formats may take a considerable amount of time. Existing compressors are designed for highquality offline compression, not realtime compression. However, at the cost of a little quality, normal maps can also be compressed realtime on both the CPU and GPU, which is useful for transcoding normal maps from a different format and compression of dynamically generated normal maps.
9. References
1.  Simulation of Wrinkled Surfaces. James F. Blinn In Proceedings of SIGGRAPH, vol. 12, #3, pp. 286292, 1978 Available Online: //portal.acm.org/citation.cfm?id=507101 
2.  Efficient Bump Mapping Hardware. Mark Peercy, John Airey, Brian Cabral Computer Graphics, vol. 31, pp. 303306, 1997 Available Online: //citeseer.ist.psu.edu/peercy97efficient.html 
3.  S3 Texture Compression Pat Brown NVIDIA Corporation, November 2001 Available Online: //oss.sgi.com/projects/oglsample/registry/EXT/texture_compression_s3tc.txt 
4.  Compressed Texture Resources (Direct3D 9) Microsoft Developer Network MSDN, November 2007 Available Online: //msdn2.microsoft.com/enus/library/bb204843.aspx 
5.  Block Compression (Direct3D 10) Microsoft Developer Network MSDN, November 2007 Available Online: //msdn2.microsoft.com/enus/library/bb694531.aspx 
6.  Image Coding using Block Truncation Coding E.J. Delp, O.R. Mitchell IEEE Transactions on Communications, vol. 27(9), pp. 13351342, September 1979 Available Online: //scholarsmine.umr.edu/post_prints/01094560_09007dcc8030cc78.html 
7.  Bump Map Compression Simon Green NVIDIA Technical Report, October 2001 Available Online: //developer.nvidia.com/object/bump_map_compression.html 
8.  Normal Map Compression ATI Technologies Inc ATI, August 2003 Available Online: //www.ati.com/developer/NormalMapCompression.pdf 
9.  Normal Map Compression Jakub Klarowicz Shader X2: Shader Programming Tips & Tricks with DirectX 9 
10.  3Dc White Paper ATI Technologies Inc ATI, April 2004 Available Online: //ati.de/products/radeonx800/3DcWhitePaper.pdf 
11.  High Quality Normal Map Compression Jacob Munkberg, Tomas AkenineMöller, Jacob Ström Graphics Hardware 2006 Available Online: //graphics.cs.lth.se/research/papers/normals2006/ 
12.  Tight Frame Normal Map Compression Jacob Munkberg, Ola Olsson, Jacob Ström, Tomas AkenineMöller Graphics Hardware 2007 Available Online: //graphics.cs.lth.se/research/papers/2007/tightframe/ 
13.  Fast and Efficient Normal Map Compression Based on Vector Quantization T. Yamasaki, K. Aizawa In Proceedings of ICASSP (2006), vol. 2, pp. 212 Available Online: //www.ee.columbia.edu/~dpwe/LabROSA/proceeds/icassp/2006/pdfs/0200009.pdf 
14.  A Hybrid Adaptive Normal Map Texture Compression Algorithm B. Yang, Z. Pan In International Conference on Artificial Reality and Telexistence (2006), IEEE Computer Society, pp. 349354 Available Online: //doi.ieeecomputersociety.org/10.1109/ICAT.2006.11 
15.  Mathematical error analysis of normal map compression based on unity condition Toshihiko Yamasaki, Kazuya Hayase, Kiyoharu Aizawa IEEE International Conference on Image Processing, vol. 2, pp. 2536, September 2005 Available Online: //ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1530039 
16.  Mathematical PSNR Prediction Model Between Compressed Normal Maps and Rendered 3D Images Toshihiko Yamasaki, Kazuya Hayase, Kiyoharu Aizawa Pacific Rim Conference on Multimedia (PCM2005) LNCS 3768, pp. 584594, Jeju Island, Korea, Nov. 1316, 2005 Available Online: //www.springerlink.com/content/f3707080w8553g3l/ 
17.  Realtime rendering of normal maps with discontinuities Evgueni Parilov, Ilya Rosenberg, Denis Zorin CIMS Technical Report, TR2005872, August 2005 Available Online: //csdocs.cs.nyu.edu/Dienst/UI/2.0/Describe/ncstrl.nyu_cs%2FTR2005872t 
18.  Mipmapping normal maps M. Toksvig Journal of Graphics Tools 10, 3, 6571. 2005 Available Online: //developer.nvidia.com/object/mipmapping_normal_maps.html 
19.  Frequency Domain Normal Map Filtering Charles Han, Bo Sun, Ravi Ramamoorthi, Eitan Grinspun SIGGRAPH 2007 Available Online: //www.cs.columbia.edu/cg/normalmap/normalmap.pdf 
20.  ATI Compressonator Library Seth Sowerby, Daniel Killebrew ATI Technologies Inc, The Compressonator version 1.27.1066, March 2006 Available Online: //www.ati.com/developer/compressonator.html 
21.  NVIDIA DDS Utilities NVIDIA NVIDIA DDS Utilities, April 2006 Available Online: //developer.nvidia.com/object/nv_texture_tools.html 
22.  NVIDIA Texture Tools NVIDIA NVIDIA Texture Tools, September 2007 Available Online: //developer.nvidia.com/object/texture_tools.html 
23.  RealTime DXT Compression J.M.P. van Waveren Intel Software Network, October 2006 Available Online: //www.intel.com/cd/ids/developer/asmona/eng/324337.htm 
24.  RealTime YCoCgDXT Compression J.M.P. van Waveren, Ignacio Castaño NVIDIA, October 2007 Available Online: //news.developer.nvidia.com/2007/10/realtimeycocg.html 
25.  GL_EXT_texture_compression_latc Available Online: //www.opengl.org/registry/specs/EXT/texture_compression_latc.txt 
26.  GL_EXT_texture_compression_rgtc Available Online: //www.opengl.org/registry/specs/EXT/texture_compression_rgtc.txt 
Appendix A
/* RealTime Normal Map Compression (C++) Copyright (C) 2008 Id Software, Inc. Written by J.M.P. van Waveren This code is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. */ typedef unsigned char byte; typedef unsigned short word; typedef unsigned int dword; #define INSET_COLOR_SHIFT 4 // inset color channel #define INSET_ALPHA_SHIFT 5 // inset alpha channel #define C565_5_MASK 0xF8 // 0xFF minus last three bits #define C565_6_MASK 0xFC // 0xFF minus last two bits byte *globalOutData; void EmitByte( byte b ) { globalOutData[0] = b; globalOutData += 1; } void EmitWord( word s ) { globalOutData[0] = ( s >> 0 ) & 255; globalOutData[1] = ( s >> 8 ) & 255; globalOutData += 2; } void EmitDoubleWord( dword i ) { globalOutData[0] = ( i >> 0 ) & 255; globalOutData[1] = ( i >> 8 ) & 255; globalOutData[2] = ( i >> 16 ) & 255; globalOutData[3] = ( i >> 24 ) & 255; globalOutData += 4; } word NormalYTo565( byte y ) { return ( ( y >> 2 ) << 5 ); } void ExtractBlock( const byte *inPtr, const int width, byte *block ) { for ( int j = 0; j < 4; j++ ) { memcpy( &block[j*4*4], inPtr, 4*4 ); inPtr += width * 4; } } void GetMinMaxNormalsBBox( const byte *block, byte *minNormal, byte *maxNormal ) { minNormal[0] = minNormal[1] = 255; maxNormal[0] = maxNormal[1] = 0; for ( int i = 0; i < 16; i++ ) { if ( block[i*4+0] < minNormal[0] ) { minNormal[0] = block[i*4+0]; } if ( block[i*4+1] < minNormal[1] ) { minNormal[1] = block[i*4+1]; } if ( block[i*4+0] > maxNormal[0] ) { maxNormal[0] = block[i*4+0]; } if ( block[i*4+1] > maxNormal[1] ) { maxNormal[1] = block[i*4+1]; } } } void InsetNormalsBBoxDXT5( byte *minNormal, byte *maxNormal ) { int inset[4]; int mini[4]; int maxi[4]; inset[0] = ( maxNormal[0]  minNormal[0] )  ((1<<(INSET_ALPHA_SHIFT1))1); inset[1] = ( maxNormal[1]  minNormal[1] )  ((1<<(INSET_COLOR_SHIFT1))1); mini[0] = ( ( minNormal[0] << INSET_ALPHA_SHIFT ) + inset[0] ) >> INSET_ALPHA_SHIFT; mini[1] = ( ( minNormal[1] << INSET_COLOR_SHIFT ) + inset[1] ) >> INSET_COLOR_SHIFT; maxi[0] = ( ( maxNormal[0] << INSET_ALPHA_SHIFT )  inset[0] ) >> INSET_ALPHA_SHIFT; maxi[1] = ( ( maxNormal[1] << INSET_COLOR_SHIFT )  inset[1] ) >> INSET_COLOR_SHIFT; mini[0] = ( mini[0] >= 0 ) ? mini[0] : 0; mini[1] = ( mini[1] >= 0 ) ? mini[1] : 0; maxi[0] = ( maxi[0] <= 255 ) ? maxi[0] : 255; maxi[1] = ( maxi[1] <= 255 ) ? maxi[1] : 255; minNormal[0] = mini[0]; minNormal[1] = ( mini[1] & C565_6_MASK )  ( mini[1] >> 6 ); maxNormal[0] = maxi[0]; maxNormal[1] = ( maxi[1] & C565_6_MASK )  ( maxi[1] >> 6 ); } void InsetNormalsBBox3Dc( byte *minNormal, byte *maxNormal ) { int inset[4]; int mini[4]; int maxi[4]; inset[0] = ( maxNormal[0]  minNormal[0] )  ((1<<(INSET_ALPHA_SHIFT1))1); inset[1] = ( maxNormal[1]  minNormal[1] )  ((1<<(INSET_ALPHA_SHIFT1))1); mini[0] = ( ( minNormal[0] << INSET_ALPHA_SHIFT ) + inset[0] ) >> INSET_ALPHA_SHIFT; mini[1] = ( ( minNormal[1] << INSET_ALPHA_SHIFT ) + inset[1] ) >> INSET_ALPHA_SHIFT; maxi[0] = ( ( maxNormal[0] << INSET_ALPHA_SHIFT )  inset[0] ) >> INSET_ALPHA_SHIFT; maxi[1] = ( ( maxNormal[1] << INSET_ALPHA_SHIFT )  inset[1] ) >> INSET_ALPHA_SHIFT; mini[0] = ( mini[0] >= 0 ) ? mini[0] : 0; mini[1] = ( mini[1] >= 0 ) ? mini[1] : 0; maxi[0] = ( maxi[0] <= 255 ) ? maxi[0] : 255; maxi[1] = ( maxi[1] <= 255 ) ? maxi[1] : 255; minNormal[0] = mini[0]; minNormal[1] = mini[1]; maxNormal[0] = maxi[0]; maxNormal[1] = maxi[1]; } void EmitAlphaIndices( const byte *block, const int offset, const byte minAlpha, const byte maxAlpha ) { byte mid = ( maxAlpha  minAlpha ) / ( 2 * 7 ); byte ab1 = maxAlpha  mid; byte ab2 = ( 6 * maxAlpha + 1 * minAlpha ) / 7  mid; byte ab3 = ( 5 * maxAlpha + 2 * minAlpha ) / 7  mid; byte ab4 = ( 4 * maxAlpha + 3 * minAlpha ) / 7  mid; byte ab5 = ( 3 * maxAlpha + 4 * minAlpha ) / 7  mid; byte ab6 = ( 2 * maxAlpha + 5 * minAlpha ) / 7  mid; byte ab7 = ( 1 * maxAlpha + 6 * minAlpha ) / 7  mid; block += offset; byte indices[16]; for ( int i = 0; i < 16; i++ ) { byte a = block[i*4]; int b1 = ( a >= ab1 ); int b2 = ( a >= ab2 ); int b3 = ( a >= ab3 ); int b4 = ( a >= ab4 ); int b5 = ( a >= ab5 ); int b6 = ( a >= ab6 ); int b7 = ( a >= ab7 ); int index = ( 8  b1  b2  b3  b4  b5  b6  b7 ) & 7; indices[i] = index ^ ( 2 > index ); } EmitByte( (indices[ 0] >> 0)  (indices[ 1] << 3)  (indices[ 2] << 6) ); EmitByte( (indices[ 2] >> 2)  (indices[ 3] << 1)  (indices[ 4] << 4)  (indices[ 5] << 7) ); EmitByte( (indices[ 5] >> 1)  (indices[ 6] << 2)  (indices[ 7] << 5) ); EmitByte( (indices[ 8] >> 0)  (indices[ 9] << 3)  (indices[10] << 6) ); EmitByte( (indices[10] >> 2)  (indices[11] << 1)  (indices[12] << 4)  (indices[13] << 7) ); EmitByte( (indices[13] >> 1)  (indices[14] << 2)  (indices[15] << 5) ); } void EmitGreenIndices( const byte *block, const int offset, const byte minGreen, const byte maxGreen ) { byte mid = ( maxGreen  minGreen ) / ( 2 * 3 ); byte gb1 = maxGreen  mid; byte gb2 = ( 2 * maxGreen + 1 * minGreen ) / 3  mid; byte gb3 = ( 1 * maxGreen + 2 * minGreen ) / 3  mid; block += offset; unsigned int result = 0; for ( int i = 15; i >= 0; i ) { result <<= 2; byte g = block[i*4]; int b1 = ( g >= gb1 ); int b2 = ( g >= gb2 ); int b3 = ( g >= gb3 ); int index = ( 4  b1  b2  b3 ) & 3; index ^= ( 2 > index ); result = index; } EmitUInt( result ); } void CompressNormalMapDXT5( const byte *inBuf, byte *outBuf, int width, int height, int &outputBytes ) { byte block[64]; byte normalMin[4]; byte normalMax[4]; globalOutData = outBuf; for ( int j = 0; j < height; j += 4, inBuf += width * 4*4 ) { for ( int i = 0; i < width; i += 4 ) { ExtractBlock( inBuf + i * 4, width, block ); GetMinMaxNormalsBBox( block, normalMin, normalMax ); InsetNormalsBBoxDXT5( normalMin, normalMax ); // Write out Nx into alpha channel. EmitByte( normalMax[0] ); EmitByte( normalMin[0] ); EmitAlphaIndices( block, 0, normalMin[0], normalMax[0] ); // Write out Ny into green channel. EmitUShort( NormalYTo565( normalMax[1] ) ); EmitUShort( NormalYTo565( normalMin[1] ) ); EmitGreenIndices( block, 1, normalMin[1], normalMax[1] ); } } outputBytes = outData  outBuf; } void CompressNormalMap3Dc( const byte *inBuf, byte *outBuf, int width, int height, int &outputBytes ) { byte block[64]; byte normalMin[4]; byte normalMax[4]; globalOutData = outBuf; for ( int j = 0; j < height; j += 4, inBuf += width * 4*4 ) { for ( int i = 0; i < width; i += 4 ) { ExtractBlock( inBuf + i * 4, width, block ); GetMinMaxNormalsBBox( block, normalMin, normalMax ); InsetNormalsBBox3Dc( normalMin, normalMax ); // Write out Nx as an alpha channel. EmitByte( normalMax[0] ); EmitByte( normalMin[0] ); EmitAlphaIndices( block, 0, normalMin[0], normalMax[0] ); // Write out Ny as an alpha channel. EmitByte( normalMax[1] ); EmitByte( normalMin[1] ); EmitAlphaIndices( block, 1, normalMin[1], normalMax[1] ); } } outputBytes = outData  outBuf; }
Appendix B
/* RealTime Normal Map Compression (MMX) Copyright (C) 2008 Id Software, Inc. Written by J.M.P. van Waveren This code is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. */ #define ALIGN16( x ) __declspec(align(16)) x #define R_SHUFFLE_D( x, y, z, w ) (( (w) & 3 ) << 6  ( (z) & 3 ) << 4  ( (y) & 3 ) << 2  ( (x) & 3 )) ALIGN16( static dword SIMD_MMX_dword_byte_mask[2] ) = { 0x000000FF, 0x000000FF }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask0[2] ) = { 7<<0, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask1[2] ) = { 7<<3, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask2[2] ) = { 7<<6, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask3[2] ) = { 7<<9, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask4[2] ) = { 7<<12, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask5[2] ) = { 7<<15, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask6[2] ) = { 7<<18, 0 }; ALIGN16( static dword SIMD_MMX_dword_alpha_bit_mask7[2] ) = { 7<<21, 0 }; ALIGN16( static word SIMD_MMX_word_0[4] ) = { 0x0000, 0x0000, 0x0000, 0x0000 }; ALIGN16( static word SIMD_MMX_word_div_by_3[4] ) = { (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1 }; ALIGN16( static word SIMD_MMX_word_div_by_7[4] ) = { (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1 }; ALIGN16( static word SIMD_MMX_word_div_by_14[4] ) = { (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1 }; ALIGN16( static word SIMD_MMX_word_scale654[4] ) = { 6, 5, 4, 0 }; ALIGN16( static word SIMD_MMX_word_scale123[4] ) = { 1, 2, 3, 0 }; ALIGN16( static word SIMD_MMX_word_insetNormalDXT5Round[4] ) = { ((1<<(INSET_ALPHA_SHIFT1))1), ((1<<(INSET_COLOR_SHIFT1))1), 0, 0 }; ALIGN16( static word SIMD_MMX_word_insetNormalDXT5Mask[4] ) = { 0xFFFF, 0xFFFF, 0x0000, 0x0000 }; ALIGN16( static word SIMD_MMX_word_insetNormalDXT5ShiftUp[4] ) = { 1 << INSET_ALPHA_SHIFT, 1 << INSET_COLOR_SHIFT, 1, 1 }; ALIGN16( static word SIMD_MMX_word_insetNormalDXT5ShiftDown[4] ) = { 1 << ( 16  INSET_ALPHA_SHIFT ), 1 << ( 16  INSET_COLOR_SHIFT ), 0, 0 }; ALIGN16( static word SIMD_MMX_word_insetNormalDXT5QuantMask[4] ) = { 0xFF, C565_6_MASK, 0xFF, 0xFF }; ALIGN16( static word SIMD_MMX_word_insetNormalDXT5Rep[4] ) = { 0, 1 << ( 16  6 ), 0, 0 }; ALIGN16( static word SIMD_MMX_word_insetNormal3DcRound[4] ) = { ((1<<(INSET_ALPHA_SHIFT1))1), ((1<<(INSET_ALPHA_SHIFT1))1), 0, 0 }; ALIGN16( static word SIMD_MMX_word_insetNormal3DcMask[4] ) = { 0xFFFF, 0xFFFF, 0x0000, 0x0000 }; ALIGN16( static word SIMD_MMX_word_insetNormal3DcShiftUp[4] ) = { 1 << INSET_ALPHA_SHIFT, 1 << INSET_ALPHA_SHIFT, 1, 1 }; ALIGN16( static word SIMD_MMX_word_insetNormal3DcShiftDown[4] ) = { 1 << ( 16  INSET_ALPHA_SHIFT ), 1 << ( 16  INSET_ALPHA_SHIFT ), 0, 0 }; ALIGN16( static byte SIMD_MMX_byte_0[8] ) = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; ALIGN16( static byte SIMD_MMX_byte_1[8] ) = { 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 }; ALIGN16( static byte SIMD_MMX_byte_2[8] ) = { 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02 }; ALIGN16( static byte SIMD_MMX_byte_7[8] ) = { 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07 }; ALIGN16( static byte SIMD_MMX_byte_8[8] ) = { 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08 }; ALIGN16( static byte SIMD_MMX_byte_not[8] ) = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF }; void ExtractBlock_MMX( const byte *inPtr, int width, byte *block ) { __asm { mov esi, inPtr mov edi, block mov eax, width shl eax, 2 movq mm0, qword ptr [esi+0] movq qword ptr [edi+ 0], mm0 movq mm1, qword ptr [esi+8] movq qword ptr [edi+ 8], mm1 movq mm2, qword ptr [esi+eax+0] movq qword ptr [edi+16], mm2 movq mm3, qword ptr [esi+eax+8] movq qword ptr [edi+24], mm3 movq mm4, qword ptr [esi+eax*2+0] movq qword ptr [edi+32], mm4 movq mm5, qword ptr [esi+eax*2+8] add esi, eax movq qword ptr [edi+40], mm5 movq mm6, qword ptr [esi+eax*2+0] movq qword ptr [edi+48], mm6 movq mm7, qword ptr [esi+eax*2+8] movq qword ptr [edi+56], mm7 emms } } void GetMinMaxNormalsBBox_MMX( const byte *block, byte *minNormal, byte *maxNormal ) { __asm { mov eax, block mov esi, minNormal mov edi, maxNormal pshufw mm0, qword ptr [eax+ 0], R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm1, qword ptr [eax+ 0], R_SHUFFLE_D( 0, 1, 2, 3 ) pminub mm0, qword ptr [eax+ 8] pmaxub mm1, qword ptr [eax+ 8] pminub mm0, qword ptr [eax+16] pmaxub mm1, qword ptr [eax+16] pminub mm0, qword ptr [eax+24] pmaxub mm1, qword ptr [eax+24] pminub mm0, qword ptr [eax+32] pmaxub mm1, qword ptr [eax+32] pminub mm0, qword ptr [eax+40] pmaxub mm1, qword ptr [eax+40] pminub mm0, qword ptr [eax+48] pmaxub mm1, qword ptr [eax+48] pminub mm0, qword ptr [eax+56] pmaxub mm1, qword ptr [eax+56] pshufw mm6, mm0, R_SHUFFLE_D( 2, 3, 2, 3 ) pshufw mm7, mm1, R_SHUFFLE_D( 2, 3, 2, 3 ) pminub mm0, mm6 pmaxub mm1, mm7 movd dword ptr [esi], mm0 movd dword ptr [edi], mm1 emms } } void InsetNormalsBBoxDXT5_MMX( byte *minNormal, byte *maxNormal ) { __asm { mov esi, minNormal mov edi, maxNormal movd mm0, dword ptr [esi] movd mm1, dword ptr [edi] punpcklbw mm0, SIMD_MMX_byte_0 punpcklbw mm1, SIMD_MMX_byte_0 movq mm2, mm1 psubw mm2, mm0 psubw mm2, SIMD_MMX_word_insetNormalDXT5Round pand mm2, SIMD_MMX_word_insetNormalDXT5Mask pmullw mm0, SIMD_MMX_word_insetNormalDXT5ShiftUp pmullw mm1, SIMD_MMX_word_insetNormalDXT5ShiftUp paddw mm0, mm2 psubw mm1, mm2 pmulhw mm0, SIMD_MMX_word_insetNormalDXT5ShiftDown pmulhw mm1, SIMD_MMX_word_insetNormalDXT5ShiftDown pmaxsw mm0, SIMD_MMX_word_0 pmaxsw mm1, SIMD_MMX_word_0 pand mm0, SIMD_MMX_word_insetNormalDXT5QuantMask pand mm1, SIMD_MMX_word_insetNormalDXT5QuantMask movq mm2, mm0 movq mm3, mm1 pmulhw mm2, SIMD_MMX_word_insetNormalDXT5Rep pmulhw mm3, SIMD_MMX_word_insetNormalDXT5Rep por mm0, mm2 por mm1, mm3 packuswb mm0, mm0 packuswb mm1, mm1 movd dword ptr [esi], mm0 movd dword ptr [edi], mm1 emms } } void InsetNormalsBBox3Dc_MMX( byte *minNormal, byte *maxNormal ) { __asm { mov esi, minNormal mov edi, maxNormal movd mm0, dword ptr [esi] movd mm1, dword ptr [edi] punpcklbw mm0, SIMD_MMX_byte_0 punpcklbw mm1, SIMD_MMX_byte_0 movq mm2, mm1 psubw mm2, mm0 psubw mm2, SIMD_MMX_word_insetNormal3DcRound pand mm2, SIMD_MMX_word_insetNormal3DcMask pmullw mm0, SIMD_MMX_word_insetNormal3DcShiftUp pmullw mm1, SIMD_MMX_word_insetNormal3DcShiftUp paddw mm0, mm2 psubw mm1, mm2 pmulhw mm0, SIMD_MMX_word_insetNormal3DcShiftDown pmulhw mm1, SIMD_MMX_word_insetNormal3DcShiftDown pmaxsw mm0, SIMD_MMX_word_0 pmaxsw mm1, SIMD_MMX_word_0 packuswb mm0, mm0 packuswb mm1, mm1 movd dword ptr [esi], mm0 movd dword ptr [edi], mm1 emms } } void EmitAlphaIndices_MMX( const byte *block, const int channelBitOffset, const int minAlpha, const int maxAlpha ) { ALIGN16( byte alphaBlock[16] ); ALIGN16( byte ab1[8] ); ALIGN16( byte ab2[8] ); ALIGN16( byte ab3[8] ); ALIGN16( byte ab4[8] ); ALIGN16( byte ab5[8] ); ALIGN16( byte ab6[8] ); ALIGN16( byte ab7[8] ); __asm { movd mm7, channelBitOffset mov esi, block movq mm0, qword ptr [esi+ 0] movq mm5, qword ptr [esi+ 8] movq mm6, qword ptr [esi+16] movq mm4, qword ptr [esi+24] psrld mm0, mm7 psrld mm5, mm7 psrld mm6, mm7 psrld mm4, mm7 pand mm0, SIMD_MMX_dword_byte_mask pand mm5, SIMD_MMX_dword_byte_mask pand mm6, SIMD_MMX_dword_byte_mask pand mm4, SIMD_MMX_dword_byte_mask packuswb mm0, mm5 packuswb mm6, mm4 packuswb mm0, mm6 movq alphaBlock+0, mm0 movq mm0, qword ptr [esi+32] movq mm5, qword ptr [esi+40] movq mm6, qword ptr [esi+48] movq mm4, qword ptr [esi+56] psrld mm0, mm7 psrld mm5, mm7 psrld mm6, mm7 psrld mm4, mm7 pand mm0, SIMD_MMX_dword_byte_mask pand mm5, SIMD_MMX_dword_byte_mask pand mm6, SIMD_MMX_dword_byte_mask pand mm4, SIMD_MMX_dword_byte_mask packuswb mm0, mm5 packuswb mm6, mm4 packuswb mm0, mm6 movq alphaBlock+8, mm0 movd mm0, maxAlpha pshufw mm0, mm0, R_SHUFFLE_D( 0, 0, 0, 0 ) movq mm1, mm0 movd mm2, minAlpha pshufw mm2, mm2, R_SHUFFLE_D( 0, 0, 0, 0 ) movq mm3, mm2 movq mm4, mm0 psubw mm4, mm2 pmulhw mm4, SIMD_MMX_word_div_by_14 movq mm5, mm0 psubw mm5, mm4 packuswb mm5, mm5 movq ab1, mm5 pmullw mm0, SIMD_MMX_word_scale654 pmullw mm1, SIMD_MMX_word_scale123 pmullw mm2, SIMD_MMX_word_scale123 pmullw mm3, SIMD_MMX_word_scale654 paddw mm0, mm2 paddw mm1, mm3 pmulhw mm0, SIMD_MMX_word_div_by_7 pmulhw mm1, SIMD_MMX_word_div_by_7 psubw mm0, mm4 psubw mm1, mm4 pshufw mm2, mm0, R_SHUFFLE_D( 0, 0, 0, 0 ) pshufw mm3, mm0, R_SHUFFLE_D( 1, 1, 1, 1 ) pshufw mm4, mm0, R_SHUFFLE_D( 2, 2, 2, 2 ) packuswb mm2, mm2 packuswb mm3, mm3 packuswb mm4, mm4 movq ab2, mm2 movq ab3, mm3 movq ab4, mm4 pshufw mm2, mm1, R_SHUFFLE_D( 2, 2, 2, 2 ) pshufw mm3, mm1, R_SHUFFLE_D( 1, 1, 1, 1 ) pshufw mm4, mm1, R_SHUFFLE_D( 0, 0, 0, 0 ) packuswb mm2, mm2 packuswb mm3, mm3 packuswb mm4, mm4 movq ab5, mm2 movq ab6, mm3 movq ab7, mm4 pshufw mm0, alphaBlock+0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm1, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm2, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm3, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pmaxub mm1, ab1 pmaxub mm2, ab2 pmaxub mm3, ab3 pmaxub mm4, ab4 pmaxub mm5, ab5 pmaxub mm6, ab6 pmaxub mm7, ab7 pcmpeqb mm1, mm0 pcmpeqb mm2, mm0 pcmpeqb mm3, mm0 pcmpeqb mm4, mm0 pcmpeqb mm5, mm0 pcmpeqb mm6, mm0 pcmpeqb mm7, mm0 pshufw mm0, SIMD_MMX_byte_8, R_SHUFFLE_D( 0, 1, 2, 3 ) paddsb mm0, mm1 paddsb mm2, mm3 paddsb mm4, mm5 paddsb mm6, mm7 paddsb mm0, mm2 paddsb mm4, mm6 paddsb mm0, mm4 pand mm0, SIMD_MMX_byte_7 pshufw mm1, SIMD_MMX_byte_2, R_SHUFFLE_D( 0, 1, 2, 3 ) pcmpgtb mm1, mm0 pand mm1, SIMD_MMX_byte_1 pxor mm0, mm1 pshufw mm1, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm2, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm3, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) psrlq mm1, 8 3 psrlq mm2, 16 6 psrlq mm3, 24 9 psrlq mm4, 3212 psrlq mm5, 4015 psrlq mm6, 4818 psrlq mm7, 5621 pand mm0, SIMD_MMX_dword_alpha_bit_mask0 pand mm1, SIMD_MMX_dword_alpha_bit_mask1 pand mm2, SIMD_MMX_dword_alpha_bit_mask2 pand mm3, SIMD_MMX_dword_alpha_bit_mask3 pand mm4, SIMD_MMX_dword_alpha_bit_mask4 pand mm5, SIMD_MMX_dword_alpha_bit_mask5 pand mm6, SIMD_MMX_dword_alpha_bit_mask6 pand mm7, SIMD_MMX_dword_alpha_bit_mask7 por mm0, mm1 por mm2, mm3 por mm4, mm5 por mm6, mm7 por mm0, mm2 por mm4, mm6 por mm0, mm4 mov esi, globalOutData movd [esi+0], mm0 pshufw mm0, alphaBlock+8, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm1, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm2, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm3, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pmaxub mm1, ab1 pmaxub mm2, ab2 pmaxub mm3, ab3 pmaxub mm4, ab4 pmaxub mm5, ab5 pmaxub mm6, ab6 pmaxub mm7, ab7 pcmpeqb mm1, mm0 pcmpeqb mm2, mm0 pcmpeqb mm3, mm0 pcmpeqb mm4, mm0 pcmpeqb mm5, mm0 pcmpeqb mm6, mm0 pcmpeqb mm7, mm0 pshufw mm0, SIMD_MMX_byte_8, R_SHUFFLE_D( 0, 1, 2, 3 ) paddsb mm0, mm1 paddsb mm2, mm3 paddsb mm4, mm5 paddsb mm6, mm7 paddsb mm0, mm2 paddsb mm4, mm6 paddsb mm0, mm4 pand mm0, SIMD_MMX_byte_7 pshufw mm1, SIMD_MMX_byte_2, R_SHUFFLE_D( 0, 1, 2, 3 ) pcmpgtb mm1, mm0 pand mm1, SIMD_MMX_byte_1 pxor mm0, mm1 pshufw mm1, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm2, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm3, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) psrlq mm1, 8 3 psrlq mm2, 16 6 psrlq mm3, 24 9 psrlq mm4, 3212 psrlq mm5, 4015 psrlq mm6, 4818 psrlq mm7, 5621 pand mm0, SIMD_MMX_dword_alpha_bit_mask0 pand mm1, SIMD_MMX_dword_alpha_bit_mask1 pand mm2, SIMD_MMX_dword_alpha_bit_mask2 pand mm3, SIMD_MMX_dword_alpha_bit_mask3 pand mm4, SIMD_MMX_dword_alpha_bit_mask4 pand mm5, SIMD_MMX_dword_alpha_bit_mask5 pand mm6, SIMD_MMX_dword_alpha_bit_mask6 pand mm7, SIMD_MMX_dword_alpha_bit_mask7 por mm0, mm1 por mm2, mm3 por mm4, mm5 por mm6, mm7 por mm0, mm2 por mm4, mm6 por mm0, mm4 movd dword ptr [esi+3], mm0 emms } globalOutData += 6; } void EmitGreenIndices_MMX( const byte *block, const int channelBitOffset, const int minGreen, const int maxGreen ) { ALIGN16( byte greenBlock[16] ); __asm { movd mm7, channelBitOffset mov esi, block movq mm0, qword ptr [esi+ 0] movq mm5, qword ptr [esi+ 8] movq mm6, qword ptr [esi+16] movq mm4, qword ptr [esi+24] psrld mm0, mm7 psrld mm5, mm7 psrld mm6, mm7 psrld mm4, mm7 pand mm0, SIMD_MMX_dword_byte_mask pand mm5, SIMD_MMX_dword_byte_mask pand mm6, SIMD_MMX_dword_byte_mask pand mm4, SIMD_MMX_dword_byte_mask packuswb mm0, mm5 packuswb mm6, mm4 packuswb mm0, mm6 movq greenBlock+0, mm0 movq mm0, qword ptr [esi+32] movq mm5, qword ptr [esi+40] movq mm6, qword ptr [esi+48] movq mm4, qword ptr [esi+56] psrld mm0, mm7 psrld mm5, mm7 psrld mm6, mm7 psrld mm4, mm7 pand mm0, SIMD_MMX_dword_byte_mask pand mm5, SIMD_MMX_dword_byte_mask pand mm6, SIMD_MMX_dword_byte_mask pand mm4, SIMD_MMX_dword_byte_mask packuswb mm0, mm5 packuswb mm6, mm4 packuswb mm0, mm6 movq greenBlock+8, mm0 movd mm2, maxGreen pshufw mm2, mm2, R_SHUFFLE_D( 0, 0, 0, 0 ) movq mm1, mm2 movd mm3, minGreen pshufw mm3, mm3, R_SHUFFLE_D( 0, 0, 0, 0 ) movq mm4, mm2 psubw mm4, mm3 pmulhw mm4, SIMD_MMX_word_div_by_6 psllw mm2, 1 paddw mm2, mm3 pmulhw mm2, SIMD_MMX_word_div_by_3 psubw mm2, mm4 packuswb mm2, mm2 // gb2 psllw mm3, 1 paddw mm3, mm1 pmulhw mm3, SIMD_MMX_word_div_by_3 psubw mm3, mm4 packuswb mm3, mm3 // gb3 psubw mm1, mm4 packuswb mm1, mm1 // gb1 pshufw mm0, greenBlock+0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pmaxub mm5, mm1 pmaxub mm6, mm2 pmaxub mm7, mm3 pcmpeqb mm5, mm0 pcmpeqb mm6, mm0 pcmpeqb mm7, mm0 pshufw mm0, SIMD_MMX_byte_4, R_SHUFFLE_D( 0, 1, 2, 3 ) paddsb mm0, mm5 paddsb mm6, mm7 paddsb mm0, mm6 pand mm0, SIMD_MMX_byte_3 pshufw mm4, SIMD_MMX_byte_2, R_SHUFFLE_D( 0, 1, 2, 3 ) pcmpgtb mm4, mm0 pand mm4, SIMD_MMX_byte_1 pxor mm0, mm4 pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) psrlq mm4, 8 2 psrlq mm5, 16 4 psrlq mm6, 24 6 psrlq mm7, 32 8 pand mm4, SIMD_MMX_dword_color_bit_mask1 pand mm5, SIMD_MMX_dword_color_bit_mask2 pand mm6, SIMD_MMX_dword_color_bit_mask3 pand mm7, SIMD_MMX_dword_color_bit_mask4 por mm5, mm4 por mm7, mm6 por mm7, mm5 pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) psrlq mm4, 4010 psrlq mm5, 4812 psrlq mm6, 5614 pand mm0, SIMD_MMX_dword_color_bit_mask0 pand mm4, SIMD_MMX_dword_color_bit_mask5 pand mm5, SIMD_MMX_dword_color_bit_mask6 pand mm6, SIMD_MMX_dword_color_bit_mask7 por mm4, mm5 por mm0, mm6 por mm7, mm4 por mm7, mm0 mov esi, gobalOutPtr movd [esi+0], mm7 pshufw mm0, greenBlock+8, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pmaxub mm5, mm1 pmaxub mm6, mm2 pmaxub mm7, mm3 pcmpeqb mm5, mm0 pcmpeqb mm6, mm0 pcmpeqb mm7, mm0 pshufw mm0, SIMD_MMX_byte_4, R_SHUFFLE_D( 0, 1, 2, 3 ) paddsb mm0, mm5 paddsb mm6, mm7 paddsb mm0, mm6 pand mm0, SIMD_MMX_byte_3 pshufw mm4, SIMD_MMX_byte_2, R_SHUFFLE_D( 0, 1, 2, 3 ) pcmpgtb mm4, mm0 pand mm4, SIMD_MMX_byte_1 pxor mm0, mm4 pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm7, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) psrlq mm4, 8 2 psrlq mm5, 16 4 psrlq mm6, 24 6 psrlq mm7, 32 8 pand mm4, SIMD_MMX_dword_color_bit_mask1 pand mm5, SIMD_MMX_dword_color_bit_mask2 pand mm6, SIMD_MMX_dword_color_bit_mask3 pand mm7, SIMD_MMX_dword_color_bit_mask4 por mm5, mm4 por mm7, mm6 por mm7, mm5 pshufw mm4, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm5, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) pshufw mm6, mm0, R_SHUFFLE_D( 0, 1, 2, 3 ) psrlq mm4, 4010 psrlq mm5, 4812 psrlq mm6, 5614 pand mm0, SIMD_MMX_dword_color_bit_mask0 pand mm4, SIMD_MMX_dword_color_bit_mask5 pand mm5, SIMD_MMX_dword_color_bit_mask6 pand mm6, SIMD_MMX_dword_color_bit_mask7 por mm4, mm5 por mm0, mm6 por mm7, mm4 por mm7, mm0 movd [esi+2], mm7 emms } globalOutData += 4; } void CompressNormalMapDXT5_MMX( const byte *inBuf, byte *outBuf, int width, int height, int &outputBytes ) { ALIGN16( byte block[64] ); ALIGN16( byte normalMin[4] ); ALIGN16( byte normalMax[4] ); globalOutData = outBuf; for ( int j = 0; j < height; j += 4, inBuf += width * 4*4 ) { for ( int i = 0; i < width; i += 4 ) { ExtractBlock_MMX( inBuf + i * 4, width, block ); GetMinMaxNormalsBBox_MMX( block, normalMin, normalMax ); InsetNormalsBBoxDXT5_MMX( normalMin, normalMax ); // Write out Nx into alpha channel. EmitByte( normalMax[0] ); EmitByte( normalMin[0] ); EmitAlphaIndices_MMX( block, 0*8, normalMin[0], normalMax[0] ); // Write out Ny into green channel. EmitUShort( NormalYTo565( normalMax[1] ) ); EmitUShort( NormalYTo565( normalMin[1] ) ); EmitGreenIndices_MMX( block, 1*8, normalMin[1], normalMax[1] ); } } outputBytes = outData  outBuf; } void CompressNormalMap3Dc_MMX( const byte *inBuf, byte *outBuf, int width, int height, int &outputBytes ) { ALIGN16( byte block[64] ); ALIGN16( byte normalMin[4] ); ALIGN16( byte normalMax[4] ); globalOutData = outBuf; for ( int j = 0; j < height; j += 4, inBuf += width * 4*4 ) { for ( int i = 0; i < width; i += 4 ) { ExtractBlock_MMX( inBuf + i * 4, width, block ); GetMinMaxNormalsBBox_MMX( block, normalMin, normalMax ); InsetNormalsBBox3Dc_MMX( normalMin, normalMax ); // Write out Nx as an alpha channel. EmitByte( normalMax[0] ); EmitByte( normalMin[0] ); EmitAlphaIndices_MMX( block, 0*8, normalMin[0], normalMax[0] ); // Write out Ny as an alpha channel. EmitByte( normalMax[1] ); EmitByte( normalMin[1] ); EmitAlphaIndices_MMX( block, 1*8, normalMin[1], normalMax[1] ); } } outputBytes = outData  outBuf; }
Appendix C
/* RealTime Normal Map Compression (SSE2) Copyright (C) 2008 Id Software, Inc. Written by J.M.P. van Waveren This code is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This code is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. */ #define ALIGN16( x ) __declspec(align(16)) x #define R_SHUFFLE_D( x, y, z, w ) (( (w) & 3 ) << 6  ( (z) & 3 ) << 4  ( (y) & 3 ) << 2  ( (x) & 3 )) ALIGN16( static dword SIMD_SSE2_dword_byte_mask[4] ) = { 0x000000FF, 0x000000FF, 0x000000FF, 0x000000FF }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask0[4] ) = { 7<<0, 0, 7<<0, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask1[4] ) = { 7<<3, 0, 7<<3, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask2[4] ) = { 7<<6, 0, 7<<6, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask3[4] ) = { 7<<9, 0, 7<<9, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask4[4] ) = { 7<<12, 0, 7<<12, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask5[4] ) = { 7<<15, 0, 7<<15, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask6[4] ) = { 7<<18, 0, 7<<18, 0 }; ALIGN16( static dword SIMD_SSE2_dword_alpha_bit_mask7[4] ) = { 7<<21, 0, 7<<21, 0 }; ALIGN16( static word SIMD_SSE2_word_0[8] ) = { 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000 }; ALIGN16( static word SIMD_SSE2_word_div_by_3[8] ) = { (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1, (1<<16)/3+1 }; ALIGN16( static word SIMD_SSE2_word_div_by_7[8] ) = { (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1, (1<<16)/7+1 }; ALIGN16( static word SIMD_SSE2_word_div_by_14[8] ) = { (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1, (1<<16)/14+1 }; ALIGN16( static word SIMD_SSE2_word_scale66554400[8] ) = { 6, 6, 5, 5, 4, 4, 0, 0 }; ALIGN16( static word SIMD_SSE2_word_scale11223300[8] ) = { 1, 1, 2, 2, 3, 3, 0, 0 }; ALIGN16( static word SIMD_SSE2_word_insetNormalDXT5Round[8] ) = { ((1<<(INSET_ALPHA_SHIFT1))1), ((1<<(INSET_COLOR_SHIFT1))1), 0, 0, 0, 0, 0, 0 }; ALIGN16( static word SIMD_SSE2_word_insetNormalDXT5Mask[8] ) = { 0xFFFF, 0xFFFF, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000 }; ALIGN16( static word SIMD_SSE2_word_insetNormalDXT5ShiftUp[8] ) = { 1 << INSET_ALPHA_SHIFT, 1 << INSET_COLOR_SHIFT, 1, 1, 1, 1, 1, 1 }; ALIGN16( static word SIMD_SSE2_word_insetNormalDXT5ShiftDown[8] ) = { 1 << ( 16  INSET_ALPHA_SHIFT ), 1 << ( 16  INSET_COLOR_SHIFT ), 0, 0, 0, 0, 0, 0 }; ALIGN16( static word SIMD_SSE2_word_insetNormalDXT5QuantMask[8] ) = { 0xFF, C565_6_MASK, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF }; ALIGN16( static word SIMD_SSE2_word_insetNormalDXT5Rep[8] ) = { 0, 1 << ( 16  6 ), 0, 0, 0, 0, 0, 0 }; ALIGN16( static word SIMD_SSE2_word_insetNormal3DcRound[8] ) = { ((1<<(INSET_ALPHA_SHIFT1))1), ((1<<(INSET_ALPHA_SHIFT1))1), 0, 0, 0, 0, 0, 0 }; ALIGN16( static word SIMD_SSE2_word_insetNormal3DcMask[8] ) = { 0xFFFF, 0xFFFF, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000 }; ALIGN16( static word SIMD_SSE2_word_insetNormal3DcShiftUp[8] ) = { 1 << INSET_ALPHA_SHIFT, 1 << INSET_ALPHA_SHIFT, 1, 1, 1, 1, 1, 1 }; ALIGN16( static word SIMD_SSE2_word_insetNormal3DcShiftDown[8] ) = { 1 << ( 16  INSET_ALPHA_SHIFT ), 1 << ( 16  INSET_ALPHA_SHIFT ), 0, 0, 0, 0, 0, 0 }; ALIGN16( static byte SIMD_SSE2_byte_0[16] ) = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; ALIGN16( static byte SIMD_SSE2_byte_1[16] ) = { 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01 }; ALIGN16( static byte SIMD_SSE2_byte_2[16] ) = { 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02 }; ALIGN16( static byte SIMD_SSE2_byte_7[16] ) = { 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07 }; void ExtractBlock_SSE2( const byte *inPtr, int width, byte *block ) { __asm { mov esi, inPtr mov edi, block mov eax, width shl eax, 2 movdqa xmm0, [esi] movdqa xmmword ptr [edi+ 0], xmm0 movdqa xmm1, xmmword ptr [esi+eax] movdqa xmmword ptr [edi+16], xmm1 movdqa xmm2, xmmword ptr [esi+eax*2] add esi, eax movdqa xmmword ptr [edi+32], xmm2 movdqa xmm3, xmmword ptr [esi+eax*2] movdqa xmmword ptr [edi+48], xmm3 } } void GetMinMaxNormalsBBox_SSE2( const byte *block, byte *minNormal, byte *maxNormal ) { __asm { mov eax, block mov esi, minNormal mov edi, maxNormal movdqa xmm0, xmmword ptr [eax+ 0] movdqa xmm1, xmmword ptr [eax+ 0] pminub xmm0, xmmword ptr [eax+16] pmaxub xmm1, xmmword ptr [eax+16] pminub xmm0, xmmword ptr [eax+32] pmaxub xmm1, xmmword ptr [eax+32] pminub xmm0, xmmword ptr [eax+48] pmaxub xmm1, xmmword ptr [eax+48] pshufd xmm3, xmm0, R_SHUFFLE_D( 2, 3, 2, 3 ) pshufd xmm4, xmm1, R_SHUFFLE_D( 2, 3, 2, 3 ) pminub xmm0, xmm3 pmaxub xmm1, xmm4 pshuflw xmm6, xmm0, R_SHUFFLE_D( 2, 3, 2, 3 ) pshuflw xmm7, xmm1, R_SHUFFLE_D( 2, 3, 2, 3 ) pminub xmm0, xmm6 pmaxub xmm1, xmm7 movd dword ptr [esi], xmm0 movd dword ptr [edi], xmm1 } } void InsetNormalsBBoxDXT5_SSE2( byte *minNormal, byte *maxNormal ) { __asm { mov esi, minNormal mov edi, maxNormal movd xmm0, dword ptr [esi] movd xmm1, dword ptr [edi] punpcklbw xmm0, SIMD_SSE2_byte_0 punpcklbw xmm1, SIMD_SSE2_byte_0 movdqa xmm2, xmm1 psubw xmm2, xmm0 psubw xmm2, SIMD_SSE2_word_insetNormalDXT5Round pand xmm2, SIMD_SSE2_word_insetNormalDXT5Mask pmullw xmm0, SIMD_SSE2_word_insetNormalDXT5ShiftUp pmullw xmm1, SIMD_SSE2_word_insetNormalDXT5ShiftUp paddw xmm0, xmm2 psubw xmm1, xmm2 pmulhw xmm0, SIMD_SSE2_word_insetNormalDXT5ShiftDown pmulhw xmm1, SIMD_SSE2_word_insetNormalDXT5ShiftDown pmaxsw xmm0, SIMD_SSE2_word_0 pmaxsw xmm1, SIMD_SSE2_word_0 pand xmm0, SIMD_SSE2_word_insetNormalDXT5QuantMask pand xmm1, SIMD_SSE2_word_insetNormalDXT5QuantMask movdqa xmm2, xmm0 movdqa xmm3, xmm1 pmulhw xmm2, SIMD_SSE2_word_insetNormalDXT5Rep pmulhw xmm3, SIMD_SSE2_word_insetNormalDXT5Rep por xmm0, xmm2 por xmm1, xmm3 packuswb xmm0, xmm0 packuswb xmm1, xmm1 movd dword ptr [esi], xmm0 movd dword ptr [edi], xmm1 } } void InsetNormalsBBox3Dc_SSE2( byte *minNormal, byte *maxNormal ) { __asm { mov esi, minNormal mov edi, maxNormal movd xmm0, dword ptr [esi] movd xmm1, dword ptr [edi] punpcklbw xmm0, SIMD_SSE2_byte_0 punpcklbw xmm1, SIMD_SSE2_byte_0 movdqa xmm2, xmm1 psubw xmm2, xmm0 psubw xmm2, SIMD_SSE2_word_insetNormal3DcRound pand xmm2, SIMD_SSE2_word_insetNormal3DcMask pmullw xmm0, SIMD_SSE2_word_insetNormal3DcShiftUp pmullw xmm1, SIMD_SSE2_word_insetNormal3DcShiftUp paddw xmm0, xmm2 psubw xmm1, xmm2 pmulhw xmm0, SIMD_SSE2_word_insetNormal3DcShiftDown pmulhw xmm1, SIMD_SSE2_word_insetNormal3DcShiftDown pmaxsw xmm0, SIMD_SSE2_word_0 pmaxsw xmm1, SIMD_SSE2_word_0 packuswb xmm0, xmm0 packuswb xmm1, xmm1 movd dword ptr [esi], xmm0 movd dword ptr [edi], xmm1 } } void EmitAlphaIndices_SSE2( const byte *block, const int channelBitOffset, const int minAlpha, const int maxAlpha ) { __asm { movd xmm7, channelBitOffset mov esi, block movdqa xmm0, xmmword ptr [esi+ 0] movdqa xmm5, xmmword ptr [esi+16] movdqa xmm6, xmmword ptr [esi+32] movdqa xmm4, xmmword ptr [esi+48] psrld xmm0, xmm7 psrld xmm5, xmm7 psrld xmm6, xmm7 psrld xmm4, xmm7 pand xmm0, SIMD_SSE2_dword_byte_mask pand xmm5, SIMD_SSE2_dword_byte_mask pand xmm6, SIMD_SSE2_dword_byte_mask pand xmm4, SIMD_SSE2_dword_byte_mask packuswb xmm0, xmm5 packuswb xmm6, xmm4 movd xmm5, maxAlpha pshuflw xmm5, xmm5, R_SHUFFLE_D( 0, 0, 0, 0 ) pshufd xmm5, xmm5, R_SHUFFLE_D( 0, 0, 0, 0 ) movdqa xmm7, xmm5 movd xmm2, minAlpha pshuflw xmm2, xmm2, R_SHUFFLE_D( 0, 0, 0, 0 ) pshufd xmm2, xmm2, R_SHUFFLE_D( 0, 0, 0, 0 ) movdqa xmm3, xmm2 movdqa xmm4, xmm5 psubw xmm4, xmm2 pmulhw xmm4, SIMD_SSE2_word_div_by_14 movdqa xmm1, xmm5 psubw xmm1, xmm4 packuswb xmm1, xmm1 // ab1 pmullw xmm5, SIMD_SSE2_word_scale66554400 pmullw xmm7, SIMD_SSE2_word_scale11223300 pmullw xmm2, SIMD_SSE2_word_scale11223300 pmullw xmm3, SIMD_SSE2_word_scale66554400 paddw xmm5, xmm2 paddw xmm7, xmm3 pmulhw xmm5, SIMD_SSE2_word_div_by_7 pmulhw xmm7, SIMD_SSE2_word_div_by_7 psubw xmm5, xmm4 psubw xmm7, xmm4 pshufd xmm2, xmm5, R_SHUFFLE_D( 0, 0, 0, 0 ) pshufd xmm3, xmm5, R_SHUFFLE_D( 1, 1, 1, 1 ) pshufd xmm4, xmm5, R_SHUFFLE_D( 2, 2, 2, 2 ) packuswb xmm2, xmm2 // ab2 packuswb xmm3, xmm3 // ab3 packuswb xmm4, xmm4 // ab4 packuswb xmm0, xmm6 pshufd xmm5, xmm7, R_SHUFFLE_D( 2, 2, 2, 2 ) pshufd xmm6, xmm7, R_SHUFFLE_D( 1, 1, 1, 1 ) pshufd xmm7, xmm7, R_SHUFFLE_D( 0, 0, 0, 0 ) packuswb xmm5, xmm5 // ab5 packuswb xmm6, xmm6 // ab6 packuswb xmm7, xmm7 // ab7 pmaxub xmm1, xmm0 pmaxub xmm2, xmm0 pmaxub xmm3, xmm0 pcmpeqb xmm1, xmm0 pcmpeqb xmm2, xmm0 pcmpeqb xmm3, xmm0 pmaxub xmm4, xmm0 pmaxub xmm5, xmm0 pmaxub xmm6, xmm0 pmaxub xmm7, xmm0 pcmpeqb xmm4, xmm0 pcmpeqb xmm5, xmm0 pcmpeqb xmm6, xmm0 pcmpeqb xmm7, xmm0 movdqa xmm0, SIMD_SSE2_byte_8 paddsb xmm0, xmm1 paddsb xmm2, xmm3 paddsb xmm4, xmm5 paddsb xmm6, xmm7 paddsb xmm0, xmm2 paddsb xmm4, xmm6 paddsb xmm0, xmm4 pand xmm0, SIMD_SSE2_byte_7 movdqa xmm1, SIMD_SSE2_byte_2 pcmpgtb xmm1, xmm0 pand xmm1, SIMD_SSE2_byte_1 pxor xmm0, xmm1 movdqa xmm1, xmm0 movdqa xmm2, xmm0 movdqa xmm3, xmm0 movdqa xmm4, xmm0 movdqa xmm5, xmm0 movdqa xmm6, xmm0 movdqa xmm7, xmm0 psrlq xmm1, 8 3 psrlq xmm2, 16 6 psrlq xmm3, 24 9 psrlq xmm4, 3212 psrlq xmm5, 4015 psrlq xmm6, 4818 psrlq xmm7, 5621 pand xmm0, SIMD_SSE2_dword_alpha_bit_mask0 pand xmm1, SIMD_SSE2_dword_alpha_bit_mask1 pand xmm2, SIMD_SSE2_dword_alpha_bit_mask2 pand xmm3, SIMD_SSE2_dword_alpha_bit_mask3 pand xmm4, SIMD_SSE2_dword_alpha_bit_mask4 pand xmm5, SIMD_SSE2_dword_alpha_bit_mask5 pand xmm6, SIMD_SSE2_dword_alpha_bit_mask6 pand xmm7, SIMD_SSE2_dword_alpha_bit_mask7 por xmm0, xmm1 por xmm2, xmm3 por xmm4, xmm5 por xmm6, xmm7 por xmm0, xmm2 por xmm4, xmm6 por xmm0, xmm4 mov esi, globalOutData movd [esi+0], xmm0 pshufd xmm1, xmm0, R_SHUFFLE_D( 2, 3, 0, 1 ) movd [esi+3], xmm1 } globalOutData += 6; } void EmitGreenIndices_SSE2( const byte *block, const int channelBitOffset, const int minGreen, const int maxGreen ) { __asm { movd xmm7, channelBitOffset mov esi, block movdqa xmm0, xmmword ptr [esi+ 0] movdqa xmm5, xmmword ptr [esi+16] movdqa xmm6, xmmword ptr [esi+32] movdqa xmm4, xmmword ptr [esi+48] psrld xmm0, xmm7 psrld xmm5, xmm7 psrld xmm6, xmm7 psrld xmm4, xmm7 pand xmm0, SIMD_SSE2_dword_byte_mask pand xmm5, SIMD_SSE2_dword_byte_mask pand xmm6, SIMD_SSE2_dword_byte_mask pand xmm4, SIMD_SSE2_dword_byte_mask packuswb xmm0, xmm5 packuswb xmm6, xmm4 movd xmm2, maxGreen pshuflw xmm2, xmm2, R_SHUFFLE_D( 0, 0, 0, 0 ) pshufd xmm2, xmm2, R_SHUFFLE_D( 0, 0, 0, 0 ) movdqa xmm1, xmm2 movd xmm3, minGreen pshuflw xmm3, xmm3, R_SHUFFLE_D( 0, 0, 0, 0 ) pshufd xmm3, xmm3, R_SHUFFLE_D( 0, 0, 0, 0 ) movdqa xmm4, xmm2 psubw xmm4, xmm3 pmulhw xmm4, SIMD_SSE2_word_div_by_6 psllw xmm2, 1 paddw xmm2, xmm3 pmulhw xmm2, SIMD_SSE2_word_div_by_3 psubw xmm2, xmm4 packuswb xmm2, xmm2 // gb2 psllw xmm3, 1 paddw xmm3, xmm1 pmulhw xmm3, SIMD_SSE2_word_div_by_3 psubw xmm3, xmm4 packuswb xmm3, xmm3 // gb3 psubw xmm1, xmm4 packuswb xmm1, xmm1 // gb1 packuswb xmm0, xmm6 pmaxub xmm1, xmm0 pmaxub xmm2, xmm0 pmaxub xmm3, xmm0 pcmpeqb xmm1, xmm0 pcmpeqb xmm2, xmm0 pcmpeqb xmm3, xmm0 movdqa xmm0, SIMD_SSE2_byte_4 paddsb xmm0, xmm1 paddsb xmm2, xmm3 paddsb xmm0, xmm2 pand xmm0, SIMD_SSE2_byte_3 movdqa xmm4, SIMD_SSE2_byte_2 pcmpgtb xmm4, xmm0 pand xmm4, SIMD_SSE2_byte_1 pxor xmm0, xmm4 movdqa xmm4, xmm0 movdqa xmm5, xmm0 movdqa xmm6, xmm0 movdqa xmm7, xmm0 psrlq xmm4, 8 2 psrlq xmm5, 16 4 psrlq xmm6, 24 6 psrlq xmm7, 32 8 pand xmm4, SIMD_SSE2_dword_color_bit_mask1 pand xmm5, SIMD_SSE2_dword_color_bit_mask2 pand xmm6, SIMD_SSE2_dword_color_bit_mask3 pand xmm7, SIMD_SSE2_dword_color_bit_mask4 por xmm5, xmm4 por xmm7, xmm6 por xmm7, xmm5 movdqa xmm4, xmm0 movdqa xmm5, xmm0 movdqa xmm6, xmm0 psrlq xmm4, 4010 psrlq xmm5, 4812 psrlq xmm6, 5614 pand xmm0, SIMD_SSE2_dword_color_bit_mask0 pand xmm4, SIMD_SSE2_dword_color_bit_mask5 pand xmm5, SIMD_SSE2_dword_color_bit_mask6 pand xmm6, SIMD_SSE2_dword_color_bit_mask7 por xmm4, xmm5 por xmm0, xmm6 por xmm7, xmm4 por xmm7, xmm0 mov esi, globalOutData movd [esi+0], xmm7 pshufd xmm6, xmm7, R_SHUFFLE_D( 2, 3, 0, 1 ) movd [esi+2], xmm6 } globalOutData += 4; } bool CompressNormalMapDXT5_SSE2( const byte *inBuf, byte *outBuf, int width, int height, int &outputBytes ) { ALIGN16( byte block[64] ); ALIGN16( byte normalMin[4] ); ALIGN16( byte normalMax[4] ); globalOutData = outBuf; for ( int j = 0; j < height; j += 4, inBuf += width * 4*4 ) { for ( int i = 0; i < width; i += 4 ) { ExtractBlock_SSE2( inBuf + i * 4, width, block ); GetMinMaxNormalsBBox_SSE2( block, normalMin, normalMax ); InsetNormalsBBoxDXT5_SSE2( normalMin, normalMax ); // Write out Nx into alpha channel. EmitByte( normalMax[0] ); EmitByte( normalMin[0] ); EmitAlphaIndices_SSE2( block, 0*8, normalMin[0], normalMax[0] ); // Write out Ny into green channel. EmitUShort( NormalYTo565( normalMax[1] ) ); EmitUShort( NormalYTo565( normalMin[1] ) ); EmitGreenIndices_SSE2( block, 1*8, normalMin[1], normalMax[1] ); } } outputBytes = outData  outBuf; } void CompressNormalMap3Dc_SSE2( const byte *inBuf, byte *outBuf, int width, int height, int &outputBytes ) { ALIGN16( byte block[64] ); ALIGN16( byte normalMin[4] ); ALIGN16( byte normalMax[4] ); globalOutData = outBuf; for ( int j = 0; j < height; j += 4, inBuf += width * 4*4 ) { for ( int i = 0; i < width; i += 4 ) { ExtractBlock_SSE2( inBuf + i * 4, width, block ); GetMinMaxNormalsBBox_SSE2( block, normalMin, normalMax ); InsetNormalsBBox3Dc_SSE2( normalMin, normalMax ); // Write out Nx as an alpha channel. EmitByte( normalMax[0] ); EmitByte( normalMin[0] ); EmitAlphaIndices_SSE2( block, 0*8, normalMin[0], normalMax[0] ); // Write out Ny as an alpha channel. EmitByte( normalMax[1] ); EmitByte( normalMin[1] ); EmitAlphaIndices_SSE2( block, 1*8, normalMin[1], normalMax[1] ); } } outputBytes = outData  outBuf; }
Appendix D
/* Realtime DXT1 & YCoCg DXT5 compression (Cg 2.0) Copyright (c) NVIDIA Corporation. Written by: Ignacio CastanoThanks to JMP van Waveren, Simon Green, Eric Werness, Simon Brown Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ // vertex program void compress_vp(float4 pos : POSITION, float2 texcoord : TEXCOORD0, out float4 hpos : POSITION, out float2 o_texcoord : TEXCOORD0 ) { o_texcoord = texcoord; hpos = pos; } typedef unsigned int uint; typedef unsigned int2 uint2; typedef unsigned int4 uint4; void ExtractColorBlockXY(out float2 col[16], sampler2D image, float2 texcoord, float2 imageSize) { #if 0 float2 texelSize = (1.0f / imageSize); texcoord = texelSize * 2; for (int i = 0; i < 4; i++) { for (int j = 0; j < 4; j++) { col[i*4+j] = tex2D(image, texcoord + float2(j, i) * texelSize).rg; } } #else // use TXF instruction (integer coordinates with offset) // note offsets must be constant //int4 base = int4(wpos*42, 0, 0); int4 base = int4(texcoord * imageSize  1.5, 0, 0); col[0] = tex2Dfetch(image, base, int2(0, 0)).rg; col[1] = tex2Dfetch(image, base, int2(1, 0)).rg; col[2] = tex2Dfetch(image, base, int2(2, 0)).rg; col[3] = tex2Dfetch(image, base, int2(3, 0)).rg; col[4] = tex2Dfetch(image, base, int2(0, 1)).rg; col[5] = tex2Dfetch(image, base, int2(1, 1)).rg; col[6] = tex2Dfetch(image, base, int2(2, 1)).rg; col[7] = tex2Dfetch(image, base, int2(3, 1)).rg; col[8] = tex2Dfetch(image, base, int2(0, 2)).rg; col[9] = tex2Dfetch(image, base, int2(1, 2)).rg; col[10] = tex2Dfetch(image, base, int2(2, 2)).rg; col[11] = tex2Dfetch(image, base, int2(3, 2)).rg; col[12] = tex2Dfetch(image, base, int2(0, 3)).rg; col[13] = tex2Dfetch(image, base, int2(1, 3)).rg; col[14] = tex2Dfetch(image, base, int2(2, 3)).rg; col[15] = tex2Dfetch(image, base, int2(3, 3)).rg; #endif } // find minimum and maximum colors based on bounding box in color space void FindMinMaxColorsBox(float2 block[16], out float2 mincol, out float2 maxcol) { mincol = block[0]; maxcol = block[0]; for (int i = 1; i < 16; i++) { mincol = min(mincol, block[i]); maxcol = max(maxcol, block[i]); } } void InsetNormalsBBoxDXT5(in out float2 mincol, in out float2 maxcol) { float2 inset; inset.x = (maxcol.x  mincol.x) / 32.0  (16.0 / 255.0) / 32.0; // ALPHA scalebias. inset.y = (maxcol.y  mincol.y) / 16.0  (8.0 / 255.0) / 16; // GREEN scalebias. mincol = saturate(mincol + inset); maxcol = saturate(maxcol  inset); } void InsetNormalsBBoxLATC(in out float2 mincol, in out float2 maxcol) { float2 inset = (maxcol  mincol) / 32.0  (16.0 / 255.0) / 32.0; // ALPHA scalebias. mincol = saturate(mincol + inset); maxcol = saturate(maxcol  inset); } uint EmitGreenEndPoints(in out float ming, in out float maxg) { uint c0 = round(ming * 63); uint c1 = round(maxg * 63); ming = float((c0 << 2)  (c0 >> 4)) * (1.0 / 255.0); maxg = float((c1 << 2)  (c1 >> 4)) * (1.0 / 255.0); return (c0 << 21)  (c1 << 5); } #if 1 uint EmitGreenIndices(float2 block[16], float minGreen, float maxGreen) { const int GREEN_RANGE = 3; float bias = maxGreen + (maxGreen  minGreen) / (2.0 * GREEN_RANGE); float scale = 1.0f / (maxGreen  minGreen); // Compute indices uint indices = 0; for (int i = 0; i < 16; i++) { uint index = saturate((bias  block[i].y) * scale) * GREEN_RANGE; indices = index << (i * 2); } uint i0 = (indices & 0x55555555); uint i1 = (indices & 0xAAAAAAAA) >> 1; indices = ((i0 ^ i1) << 1)  i1; // Output indices return indices; } #else uint EmitGreenIndices(float2 block[16], float minGreen, float maxGreen) { const int GREEN_RANGE = 3; float mid = (maxGreen  minGreen) / (2 * GREEN_RANGE); float yb1 = minGreen + mid; float yb2 = (2 * maxGreen + 1 * minGreen) / GREEN_RANGE + mid; float yb3 = (1 * maxGreen + 2 * minGreen) / GREEN_RANGE + mid; // Compute indices uint indices = 0; for (int i = 0; i < 16; i++) { float y = block[i].y; uint index = (y <= yb1); index += (y <= yb2); index += (y <= yb3); indices = index << (i * 2); } uint i0 = (indices & 0x55555555); uint i1 = (indices & 0xAAAAAAAA) >> 1; indices = ((i0 ^ i1) << 1)  i1; // Output indices return indices; } #endif uint EmitAlphaEndPoints(float mincol, float maxcol) { uint c0 = round(mincol * 255); uint c1 = round(maxcol * 255); return (c0 << 8)  c1; } uint2 EmitAlphaIndices(float2 block[16], float minAlpha, float maxAlpha) { const int ALPHA_RANGE = 7; float bias = maxAlpha + (maxAlpha  minAlpha) / (2.0 * ALPHA_RANGE); float scale = 1.0f / (maxAlpha  minAlpha); uint2 indices = 0; for (int i = 0; i < 6; i++) { uint index = saturate((bias  block[i].x) * scale) * ALPHA_RANGE; indices.x = index << (3 * i); } for (int i = 6; i < 16; i++) { uint index = saturate((bias  block[i].x) * scale) * ALPHA_RANGE; indices.y = index << (3 * i  18); } uint2 i0 = (indices >> 0) & 0x09249249; uint2 i1 = (indices >> 1) & 0x09249249; uint2 i2 = (indices >> 2) & 0x09249249; i2 ^= i0 & i1; i1 ^= i0; i0 ^= (i1  i2); indices.x = (i2.x << 2)  (i1.x << 1)  i0.x; indices.y = (((i2.y << 2)  (i1.y << 1)  i0.y) << 2)  (indices.x >> 16); indices.x <<= 16; return indices; } uint2 EmitLuminanceIndices(float2 block[16], float minAlpha, float maxAlpha) { const int ALPHA_RANGE = 7; float bias = maxAlpha + (maxAlpha  minAlpha) / (2.0 * ALPHA_RANGE); float scale = 1.0f / (maxAlpha  minAlpha); uint2 indices = 0; for (int i = 0; i < 6; i++) { uint index = saturate((bias  block[i].y) * scale) * ALPHA_RANGE; indices.x = index << (3 * i); } for (int i = 6; i < 16; i++) { uint index = saturate((bias  block[i].y) * scale) * ALPHA_RANGE; indices.y = index << (3 * i  18); } uint2 i0 = (indices >> 0) & 0x09249249; uint2 i1 = (indices >> 1) & 0x09249249; uint2 i2 = (indices >> 2) & 0x09249249; i2 ^= i0 & i1; i1 ^= i0; i0 ^= (i1  i2); indices.x = (i2.x << 2)  (i1.x << 1)  i0.x; indices.y = (((i2.y << 2)  (i1.y << 1)  i0.y) << 2)  (indices.x >> 16); indices.x <<= 16; return indices; } // compress a 4x4 block to DXT5nm format // integer version, renders to 4 x int32 buffer uint4 compress_NormalDXT5_fp(float2 texcoord : TEXCOORD0, uniform sampler2D image, uniform float2 imageSize = { 512.0, 512.0 } ) : COLOR { // read block float2 block[16]; ExtractColorBlockXY(block, image, texcoord, imageSize); // find min and max colors float2 mincol, maxcol; FindMinMaxColorsBox(block, mincol, maxcol); InsetNormalsBBoxDXT5(mincol, maxcol); uint4 output; // Output X in DXT5 green channel. output.z = EmitGreenEndPoints(mincol.y, maxcol.y); output.w = EmitGreenIndices(block, mincol.y, maxcol.y); // Output Y in DXT5 alpha block. output.x = EmitAlphaEndPoints(mincol.x, maxcol.x); uint2 indices = EmitAlphaIndices(block, mincol.x, maxcol.x); output.x = indices.x; output.y = indices.y; return output; } // compress a 4x4 block to LATC format // integer version, renders to 4 x int32 buffer uint4 compress_NormalLATC_fp(float2 texcoord : TEXCOORD0, uniform sampler2D image, uniform float2 imageSize = { 512.0, 512.0 } ) : COLOR { //imageSize = tex2Dsize(image, texcoord); // read block float2 block[16]; ExtractColorBlockXY(block, image, texcoord, imageSize); // find min and max colors float2 mincol, maxcol; FindMinMaxColorsBox(block, mincol, maxcol); InsetNormalsBBoxLATC(mincol, maxcol); uint4 output; // Output Ny as an alpha block. output.x = EmitAlphaEndPoints(mincol.y, maxcol.y); uint2 indices = EmitLuminanceIndices(block, mincol.y, maxcol.y); output.x = indices.x; output.y = indices.y; // Output Nx as an alpha block. output.z = EmitAlphaEndPoints(mincol.x, maxcol.x); indices = EmitAlphaIndices(block, mincol.x, maxcol.x); output.z = indices.x; output.w = indices.y; return output; } uniform float3 lightDirection; uniform bool reconstructNormal = true; uniform bool displayNormal = true; uniform bool displayError = false; uniform float errorScale = 4.0f; uniform sampler2D image : TEXUNIT0; uniform sampler2D originalImage : TEXUNIT1; float3 shadeNormal(float3 N) { float3 L = normalize(lightDirection); float3 R = reflect(float3(0, 0, 1), N); float diffuse = saturate(dot (N, L)); float specular = pow(saturate(dot(R, L)), 12); return 0.7 * diffuse + 0.5 * specular; } // Draw reconstructed normals. float4 display_fp(float2 texcoord : TEXCOORD0) : COLOR { float3 N; if (reconstructNormal) { N.xy = 2 * tex2D(image, texcoord).wy  1; N.z = sqrt(saturate(1  N.x * N.x  N.y * N.y)); } else { N = normalize(2 * tex2D(image, texcoord).xyz  1); } if (displayError) { float3 originalNormal = normalize(2 * tex2D(originalImage, texcoord).xyz  1); if (displayNormal) { float3 diff = (N  originalNormal) * errorScale; return float4(diff, 1); } else { float3 diff = abs(shadeNormal(N)  shadeNormal(originalNormal)) * errorScale; return float4(diff, 1); } } else { if (displayNormal) { return float4(0.5 * N + 0.5, 1); } else { return float4(shadeNormal(N), 1); } } } // Draw geometry normals. uniform float4x4 mvp : ModelViewProjection; uniform float3x3 mvit : ModelViewInverseTranspose; void display_object_vp(float4 pos : POSITION, float3 normal : NORMAL, out float4 hpos : POSITION, out float3 o_normal : TEXCOORD0) { hpos = mul(pos, mvp); o_normal = mul(normal, mvit); } float4 display_object_fp(float3 N : TEXCOORD0) : COLOR { N = normalize(N); return float4(0.5 * N + 0.5, 1); }