Back

General FAQ

 
 

This FAQ is a collection of questions that developers often ask us. We hope that you will find it useful. If you have questions that you would like to suggest, please send them to us. We will review all submissions and periodically add new entries to the FAQ.

General

Performance

Texturing

Programmable Shading

Cg


General

My Direct3D app does not work. I think your drivers are broken. What's going on?

Have you tried the following?

  • Run your app with the DirectX debug run-time. Does it run clean, i.e., it does not Generate any errors or warnings? If you do get warnings and errors, the problem may be in your app -- try fixing all the complaints the Debug runtime has.
  • Run your app with the reference rasterizer device. It is slow, but if it generates the same output as Running on the HAL device, then your app is doing something other than what you intended.
  • If the reference rasterizer results differ from the HAL results, then please get in touch with us. We will fix the problem as soon as possible. However, we will need the following information:
    • Operating system
    • Driver version
    • Graphics card
    • A sample application that clearly demonstrates the problem. We do not need source in general, and we are happy with reasonable complex apps (i.e., no need to narrow the problem down as long as the problem is clearly demonstrated). What is most important is that the problem is easy to reproduce, and that the application is self-contained. An application that goes straight to the problem is best (i.e., a "single-click" application that doesn't require navigation to get to the problem).

I'm confused about vertex buffers in Direct3D. What do the various parameters mean when allocating and locking vertex buffers?

When allocating vertex buffers:

  • If the buffer is created as STATIC (D3DUSAGE_DYNAMIC not specified), it goes into video memory, period. These vertex buffers are meant to be written once and not read back.

  • The D3DUSAGE_WRITEONLY flag has a meaning only when the D3DUSAGE_DYNAMIC flag is set.

  • If the buffer is created as DYNAMIC (D3DUSAGE_DYNAMIC specified) and D3DUSAGE_WRITEONLY is specified, it goes into video memory.

  • If the buffer is created as DYNAMIC (D3DUSAGE_DYNAMIC specified) and D3DUSAGE_WRITEONLY is not specified, it goes into AGP memory. (CPU reads faster from AGP than video memory, but not as fast as from system memory, good compromise)

  • If there is no more video memory the buffer goes into AGP memory; if there is no more AGP memory, creation fails unless the buffer’s allocation pool is POOL_ MANAGED in which case the RUNTIME will actually free as many other buffers as necessary until the creation does not fail anymore and keep a copy of the vertex buffer in system memory; the freed buffers will get re-created whenever they are used. Note that the POOL_ MANAGED flag is not passed to the driver.

  • For NV1X GPUs only, any vertex buffer located in video memory may get demoted to AGP memory in case of multi-stream rendering: If the APPLICATION renders from several vertex buffers and one of them is in AGP memory, every other one that is in video memory will get dumped into AGP.

When locking vertex buffers:

  • If the vertex buffer’s allocation pool is POOL_DEFAULT:

    • If no flag is specified: The lock can be blocking because it forces a synch between the GPU and the app -> bad.

    • If the D3DLOCK_NOOVERWRITE flag is specified, the RUNTIME reuses the locks because the APPLICATION promises not to overwrite data that is being rendered. The DRIVER receives the lock calls, but returns immediately -> efficient.

    • If the D3DLOCK_DISCARD flag is specified, the APPLICATION will refresh the whole content of the buffer. This allows the DRIVER to rename -> efficient

  • If the vertex buffer’s allocation pool is POOL_ MANAGED:

    • The DRIVER never sees the locks on these ones.

    • For those buffers, the RUNTIME has created a cache in system memory (used to restore the buffer after a driver loss) and updates the DRIVER’s vertex buffer in video memory with the "buffer blit" method right before it is used by a rendering operation.  

Render-to-texture seems to really slow down my Direct3D application, even when I do it infrequently and render very little geometry to a small surface.  Is this normal?

Make sure you're not creating your texture in the D3DPOOL_MANAGED pool.  The Direct3D runtime needs a local copy of all MANAGED textures to restore them on mode switch or for texture management, so rendering to these implies that a readback from local video memory to system memory will occur, dramatically hurting performance.  For render targets, use D3DPOOL_DEFAULT instead.

How should I handle input lag (that is, apparent mouse and keyboard response time) in situations where the CPU is getting too far ahead of the GPU?

The most obvious solution would be to lock the back buffer for each frame in Direct3D (analogous to calling glFinish() in OpenGL). This ensures that all pending graphics commands are completed by the GPU before the CPU moves on. However, this completely removes any potential for asynchronous processing, as the CPU is unable to process the next frame until the current frame has finished rendering.

A better solution is double-buffered texture locking. This is a generalization of locking the back-buffer.  At the end of your frame you render a single triangle to a tiny (2x2) texture, then read the contents of your texture.  So far this solution is equivalent to locking the back-buffer, and suffers the same kind of stalls.  It ensures that the GPU never gets more than 1 frame ahead of the CPU.

Now generalize it: use two tiny textures and alternately render to them and alternately lock them:

  • Render frame 1

  • Render a triangle to texture 0

  • Lock and read texture 1

  • Render frame 2

  • Render a triangle to texture 1

  • Lock and read texture 0

  • Render frame 3

  • Render triangle to texture 0

  • Lock and read texture 1

  • Render frame 4

  • Render a triangle to texture 1

  • Lock and read texture 0

  • ...

Now, the GPU does not get stalled; it also never gets more than 2 frames ahead of the CPU.  Lag is up to one frame, but overall efficiency is higher since the GPU is always busy (if you are GPU bound). You can further generalize it to use triple-buffered textures, and you may even be able to insert multiple sync points per frame to get finer control over lag.

A second solution is to use DirectX 9's Asynchronous Query functionality (analogous to using fences in OpenGL).  At the end of your frame, insert a D3DQUERYTYPE_EVENT query into your rendering stream.  You can then poll whether the GPU has reached this event yet by using GetData.  As in 1) you can thus ensure (i.e., busy wait w/ the CPU) that the CPU never gets more than 2 frames ahead of the GPU, while the GPU is never idled.  Similarly it is conceivable to insert multiple queries per frame to get finer control over lag.

I upgraded my application to DirectX9 and my hardware shadow maps no longer work!  What's up?  

We've changed the behavior of hardware shadow maps between the DirectX8 and DirectX9 interfaces.  In DirectX8, you're required to scale the interpolated z component (that will be compared with the value in the shadow map) by the bit depth of the shadow map itself.  Starting with the DirectX9 interfaces, we've changed this behavior to no longer require this scale, so the z value should be in the range [0..1], regardless of bit depth.  Basically, we wanted to implement this new cleaner behavior, but didn't want to break shipping apps that rely on the old behavior, so we changed it only for the new DX9 interfaces.

How do I force FSAA on within my application?

In OpenGL, use the ARB_multisample extension. In Direct3D, set the MultiSampleType value of the D3DPRESENT_PARAMETER struct passed into IDirect3D9::CreateDevice().  For example, this code snippet enabled 4X antialiasing:

D3DPRESENT_PARAMETER d3dPP;

ZeroMemory( &d3dPP, sizeof( d3dPP ) ); 

d3dPP.Windowed = FALSE d3dPP.SwapEffect = D3DSWAPEFFECT_DISCARD; 

d3dPP.MultiSampleType = D3DMULTISAMPLE_4_SAMPLES; 

pD3D->CreateDevice(D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL, hWnd, D3DCREATE_SOFTWARE_VERTEXPROCESSING, &d3dpp, &d3dDevice)

 

What new resolutions should I test my applications with?

 

When testing applications for quality assurance, remember to test some of the new widescreen panel formats: SXGA+ (1400 x 1050), WUXGA (1920 x 1200), and WXGA (1280 x 800).  The first format uses a 4:3 aspect ratio, while the other two use a 16:10 aspect ratio.  Even if you don't have a suitable monitor to display these resolutions, they are worth testing as notebook manufacturers and new displays are now using them.

 

What floating-point surface formats does the GeForce FX family support in DirectX?

 

Under DirectX 9.0, all all GeForce FX-class GPU's support floating point surfaces using the multi-element texture mechanism.  Multi-element textures, as described in the DXSDK documentation, allows you to render to 4 simultaneous outputs, each being 32-bits wide, for a total of 128-bits.  Each of these 32-bit components can be written to as 4x8, 2x16fp, or 1x32fp.  To use these surfaces, create them using the following FourCC formats:

  • NVE0: NV_SURFACE_FORMAT_IMAGE_MET_4_4_4_4
  • NVE1: NV_SURFACE_FORMAT_IMAGE_MET_4_4_4_2
  • NVE2: NV_SURFACE_FORMAT_IMAGE_MET_4_4_4_1
  • NVE3: NV_SURFACE_FORMAT_IMAGE_MET_4_4_2_2
  • NVE4: NV_SURFACE_FORMAT_IMAGE_MET_4_4_2_1
  • NVE5: NV_SURFACE_FORMAT_IMAGE_MET_4_4_1_1
  • NVE6: NV_SURFACE_FORMAT_IMAGE_MET_4_2_2_2
  • NVE7: NV_SURFACE_FORMAT_IMAGE_MET_4_2_2_1
  • NVE8: NV_SURFACE_FORMAT_IMAGE_MET_4_2_1_1
  • NVE9: NV_SURFACE_FORMAT_IMAGE_MET_4_1_1_1
  • NVEa: NV_SURFACE_FORMAT_IMAGE_MET_2_2_2_2
  • NVEb: NV_SURFACE_FORMAT_IMAGE_MET_2_2_2_1
  • NVEc: NV_SURFACE_FORMAT_IMAGE_MET_2_2_1_1
  • NVEd: NV_SURFACE_FORMAT_IMAGE_MET_2_1_1_1

Here, 4=4xFX8, 2=2xFP16, and 1=1xFP32.  So, for example, NVE7 has 8-bit components at element index 0, 16-bit FP at index 1 and 2, and 32-bit FP at index 3.  As you can see, all formats have 4 “chunks”, and each chunk is always 32 bits. MET_1_1_1_1 is coming soon.

 


Performance

I'm using glDrawPixels and glReadPixels in OpenGL. I'm seeing poor performance. What should I do?

BGRA is and always has been the fastest format to use. (There are some cases where RGBA is OK, and usually BGR is better than RGB, but in general, BGRA is the safest mode.)

The fastest performance you'll get a readback is approximately 160-180 MB/s (~45 MPix/s) for RGBA/BGRA which is the GPU hardware limit (due to PCI reads on the memory interface). This is with a P4 1.5GHz and above class system. The readback rate doesn't change significantly with the GeForce FX family. Note that you'll get the highest performance when you read back large areas as opposed to small ones.

For glDrawPixels(), performance it depends on the GPU, path, and the texture size but for NV28GL we achieve ~130 MPix/s RGBA (520 MB/s) and for Quadro FX we achieve ~170 MPix/s RGBA (680 MB/s) for images >128x128.

These numbers will vary based on your particular system so you may consider measure these yourself using GLperf.

Using pixel data range on an AGP 8x system we can achieve writes at ~1.7GB/s and ~960 MB/s on an AGP 4x system. More information is available at: http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_pixel_data_range.txt.

I upgraded my application to DirectX9 and my hardware shadow maps no longer work! What's up?

We've changed the behavior of hardware shadow maps between the DirectX8 and DirectX9 interfaces. In DirectX8, you're required to scale the interpolated z component (that will be compared with the value in the shadow map) by the bit depth of the shadow map itself. Starting with the DirectX9 interfaces, we've changed this behavior to no longer require this scale, so the z value should be in the range [0..1], regardless of bit depth. Basically, we wanted to implement this new cleaner behavior, but didn't want to break shipping apps that rely on the old behavior, so we changed it only for the new DX9 interfaces.

Render-to-texture seems to really slow down my Direct3D application, even when I do it infrequently and render very little geometry to a small surface. Is this normal?

Make sure you're not creating your texture in the D3DPOOL_MANAGED pool. The Direct3D runtime needs a local copy of all MANAGED textures to restore them on mode switch or for texture management, so rendering to these implies that a readback from local video memory to system memory will occur, dramatically hurting performance. For render targets, use D3DPOOL_DEFAULT instead.

My application is slow. How can I figure out what's causing the slowdown?

The key is to identify your application's bottleneck. There are several ways to do this:

  • Eliminate all file accesses. Any hard disk accesses will surely kill your frame rate. This is easy enough to detect--just take a look at your computer's "hard disk in use" light.
  • Run identical GPUs on different speed CPUs.  If the frame rate varies, your application is CPU-limited.
  • Decrease your AGP speed from your system BIOS. If the frame rate varies, your application is AGP bandwidth-limited.
  • Reduce your GPU's core clock. If the slower core clock reduces performance, then your application is limited by the vertex shader, rasterization, or the fragment shader (i.e, shader-limited).
  • Reduce your GPU's memory clock. If the slower memory clock affects performance, your application is limited by texture or frame-buffer bandwidth (GPU bandwidth-limited).

Ok, now I know what my application's bottleneck is. How do I get rid of it and make my application run faster?

  • If you are CPU-limited: Try running VTune or a similar performance tool to find out where most of your time is being spent. Note that the graphics driver is a potential CPU consumer, particularly if you are using the GPU in non-standard ways. One common way to lose parallelism between the CPU and the GPU is locking resources (vertex buffers or textures), or reading back data from the GPU to the CPU.
  • If you are AGP-bandwidth-limited: Make sure your AGP settings are maximized.  Transfer less data per frame to the GPU.  Today, we see very few applications that are AGP-bandwidth-limited.
  • If you are shader-limited: Make sure you've balanced the workload between the vertex and fragment programs. For example, calculations that can be linearly interpolated belong in the vertex shader, not in the fragment shader. Use only the amount of precision that you need (choose between float, half, and fixed data types prudently). Try encoding functions in textures.
  • If you are GPU bandwidth-limited: Try reducing the size of your textures. You may also be performing too many blending operations, which are costly.

How do I time my rendering code? (How do I know how long it takes the GPU to render something?)

The wrong answer is to time all Direct3D or OpenGL calls. This simply times how long it takes the driver to submit the rendering request to the push-buffer. The actual rendering work is done asynchronously and later. There is no direct way for you to measure how long the GPU takes to process a particular rendering call.


Texturing

I've heard that it's a crime to not use mipmaps. Why is this?

Whether you're using mipmapping or not, you should never use "nearest" filtering, which causes texture swimming artifacts when textures are minified. Using mipmaps allows you to create accurately minified versions of your textures, so that they look as good as possible in the distance. Mipmapped textures only require 33% more memory than their non-mipmapped counterparts, so it is wise to trade off the small amount of extra memory for the vastly improved image quality. In addition, today's GPUs are optimized to handle mipmaps very efficiently, so the performance impact of using mipmaps is minimal.

Can you tell me a little about the DXT texture formats?

DXT is a format for compressing textures. Using texture compression helps to reduce bandwidth between system memory and the GPU, as well as to save precious texture memory in applications that use heavy texturing.

DXT1 has 16 bit interpolation on GF2 and GF3. Most textures will compress fine with DXT1 on these platforms. Textures with smooth gradients may have problems and this can often be fixed by adding dithering during compression. DXT3 and DXT5 are always interplated int 32 bits, and can be used if DXT1 has problems.

DXT1 is 4bpp, DXT3/5 is 8bpp.

So the short answer is use DXT1 where you can and DXT3 or DXT5 when you need to.

Our Photoshop plug-in lets you preview in 3D what the textures will look like when rendered in hardware. The plug-in also adds dithering as well as a host of other options. It is available at: http://developer.nvidia.com/view.asp?IO=ps_texture_compression_plugin

Can I use tex2D in a vertex shader?

tex2D will be supported in the vs_3_0 vertex profile and beyond in HLSL and in Cg, as it is a requirement of Vertex Shaders 3.0 and beyond. Similar functionality will be available in Cg's OpenGL profiles.

What's the difference between displacement maps and bump maps?

Displacement maps actually change geometric position at the per-vertex or per-fragment level. This means that object silhouettes actually change, based on the displacement map.

In contrast, bump maps keep the geometric position the same, but perturb the normal. Therefore, bump mapping is useful only when lighting changes or when the viewpoint moves (in the case of specular highlights). Given a static screenshot, it is often difficult to tell if a surface is bump-mapped or displaced.

I've set up my Direct3D texture stage states to perform an operation, but it's not working! What's wrong? The caps report 8 stages are supported and I'm only using 4.

Although 8 stages may be reported by the caps, that doesn't mean that all combinations of operations are supported. D3D8 added the ValidateDevice function to help with this. So before you start your game, you should run through the techniques you would like to use, and test them with ValidateDevice. To support older cards like GeForce4 MX and GeForce2 you should provide a final fallback that supports only two texture stages. If you use OpenGL, you may be able to access greater functionality through the register combiners extension.

The value of GL_MAX_TEXTURE_UNITS is 4 for GeForce FX and GeForce 6 Series GPUs. Why is that, since those GPUs have 16 texture units?

GL_MAX_TEXTURE_UNITS refers to the number of conventional texture units as designed by the old GL_ARB_multitexture extension for which texture coordinate sets and texture image units have to correspond. Now, it doesn’t make sense to carry on with this design choice when, as it is the case with the GeForce FX and GeForce 6800 families, the number of texture coordinate sets and the number of texture image units differs, both of them are more than 4, and you can arbitrarily use any texture coordinate to sample any texture image. So, we have intentionally chosen not to further aggrandize conventional texturing with more texture units and have programmers use the GL_ARB_fragment_program and GL_NV_fragment_program extensions when they need to work with more than 4 texture units. Fragment programs are far more general, flexible, efficient, and forward-looking.

The GL_MAX_TEXTURE_UNITS limit has thus been kept at 4 and the GL_ARB_fragment_program and GL_NV_fragment_program extensions define two new limits:

  • GL_MAX_TEXTURE_IMAGE_UNITS_ARB/GL_MAX_TEXTURE_IMAGE_UNITS_NV for the number of texture image units, which is 16 for the GeForce FX and GeForce 6800 families,
  • GL_MAX_TEXTURE_COORDS_ARB/ GL_MAX_TEXTURE_COORDS_NV for the number of texture coordinate sets, which is 8 for the GeForce FX and GeForce 6800 families.

For GPUs that do not support GL_ARB_fragment_program and GL_NV_fragment_program, those two limits are set equal to GL_MAX_TEXTURE_UNITS.

It’s also important to understand which of those three limits apply to which texture-related states:

  • GL_MAX_TEXTURE_UNITS applies to:
    • glEnable/glDisable(GL_TEXTURE_xxx)
    • glTexEnv/glGetTexEnv(GL_TEXTURE_ENV_MODE, ...)
    • glTexEnv/glGetTexEnviv(GL_TEXTURE_SHADER_NV, ...)
    • glFinalCombinerInput()
    • glCombinerInput()
    • glCombinerOutput()
  • GL_MAX_TEXTURE_IMAGE_UNITS_ARB/GL_MAX_TEXTURE_IMAGE_UNITS_NV applies to:
    • glTexImageXXX(...)
    • glGetTexImage(...)
    • glTexSubImageXXX(...)
    • glCopyTexImageXXX(...)
    • glCopySubTexImageXXX(...)
    • glTexParameter/glGetTexParameter(...)
    • glColorTable/glGetColorTable(GL_TEXTURE_xxx/GL_PROXY_TEXTURE_xxx, ...)
    • glCopyColorTable/glCopyColorSubTable(GL_TEXTURE_xxx/GL_PROXY_TEXTURE_xxx, ...)
    • glGetColorTableParameter(GL_TEXTURE_xxx/GL_PROXY_TEXTURE_xxx, ...)
    • glTexEnv/glGetTexEnv(GL_TEXTURE_FILTER_CONTROL_EXT, ...)
    • glGetTextureLevelParameter(...)
    • glBindTexture(...)
  • GL_MAX_TEXTURE_COORDS_ARB/ GL_MAX_TEXTURE_COORDS_NV applies to:
    • glEnable/glDisable(GL_TEXTURE_GEN_xxx)
    • glTexGen(...)
    • glMatrixMode(GL_TEXTURE)/glLoadIdentity/glLoadMatrix/glRotate/...
    • glPointParameter(GL_POINT_SPRITE_R_MODE_NV, ...)
    • glTexEnv/glGetTexEnv(GL_POINT_SPRITE_NV, ...)
    • glClientActiveTexture(...)
    • glMultTexCoord(...)

Incidentally, the classification above also tells you that all the calls corresponding to GL_MAX_TEXTURE_UNITS are useless when using fragment programs and actually ignored by the driver (so, making those calls has no adverse performance impact). For example, you don't need to call glEnable/glDisable(GL_TEXTURE_xxx) since the texture target is passed as a parameter to the TEX/TXP/etc instructions called inside the fragment program. Making those calls will actually generate errors if the active texture index is above GL_MAX_TEXTURE_UNITS. glBindProgramARB is the one doing all the work of state configuration and texture enabling/disabling. One glBindProgramARB basically replaces a dozen or more of glActiveTexture, glEnable/glDisable, and glTexEnv calls.


Programmable Shading

Can I use pixel shaders without vertex shaders, and vice versa?

Most definitely. It's entirely up to you. If you don't use a shader, the fixed-function processing will be used instead for that part for the pipeline. In addition, you can express your shaders either in a higher-level language or with assembly code.

Which level of vertex and pixel shading capability do the various NVIDIA GPUs support?

The table uses DirectX nomenclature (i.e., Vertex Shader and Pixel Shader version numbers). For GeForce FX and newer GPUs that are capable of flow control, the static instruction count represents the number of instructions in a program as it is compiled. The dynamic instruction count represents the number of instructions actually executed. In practice, the dynamic count can be vastly higher than the static count due to looping and subroutine calls.

GPU

Vertex Shading Capability

Pixel Shading Capability

Riva 128

None

None

TNT

None

None

TNT2

None

None

GeForce 256

None

None

GeForce2 GTS

None

Register Combiners

GeForce3

1.0

1.1

GeForce4 Ti

1.0

1.3

GeForce FX

2.0+
256 static instructions
65,535 dynamic instructions

2.0+
512 static instructions
512 dynamic instructions

GeForce 6 Series

3.0
512 static instructions
65,535 dynamic instructions

3.0
2,048 static instructions
65,535 dynamic instructions

I'm using multiple passes, but with the exact same vertex data and transformations in each pass.  Why am I getting z-fighting in my scene?

Make sure you're not using vertex shaders to do transforms in one pass and fixed function transforms in another. These two paths, even given the exact same data, will not generally produce the exact same values, and this can cause z-fighting when rendering multiple passes with an EQUAL depth test. The solution is to stick to one method for all passes over an object, or use an OpenGL extension such as ARB_vertex_program with the ARB_position_invariant option.


Cg

Please see our Cg FAQ.

 



 
 
LinkedInTwitterGoogle+FacebookReddit