************************** WW05 - 1999 Date: 01/25/99 - 01/31/99 ************************** From: Hsien-Hsin Lee Advisor: Professor Gary S. Tyson Research: ========= * Did some profiling using VTune on some Direct3D games (These games only use Direct3D as a rasterization library. They should have their own proprietary geometry engines.) * Build viewperf using Mesa 3D library (alternative to OpenGL), am able to run the provided dataset on my Windows 98 and Linux. For some reasons, it runs much faster on Win98 than on Linux. No idea what's going on, since Linux activates 2 processors whereas only one processor is running under Win98. (Viewperf is a 3D CPU benchmark released by SPECbench consortium. The implementation is based on OpenGL, Mesa 3D is written by Brian Paul when he did his master in Wisconsin-Madison. Mesa can be said compatible with OpenGL.) * Built Mesa 3D library. (All the source code trees are publicly available.) and ran some 3D spinning boxes. This could be a good resource for us to instrument and modify the source code in the geometry. We can use this as a vechicle for performance study on 3D apps. * Tweaked the building enviornment and re-built Mesa 3D lib with -pg option in order to gather profiling info from gprof. Now we are able to profile all OpenGL-compliant apps using gprof. * Doing profiling (grpof) on Awadvs working set on OpenGL. Try to identify the bottleneck of viewperf. Reading: ======== ------------------------------------------------------------------------------ Effevtive HW-based Data Prefetching for High-performance Processors by Tien-Fu Chen and Jean-Loup Baer, IEEE TC Computers, May 1995 This paper describes 3 HW-based schemes for data prefetching. 1) basic 2) look-ahead 3) correlation 1) basic: use the current load to trigger for the next address if the stride of access pattern is steady. (Drawback, always one iteration ahead. 2) Lookahead: employ a LA-PC (lookahead program counter) that counts ahead of the PC. Use a "times" field to calculate the address to be prefetched (=prev_addr + stride*times) Use a LA-limit as a maximum distance, so that the LA-PC will be stalled when this value is reached or ORL (outstanding request list is full) ORL acts like a busQ. LA-PC will be modified by a branch predictor. These 2 schemes have a common drawback that they'll de-pipeline the memory in the 1st couple of iterations or in the nested loop boundaries. The paper did not describe how they handle the excess prefetch instructions that hit caches but consume extra machine resources. ----------------------------------------------------------------------------- Performance Limits of Trace Caches, Matt Postiff, Gary Tyson and Trevor Mudge. UM Tech-Report Sept. 1998 Interesting observations. no public comments b4 publications. ----------------------------------------------------------------------------- Multi-Level Texture Caching for 3D Gfx hardware, by Michael Cox, Narendra Bhandari, and Mike Shantz, ISCA98 (revisited) Proposed a multi-level texture cache with the use of AGP-texturing for graphics platform, in fact, PC platform. ------------------------------------------------------------------------------ Improving the Accuracy and performance of memory Communication Through Renaming by Gary Tyson and Todd Austin, MICRO-30 Basic idea is to adopt a load-store cache with a value file to speculate loaded results before effective address is calculated. The load-store will then index to the entries of the value file. If the producer and consumer has the same datum, then they'll index to the same entry in the value file. The result of the load speculation requires a match checking with the actual loaded result from the D-cache. If speculation is bad, correct processor state must be restored. This paper also shows invariant values are common in SPEC95, thus very predictable. Performance advatange from memory renaming will diminish when issue width is narrower. The scheme will mis-speculate in some cases such as loop-carried dependency is present. (Because it uses the instruction IP address to index the dependent values.) It shows that recovery (only re-execute data-depedent instructions) although with much more complex design but will outperform squash schemes. (observation is recovery scheme only execute about 1/3 instr. of squash scheme.) Comments: There are some similarity between memory renaming and dynamic instruction reuse. The pros of renaming is to get loaded data even before effective address is calculated. The pros of dynamic instr. reuse is in the beauty of breaking data dependency if the same of dependency chain was recorded earlier and the dependent data values are the same. Also similarity in value prediction and renaming. But renaming will carry much more confidence than value because it depends on the relationship between producer and consumer. ------------------------------------------------------------------------------ Ideas: (to be shared) ====== Other Techy Interests: ====================== NVidia is IPO-ed, finally. NVdivia's RIVA TNT is considered to be the performance leader in the desktop 3D card market. This position might not be shattered before 3Dfx introduces their Voodoo3. Voodoo3 will not support AGP-texturing. (So AGP will not be so useful anymore even AGP-4 supports 1GB/sec bandwidth) Voodoo3 will have their own texture-compression so that the texture can be stored in the local memory of the graphics card. ATI Rage Fury is out and seems to beat NVidia's TNT.