SWAR

SIMD within a register (SWAR), also known by the name "packed SIMD"[1] is a technique for performing parallel operations on data contained in a processor register. SIMD stands for single instruction, multiple data. Flynn's 1972 taxonomy categorises SWAR as "pipelined processing".

Many modern general-purpose computer processors have some provisions for SIMD, in the form of a group of registers and instructions to make use of them. SWAR refers to the use of those registers and instructions, as opposed to using specialized processing engines designed to be better at SIMD operations. It also refers to the use of SIMD with general-purpose registers and instructions that were not meant to do it at the time, by way of various novel software tricks.[3]

SWAR architectures

A SWAR architecture is one that includes instructions explicitly intended to perform parallel operations across data that is stored in the independent subwords or fields of a register. A SWAR-capable architecture is one that includes a set of instructions that is sufficient to allow data stored in these fields to be treated independently even though the architecture does not include instructions that are explicitly intended for that purpose.

An early example of a SWAR architecture was the Intel Pentium with MMX, which implemented the MMX extension set. The Intel Pentium, by contrast, did not include such instructions, but could still act as a SWAR architecture through careful hand-coding or compiler techniques.

Early SWAR architectures include DEC Alpha MVI, Hewlett-Packard's PA-RISC MAX, Silicon Graphics Incorporated's MIPS MDMX, and Sun's SPARC V9 VIS. Like MMX, many of the SWAR instruction sets are intended for faster video coding.[4]

History of the SWAR programming model

Wesley A. Clark introduced partitioned subword data operations in the 1950s. This can be seen as a very early predecessor to SWAR. Leslie Lamport presented SWAR techniques in his paper titled "Multiple byte processing with full-word instructions"[5] in 1975.

With the introduction of Intel's MMX multimedia instruction set extensions in 1996, desktop processors with SIMD parallel processing capabilities became common. Early on, these instructions could only be used via hand-written assembly code.

In the fall of 1996, Professor Hank Dietz was the instructor for the undergraduate Compiler Construction course at Purdue University's School of Electrical and Computer Engineering. For this course, he assigned a series of projects in which the students would build a simple compiler targeting MMX. The input language was a subset dialect of MasPar's MPL called NEMPL (Not Exactly MPL).

During the course of the semester, it became clear to the course teaching assistant, Randall (Randy) Fisher, that there were a number of issues with MMX that would make it difficult to build the back-end of the NEMPL compiler. For example, MMX has an instruction for multiplying 16-bit data but not multiplying 8-bit data. The NEMPL language did not account for this problem, allowing the programmer to write programs that required 8-bit multiplies.

Intel's x86 architecture was not the only architecture to include SIMD-like parallel instructions. Sun's VIS, SGI's MDMX, and other multimedia instruction sets had been added to other manufacturers' existing instruction set architectures to support so-called new media applications. These extensions had significant differences in the precision of data and types of instructions supported.

Dietz and Fisher began developing the idea of a well-defined parallel programming model that would allow the programming to target the model without knowing the specifics of the target architecture. This model would become the basis of Fisher's dissertation. The acronym "SWAR" was coined by Dietz and Fisher one day in Hank's office in the MSEE building at Purdue University.[6] It refers to this form of parallel processing, architectures that are designed to natively perform this type of processing, and the general-purpose programming model that is Fisher's dissertation.

The problem of compiling for these widely varying architectures was discussed in a paper presented at LCPC98.[4]

Some applications of SWAR

SWAR processing has been used in image processing,[7] cryptographic pairings,[8] raster processing,[9] computational fluid dynamics,[10] and communications.[11]

References

Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002). An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions. Asia-Pacific Conference on Circuits and Systems. Vol. 1. p. 171-176. doi:10.1109/APCCAS.2002.1114930. hdl:2065/10689.
Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.
Fisher, Randall J (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors (PDF) (Ph.D.). Purdue University.
Fisher, Randall J.; Henry G. Dietz (August 1998). S. Chatterjee; J. F. Prins; L. Carter; J. Ferrante; Z. Li; D. Sehr; P.-C.Yew (eds.). "Compiling for SIMD Within A Register". Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing.
Lamport, Leslie (August 1975). "Multiple byte processing with full-word instructions". Communications of the ACM. 18 (8): 471–475. doi:10.1145/360933.360994. S2CID 1593593.
Dietz, Hank. "The Aggregate Magic Algorithms".
Padua, Flavio L. C.; Pereira, Guilherme A. S.; Neto, Jose P. de Queiroz; Campos, Mario F. M.; Fernandes, Antonio O. (January 2001). Improving processing time of large images by instruction level parallelism (PDF). Chilean Computing Week, V Workshop on Parallel and Distributed Systems. Punta Arenas. Archived from the original (PDF) on 2007-02-25.
Grabher, Philipp; Johann Großschädl; Dan Page (2009). "On Software Parallel Implementation of Cryptographic Pairings". Selected Areas in Cryptography. Lecture Notes in Computer Science. Vol. 5381. pp. 35–50. doi:10.1007/978-3-642-04159-4_3. ISBN 978-3-642-04158-7.
Persada, Onil Nazra; Thierry Goubier (12–14 September 2004). "Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS". Proceedings of the FOSS/GRASS Users Conference 2004.
Hauser, Thomas; T. I. Mattox; R. P. LeBeau; H. G. Dietz; P. G. Huang (April 2003). "Code Optimizations for Complex Microprocessors Applied to CFD Software". SIAM Journal on Scientific Computing. 25 (4): 1461–1477. doi:10.1137/S1064827502410530. ISSN 1064-8275.
Spracklen, Lawrence A. (2001). SWAR Systems and Communications Applications (PDF) (Ph.D.). University of Aberdeen.

External links

The Aggregate - SWAR: SIMD Within A Register

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002). An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions. Asia-Pacific Conference on Circuits and Systems. Vol. 1. p. 171-176. doi:10.1109/APCCAS.2002.1114930. hdl:2065/10689.

[flynn-1972-2] Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness" (PDF). IEEE Transactions on Computers. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.

[3] Fisher, Randall J (2003). General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors (PDF) (Ph.D.). Purdue University.

[LCPC98-4] Fisher, Randall J.; Henry G. Dietz (August 1998). S. Chatterjee; J. F. Prins; L. Carter; J. Ferrante; Z. Li; D. Sehr; P.-C.Yew (eds.). "Compiling for SIMD Within A Register". Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing.

[5] Lamport, Leslie (August 1975). "Multiple byte processing with full-word instructions". Communications of the ACM. 18 (8): 471–475. doi:10.1145/360933.360994. S2CID 1593593.

[6] Dietz, Hank. "The Aggregate Magic Algorithms".

[7] Padua, Flavio L. C.; Pereira, Guilherme A. S.; Neto, Jose P. de Queiroz; Campos, Mario F. M.; Fernandes, Antonio O. (January 2001). Improving processing time of large images by instruction level parallelism (PDF). Chilean Computing Week, V Workshop on Parallel and Distributed Systems. Punta Arenas. Archived from the original (PDF) on 2007-02-25.

[8] Grabher, Philipp; Johann Großschädl; Dan Page (2009). "On Software Parallel Implementation of Cryptographic Pairings". Selected Areas in Cryptography. Lecture Notes in Computer Science. Vol. 5381. pp. 35–50. doi:10.1007/978-3-642-04159-4_3. ISBN 978-3-642-04158-7.

[9] Persada, Onil Nazra; Thierry Goubier (12–14 September 2004). "Accelerating Raster Processing with Fine and Coarse Grain Parallelism in GRASS". Proceedings of the FOSS/GRASS Users Conference 2004.

[10] Hauser, Thomas; T. I. Mattox; R. P. LeBeau; H. G. Dietz; P. G. Huang (April 2003). "Code Optimizations for Complex Microprocessors Applied to CFD Software". SIAM Journal on Scientific Computing. 25 (4): 1461–1477. doi:10.1137/S1064827502410530. ISSN 1064-8275.

[11] Spracklen, Lawrence A. (2001). SWAR Systems and Communications Applications (PDF) (Ph.D.). University of Aberdeen.