Aligning data on boundaries can help performance. The Intel® compiler attempts to align data on boundaries for you. However, as in all areas of optimization, coding practices can either help or hinder the compiler and can lead to performance problems. Always attempt to optimize using compiler options first. See Optimization Options Summary for more information.
To avoid performance problems you should keep the following guidelines in mind, which are separated by architecture:
IA-32, IntelŪ EM64T, IntelŪ ItaniumŪ architectures:
Do not access or create data at large intervals that are separated by exactly 2n (for example, 1 KB, 2 KB, 4 KB, 16 KB, 32 KB, 64 KB, 128 KB, 512 KB, 1 MB, 2 MB, 4 MB, 8 MB, etc.).
Align data so that memory accesses does not cross cache lines (for example, 32 bytes, 64 bytes, 128 bytes).
Use __mm_malloc(size,alignment,[offset]) to force allocated structures to be enforce the rules above.
Use Application Binary Interface (ABI) for the ItaniumŪ compiler to insure that ITP pointers are 16-byte aligned.
IA-32 and IntelŪ EM64T architectures:
Align data to correspond to the SIMD or Streaming SIMD Extension registers sizes.
Use either __assume _aligned() or #pragma vector aligned to instruct the compiler that the data is aligned.
ItaniumŪ architecture:
Avoid using packed structures.
Avoid casting pointers of small data elements to pointers of large data elements.
Do computations on unpacked data, then repack data if necessary, to correctly output the data.
Use __unaligned keyword on pointers to unaligned data to cause the structure to be accessed one byte at a time. This is a slow alternative.
In general, keeping data in cache has a better performance impact than keeping the data aligned. Try to use techniques that conform to the rules listed above.
When structures are packed with the pack pragma, pointers to interior members of the structure can cause unaligned access. Unaligned access will cause an application on Itanium®-based systems to terminate prematurely by default. You can get around this limitation by calling WINAPI function seterrormode. The condition is not fatal, only less efficient.
For the Itanium compiler packed structures are smaller in size but much slower. You will get software exceptions almost every time you access unaligned data.