Defining a SIMD Interface: Redux

Original Author: Don Olmstead

In a previous article I discussed implementing a SIMD interface which could then be used to build a math library. By abstracting the underlying SIMD architecture the math library can work on multiple platforms without requiring a platform specific rewrite. When moving to a different platform the interface is the only portion that needs to be satisfied and the rest of the mathematics code should just work.

Math library diagram

Math library diagram

Now that all sounds good on paper, but how does it perform? This was the most glaring omission of the previous article, and one I hope to rectify in this article by providing an actual performance comparison.

One of the articles I referenced, GitHub.

SIMD Implementations

The performance tests compare five different SIMD vector implementations. The underlying SIMD code for each implementation is equivalent, they all use the same intrinsics for each function/method. The actual comparison is on how well the compiler generates the assembly for the various levels of indirection.

Implementation #1 – VMath

 
  namespace VMATH
 
  {
 
  	typedef __m128	Vec4;
 
  
 
  	inline Vec4	VLoad(float *pVec)
 
  	{
 
  		return(_mm_load_ps(pVec));
 
  	};
 
  
 
  	inline Vec4 VReplicate(float f)
 
  	{
 
  		return _mm_set_ps1(f);
 
  	}
 
  
 
  	inline Vec4 VLoad(float x, float y, float z, float w)
 
  	{
 
  		return(_mm_set_ps(x, y, z, w));
 
  	}
 
  
 
  	inline Vec4 VAdd(Vec4 va, Vec4 vb)
 
  	{
 
  		return(_mm_add_ps(va, vb));
 
  	};
 
  
 
  	inline Vec4 VSub(Vec4 va, Vec4 vb)
 
  	{
 
  		return(_mm_sub_ps(va, vb));
 
  	};
 
  
 
  	inline Vec4 VMul(Vec4 va, Vec4 vb)
 
  	{
 
  		return(_mm_mul_ps(va, vb));
 
  	};
 
  
 
  	inline Vec4 VDiv(Vec4 va, Vec4 vb)
 
  	{
 
  		return(_mm_div_ps(va, vb));
 
  	};
 
  
 
  	inline void VStore(float *pVec, Vec4 v)
 
  	{
 
  		_mm_store_ps(pVec, v);
 
  	};
 
  
 
  	inline Vec4 VBc(Vec4 v)
 
  	{
 
  		return(_mm_shuffle_ps(v, v, _MM_SHUFFLE(3,3,3,3)));
 
  	}
 
  
 
  	inline Vec4 Dot(Vec4 va, Vec4 vb)
 
  	{
 
  		Vec4 t0 = _mm_mul_ps(va, vb);
 
  		Vec4 t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,0,3,2));
 
  		Vec4 t2 = _mm_add_ps(t0, t1);
 
  		Vec4 t3 = _mm_shuffle_ps(t2, t2, _MM_SHUFFLE(2,3,0,1));
 
  		Vec4 dot = _mm_add_ps(t3, t2);
 
  		return (dot);
 
  	}
 
  
 
  	inline Vec4 Sqrt(Vec4 v)
 
  	{
 
  		return(_mm_sqrt_ps(v));
 
  	}
 
  
 
  	inline void GetX(float *p, Vec4 v)
 
  	{
 
  		_mm_store_ss(p, v);
 
  	}
 
  }
 
  

This is Gustavo’s vector implementation. The SIMD type is declared purely, via a typedef. It also forgoes operator overloading, preferring to use a procedural interface.

Implementation #2 – XNAMath

This is the vector implementation that comes with the DirectX SDK, and can be found in the include directory within the SDK. Like VMath it declares the SIMD type purely. It contains both a procedural interface and operator overloading.

Implementation #3 – VClass

 
  namespace VCLASS
 
  {
 
  	class Vec4
 
  	{
 
  	public:
 
  		inline Vec4() {}
 
  
 
  		inline Vec4(float *pVec)
 
  			: xyzw(_mm_load_ps(pVec))
 
  		{ }
 
  
 
  		inline Vec4(float f)
 
  			: xyzw(_mm_set_ps1(f))
 
  		{ }
 
  
 
  		inline Vec4(const __m128& qword)
 
  			: xyzw(qword)
 
  		{ }
 
  
 
  		inline Vec4(float x, float y, float z, float w)
 
  			: xyzw(_mm_set_ps(x, y, z, w))
 
  		{ }
 
  
 
  		inline Vec4(const Vec4& copy)
 
  			: xyzw(copy.xyzw)
 
  		{ }
 
  
 
  		inline Vec4& operator= (const Vec4& copy)
 
  		{
 
  			xyzw = copy.xyzw;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline Vec4& operator+=(const Vec4 &rhs)
 
  		{
 
  			xyzw = _mm_add_ps(xyzw, rhs.xyzw);
 
  			return *this;
 
  		}
 
  
 
  		inline Vec4& operator-=(const Vec4 &rhs)
 
  		{
 
  			xyzw = _mm_sub_ps(xyzw, rhs.xyzw);
 
  			return *this;
 
  		}
 
  
 
  		inline Vec4& operator*=(const Vec4 &rhs)
 
  		{
 
  			xyzw = _mm_mul_ps(xyzw, rhs.xyzw);
 
  			return *this;
 
  		}
 
  
 
  		inline Vec4 operator+(const Vec4 &rhs) const
 
  		{
 
  			return Vec4(_mm_add_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline Vec4 operator*(const Vec4 &rhs) const
 
  		{
 
  			return Vec4(_mm_mul_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline Vec4 operator-(const Vec4 &rhs) const
 
  		{
 
  			return Vec4(_mm_sub_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline Vec4 operator/(const Vec4 &rhs) const
 
  		{
 
  			return Vec4(_mm_div_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline void Store(float *pVec) const
 
  		{
 
  			_mm_store_ps(pVec, xyzw);
 
  		}
 
  
 
  		inline void Bc()
 
  		{
 
  			xyzw = _mm_shuffle_ps(xyzw, xyzw, _MM_SHUFFLE(3,3,3,3));
 
  		}
 
  
 
  		static inline Vec4 Dot(const Vec4& va, const Vec4& vb)
 
  		{
 
  			const __m128 t0 = _mm_mul_ps(va.xyzw, vb.xyzw);
 
  			const __m128 t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,0,3,2));
 
  			const __m128 t2 = _mm_add_ps(t0, t1);
 
  			const __m128 t3 = _mm_shuffle_ps(t2, t2, _MM_SHUFFLE(2,3,0,1));
 
  
 
  			return Vec4(_mm_add_ps(t3, t2));
 
  		}
 
  
 
  		static inline Vec4 Sqrt(const Vec4& va)
 
  		{
 
  			return Vec4(_mm_sqrt_ps(va.xyzw));
 
  		}
 
  
 
  		static inline void GetX(float *p, const Vec4& v)
 
  		{
 
  			_mm_store_ss(p, v.xyzw);
 
  		}
 
  
 
  	private:
 
  		__m128	xyzw;
 
  	};
 
  }
 
  

This is a vector implementation where the SIMD type is wrapped in a class. It uses operator overloading.

As an aside for those who have read Gustavo’s article the reason the original VClass implementation performed so poorly was because of the copy constructor being generated by Visual Studio. It was not being inlined and resulted in the performance deficiency. The copy constructor was added as well as the assignment operator. The operators were also modified to not create a temporary instance.

Implementation #4 – VClassTypedef

 
  namespace VCLASS_TYPEDEF
 
  {
 
  	///////////////////////////////////////////
 
  	// SIMD TYPEDEF
 
  	///////////////////////////////////////////
 
  
 
  	typedef __m128 simd_type;
 
  	typedef const __m128& simd_param;
 
  
 
  	inline simd_type VLoad(float *pVec)
 
  	{
 
  		return _mm_load_ps(pVec);
 
  	}
 
  
 
  	inline simd_type VLoad(float f)
 
  	{
 
  		return _mm_set_ps1(f);
 
  	}
 
  
 
  	inline simd_type VLoad(float x, float y, float z, float w)
 
  	{
 
  		return _mm_set_ps(x, y, z, w);
 
  	}
 
  
 
  	inline simd_type VBAdd(simd_param va, simd_param vb)
 
  	{
 
  		return _mm_add_ps(va, vb);
 
  	}
 
  
 
  	inline simd_type VBSub(simd_param va, simd_param vb)
 
  	{
 
  		return _mm_sub_ps(va, vb);
 
  	}
 
  
 
  	inline simd_type VBMul(simd_param va, simd_param vb)
 
  	{
 
  		return _mm_mul_ps(va, vb);
 
  	}
 
  
 
  	inline simd_type VBDiv(simd_param va, simd_param vb)
 
  	{
 
  		return _mm_div_ps(va, vb);
 
  	}
 
  
 
  	inline void VBStore(float *pVec, simd_param v)
 
  	{
 
  		return _mm_store_ps(pVec, v);
 
  	}
 
  
 
  	inline simd_type VBc(simd_param v)
 
  	{
 
  		return _mm_shuffle_ps(v, v, _MM_SHUFFLE(3,3,3,3));
 
  	}
 
  
 
  	inline simd_type VBDot(simd_param va, simd_param vb)
 
  	{
 
  		const simd_type t0 = _mm_mul_ps(va, vb);
 
  		const simd_type t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,0,3,2));
 
  		const simd_type t2 = _mm_add_ps(t0, t1);
 
  		const simd_type t3 = _mm_shuffle_ps(t2, t2, _MM_SHUFFLE(2,3,0,1));
 
  
 
  		return _mm_add_ps(t3, t2);
 
  	}
 
  
 
  	inline simd_type VBSqrt(simd_param v)
 
  	{
 
  		return _mm_sqrt_ps(v);
 
  	}
 
  
 
  	inline void VBGetX(float *p, simd_param v)
 
  	{
 
  		_mm_store_ss(p, v);
 
  	}
 
  
 
  	///////////////////////////////////////////
 
  	// Vec4
 
  	///////////////////////////////////////////
 
  
 
  	template <typename Real, typename Rep>
 
  	class vector4
 
  	{
 
  	public:
 
  		inline vector4() { }
 
  
 
  		inline vector4(Real *pVec)
 
  			: _rep(VLoad(pVec))
 
  		{ }
 
  
 
  		inline vector4(Real f)
 
  			: _rep(VLoad(f))
 
  		{ }
 
  
 
  		inline vector4(Real x, Real y, Real z, Real w)
 
  			: _rep(VLoad(x, y, z, w))
 
  		{ }
 
  
 
  		inline vector4(const simd_type& rep)
 
  			: _rep(rep)
 
  		{ }
 
  
 
  		inline vector4(const vector4& copy)
 
  			: _rep(copy._rep)
 
  		{ }
 
  
 
  		inline vector4& operator= (const vector4& copy)
 
  		{
 
  			_rep = copy._rep;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4& operator+= (const vector4& rhs)
 
  		{
 
  			_rep = VBAdd(_rep, rhs._rep);
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4& operator-= (const vector4& rhs)
 
  		{
 
  			_rep = VBSub(_rep, rhs._rep);
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4& operator*= (const vector4& rhs)
 
  		{
 
  			_rep = VBMul(_rep, rhs._rep);
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4 operator+ (const vector4& rhs) const
 
  		{
 
  			return vector4(VBAdd(_rep, rhs._rep));
 
  		}
 
  
 
  		inline vector4 operator- (const vector4& rhs) const
 
  		{
 
  			return vector4(VBSub(_rep, rhs._rep));
 
  		}
 
  
 
  		inline vector4 operator* (const vector4& rhs) const
 
  		{
 
  			return vector4(VBMul(_rep, rhs._rep));
 
  		}
 
  
 
  		inline vector4 operator/ (const vector4& rhs) const
 
  		{
 
  			return vector4(VBDiv(_rep, rhs._rep));
 
  		}
 
  
 
  		inline void Store(Real *pVec) const
 
  		{
 
  			VBStore(pVec, _rep);
 
  		}
 
  
 
  		inline void Bc()
 
  		{
 
  			VBc(_rep);
 
  		}
 
  
 
  		static inline vector4 Dot(const vector4& va, const vector4& vb)
 
  		{
 
  			return vector4(VBDot(va._rep, vb._rep));
 
  		}
 
  
 
  		static inline vector4 Sqrt(const vector4& va)
 
  		{
 
  			return vector4(VBSqrt(va._rep));
 
  		}
 
  
 
  		static inline void GetX(Real *p, const vector4& v)
 
  		{
 
  			VBGetX(p, v._rep);
 
  		}
 
  
 
  	private:
 
  		Rep _rep;
 
  	} ;
 
  
 
  	typedef vector4<float, simd_type> Vec4;
 
  }
 
  

This is a vector implementation where the SIMD type is passed as a template parameter for the class. The SIMD type being passed is declared purely using a procedural interface.

Implementation #5 – VClassSIMDType

 
  namespace VCLASS_SIMDTYPE
 
  {
 
  	///////////////////////////////////////////
 
  	// SIMD CLASS (Same as VCLASS)
 
  	///////////////////////////////////////////
 
  
 
  	class simd_type
 
  	{
 
  	public:
 
  		inline simd_type() {}
 
  
 
  		inline simd_type(float *pVec)
 
  			: xyzw(_mm_load_ps(pVec))
 
  		{ }
 
  
 
  		inline simd_type(float f)
 
  			: xyzw(_mm_set_ps1(f))
 
  		{ }
 
  
 
  		inline simd_type(const __m128& qword)
 
  			: xyzw(qword)
 
  		{ }
 
  
 
  		inline simd_type(float x, float y, float z, float w)
 
  			: xyzw(_mm_set_ps(x, y, z, w))
 
  		{ }
 
  
 
  		inline simd_type(const simd_type& copy)
 
  			: xyzw(copy.xyzw)
 
  		{ }
 
  
 
  		inline simd_type& operator= (const simd_type& copy)
 
  		{
 
  			xyzw = copy.xyzw;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline simd_type& operator+=(const simd_type &rhs)
 
  		{
 
  			xyzw = _mm_add_ps(xyzw, rhs.xyzw);
 
  			return *this;
 
  		}
 
  
 
  		inline simd_type& operator-=(const simd_type &rhs)
 
  		{
 
  			xyzw = _mm_sub_ps(xyzw, rhs.xyzw);
 
  			return *this;
 
  		}
 
  
 
  		inline simd_type& operator*=(const simd_type &rhs)
 
  		{
 
  			xyzw = _mm_mul_ps(xyzw, rhs.xyzw);
 
  			return *this;
 
  		}
 
  
 
  		inline simd_type operator+(const simd_type &rhs) const
 
  		{
 
  			return simd_type(_mm_add_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline simd_type operator*(const simd_type &rhs) const
 
  		{
 
  			return simd_type(_mm_mul_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline simd_type operator-(const simd_type &rhs) const
 
  		{
 
  			return simd_type(_mm_sub_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline simd_type operator/(const simd_type &rhs) const
 
  		{
 
  			return simd_type(_mm_div_ps(xyzw, rhs.xyzw));
 
  		}
 
  
 
  		inline void Store(float *pVec) const
 
  		{
 
  			_mm_store_ps(pVec, xyzw);
 
  		}
 
  
 
  		inline void Bc()
 
  		{
 
  			xyzw = _mm_shuffle_ps(xyzw, xyzw, _MM_SHUFFLE(3,3,3,3));
 
  		}
 
  
 
  		static inline simd_type Dot(const simd_type& va, const simd_type& vb)
 
  		{
 
  			const __m128 t0 = _mm_mul_ps(va.xyzw, vb.xyzw);
 
  			const __m128 t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,0,3,2));
 
  			const __m128 t2 = _mm_add_ps(t0, t1);
 
  			const __m128 t3 = _mm_shuffle_ps(t2, t2, _MM_SHUFFLE(2,3,0,1));
 
  				
 
  			return simd_type(_mm_add_ps(t3, t2));
 
  		}
 
  
 
  		static inline simd_type Sqrt(const simd_type& va)
 
  		{
 
  			return simd_type(_mm_sqrt_ps(va.xyzw));
 
  		}
 
  
 
  		static inline void GetX(float *p, const simd_type& v)
 
  		{
 
  			_mm_store_ss(p, v.xyzw);
 
  		}
 
  
 
  	private:
 
  		__m128	xyzw;
 
  	};
 
  
 
  	///////////////////////////////////////////
 
  	// Vec4
 
  	///////////////////////////////////////////
 
  
 
  	template <typename Real, typename Rep>
 
  	class vector4
 
  	{
 
  	public:
 
  		inline vector4() { }
 
  
 
  		inline vector4(Real *pVec)
 
  			: _rep(pVec)
 
  		{ }
 
  
 
  		inline vector4(Real f)
 
  			: _rep(f)
 
  		{ }
 
  
 
  		inline vector4(Real x, Real y, Real z, Real w)
 
  			: _rep(x, y, z, w)
 
  		{ }
 
  
 
  		inline vector4(const simd_type& rep)
 
  			: _rep(rep)
 
  		{ }
 
  
 
  		inline vector4(const vector4& copy)
 
  			: _rep(copy._rep)
 
  		{ }
 
  
 
  		inline vector4& operator= (const vector4& copy)
 
  		{
 
  			_rep = copy._rep;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4& operator+= (const vector4& rhs)
 
  		{
 
  			_rep += rhs._rep;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4& operator-= (const vector4& rhs)
 
  		{
 
  			_rep -= rhs._rep;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4& operator*= (const vector4& rhs)
 
  		{
 
  			_rep *= rhs._rep;
 
  
 
  			return *this;
 
  		}
 
  
 
  		inline vector4 operator+ (const vector4& rhs) const
 
  		{
 
  			return vector4(_rep + rhs._rep);
 
  		}
 
  
 
  		inline vector4 operator- (const vector4& rhs) const
 
  		{
 
  			return vector4(_rep - rhs._rep);
 
  		}
 
  
 
  		inline vector4 operator* (const vector4& rhs) const
 
  		{
 
  			return vector4(_rep * rhs._rep);
 
  		}
 
  
 
  		inline vector4 operator/ (const vector4& rhs) const
 
  		{
 
  			return vector4(_rep / rhs._rep);
 
  		}
 
  
 
  		inline void Store(Real *pVec) const
 
  		{
 
  			_rep.Store(pVec);
 
  		}
 
  
 
  		inline void Bc()
 
  		{
 
  			_rep.Bc();
 
  		}
 
  
 
  		static inline vector4 Dot(const vector4& va, const vector4& vb)
 
  		{
 
  			return vector4(simd_type::Dot(va._rep, vb._rep));
 
  		}
 
  
 
  		static inline vector4 Sqrt(const vector4& va)
 
  		{
 
  			return vector4(simd_type::Sqrt(va._rep));
 
  		}
 
  
 
  		static inline void GetX(Real *p, const vector4& v)
 
  		{
 
  			simd_type::GetX(p, v._rep);
 
  		}
 
  
 
  	private:
 
  		Rep _rep;
 
  	} ;
 
  
 
  	typedef vector4<float, simd_type> Vec4;
 
  }
 
  

This is a vector implementation where the SIMD type is passed as a template parameter for the class. The SIMD type is wrapped in a class.

Test Setup

The application was built using Visual Studio 2010 utilizing the June 2010 DirectX SDK. The project file was converted from the original Visual Studio 2008 file with all the options kept intact.

The computer running the application is a MacBook Pro running Windows 7 with a 2.2 GHz Core 2 Duo with a Nvidia GeForce 8600M GT. The number of calculations done per frame is set to the maximum per each test.

Results

Cloth test settings

Cloth test settings

Implementation Average Time (ms)
VMath 77.66
XNAMath 75.85
VClass 74.90
VClassTypedef 74.89
VClassSIMDType 74.98

For the cloth simulation the VClass and VClassTypedef fight for the #1 slot. Followed by the VClassSIMDType. The XNAMath implementation isn’t too far behind at the #4 slot. Coming up in the rear is the VMath library.

3-band equalizer settings

3-band equalizer settings

Implementation Average Time (ms)
VMath 18.15
XNAMath 25.34
VClass 24.28
VClassTypedef 23.55
VClassSIMDType 25.25

For the 3-band equalizer the VMath implementation trounces the competition. Once again the VClass and VClassTypedef keep exchanging the #2 and #3 slots. Next up is the VClassSIMDType, and at the end is XNAMath.

Conclusion

The VClass implementations perform better than the XNAMath implementation for both tests, and better than the VMath implementation in the cloth simulation. Because of this the SIMD interface appears to be a viable approach.

I plan on creating a math library using this technique, with the SIMD type implemented using a procedural interface. I’ve started a repository at GitHub and will be developing it into a full fledged solution.

Feel free to follow its development!