Loads and stores for SIMD vector types

Submitted by Matthias on Tue, 08/09/2016 - 08:21

Loads and stores are the main bridge between the vector and scalar worlds. (Gather, scatter, and subscripting are the small and slow bridges.) Using a load instruction you can efficiently copy WT scalar objects (stored contiguously in memory) to one vector register. Store instructions do the reverse. This is a really simple concept, so why bother a discussion?

API design issue

Loads and stores in standard C++ are never explicit. The compiler decides when to copy a value from register to memory or the other way. (Back in the days there was a register keyword. It's been useless since a long time already.) The thing that is visible in C++ is copies, and explicit SIMD load/store is, in principle, a copy. So, would it be possible to model the load/store API following the copy API in standard C++?

 1  std::vector<float> data(1024);  // allocates memory for 1024 floats
 2  data[0] = 1.f;      // copies the constant 1 to memory (mov instruction)
 3  float x = data[0];  // copies the first float element from memory to variable x (mov instruction)
 5  datapar<float> v;
 6  v = data[0];
 7  v = datapar<float>(&data[0]);
 8  v.copy_from(&data[0]);
 9  v = data[datapar<float>::index_type(0)];
10  data[datapar<float>::index_type(0)] = v + 1.f;

The first three lines above show standard C++ code you know and immediately understand. Line 5 declares a SIMD vector object (using the API of the current C++ proposal for SIMD vector types - substitute with Vc::Vector if you use the Vc library).

Line 6 doesn't do a vector load. It's a straightforward copy of the scalar API for copies as seen in line 3. I don't think anyone wants to argue that this syntax should result in a load of Wfloatfloat objects starting from &data[0]. The right hand side of the assignment clearly shows a subscript to a single element. Magically making that a larger load, through assignment, is confusing. Especially since assignment of the scalar element type to vectors already has a meaning. It's broadcasting this one value into all elements of the vector. So after line 6 all elements of v hold the value of data[0].

Line 7 shows a load constructor. The constructor call takes a pointer to the element type and implicitly advances the pointer Wfloat-1 times to access all elements in memory for the vector load. This is the classic load API used in Vc and also carried into the datapar proposal. I honestly don't like it too much. The expression on the right hand side implicitly accesses memory that is not addressed explicitly in the code. The implicit size is obvious enough, so I'm still OK with having this load syntax.

Line 8 shows the same load syntax using a member function. This has been going back and forth a bit in the proposal. Should it be a non-member, member, or static member function? Note that there's precedent for load and store functions in std::atomic. However, the meaning of load and store is exactly reversed for atomic. Ugh. Because of this naming confusion and the names load and store not being 100% intuitive for newbies, I looked for alternatives and currently I'm going with copy_from(const T *) for loads and copy_to(T *) for stores. (My criticism for the syntax in line 7 applies just the same.)

Line 9 shows an idea, which I have not come around to actually test yet. The index type for the subscript operator of std::vector is not an integral type but rather a class type carrying the information about the chunk size for the load (or store) in its type. Thus, the subscript operator has all the information to execute the load to datapar<float>. Consider, though, that the same subscript expression should also work on the left hand side to enable stores. Then the subscript operator itself may not execute the load but rather only return a "subscript expression", which either is converted to datapar<float> or is assigned with a datapar<float>. I'd rather avoid such expression types, because we don't have template deduction guidelines (e.g. decay rules) for such types yet.


Finding an intuitive and no-surprises syntax is not trivial. The resulting code should be concise and easy to understand (even for newbies that just started to maintain some existing vector code). My work on Vc has taught me that there's a lot of options, but none of them is perfect. The best way out is more high-level syntax to hide loads and stores.


Add new comment

By submitting this form, you accept the Mollom privacy policy.