Traditionally, the semiconductor industry has been capable of increasing the clock speed of computer devices while doubling the number of transistors at each new generation. Unfortunately, due to power constraints, clock frequency growth stalled in the past decade and major microprocessor manufacturers switched their product portfolios to many-core architectures. Modern GPUs go even further and currently integrate hundreds of simple cores at a reasonable power budget. Due to this fact, software companies have been struggling to adapt and map traditional sequential algorithms that were conceived for serial execution to these new devices. This parallelization task is not as easy as one may think. Generally, humans tend to naturally think their actions in a sequential fashion. Are you capable of quickly envisioning hundreds of simultaneous actions (some of them even with interdependences) in parallel? Most of the best minds and programmers in the world aren’t. And this is reasonable. In our daily lives, we tend to be overwhelmed at work if we have to deal with nine or ten important things at the same time. But put simply, it seems that our brain is not designed to deal with multitasking. The more tasks we intend to do in parallel (for example, answering an email while taking a phone call), the more likely is that we cause an error due to insufficient attention. That’s the reason why writing optimized and scalable parallel code that targets devices such as GPUs is a long and tedious process. Writing code for these architectures is usually error prone and difficult to debug. It is not uncommon to spend months writing efficient parallel code for a particular kernel operation whose serialized version could be easily programmed and tested in a couple of days.
In this new era of parallel computing, if a company wants to release a software product that fully exploits the capabilities of the latest parallel hardware available in the market, surely it has to invest more money and resources than they were required in the past. We at Herta Security started to adopt parallelization since the very beginning in our products, and soon realized that biometric algorithms can greatly benefit from fine-grain parallelization and data-parallel vector computations. Multiples stages of the traditional biometric pipeline such as the feature extraction process are inherently parallel. However, the complexity of the memory access patterns of some features make strategies such as data reuse, caching and prefetching challenging. With the advent of GPU computing, the programmer has now to deal with millions of concurrent threads, synchronize them, and efficiently exploit the heterogeneous cache architecture of these devices in order to reduce latency as much as possible. Even though these tasks are time consuming from a research and development perspective, they enable us to deliver good quality products that scale nicely with the latest advances in hardware while providing an unbeatable user experience.