AI and ML and their use in Facial Recognition systems have received a lot attention recently. Much of the negative commentary has highlighted the lack of accuracy that is sometimes seen. If you’ve tried this out what you realize quickly is that an effective system with reasonable accuracy requires multiple models. For example, in the setup I describe here there are different models for Face Detection (finding faces in an image) and Face Matching (how close are these two images of a face).
Also, the models I’m using are good for frontal face images – for a sophisticated system there would likely be other models that could deal with images from different angles (side, up, down etc) or that could interpolate an image from one angle to match an image from another. matching needs to take place. If you want to do real-time, then you’ve a short window in which to execute each of the models. Also, the result is likely to be an aggregate across the results of many models – giving a probabilistic rather than absolute determination.
How to bring all this together quickly brings you to a need for a streaming architecture – particularly when you consider that a practical application is likely to be distributed in nature. You might want to take a look at the Rendezvous Architecture described in Machine Learning logistics by Ted Dunning and Ellan Friedman for an overview of how this might work: https://mapr.com/ebook/machine-learning-logistics/
A possible scenario is depicted here:
So how does MapR simplify this setup ?
For the real-time communication aspect, MapR Streams provides an implementation based on the Kafka API. Could I not use Apache Kafka for that ? Yes, but we’re asking about simplification here, so let’s come back to that.
For execution of the Face Detection model, the MapR REST API and Kafka Connect module provides methods for direct injection of the image into the communication system. For widely distributed systems, MapR’s Global Data Fabric and MapR Edge provide was of easily implementing the system on a global scale. In terms of simplification, the key thing is that nothing changes when moving from one machine, to running on a cluster, to running in the cloud, to running across clusters or across clouds – it all stays the same.
The mulit-model execution and result aggregation for the Rendezvous implementation are very well served from MapR Streams as described in the referenced text.
The image database and meta data database are well served from MapR-DB (Enterprise NoSQL) and/or the MapR File System (MapR-FS). Could I not use Apache Hadoop for this ? Well yes and no. At a certain scale this would fine, however at some point you’ll run into issue with the number of files that Hadoop can handle, particularly if you are storing say the results of each match for later analysis. Are there ways around this?
Possibly, but we’re trying for real-time and again we’re talking about this in the context of simplification – why not just use a file system that doesn’t have the problem?
Lastly let’s consider portability, we want our code to be able to run on a laptop for easy development and on a cluster – so we want to write our code such that it could execute anywhere. The changes required to move from local to cluster execution are – none! No changes are required – because of MapR’s API compliance, code that runs in a standard computer environment also runs on MapR. This makes portability easy and opens access to all the libraries and repositories of shared code available – they just work (consider all that exists for Python, R, TensorFlow, …). What that meant here was that I was able to pull down example code from the web and just follow it without having to change it – vastly simplifying the development and giving me “execute anywhere” for free.
What else is simplified ?
So far we’ve been looking at functionality, however in a production world we need to include operational aspects such as security and auditing. Coming back to the question of Apache Kakfa, yes the system could be built with that, it could also have used HBase or Cassandra or Mongo as the NoSQL database, and HDFS as the file system. However in such an architecture I have to consider operational aspects in multiple places – for example how to implement security across Kafka, Cassandra, and HDFS. Can it be done? Yes. Is it simple? No. If I want to operate at scale I also have to deploy dedicated infrastructure to my different function areas and monitor each separately. With MapR I deploy one thing – MapR. I define security at the data level and that applies to my files, my database records, and my streaming data – much simpler.