Large-scale Machine Learning
The ability to learn without being explicitly programmed, Machine Learning, has been around for a long time and is well understood. What is different is the relatively recent emergence of general purpose tools, such as Apache Spark, that enable processing of very large datasets. Additionally, data scientists can now collaborate and rapidly deliver high-impact and high-value business assets, without worrying about managing compute resources, security, or data-replication.
A classic example of Machine Learning is detecting fraudulent login attempts. Instead of explicitly specifying every rule and every possible fraud case, machines learn by being presented with thousands of examples, e.g. fraudulent vs normal activity. From these examples a model is created that can then be used to detect irregular activity. The advantage here is that once the initial model has been created, it can continuously evolve (what’s known as online learning) and self-improve when presented with new examples of fraud. And with larger datasets these models can become more accurate.
A subset of Machine Learning, deep learning, has structures loosely inspired by the neural connections in the human brain. It has been generating a lot of buzz due to some excellent results in specific narrow tasks, such as cancer pre-screening (Watson), retina scanning (Google), auto-labeling of images (Google, Baidu), and language processing (Alexa, Siri, Google Now). Again, large part of the success in Deep Learning is due to recent hardware improvements in computer graphic cards (Nvidia or AMD) and releases of new and very popular frameworks such as TensorFlow, MxNet, Torch, Theano, DeepLearning4J, CaffeOnSpark, and most recently, TensorFlowOnSpark.
IBM, Google, Baidu, Facebook, Microsoft and Amazon have shown excellent results in applying Deep Learning technology. Much of their success can be attributed to large in-house teams of AI experts, heavy investment in compute resources, long training times (weeks or months), as well as having millions of quality labeled training examples. It is also worth noting that the scaling of the human-intensive process in creating all these training examples has been made possible by crowdsourcing services from Amazon and CrowdFlower, among others.
Fortunately, the raw processing costs are falling rapidly, lowering the barrier to entry for everyone, and there are many pre-trained (downloadable) components of the deep neural network available for reuse, allowing companies to significantly shorten model training time and focus on optimizing their networks for their specific use cases. For example, recognizing and classifying objects in aerial photography to predict crop yield and highlight areas that require attention; recognizing street signs, people, and other vehicles from onboard cameras in autonomous cars.
New initiatives, such as those at OpenAI, attempt to counter the requirement for such large labeled datasets by working on generative models. These models “are forced to discover and efficiently internalize the essence of the data in order to generate it.” One interesting example of this work will allow to generate plausible images and videos.
Checkout our white paper on Four Tenets of Machine Learning and Data Science to learn how data science plays a vital role in unlocking the potential of enterprise data.