Data Science: Production AI Systems
This article is not a “how to create your own AI model in 2 lines of code tutorial” ,but acts as an initial guidance on Data Science best practices for reliable production AI systems.
We all agree that training your machine learning model on your personal computer is not enough to be shared to the public and used by daily users, in which large industries have learned throughout the years about the paradigm of any system (either AI or any system) to be considered accepted on production after it goes through various iterations of validations to be migrated as a real life reliable system.
Arguably, it is no longer necessary to rebuild the wheel or build the system in-house, rather to utilize the usage of sustainable, reliable and existing platforms that quickens the process while maintaining the quality of what you are going to build!
This article will act as a building block for upcoming articles that will serve the reader a decent observation on building AI models and improved systems on production.
What is Databricks?
Databricks is a Data-driven / Data Science platform in which its existence has caught the attention of many aspiring, intermediate, experienced Data Scientists, the ability to move from a Proof-of-Concept status to a robust Production status in minimal time is a great value proposition.
Furthermore, it is worth mentioning that the ability to create robust machine learning models can take quite some time to accomplish in-house, in which startups and top-tech companies are starting to use Databricks on a daily basis to automate their Machine Learning Operations (MLOps).
Databricks is a licensed product from Apache (founders of Apache Spark), a community-led organization that has been known for its large contribution in the field of Big Data, where corporates like Google, Microsoft and Amazon use their open-source products in their core-engine which explains their tremendous impact on this attracting field. The main advantage of Databricks is that it effectively utilizes the usage of Spark (A Big Data Distributed Processing tool), Databricks is based on Spark which gives it the upper hand in terms of performance in the field of Big Data processing and manipulation.
Has programming become redundant? Will those platforms replace programmers with simple drag-and-drop ML models?
The answers to those questions are a big no, considering that Data Science Platforms are there to make the life of a Data Engineer and Data Scientist easier by letting them drive their focus on what matters and to gradually stop the cycle of rebuilding the wheel all over again, which is quite known to be a common challenge for both Data Engineers and Data Scientists.
Databricks, MLOps and Cloud Services
Databricks focuses on delivering MLOps optimized pipelines that focus on data-driven methodology (the process on improving the data rather than improving the machine learning algorithm) in the Machine Learning model life-cycle:
A data driven approach is a methodology suggested by top notch machine learning scientists to propose a method that could ease the process of machine learning pipelines especially in the field of model improvements, the approach can be summarized as the following:
- A machine learning model can be improved by improving the quality of the data.
- Always explore the data before looking at the architecture of your model.
- A state-of-the-art architecture can not always solve your problem.
- One single extra (high-correlation) feature can improve your model noticeably.
- Remember: Garbage-in, Garbage-out, your data matters.
Advantages and Disadvantages of Out-of-the-box ML Tools
Many field-experts prefer to create tech in-house using out-of-the-box tools for multiple reasons including managing hardware cost. Yet, using any out-of-the-box ML tools can have its advantages and disadvantages, framing them as below:
- Quick to apply
- Readable documentation
- Installing the native tool and its dependencies
- Adopted by most small-medium businesses
- Supported by a large community of developers
- Opaque/limited understanding of the tool.
- Black-box modeling.
- Difficult to reproduce.
- Mostly not production friendly unless it is used with a wrapper.
Databricks seems to solve most (if not all) of the issues.
Thus, resulting in more production-ready models that fit your Data Science project.
For instance, Databricks documentation is one of the most transparent documents out there, good docs lead to game-changing products.
It is quite rare to go on Stack Overflow to ask for some help while Databricks offers a reliable forum to answer common and uncommon questions which has been truly beneficial for many Data Engineers and Data Scientists.
Usually such platforms are known not to be cost-friendly, yet, by following best-practices can largely reduce the monthly bill by 70% if-and-only-if you follow the best practices out there.
Databricks is in a collaboration with Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform (GCP) to manage your cloud-computations effectively, basically, everything is running on the cloud, nothing to manage locally on your machine.
The choice of which cloud provider to use is up to you, depending on your DevOps team and their preference. Comparing the cloud providers is like comparing apples to oranges, some providers have features that the other does not.
The possibilities of the Data Science field in production are significantly greater than what they were in the past five years.
The growth of data sources and the facilitation of data tools will tremendously be a supporting factor in improving humans’ lives in the near future, which is closer than we think when Tesla solves the self-driving car problem.
From another perspective, such platforms are great to have onboard, because the knowledge that is elicited from those platforms is extremely valuable in the tech market today.
In the next article in this series we'll explore the common dependancies, and the model lifecycle in more detail. Stay tuned for Part 2!