The IBM Data and Analytics Reference Architecture defines possible components at an abstract level. The purpose of this article is to choose between the necessary components and to assign real and concrete architectural components to them. The article contains information to supplement the “Lightweight IBM Cloud Garage Method for data science” article.
The method is highly iterative, so any conclusion along the way can result in changes in architectural decisions. However, there are never any wrong architectural decisions, as these decisions take into account all the knowledge available at any given time. That said, it is important to record the reason for making each decision. The following figure shows IBM Reference Architecture for Cloud Analytics.
The following sections provide guidelines for each component, explaining which technology to choose and whether the component needs to be included.
The data source is an internal or external source of data collection that includes relational databases; web pages; CSV, JSON or text files; and video and audio data.
Architectural decision guidelines
Regarding the data source, it is not necessary to evaluate much because, in most cases, the type and structure of a data source are already defined and controlled by the interested parties. However, if there is any control over the process, the following principles should be considered:
How is the delivery point?
Corporate data resides primarily in relational databases that serve OLTP systems. Usually, directly accessing these systems is not a good practice, even if only in read mode, because ETL processes are executing SQL queries within those systems, which can degrade performance. An exception is IBM Db2 Workload Manager, as it allows OLAP and ETL workloads to run in parallel with an OLTP workload without worsening the performance of OLTP queries, using intelligent planning and prioritization mechanisms. You can read more about IBM Db2 Workload Manager.
Do real-time data need to be considered?
Real-time data exists in a variety of formats and delivery methods. The most prominent include MQTT telemetry and sensor data (for example, data from the IBM Watson IoT Platform), a simple HTTP REST endpoint that needs to be probed, or a TCP or UDP socket. If no real-time downstream processing is required, this data can be tested (for example, using the Cloud Object Store). If real-time downstream processing is required, read the section on Streaming Analysis later in this article.
Cloud-based solutions tend to extend the corporate data model. Therefore, it may be necessary to continuously transfer subsets of corporate data to the cloud or access it in real time through an API gateway over a VPN.
Guidelines for architectural decisions
Moving corporate data to the cloud can be expensive, so it should only be considered if there is such a need. For example, if user data is managed in the cloud, it is sufficient to store an anonymous primary key. If transferring corporate data to the cloud is unavoidable, privacy concerns and regulations must be addressed. In this case, there are two forms of access.
Batch synchronization from a corporate data center to the cloud;
Real-time access to subsets of data using a VPN and an API gateway.
The secure gateway allows cloud applications to access specified hosts and ports in a private data center over an outbound connection. Therefore, no external entry access is required. You can go to the Secure Gateway service on the IBM Cloud for more information.