Architectural thinking in the Wild West of data science


Acknowledgment: Thanks to Kevin Turner for reviewing this document several times and for his valuable contributions.

Data scientists tend to use ad hoc approaches. We see many creative script hacks in different programming languages ​​and in different machine learning structures distributed in all corners, both on servers and on client machines. I am not complaining about the way such professionals work. I have found myself in such extremely creative ways many times while doing something meaningful.

Having complete freedom of choice with programming languages, tools and structures improves creative thinking and evolution. However, at the end of the day, data scientists must fully shape their assets before delivery, because there can be many pitfalls if they don’t. I describe these pitfalls below.

Technological blindness

For a data scientist, it is common sense that, from a functional point of view, technology does not really matter much because the models and algorithms used are defined mathematically. In view of this, the only source of truth is the mathematical definition of the algorithm. For non-functional requirements, this view is not supported. For example, the availability and cost of specialists for a particular programming language and technology varies widely. When it comes to maintenance, the technology chosen has a major impact on the success of a project.

Data scientists tend to use programming languages ​​and structures in which they are most qualified. This starts with open source technologies, such as R and R-Studio, with its uncontrollable universe of packages and libraries and its inelegant and difficult to maintain syntax. The runner-up is Python, with its well-structured and well-organized syntax and associated Pandas and Scikit-Learn structures. On the other side of the tool spectrum are open source “low-code / no-code” completely visual, such as Node-RED, KNIME, RapidMiner and Weka, and commercial offers, such as SPSS Modeler.

“The technology I know best” is suitable for a proof of concept (PoC), a hackathon (an event that brings together programmers and other professionals in the field of software development) or a start-up style project. However, when it comes to the scale of projects for industries and companies, some architectural guidance on the use of technology should be available, regardless of how it may manifest.


Please enter your comment!
Please enter your name here