The vision of AI/ML has not been realised today in many organisations due to the fact that data scientists work with tools that are not generally used by the rest of the development community in organisations. For example data scientists often work with Python and its many libraries or Julia or R and Rstudio. Hence models are not generally easily accessible by the development communities that use c#, Java, Scala etc. This has caused an impedence between the ability of the data scientists to create models and for the data engineers to get them into production. Thus it is claimed that 60-80 % of models developed by data scientists are not actually used. Here are some examples of references on this topic that explains that the technical issues are one of the main factors. Other more human factors such as the usual communication problems that occur in software development such as developing understanding of the problems by management, different IT teams are of course extremely important as well.
https://analyticsindiamag.com/why-majority-of-data-science-projects-never-make-it-to-production/
The ability to be able to iterate on models is key as explained here.
https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/
https://towardsdatascience.com/why-ai-models-rarely-make-it-to-production-2f3935f73e27
Thus if the business has the ability to select their own data sets, time frames and parameters for models, and then if the process of generation and serving of models is fully automated then the business should be able to realise the true value of ML. Thus we have developed a framework that allows business users to do exactly as described. Also models are generated using AutoML so a data scientist is usually not required to specify or test specific models and the best models will be found as the data changes. This is achieved using a highly generic framework that has abstracted the process of generation of data, loading of data and Auto ML model building from data at high speed. However models can still be used that have been generated by data scientists or migrated from other systems. Also key to this architecture is that it does not use a CI/CD process to build each model required. Typically in the Python world a CI/CD pipeline is used where data scientsts are required to create a model and then the model must go through the pipeline steps to go into production. However this process is generally slow and does not scale well, especially if Containerisation is used to build or serve models. Also the framework has the ability to use any required technology for clients to connect to the model servers and for where the model servers are to run. This the model servers can run in Azure App Services, Azure Functions, Containers, kernels etc or they can run in AWS S2, lamda etc or other cloud providers technologies and the system supports, Message Bus, Rest, gRCP, web sockets or named pipes clients.
Also for business to really take advantage of AI/ML there may need to be a change in the way that they generate their data. We have experience with the latest modern data warehouse concepts. This tries to avoid the ETL approach so that data can be partitioned and processed on the fly with real time streams so that decisions on data can be made in real time. Also any static data can be turned into a real time stream using Reactive Extensions from Microsoft. We developed a reusable framework for data processing at Transport NSW that achieves these requirements using Azure Data Factory. It also allows the concepts of Domain Driven Design to be applied to data processing. SSRS can still be used to generate reports but reports can also be generated from Azure ADF so SSIS is not required.
This link explains the move to real time stream processing from ETL.
https://qconsf.com/sf2016/system/files/keynotes-slides/etl_is_dead_long-live_streams.pdf
If the Domain Driven Design approach is used the events arise naturally in the business domain code. No extra streaming framework is required and sometimes if these frameworks are used then events are created artificially to add event processing as its been heard that that is the new thing. Frameworks like Kafka make It easier to implement distributed event handling but these event handling frameworks can be plugged into the Domain Driven Design architecture without affecting the business code and requiring the use of the event handling framework. Also any event handling frameworks can be plugged in as required. New tools for data warehouse management are being developed all the time such as the new Synapse tool from Microsoft. Refer to.
https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse
This tool could provide the hub for event stream data generated by batch ETL operations and streamed data generated by domain events fired from Domain Business objects.
This also removes the impedance between business requirements for data and the current data architecture. Hence today it is common to hear of a role such as the “Data Gopher” whos role is the map the existing data structure into the structure required by the business.
If data is generated by observing domain events any custom business data requirements can be achieved on the fly making this role redundant. Thus the domain events can be observed to create and denormalised views of the data or normalised data models that may be required for backend data analysis offline.
Visualisation of model data and results is also key to the success of an AI/ML solution. We have capability to show visualisations with many open source tools but also with tools such as Power BI.
However in all of the above always the business needs to be taken on a journey that they can see the value of. Too often today IT experts do not explain the options available to business stakeholders particularly in regard to their non-functional requirements. If the business only wants to make incremental changes then the above ideas do not need to be done big bang.