Blog

jacobstephenson February 28, 2024

Machine Learning Operations

With the advent of new Artificial Intelligence being released over the last couple of years, companies are beginning to accelerate the production of data science models. What used to be sequestered to companies with niche data strategies is now diffusing to the wider ecosystem. Companies are investing in platforms, processes and methodologies, feature stores, machine learning operations systems, and other tools to increase productivity and deployment rates. MLOps systems monitor the status of machine learning models and detect whether they are still predicting accurately If they’re not, the models might need to be retrained with new data - I will go into more detail on how companies use machine learning in another post. Many of these capabilities come from agencies or external vendors who provide a platform for other, more nascent companies to train and deploy machine learning models. However, some organisations are now developing their own platforms. Although automation is going a long way to bolster productivity and provide broader data science participation, the most notable achievement here is that companies are able to reuse and leverage existing methodologies, data sets, and even entire models.

jacobstephenson February 27, 2024

Distributed SQL Engines

Databases and relational database management systems allow companies to leverage their data and efficiently create, read, update and delete (CRUD) data. A study by McKinsey in 2022 showed that among financial-services leaders, only 13 percent had half or more of their IT footprint in the cloud. As companies use more and more data, the processes that allow for the data to be used efficiently will non doubt need to be optimised. This optimisation may come in the form of distributed SQL engines which allow processing and retrieval of data. Distributed SQL engines were derived from the concept of parallelisation, a method of high performance computing whereby a computer program or system breaks a problem down into smaller pieces to be independently solved simultaneously by discrete computing resources, distributed SQL engines increase compute power by linking multiple database servers under the hood of one RDBMS. This allows companies prioritise the scalability, reliability, and usability of the orchestrating ecosystem while maintaining the robust ACID compliant transactions of a traditional RDMS. Under this hood is a) virtualisation and b) an abstraction layer. This abstraction layer allows users and developers to interact with virtualised resources without needing to understand the intricacies of the underlying hardware, crucially in this context providing data scientists and analysts access across disparate data sources. This means that you can query relational and non-relational data together in a scale-out fashion for better query performance. As such, “distributed” doesn’t just refer to the query itself but also storage and compute. Companies wanting to carry out analytics on terabytes of data will opt for technologies using distributed query engines to optimise performance. The engines are primarily used in intensive OLAP queries and are able to withstand the fragility and inconsistency seen in non-distributed query engine performance. Early, well-known technologies such as Hadoop use parallel processing engines to query and analyse data stored on Hadoop Distributed File System. Many subsequent distributed query engines are based on Hadoop and are used for batch-style data processing. Each distributed query engine varies, with some breaking SQL queries into multiple stages and storing intermediate results in disks and others taking advantage of in-memory and caching. However, looking at this holistically, these technologies are based on MapReduce, a framework for processing "parrallelisable" problems across large datasets using a large number of computers. Collectively, these computers are referred to as clusters (so long as all computers/nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) like HDFS or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead. Some of the most well know technology companies leverage significant compute power to provide products and services, taking advantage of the cloud and distributed architectures including the separation of compute and storage. For example companies such as Netflix are known for having microservices that use different kinds of databases based on the capabilities of each database. Some of these microservices rely on datasources such as Hadoop, AWS S3 or data from multiple data sources. Crucially, distributed SQL query engine allows data to be queried from a variety of these sources within a single query. Example query engines include Presto, Apache Drill, Apache Spark. Companies such as Netflix and Uber use this to drive analysis across disparate datasets. The number of companies creating new and innovative services and products to suit business needs continues to increase and this will no doubt make it easier for companies of any size and in any industry to leverage their data.

jacobstephenson February 27, 2024

ML and business strategy

There are no doubt numerous techniques in which data can be collected, and even more numerous data sources which store the data itself. However, once data engineers have extracted, transformed and loaded this data into a data store it needs to be analysed. Many companies are choosing to use artificial intelligence and machine learning (ML) to glean insights from their data. The insights supplement decision making or, in some cases, completely automate the decision making process. There are a couple of examples of this: in the healthcare industry triage process, machine learning is used to categorise and prioritise incoming patients. Providing nursing staff with this technology helps increase diagnostic and therapeutic assessments and also enables remote triaging. The most frequently used machine learning models used here are based on XGBoost and Deep Neural Networks; other models have been considered such as Logistic Regression but the performance decreases. One of the main industries that uses ML is financial services. Here, ML is increasingly used by companies to aid decision making and automation of processes. According to the Bank of England, 72% of UK financial services firms are developing machine learning applications, with the insurance and banking sectors setting the pace for absolute usage. Despite this, firms are aware of the constraints of machine learning deployment that arise due to the Prudential Regulation Authority’s and Financial Conduct Authority’s existing regulations lacking clarity. An important point to raise here is that despite this lack of clarity, regulatory authorities need to ensure regulations ensure safe and responsible adoption of machine learning. Within FS, the main business areas in which ML is deployed include customer engagement and risk management and compliance. Customer engagement has the highest percentage of post deployment applications and is seen at various stages throughout the customer lifecycle. There are many other business ares which use it but these are not within in the scope of this blog. Firms deploy ML applications for a variety of uses, however, they can be both from internal and external implementation. They can be externally implemented by third party vendors or co-implemented with third party vendors providing any other the services in the vertical integration. This may be cloud storage, the ML models, software packages or data input. These applications need to monitored and tested to validate performance. The most commonly used methods are outcome monitoring against a benchmark. Here, performance and outputs of the model are compared against historical data. The historical data used will vary depending on the business area or industry but is most commonly profitability, customer satisfaction or pricing. The next most common method is data-quality validation: this is used to detect errors, biases and risks in the data. Once model performance is validated and models are deployed results can be used to aid decision-making. Typically in financial services, ML models are most commonly used in pricing and underwriting, with complex models used for credit pricing and insurance underwriting. These models are at an advanced stage of deployment and are used in expected loss accounting, claims accounting, motoring for insider trading or market manipulation, directing queries within customer interfaces, compliance/AML/KYC checks, trading strategy and execution and payments authorisation. ML has a variety of use cases and is being adopted by more and more firms, not just in financial services but also in other industries such as healthcare. Most business use it to aid in customer engagement, with it being used to classify, predict or optimise data based on customer behaviour, with this in turn used in strategic decision-making. Businesses will no doubt continue to deploy more complex ML models and as these processes become more common and understood they will trickle down to smaller firms.

jacobstephenson February 27, 2024

Background

This website and its various pages highlight my expertise gained working as a data and research specialist in the finance industry. My previous experience spans multiple industries including healthcare big data and bioinformatics.