Follow Up from our Predictive Analytics Webinar

During an October ACFCS webinar, I shared RDC’s experience of using predictive analytics for our KYC efforts, and what we’ve learned about model governance. In summary, in my presentation I demonstrated that more efficient and effective AML programs are built using probabilistic models for predictive analytics; also, with a model governance framework and internal controls we can be better stewards and fiduciaries to all stakeholders. After my presentation, the audience asked highly pertinent questions, most of which are the genesis of this post.

The questions can be grouped into three broad categories: data, model governance, and education/professional development.

The data questions

During my presentation, an in-webinar poll showed that many in the audience were thinking about how to deal with data silos, unstructured data such as media articles, lack of labelled data, and use of synthetic data. Each topic deserves a lengthy article of its own but let me try to explain my thoughts briefly here.  We have experienced all the above when it comes to data. Both Bureau van Dijk and RDC have data around corporations but both possess different information that could be complementary if there were a single source of truth. How do we come up with a single version of truth? the answer is with Master Data Management (MDM), which helps ensure that a single version of truth is maintained. Technology plays a role, but people and processes need to be in place. Assigning data stewards and implementing internal controls specifically for MDM can ensure success.

Generating insight from unstructured data by way of human effort alone can be frustrating. When I first joined RDC, the operations team accepted only 17% of the daily articles submitted. Assuming an average volume of 10,000 news articles and an average one minute spent to determine whether the news article is relevant, that is about 138 hours per day on filtering out irrelevant articles.

RDC’s data science team has created modular applications that serve up Natural Language Processing (NLP) model predictions via RESTful APIs. One module would help de-duplicate intra-month news articles, one would classify relevance based on article text and another would perform Information Extraction (IE) and Named Entity Recognition (NER) to help semi-automate the media ingestion process.

There are open-source tools such as Stanford NER, SpaCy, AllenNLP and more advanced Transformer-based models such as the GPT-n series by OpenAI that you can use. Even with current state-of-the-art NLP techniques, there is much room for improvement to match the level of human comprehension1. However, we can certainly use existing NLP techniques if we redefine the problem set so that it becomes manageable. This is where humans and machines can cooperate by focusing on their comparative advantage.

With data privacy issues and lack of labelled quality data, synthetic data is receiving more attention from the machine learning (ML) community and there are companies being created to provide quality labelled training data. This is an active area of research and there are many techniques available, including using Generative Adversarial Networks, but we need to be mindful of model transparency, which is covered next.

Model governance

The second most frequently asked question was on model governance. What level of transparency and internal controls need to be in place in highly regulated industries? Let us say that a white box model is one where we can provide an answer as to why a model made a specific prediction, whereas a black box model is one where we cannot. There are some grey areas, but let’s focus on the extremes. With a linear regression model such as this one:

F(x) = a + bx

we can clearly explain why, given input x, the model made a specific prediction, since we know a and b from training the linear regression model with data. However, now consider the NASA ST5 spacecraft flight antenna. This antenna was designed and developed by using evolutionary algorithms 3.

NASA can’t explain why their models are able to produce an evolved antenna that creates the best radiation pattern for the mission specifications, but it works based on testing. In the paper they describe evolutionary algorithms that are “a family stochastic search methods, inspired by natural biological evolution, that operate on a population of potential solutions using the principle of survival of the fittest to produce better and better approximations to a solution” 4 .

Deep learning models are also randomly determined. Artificial neural networks use randomness while being fit on a dataset, with random initial weights and random shuffling of data at each training epoch during a stochastic gradient descent or other optimization algorithm to find a good local minimum. Hence, when we attempt to look under the hood of a deep learning model, we have a harder time answering why it did what it did.

Almost any family of ‘stochastic’ models may be more challenging to explain than a linear regression model, for example. However, we should keep in mind that with frameworks such as Layer-wise relevance propagation, scientists and engineers are actively researching ways to help explain deep neural network decisions. 5 6 The jury is out as to what constitutes a satisfactory level of explanation. So, it is important that we ask model transparency and other model governance questions early in the process and engage with regulators where necessary. Metaphysical questions such as ‘What is model transparency?’ can only be answered when all the stakeholders work together.

As far as internal controls to support a model governance framework are concerned, we can discuss controls around data, model development and model performance and testing.

Education and professional development

The third most frequently asked was on education and professional development. Many asked for recommendations on learning more about ML, NLP, artificial intelligence and their applications within FinTech. This is difficult since not everyone is starting from the same place and may have different objectives. I would recommend reading The Master Algorithm by Pedro Domingos, which gives a high-level overview of ML and the philosophy behind each tribe as he refers to the different schools of thought. With such an overview, I suggest a topical approach and going deeper down the rabbit hole. Studying neural networks within the connectionism/cognitive science camp leads to learning about Backpropagation. Hopefully, you’ll pick up the necessary thought processes, math and programming naturally. If you desire more structure and tangible skillsets, I recommend MOOCs (eg EdX, Coursera, Udacity, Udemy etc.) and resources created by prominent contributors such as Andrew Ng, Wes McKinney (Pandas), Hadley Wickham (R) and others.

Each topic deserves more coverage, but I hope my thoughts were enough to steer you in the right direction for further research.

You can watch the webinar here