How NerdWallet Dialed Machine Learning up to 11
NerdWallet’s mission is to provide clarity for all of life's financial decisions. We do this via scores of helpful content, tools that provide comparisons of various financial products, and tons of financial insights and actions, including credit report monitoring and automated money transfer. While all of this information is useful, “clarity” is impossible to achieve if the experience provided is not personalized or consumable. As a result, personalization of our content, tools, insights, and actions are necessary for us to achieve our mission.
Our first attempt at personalization was a Likelihood of Approval model. A majority of our members are actively using us to monitor their credit reports, moreover, we also had a pretty big data set of users clicking “apply now”, filling out a form on a partner’s site, and either getting rejected or approved. As a result, we had enough information to accurately assess their odds of getting approved for a particular card. Knowing this, our data science team set out on building the best model they could. They wrote a query, downloaded their datasets, and used their Jupyter notebooks to build a model. Once completed, they handed off a serialized version of said model to integrate into a web-service. While seemingly simple, this workflow had a few fatal flaws.
Reproducibility: When model bugs arose, it was impossible for us to reproduce the model and figure out why it had failed because we didn’t have the exact dataset, the exact code that ran it, or the configurations that generated it. This meant that the integrity of the system was too suspect to be put into production.
Modularity: Because code was placed in behemoth Jupyter notebook cells, it was difficult to test the code, impossible to extend for anyone else other than the owner, and also hard to monitor. This increased the risk of bugs which made the cost of non-reproducibility much higher.
Scale: Our data science team quickly realized that their jobs were becoming impossible to perform on a laptop. Their datasets were too large and their tuning jobs were taking too long. They needed a way to easily run ML code in remote environments without telling them to also become DevOps engineers.
We learned that we needed to bridge a deep chasm between data science exploratory analysis and our engineering systems. To do this, we had to answer one core question: What does production code mean for a data science team? And how do we build the right infrastructure and enact the right workflows to enable it?
After feeling the pain of putting our first few models into production and faced with the demands of a growing data science team, it was becoming evident that attempting to master the complexity of the worlds of Data Science, DevOps, Engineering, and Product was not going to scale. For these reasons, we embarked on building a platform that reduced the cost of building and shipping machine learning models to production and made it easier to maintain these models when they went live.
Design Considerations
After working on a user spec with our fellow data scientists we came up with a few user stories that transpire when developing and launching a model into production.
I want to have a standardized dev environment where I can code up an ML pipeline and test it on some data stored on my laptop to make sure things work as expected.
I want to easily execute this model on a remote server with much larger data or with a more involved search algorithm.
I want to be able to tie my remote model training jobs to a code version, a data set version, a configuration version, logs, and a model version so that I can effectively debug when things inevitably go wrong.
I want to be able to launch my versioned models into transactional web services and have confidence that I can rollback changes with little turnaround time.
I want to be able to periodically generate offline predictions for a particular model version on an input data set.
I want to be able to periodically retrain my models by executing a versioned ml pipeline executable on current data.
Given these high level user requirements, we boiled them down to a couple system requirements. The following is the list of systems we landed on and a list of user requirements each system addresses.
Dev Workflow: We need a workflow that gives data science a stable development environment that mirrors remote execution. [1]
Build System: We needed a version control and build system to manage our data science workflows. [2, 3, 4]
Remote Pipeline Execution: We need a system that enables ML Pipeline orchestration and execution on the cloud. [2, 3]
Storage: We need a place to store our versioned models, pipelines, and datasets. [2, 3, 5]
Scheduler: We need a system that acts as a scheduler. [4, 5, 6]
Manager: We need a service that ties all of this together and helps store all of our metadata. [2, 3]
UI: We need a front end to work with the manager so that our experience is usable by humans. [2]
Given these system requirements, we quickly realized that building and maintaining all of these bespoke solutions would require a large amount of resourcing. As a result, a guiding meta-requirement was to leverage existing solutions and infrastructure whenever and wherever possible. Because we had heavily invested in build/deploy tooling, we wanted to have our model build/deploy system plug into it as much as we could. As a result, we landed on the following solutions:
Utilizing these technologies allowed us to focus on laying out a coherent user experience for our data scientists rather than on rebuilding machine learning on the cloud.
The Whole Hog Experience
As we were thinking about the whole experience, we defined a few different workflows. Building your first iteration of a model, iterating on your model until you are satisfied, periodically retraining your model with fresh data, integrating that model into a periodic batch predict pipeline, and integrating that model into an API that makes predictions on request. Performing these workstreams require creating an ML Deployable which is code that is responsible for every part of the model life cycle. Materially, this is a Python library that stores configurations and a set of entry points that point to code that implement the following interface:
Train: Consume a dataset and configurations and output a trained machine learning model object.
Predict: Consume an input and a model object and output a prediction.
Batch-Predict: Consume an input data set, a model object, and an output directory. Then perform batch predictions and load results into the output directory and finally into an S3 bucket for folks to consume.
Save Model: Consume a model object and an output directory and load the serialized model file[s] to the output directory.
Load Model: Consume a directory with the serialized model file[s] and output the model object.
Now that we have the pieces let’s go through each workflow and explicitly outline how everything comes together.
Building Your First Model
A data scientist uses an Initialization CLI to initialize and configure a template Python library. This template contains:
A configuration file which contains [hyper]parameters, remote execution configuration, and dataset locations (E.G S3 URI of full dataset).
Code stubs for the entry points discussed above.
A data directory where a data scientist can throw in a small dataset.
A Control CLI which executes the code in the train entry point and the code in the batch-prediction entry point. Depending on the environment it’s executed in (locally or remote), it will pass in the right arguments.
The data scientist will load a small dataset into the data directory, fill in the stubs, write tests, and use the control CLI to make sure the training job is running successfully with the local dataset. Since we guarantee that the execution of code on their laptop will be as close as humanly possible to the execution of code in a remote environment, this should give the data scientist confidence that their pipeline will run successfully on the cloud.
Iterating On Your Model
Once the data scientist is confident that things are working as intended, they will now want to train and tune their model on the real dataset in a remote environment. We decided to leverage the PR workflow to help the data scientist do this easily. More specifically, they would create a branch in git, commit their code, push those commits to that branch, and then open up a PR against master.
From then on, any time they push commits up to their branch, a Github webhook will trigger a build job. Among many things, the build job would:
Package the code and publish the package into Artifactory.
Pass the artifact name to the Metadata Manager which uses the ML Deployable’s configuration file to make the right asynchronous call to Amazon SageMaker.
SageMaker then does some orchestration, booting a container onto specified hardware, downloading the dataset from S3, mounting the data onto the hardware, downloading the code to run, etc.
SageMaker then calls the Control CLI’s train command which trains the model, serializes it via save-model and publishes the versioned model’s artifact to Artifactory.
While their training job is running, the metadata manager is polling the SDK and commenting back to the Github PR which lets data scientists know what’s going on with their job. Finally, when the code review is complete and a model is successfully trained, they can merge their branch into master and the latest versioned model artifact becomes the latest official model version. This looks like the following:
Not only does this give data science a nifty development environment for remote pipeline execution, the advantages with this setup is that we have a complete audit of every model because EVERY versioned model can be tied back to commit hashes of the code/config, immutable data snapshots, and logs from that execution!
Model Retraining
When the data scientist is satisfied with their ML pipeline, we would like to schedule periodic retrainings with fresh data on an automated release cadence. This is made easy with Airflow. The data scientist would write an Airflow DAG which has operators that run scripts which takes an Artifactory link that was created during the latest official build and run a SageMaker training job. This means versioned model artifacts are dropped into Artifactory on a periodic basis which can be released on a manual or automated schedule.
Deploying Model Predictions
There are two production use cases: Offline batch predictions and transactional services. Deploying models to services is easy because these standardized ML pipelines are semantically versioned python libraries which implement both a load model and predict entry point. In practice, this means a Flask app requires both the trained model and the ML pipeline as a dependency. Once these two requirements are met, the app will call the ML pipeline’s load model function to de-serialize the trained model on app startup, and the ML pipeline’s predict function on request.
Finally, running periodic batch prediction jobs work almost the same way as periodic retraining jobs. You create a DAG that utilizes a custom operator we built that takes a versioned ML Deployable, the Artifactory link to the versioned model artifact, and the input dataset location to perform batch predictions and runs the batch-predict entrypoint. This process uploads the batch prediction results to an S3 Bucket which anyone can consume from.
What did we get out of this?
Our ML Platform allows us to significantly decrease the amount of friction in building and releasing machine learning models. There are three big reasons for this. Firstly, since our data scientists don't have to reinvent the wheel every project, they can spend their time working on testing more interesting hypotheses. Secondly, providing access to standardized model pipelines makes it easier for our data scientists to write production quality code. Finally, and most importantly, is that the machine learning platform gives engineering and data science a common system from which to chip away at friction. Once we on-boarded our data scientists, we saw the compounding leverage we gained after a month of work first hand. Models that previously took a month to train, test, analyze, and ship to production now took 2 days on our platform.
Hopefully you gained a more clear picture of what the Machine Learning Engineering story and Data Science story looks like at NerdWallet! The user stories we defined above are not completely exhaustive, but will continue to evolve as our machine learning and data science problems grow increasingly complex. Are there use cases you like to see supported? Would you like more detailed insight into how we built some of the systems mentioned above? Feel free to comment below, and if you see yourself working on improving the engineering and data story at NerdWallet, we encourage you to apply.