Success delivering value for patients hinges heavily on predicting their risks & suggesting Next Best Actions, and we’ve demonstrated that no tool is better at doing this at scale than AI. The catch is that AI requires reliable access to data, and healthcare data/CRM/CDP integration is notoriously complex and messy. Even after the data is usable, there are thousands of diagnoses, procedures, and events to potentially predict.
The common approach has been to go through great expense to create Electronic Data Warehouses (EDWs) and develop one-off models or queries upon request, which gets queued & delayed into an extensive, labor-intensive IT backlog. The faster hospitals can deploy Next Best Action models, the faster they can save lives and create value. Knowing this, we invested heavily in data pipelines and data science, even opening a data science lab near Carnegie Mellon University in Pittsburgh.
Can the right infrastructure reduce IT’s backlog and deliver the thousands of models needed to manage complex care?
The TL;DR version of this blog post in Galaxy Brain meme format answers the question…
We’ll explain in more detail below.
Our History: Next Gen Data Pipelines
We started using Airflow to build data pipelines almost four years ago. At the time, Airflow was next generation data workflow tool compared to previous options. It lets us build multi-step workflows in Python to process data and then run them on various schedules. It was a big win for the data processing infrastructure we’d created to better serve patients.
When we looked at our objectives for expanding data science efforts, speed and scalability were chief among them. We have an incredibly ambitious view toward data science in healthcare — we see our health system clients using huge numbers of predictions to engage patients and manage relationships.(The “RM” in SymphonyRM stands for Relationship Management.)
We see our health system clients using huge numbers of predictions to engage patients and manage relationships.”
For us, that means supporting not just a few niche machine learning models but having models for a large number of Next Best Actions relevant to health systems and their patients: everything from cancer screenings to orthopedics, neurology to bariatrics, and on and on. This means we need infrastructure to facilitate training and testing thousands of models, tuning those models with hyperparameters, etc. On top of that, since iteration and training on large data sets are key to improving models we need to be able to do all of that fast — definitely without waiting days or weeks for results.
Enter Prefect: The Next Gen’s Next Gen
In the spy thriller TV series Alias, Marshall Flinkman was an affable tech genius who often saved Sydney and team with an innovative spying device or a complex computer hack. In the show, he has a great line about a MEMS (Micro Electrical Mechanical Systems) listening device being “the next gen’s next gen”, i.e. so advanced that it’s not just “next generation,” but the technology generation beyond that.
If Airflow was a next gen system, Prefect is the next gen’s next gen data workflow platform. It was conceived and built by a long-time Airflow committer, with an eye toward keeping the great things about Airflow — configuration as code, DAG (directed acyclic graph) workflow structure, etc. — while adding new advances like dramatically faster execution, scalability through mapping tasks over data, first-class support for passing data between tasks, parameterized flows, and many more. (For those interested in a more in-depth comparison, see Why Not Airflow?)
Using Prefect’s open source Core package, we initially built parameterized data flows that performed model training and testing for thousands of healthcare predictive models. We then ran those flows on a large 100-node cluster. Experiments that would have taken over 2 days to complete serially, finished in 30 minutes! This also vastly accelerates data ingestion projects, as compute resources need to scale in order to intake, process, and match millions of healthcare event records to patients.
Experiments that would have taken over 2 days to complete serially, finished in 30 minutes”
We now also use Prefect’s Cloud platform for full workflow orchestration: scheduling, task retries/checkpointing/caching, resource throttling, log aggregation, environment labels / flow affinity, and monitoring of our production flows. As we add more production flows and data engineering pipelines, another aspect of Prefect that we haven’t even talked about becomes critical: negative engineering, i.e. what to do when things go wrong: handling error conditions, cases where data is missing or malformed, etc.
One way to summarize our experience with Prefect: we came for the parallel execution, we stayed for the workflow automation.
Sample workflow automation in Prefect
Dask: Easy Cluster Computing for Python
Talking about Prefect without mentioning Dask would be a major omission, particularly for running experiments in parallel. Dask is a Python-based cluster computing platform that easily scales from running on your laptop to thousand-node clusters in cloud infrastructure and even to supercomputers. Dask’s scheduler supports coordinating distributed task execution with very low (millisecond) latency.
For data scientists used to working with Python tools like numpy, pandas, and scikit-learn/Keras/PyTorch, Dask provides a familiar interface yet can run at large scale by distributing memory and computation across a cluster of machines. For those familiar with Spark, Dask is in some ways a Python-based alternative. (For a more detailed comparison, see Comparison to Spark.)
For data scientists used to working with Python tools like numpy, pandas, and scikit-learn/Keras/PyTorch, Dask provides a familiar interface.”
Dask allows us to train machine learning models on data sets containing 10s of millions of patient interactions by scaling those efforts in parallel across clusters of hundreds of machines to finish faster. In itself, it was a big win for us and much easier than asking our data scientists to switch from pandas and scikit-learn to PySpark (the Python interface to Spark) for large scale experiments. Dask rides on top of technologies like Kubernetes (see the dask-kubernetes project ) to easily scale computing infrastructure up and down. We use Amazon’s Elastic Kubernetes Service (EKS) to dynamically scale up 100+ Dask worker nodes in minutes. (And scale back down just as easily.)
Letting Data Scientists Do Data Science
Our data scientists and engineers could have worked directly on top of Dask to achieve some of the basic aspects that Prefect provides, but they would have been re-inventing the wheel and missing out on sophisticated aspects Prefect brings. Rather than cleanup & ETL tasks, Data Scientists’ time is much better spent focused on our business objectives and client priorities. By executing flows on Dask, Prefect allows data scientists and data engineers to think in terms of simple functions that combine together into a larger workflow without having to worry about low-level Dask details, distributed execution, error handling, retries, etc.
In fact, Prefect tasks are plain Python functions. They just happen to get distributed onto a cluster for execution by a Dask worker. Prefect tasks can return data and pass it to other downstream tasks. It just so happens that a parameter to a Prefect task might be a reference to a huge feature matrix in the form of a Dask DataFrame distributed over 1TB of RAM on a large cluster.
Matthew Rocklin is the creator of Dask. Here’s his description of Prefect:
In our view, Prefect’s decision to use Dask as their primary execution engine was a great choice and will provide Prefect users with easy scalability via a powerful cluster computing platform. On top of that, both Prefect and Dask have a very Pythonic, developer experience. Engineers familiar with Python easily run the same flows locally on their laptops or on clusters.
The New Stack for Data Science
At SymphonyRM, we believe that this combination of technologies, Prefect for workflows and Dask for execution and scalability, is likely to emerge as the popular tech stack for data science workflows in Python. It’s perhaps a bold assertion, but we see the advantages as so substantial and our results as so positive that many others will soon pick up on these technologies. (We also haven’t gone into any significant depth in this brief overview blog post. Future posts could go into far more detail on specific areas.)
We think a lot about giving our data scientists “superpowers”, i.e. how we arm them with the best tools so that we can deliver on our ambitious vision of providing clients with models to support a wide variety of Next Best Actions. Prefect and Dask have been key enabling technologies that form a critical part of those “Galaxy Brain” superpowers. The speed and scale these platforms offer to data pipeline & modeling operations will make a major impact on the ability to engage patients, identify Next Best Actions, and help eliminate some of the massive IT backlog.