Distributed Machine Learning Can Bring Healthcare Breakthroughs
AI, Healthcare, and Machine Learning
Many thanks to the contributions and research from the Massachusetts Institute of Technology (mit.com). Several papers are hyperlinked in this article.
Nicholas Mitsakos
Who Needs a PC When You Have a VAX?
In 1980, while studying computer science, there was a debate about whether computer processing should be “distributed” or “centralized.” It was the dawn of the PC revolution, and we were still unsure if people wanted processing power to be local — working at your desk — or centralized, with a powerful computer sharing processing and distributing that work to you via a “dumb terminal.” Obviously, personalized distributed processing won, and the PC revolution was born.
Of course, cloud computing represents a fundamental centralization of processing, and there is now much debate, because of the enormous collection of centralized data, about whether processing should be focused on the network’s edge or centralized in the cloud. In both cases, accessing the quality and quantity of data needed to develop truly effective machine learning continues to be challenging.
Perhaps the single most impactful application with the greatest societal benefit from machine learning is if it can be applied effectively to health care. The benefits of successful development are clear but getting there is not an easy path. This is an emotional hot button and raises many concerns about privacy and access to appropriate data.
A solution may be available.
We Had a Revolution Once Before
Today, machine learning now faces the same debate about whether processing should be centralized or distributed. The standard approach to machine learning requires data to be centralized in one place. The theory is that it allows for the most effective way for the software tools to learn more quickly and effectively. However, a new approach allows machine learning from a series of data sources distributed across multiple devices.
Google has developed a technology that trains its predictive text messaging model on messages sent and received by Android users. The most interesting thing about Google’s approach is that they are able to keep privacy — the system does not ever actually read the messages or extract them from users’ devices.
This distributed approach to machine learning with its unique privacy approach can very effectively overcome the greatest obstacle facing AI adoption in health care today. We no longer need to choose between patient privacy and the utility of the data to society. We can now achieve privacy and utility simultaneously.
Google to the Rescue. That’s Right, Read It Again If You Have To…
Over the last decade, the dramatic rise of deep learning has led to stunning transformations in dozens of industries. It has powered our pursuit of self-driving cars, fundamentally changed the way we interact with our devices, and reinvented our approach to cybersecurity.
In health care, however, despite many studies showing its promise for detecting and diagnosing diseases, progress in using deep learning to help real patients has been agonizingly slow.
Current state-of-the-art algorithms require immense amounts of data to learn — the more data the better. Hospitals and research institutions need to combine their data reserves if they want a pool of data that is large and diverse enough to be useful. But in the US and other developed countries, the idea of centralizing reams of sensitive medical information in the hands of tech companies has repeatedly — and unsurprisingly — proved intensely unpopular.
As a result, research on diagnostic uses of AI has stayed narrow in scope and applicability. A breast cancer detection model can’t be deployed around the world when it’s only been trained on a few thousand patients from the same hospital.
Enter Google’s distributed learning technology.
Time to Distribute, Centralize, and Distribute Again
All this could change with distributed learning. The technique can train a model using data stored at multiple different hospitals without that data ever leaving a hospital’s premises or touching a tech company’s servers. It does this by first training separate models at each hospital with the local data available and then sending those models to a central server to be combined into a master model. As each hospital acquires more data over time, it can download the latest master model, update it with the new data, and send it back to the central server. Throughout the process, raw data is never exchanged — only the models, which cannot be reverse-engineered to reveal that data.
Distributed learning is not without its challenges:
- Combining separate models risks creating a master model that’s actually worse than each of its parts. Researchers are now working on refining existing techniques to make sure that doesn’t happen.
- Distributed learning requires every hospital to have the infrastructure and personnel capabilities for training machine-learning models.
- Standardizing data collection across all hospitals can be an enormous task.
But these challenges aren’t insurmountable considering the tremendous upside possible.
We can Solve This
Effective privacy-first distributed learning techniques have been developed in response to these challenges. One, split learning, has each hospital start by training separate models, but they only train it halfway. The half-baked models are then sent to the central server to be combined and finish training. The main benefit is that this would alleviate some of the computational burdens on the hospitals. The technique is still mainly a proof of concept, but in early testing, it created a master model nearly as accurate as it would be if it were trained on a centralized pool of data.
A handful of companies, including IBM Research, are now working on using Google’s distributed learning to advance real-world AI applications for health care. Others are using it to predict patients’ resistance to different treatments and drugs, as well as their survival rates with certain diseases. Several cancer research centers in the US and Europe allow their data to be utilized for such models. The collaborations have already resulted in a new model that appears to help survival odds for a rare form of cancer on the basis of a patient’s pathology images. Attempts to validate the benefits of this and other techniques in a real-world setting are increasing. The biggest barrier in oncology today is knowledge. We may now have the power to extract that knowledge and make medical breakthroughs.
This Could Be a Big Deal
Distributed learning could also extend far beyond health care to any industry where people don’t want to share their data — which is probably most. In distributed, trustless environments, this can be very powerful.