Machine Learning And Cancer

If you were to perform a daily web search for “Machine Learning and Cancer”, you may find more companies each day that are using Supervised Machine Learning to perform Predictive Analytics on cancer data sets. There is Tempus, here in Chicago, IL, Berghealth in Framingham, MA, and FlatIron Health in New York, NY, as examples. So, how do these companies get their data? It seems theses companies are well funded, and they likely pay good money to partner with health organizations. As partners they categorize themselves as Business Associates as they comply with HIPAA in order to access patient information. These companies are doing great work!

Although, as a small consulting group, we have been able to get data from the All Payor/All Claims databases from certain states (This site lists the status of participation in the All Payor/All Claims process for many states in the U.S.), the data we get is not very useful, as the ability to track a patient’s disease is virtually nil. In the initial stages of analysis, I thought I could track a patient from the beginning to ending service dates for a patient’s medical encounters. Although states have anonymized the patient ids (a good thing), they go overboard and don’t give you enough information to piece together the correct sequence of service dates for a specified patient. Furthermore, they limit the tracking of a patient only within a specific insurance carrier. What if the person changes insurance during treatment (a very likely occurrence)? You just can’t get the proper data to describe a person’s episode with the diagnosis being studied. The All Payor/All Claims database program was a great idea, begun during the Obama Administration, but it hasn’t gone far enough to provide useful, quality data.

In fact, there is no easy way for me to get good patient data anywhere, such that patient identifiers and names are anonymized, and dates of birth are null. The reason is that I am neither a well-funded Business Associate in a contractual relationship with a Covered Entity nor am I a researcher at a university, approved by an Independent Review Board, and also with Business Associate status.

The problem with both these barriers is that it takes a lot of money to overcome; and therefore, the analysis of such data sets are limited to a select few researchers and companies.

I believe, just like universal/free education and open source software, open healthcare data (anonymized to protect any specific patient) but with accurate diagnostic, patient characteristics (e.g. age, sex, county, city, state) would give anyone the possibility to generate useful models in order to predict whether a tumor is benign or malignant, for example.

This would be a great function of the federal government, to oversee such an open healthcare data program. Many states are already trying to do something similar with All Payor/All Claims. The federal government can learn from all the state efforts and create such a program, build the quality data that is needed, and make available in one place.

Comments on this are welcome…