David Talby is an accomplished NLP and Machine Learning researcher and engineer and currently the CTO at John Snow Labs, where he leads the development of the leading open source Java based Spark NLP library, as well as the products Spark NLP for Healthcare, Spark OCR, the Annotation Lab, and the Healthcare AI Platform. We talked to him about his career so far and how applications of AI in Healthcare differ in so many important ways from other industries.
Hi David. Thanks for joining us for this interview. Tell us a bit of your personal story. How did you get into AI and what got you hooked on it?
Thank you for inviting me here. So, I did a PhD in computer science and an MBA. Eventually, I started working for Amazon in 2006, and I was looking at financial systems and building very large scale backend systems. After that, I moved to Microsoft Bing in 2009 and that's where I really got started with machine learning, doing a lot of work on automated text classification, a lot of data quality pipelines to filter out bad content, and of course some ranking and relevance algorithms. This was 2009, and at the time, if we wanted to train a classifier, we had to code from scratch: you read the paper, and if you wanted to add regularization, it wasn't like a parameter in Pytorch; you would actually write the code. So I have seen a lot of the challenges. After Microsoft I joined a startup, and we did some NLP work on medical records, trying to do automated clinical coding. At the time, really, we just tried everything that was out there. We tried spaCy, Stanford NLP, cTakes, NegEx, and NLTK, and openNLP, and a bunch of other libraries, and for us this was just not working very well. So a few years and a couple of companies after that, when the opportunity came and we got connected to Databricks at a technical conference, we started speaking, and I told the Databricks guy: “Listen, I really need a good NLP library on top of Spark, that enables me to do this simply at scale with good pipelines. So just tell us who is building the open source project, we can join and contribute”. And they said: “well, no one is doing that, you should do it”. My initial reaction was – No way! That's a big responsibility and a multi-year investment, but then a few months later there was actually an opportunity to do just that in one of my projects. I started it, launched the Spark NLP library, and now have a team who just celebrated four years of releasing software every two weeks for the community.
So is that how John Snow Labs got started?
Well, not started, but I think that was the big break for the company, because that library was the first one to significantly take off in terms of downloads, and in terms of community. There was a real need to have something that is production grade and scalable, so it can deal with a lot of text, be mature in terms of code quality, and support multiple platforms. The other thing that happened in the last four years is that NLP completely exploded and changed, but when we started the main concern was that we wanted to be able to scale things like tokenization and part-of-speech and named entity recognition.
The reason Spark NLP really took off is because deep learning and transfer learning happened to NLP, in the same way they happened to computer vision a few years earlier. Almost overnight, all of the libraries that we all grew up on and know and love in NLP became significantly outdated in terms of accuracy. We were among the first ones who really took the new deep learning approach, and productized it. We had production grade NER and text classification on BERT out five months after the first paper. Then we started training and tuning models from that, and kept expanding with sentence embeddings and language models. Then we took to combining different kinds of algorithms and embeddings together in one pipeline, and that involved a lot of data science work to make it trainable and tuneable. But there was also a lot of engineering work involved, because what people do is: here's a cool paper and here's my code on GitHub, and here's a copy of my model, and that's all great. But if you want to have something that actually is production grade, usable, and trainable, it’s very hard to take those outside academia today.
It’s not just engineering, there's a lot of data science work as well. We’re a healthcare focused company, so when we say we deliver state-of-the-art NLP, we mean we have peer reviewed papers that show that especially on the biomedical and clinical side, we deliver better accuracy on competitive public benchmarks than others. The main thing people care about is accuracy. Yes, they do want it to be scalable in production and all of that but really, they will sacrifice a lot for accuracy. So our main promise to the community and to the industry is: yes, we'll give you the most accurate results you can get at any time. A lot of what we do is to keep reimplementing things, tuning models and training models so that out of the box you get state-of-the-art accuracy.
On the business side, here's the model: Spark NLP is open source, Apache 2.0 license, and we have now more than 4,000 open-source models that we train, support, and test. We then have two commercial products on top of it: one is Spark NLP for Healthcare for clinical and biomedical NLP, and the other is Spark OCR which exists to handle visual documents. You have a smaller number of enterprise companies which are basically subsidizing this much larger open source community, to the benefit of both, and that's the whole model. In an industry survey last month it was shown that among healthcare and life science practitioners, 59% of them use Spark NLP. So Spark NLP is absolutely and by far the most common library used in healthcare and life sciences.
What made you specialize in Healthcare and Life Sciences? That's a very regulated and conservative industry, and it seems like a big risk to go into that.
I am very happy you think that way, because that's one of the reasons I like it. So yeah, healthcare is very hard. It's hard because of three reasons. First is that the language is very different. If you look at movie reviews, you can often debug your model just by looking at results. Right? But if you look at radiology reports and pathology reports, unless you went to med school… very often you will have data scientists that say the model is great, and a pathologist will look at it and say: Oh, this is junk. You and I can tell that when the model says a man is pregnant, it's false, but you have things that are just as obvious to an oncologist who can say: oh, no, you cannot take this medicine this way while you're on this course of treatment, what the hell are you talking about? So you need to actually know medicine – we have medical doctors on the team. The second challenge is that it's technically harder. I came from Microsoft through Bing. I did work on text, and I thought what a lot of people think when they go into healthcare: Yeah, I can just replay what I did for ecommerce or travel. But it just does not work when you try the same approach on medical data. What you see in clinical and biomedical NLP, it's really a different set of journals, a different kind of workshops, and even different benchmarks. It's a very interesting technical problem. The third challenge is regulatory. It's very hard to get data. You cannot crowdsource any labeling, because first of all, it's mostly illegal to even show it to someone, but even then you need experts, real medical experts, to do this.
There are two ways to look at these challenges. Yes, it's a good way to scare people off. Healthcare is one of those areas where Microsoft, Google, and Apple all had to scale back their initiatives and timelines, after they really tried for years and years. So that's the only vertical everybody has difficulty with. From a business perspective, there's a very high barrier of entry, which works well for us.
Though for me, really, here's why I got into it. I've done ecommerce, finance, retail, marketing-tech, and a few other things. But once you work in healthcare, and you understand that your models, not always, but every once in a while, can actually save some people, and some people are still walking around because of something you did, then you just can't go back to anything else.
What are, in your view, some of the biggest breakthroughs that we can expect in the next few years from AI in Healthcare and Life Sciences?
That's the big question. I think it's gonna be very gradual. Advancements in healthcare take decades, they don't take months or quarters. The big bet with the COVID vaccine was exceptional, in the sense that it happened during one year. This is something that normally will take ten years. Here's a completely new technology that actually went from the lab, to billions of people getting vaccinated.
Last year, we saw the first IPO of a company that developed a drug via AI, and I think much more of that will happen. Drug discovery used to be PhD’s in biomedicine reading papers, and trying to find this protein or this molecule, understanding biological mechanisms and recommending what should go to a first clinical trial. Now, a lot of that becomes pooling all of the academic papers ever written, and all of the known ontologies that were curated, using NLP to extract information from the text, using graph neural network to do link prediction, and then you find new drug candidates that completely change this industry. That is happening very, very quickly right now.
The other thing that I think is a big change underway, is on real world data and real world evidence. The way we develop drugs is that we do clinical trials, and it's a nice form of A/B testing, but it's skewed by definition. First of all, when you do a clinical trial for a new drug, you say I want only adults, no one who is pregnant, no people who are currently undergoing cancer treatment, or have hepatitis, or have one of a long list of no’s. Also, by definition, you're biased towards people who have access to hospitals, because the clinical trial sites are usually within large research hospitals. So a lot of the population is excluded from clinical trials. You can then say: this is a great drug, everyone should get it, and then everyone gets it. Then the question is: how does it actually behave in all the environments you haven't tested it in? What happens for people who are taking this drug but they are pregnant, or they are type one diabetic, or who have one of the 700 other common things that people experience in real life? Now that we have actually digitized the data for the first time in history in electronic records, why don't we look? Now we can look at things that just are not statistically significant in a trial, or that you will not be doing in a trial because it can be dangerous – like drug interactions, interactions with other comorbidities, ethnicity, and other factors that we didn’t properly cover before. Things that also single clinicians cannot do, because you only see so many patients, so that's a big game changer in the industry. When you use structured data, there's only so much information about the patient, like age, gender, insurance, and prescriptions. If you diagnose something, you have diagnosis codes which are often not exact, but more importantly there are many things that only exist in text, and that we really need to deal with medically. Things like undiagnosed depression, undiagnosed diabetes, a kidney disease that's evolving before it becomes acute, and a lot of things that aren’t strictly clinical – social determinants of health. NLP is playing that role. We have a number of public cases studies already underway where NLP is reading medical records at scale, in oncology, cardiology, pathology, and in mental health.
What are some of the main challenges that your healthcare customers are seeing in adopting AI?
This really is an emerging industry, so we need to educate the customers on how this actually works. What is doable versus what is science fiction? How would this work in production? How do you correctly set it up in highly regulated setting? These are some of the things we do very well. For example, we don't deliver models as SaaS, we can deploy on a customer’s infrastructure. We also deploy in regulated environments where the code is hardened in many different ways. Definitely one of the challenges in healthcare and one of the reasons it moves slowly, is that it's pretty much illegal for you to work the way you work in other industries. You cannot just download libraries and install them, and download data sets and other models from Tensorflow Hub and use them. That’s often just illegal.
Another big challenge is MLOps. In other industries, people often start without it – you deploy your model and then see what happens. In healthcare, you just don't do that. One of the things that I'm impressed with in healthcare and in pharma, is that there's a lot of focus on ethics from the get go. Compared to the “Ethical AI” discussions that are just starting now – the medical profession has been arguing about how to safely use new technology for literally for thousands of year.
Doctors get it and will tell you outright: It's nice that you have this new software system, but I have real patients, those are actual people who trust me. So if you think that you're going to just throw something on the screen, like this person should switch from 20 to 80 milligram, that’s just not happening, not on my watch. And I agree with this, because some days, I'm the patient, right? Every one of us is. Another thing in healthcare is that it's not okay that the model works really well for most of the population. You cannot say that this model would heal most people, but it kills pregnant people, but that’s only 2% of people, so don't worry about it. That works in a marketing campaign, when you send promotional emails. In healthcare, it's not alright at all. First, do no harm. How do you guys do research and what kind of topics do you pick?
In the first place, I read a lot of papers and I look at a lot of open source projects and so does my team. NLP is really fun in the sense that it's a very fast moving area right now. Today, if you publish a paper and you obtained new state-of-the-art accuracy for a question answering challenge, you stay at the top for maybe eight weeks before another paper outperforms you. That's also one of the things we are seeing every time when people have “this one idea that changes everything”. Because after you wait two months, there'll be yet another game changer. Every time. So we look at new papers, we look at conferences, we look at a lot of open source.
One thing we try to go a bit out of our way to support are more multilingual models – for things that are outside the USA. One of the interesting things that happened in the past two or three years is that we can now build very large multilingual models. Before, most of the work on NLP was in either English or Chinese, but now we can provide state of the art support for more than 200 languages. So we made it a goal to provide state-of-the-art language understanding to all of them, in a free and open-source way.
Another interesting area is in OCR technology. Beyond just text extraction from scanned documents, the focus right now is on the union of computer vision and text. A lot of the work we're doing right is trying different deep learning approaches to do form understanding, or models that do table extraction using with both visual signals and text signals.
How do you, especially with a distributed team, keep up with these very fast advances. How do you make sure that you efficiently spot what's going on, what's relevant, and what's new?
We do not have anything magical there. We're out there, we read the papers, we go to workshops. We look at open source, and we speak at conferences. We are on Reddit, we always go on GitHub, and we have an open Slack channel. A lot of people ask us questions, as our opinion – so usually when something new comes we know about it quickly. Customers will come and say that this new startup features this thing – so do you support it? I don't think there's any magic other than being really embedded in the community. I will have to challenge you on that point later, David! Thanks for the interview.