Building Data Science teams
We interview Alessandro Pregnolato, an experienced leader of Data Science, Analytics, and ML teams.
Alessandro is the VP of Data at Preply, a 1o1 online language tutoring marketplace. He’s been successful at building and managing several Data & Analytics teams within Barcelona's tech scene.
He fell into data by chance when he joined Adobe as a production planner and learned basic BI skills, out of a need to automate excel reports. After some consulting work, he switched to the start-up world by joining Softonic, where he developed an interest in Data Science and Machine Learning. As a result, he moved to King as a Data Scientist. Guess what, he even completed the legendary Andrew Ng’s Coursera ML course! Since then, he built and scaled Data teams and infrastructures at Typeform — a form-building SaaS startup — Marfeel, Paack, Moonpay, and, most recently, Preply in Barcelona, where he’s currently based.
What's the most challenging aspect of building and running a team of Data Scientists, Analytics, and ML people? What do you spend the most time worrying about?
I think the most important thing is hiring. If you hire well, most problems will solve themselves. I am a very strong supporter of management 3.0. Hence I believe that performance comes with accountability, which comes with enablement, which comes with delegation, which comes with trust, which comes with people you can trust.
Basically, the idea is that all recipes fail if implemented outside the very specific contexts in which they were designed. If you manage by recipes, you’ll end up micromanaging. Your people will get depressed or, at the very best, switch to a reactive mode. They're going to stop thinking and just do what they're told. You won’t retain any talent with that, just mediocrity. The solution is creating some boundaries and communicating them to the team very clearly. At the beginning, I'll be mildly prescriptive with them while explaining the principles behind my reasoning.
If I do things right, within a few months, they’ll no longer ask me anything. Simply because they'll know what my answer’s going to be. By then they’ll know, for example, that I am a data governance freak. So if they plan to undermine it, they won’t ask for my approval but simply reconsider. Same goes for going top down on their reports, or allowing stakeholders to treat them like data monkeys. Not within the principles. Not acceptable. The beauty of such an approach is that, within the boundaries of these principles, the team are entirely free to operate. No need for moderation.
You probably have a lot of interaction with business stakeholders and the data team, so you act as an interface there. Have you identified some common misconceptions that business leadership folks will have around data? Such as things that are often demanded that are not even wrong or not feasible?
The most harmful thing is a tendency of some people to go to data scientists with "solutions". The typical scenario is a stakeholder who asks for a data point while providing no context. That's the most frustrating and heartbreaking example. “What is the correlation between variable x and y?” “What is the churn rate of people who look like y and z”.
That’s a symptom of somebody who has been doing all the thinking already. Because everybody loves to do the thinking and figure out the solution in their head. Which is often wrong, either because they don't have the skills or because they tend to oversimplify.
Firstly, that’s a recipe to screw it all up. Secondly, it makes the data person in question feel like a monkey. Excluded from the thinking process, merely providing random pieces of data, day in, day out. Plus, this won’t ever scale. When people come in with a question, you’ll give them an answer. Of course, a single data point won’t solve the problem. So they're gonna come back with 10 more questions. You're gonna give them 10 answers, and they're gonna come with 100 more... We could put 100 analysts there. They're gonna produce 100 answers, and receive 1000 questions in return, etc.
In comparison, a healthy situation is the one by which a stakeholder is mature enough to simply outline the problem they’re trying to solve. What is the ultimate question that we're trying to answer?
This way, we can help them refine it. We can be involved in the process of defining what's the best way of addressing it, which, in 99% of cases, is not the data point that they’d be asking. Maybe because there's a better way of doing it. Sometimes, because whatever they asked didn't make sense in the first place.
Let's move on a bit to talk a bit about modern Machine Learning. You've been vocal about your skepticism towards how useful ML is and how overhyped it is. Can you speak about that?
Well, first of all, I must be humble and admit that I’m not a machine learning expert. I am someone who has a fairly clear, functional understanding of ML and its applications in a typical business environment. Some of my statements won’t generalize to other contexts. I've come to see some impressive applications of ML, like generating text, for instance. Automatic product descriptions are fairly advanced. Also creative writing. I've seen some innovation. So I'm not saying that there is nothing new. But when we think of our expectations back in 2014, they have been completely frustrated.
All of us were expecting ML to enable applications that were not possible or even conceivable before. Such as self-driving cars, for example, which we're still far away from achieving. They do exist but they're not reliable enough. You don't hear that much about them anymore. What was the latest breakthrough?
At Typeform for instance, we had a vision for forms to write themselves one day. Or understand the users and adapt to their mood. This didn't happen. What was possible already (regression, classification and anomaly detection) improved somewhat and became a commodity instead. How boring.
It's interesting to hear your perspective, and it makes a lot of sense for a lot of existing companies and products. However, I see a lot of companies that start building a business model from scratch, based on some new advancements in research, for instance, Language Models, where companies are emerging to provide them as a service which is a very interesting direction where ML is going.
When it comes to making sure expectations from ML engineers are aligned with reality, what's your approach? For instance, imagine an engineer that works for you comes to you very excited about this new model and how it could be used to prototype this crazy cool feature. How do you manage your skepticism level?
Unfortunately, it doesn't happen as often as you might think. Data scientists are limited by a roadmap that is set by a product owner. They don’t have as much room to play as I would like them to. Yet, if they came to me with an idea, any idea, I would receive it with enthusiasm. Again, we have principles in place, and such principles act as boundaries.
In this case, the boundary number one would be a strong connection with the business and the customers. One big danger with data scientists, ML people and any technical folks is that they tend to fall in love with technical solutions. Thinking: oh, that’s so cool. This library is amazing. That’d be fun to build. I can write a paper about it, etc. I’m thinking: what problem does it solve? How does it impact the top line?
From a product perspective, I’d encourage them to build a proof of concept. I don't even want to call it an MVP. The smallest thing we can do to validate that, (1) customers find this valuable, and (2) how difficult and viable this is. Ideally, this first iteration would last one day. Two hours if possible. Okay, let's make it a week. By no means we should invest more before getting some strong positive signals.
That makes sense. Could you share a story that shows a big failure to meet expectations in an ML project? A specific feature for which you were super excited and ended up flopping.
At Typeform, the first time I was trying to apply some machine learning at scale. It seemed like a very simple task. Classifying typeforms by use case. Which ones were an "order form", a "feedback form", a "contact form", and so on and so forth. Our data scientist proposed an unsupervised approach. Using Latent Dirichlet Allocation (LDA), creating clusters of topics.
We didn't put much thought into it, and we just went for it. A few weeks later, we produced something that was completely unusable. It had zero value. Why? There were two big problems. Firstly, we classified forms by topics. It turned out that a topic and a use case are not quite the same thing. We found, for instance, a lot of forms about food but we had no indication of the underlying job-to-be-done. Menu? Delivery? Feedback? Restaurant reviews? No idea.
Secondly, use cases are not black or white. There might be a form that’s asking for feedback, and then your contact details. This form would be like 50/50.
After this first iteration, I picked 20 forms and asked everyone in my team to classify them. To my horror, I’ve seen how each person would produce completely different results.
If only we’d done so at the beginning. It was a 15 minutes test. Enough to reach the conclusion that the task is too ambiguous and, therefore, cannot be accomplished.
Now, the data scientist in question did a great job with the algorithm, from a technical standpoint. Such technical data scientists tend to think in terms of input and output. You cannot expect them to think from a business perspective, because that’s not in their blood. Exceptions apply, of course. Those are unicorns.
Yeah, the question was somewhat ill-defined from the get-go, that is such a good lesson indeed. All right, this marks the end of our interview, thank you for your time!
Thanks for having me!