As an operating advisor at Monk’s Hill Ventures, I talk to startup founders on how they can do that. These startups range from early-stage companies, to more mature companies in the pre- or post-IPO stage. Some companies already have established data science teams and are keen to scale that out, while others are figuring out how to hire their first data scientists or even decide what kind of data they should be collecting. Nevertheless, there are a few common themes that keep coming up, and I hope to address some of them here.
AI or Machine Learning?
AI and machine learning are buzzwords these days, and we've seen many companies pitch themselves as applying AI to X or using machine learning for Y. The truth is, there is a spectrum along which companies lie.
On the extreme end of the spectrum, data and machine learning themselves are the product. The company's value proposition is applying AI to a particular vertical such as fintech, digital health, etc. Often, these are B2B companies, providing a service to existing players by helping to make sense of their data. For example, in fintech, this could include building a credit scoring model for their customers. Other examples are AI companies that help with speech recognition, computer vision, or conversational bots for customer service.
Somewhere in the middle of the spectrum, we have companies where data and machine learning algorithms are not the product, but they are still crucial to enable the company to provide a superior product experience for their customers. These companies can be both B2B or B2C. For example, NinjaVan is a B2B company that helps e-commerce companies with their delivery needs, and it is able to offer a superior customer experience because of its route-optimization algorithms.
Finally, at the other end of the spectrum are companies whose main business model does not depend on the data. Nonetheless, being able to harness their data can help them greatly understand their customer/business better. This can even apply beyond technology startups to bricks and mortar companies. The business can already be doing quite well, but with added ability to predict what their customers would buy at a particular price, or being able to segment their customers better, they could significantly boost their sales. These all happen to be classic machine learning problems.
Thus, it is crucial to figure out where in the spectrum your company lies as a first step toward figuring out your company's data strategy.
Should I be hiring my first data scientist or data engineer?
Often, when companies want to start building out their data science teams, they think of looking for a data scientist. What often happens is the data scientist expects to build data models and produce insights with the data, only to find out that the data is not logged, or is incomplete and unreliable. The data pipelines, if they exist at all, are unstable and scheduled jobs fail.
Other times, the company may already have existing enterprise data warehouses, but data may be limited by some business intelligence tool that’s optimised for simple SQL queries rather than the more computationally intensive ML workloads.
In these cases, the company actually needs to hire a data engineer first, instead of a data scientist.
A data engineer's job is to help with implementation of the data infrastructure. This involves logging and storing the data, as well as writing and scheduling the batch jobs to aggregate these logs into data tables which act as the basis for further data insights. The data engineer makes important technical decisions early on, such as whether to implement 3rd party solutions or build their own technology. Only when the data infrastructure is stable, will hiring a data scientist be productive.
That's not to say that one shouldn't hire a data scientist early on. Rather, what we've seen is that the earlier the stage of the company, and the less mature the data infrastructure is, the more "full stack" the data scientist should be. That is, it will be immensely helpful for the early data scientist to also be well versed in other aspects of the data stack, including data engineering, front-end work, data visualization, database administration etc.
It is also important that the early data scientist should come in with the right set of expectations.
A good data scientist would know that much of their job scope involves data cleaning and wrangling with the data. This part of the job scope is not always the most glamorous, but it is an important part of the process of deriving insights from the data, and should never be outsourced.
That’s because there involves quite a bit of judgement in data cleaning. It isn’t just a mechanical task - it involves understanding of how the data is to be used to decide how to replace dirty data or whether to omit outliers or deal with missing data. The data scientist who is hired early on into the company will inevitably have to grapple with this more.
How do I scale my data science team?
For a company that already has a data science team set up, how should they scale it as the business scales? We've seen several workable ways to do this.
One is by specializations. As the team scales, it is possible to move from general data science hires to experts in different parts of the data science stack. Some of these roles include that of a data visualization scientist, machine learning engineer, business analyst etc. This allows the company to leverage upon diverse talents that may be more effectively applied to different problem sets. For example, a machine learning engineer has expertise in back testing and productionizing the algorithms, and can work great alongside the engineering team; a business analyst may have business or finance experience that they can apply to business strategy problems.
Another way to scale the team is by function. We've seen successful models where teams are split by functions including business intelligence (serving business teams and management who want a good, legible view of the business), and product data science (embedded with product teams to help with product metrics, experimentation etc). As the product team scales, the product data science team can be further split to serve different product verticals.
What makes for an effective data scientist?
These days, with many online resources on data science, and university courses specializing in data science and machine learning, it is not difficult to find people who are familiar with the tools and applications of specific data science techniques.
In our experience though, what makes a data scientist really effective, is that they are able to apply their domain expertise, and leverage the tools to solve a real problem that impacts the business. A great data scientist is thus one who is able to break down a problem, and ask the right questions, in addition to answering them.
These skills come with experience. We’ve found that junior and mid-career data scientists who come from bigger tech companies and have cut their teeth learning from the best there, often make for good hires. With the tech ecosystem in Singapore and broader South East Asia burgeoning, we are also beginning to see more of such data science talent emerge.
Written by Linus Lee, operating advisor at Monk’s Hill Ventures . He was the Head of Data Science at Twitter Singapore and is the co-founder of Basis AI.
Many thanks to Silvanus Lee and Feng-Yuan Liu for reviewing this post and offering suggestions.