All the sessions from Transform 2021 are available on-demand now. Watch now.
The biggest wastes in data science and machine learning don’t stem from inefficient code, random bugs, or incorrect analysis. They stem from flaws in planning and communication. Execution mistakes can cost a day or two to fix, but planning mistakes can take weeks to months to set right. Here are five ways you can avoid making those mistakes in the first place:
1. Set the right objective (function)
Mathematician and data analysis pioneer John Tukey said “an approximate answer to the right question is better than an exact answer to the wrong question.” Machine learning solutions work by optimizing towards an objective function — a mathematical formula that describes some value. One of the most basic examples is a profit function: Profit = Revenue – Costs.
While machine learning algorithms excel at finding the optimal solution, they can’t tell you if you’re maximizing the right thing at the right time. Periodically make sure that your objective function reflects your current priorities and values. For example, an early stage company may not be worried as much about profitability; instead they may want to maximize revenue in order to try to increase market-share. A company that is looking to IPO may want to demonstrate profitability, so may focus on minimizing costs, while maintaining the same level of market share. Only capturing the currently important metric (revenue) at specific points in time (quarterly) will hinder your ability to predict new cost functions (profitability) at different times.
Along those lines, data scientists can also fall into the trap of optimizing model metrics, and not business metrics. As an example, data scientists may consider using the area under a precision-recall curve or a receiver-operating-characteristic curve to evaluate overall model performance, but those curves don’t necessarily translate to business success. Instead, setting an objective like “Minimize false positives while maintaining a total false negative rate of X%” can be specific to your current business conditions, and can be used to weigh the specific costs of false positives and false negatives. Capturing pre-aggregated event-based data and periodic re-examination of the problem you’re trying to solve will allow you to keep moving in the right direction, instead of optimizing for the wrong problem.
2. Get on the same page
To your business stakeholders, there’s a huge difference between “We saw a 100 point increase in accuracy in the test set of 100,000 examples” and “If we had these improvements in place, we would have saved $20,000 dollars in the last business quarter.” “100,000 examples” and “100 point increase” are hard to visualize, whereas “$20,000” and “last business quarter” tend to be a lot easier for business stakeholders to grasp. Standardize your units of analysis so that your team and the business leaders spend less time translating, and more time ideating.
The points-in-time that are critical can also differ by business stakeholder. A sales or customer success practitioner may need weekly, monthly or event-based measures (i.e. first subscription event, renewal event, support request events). While a revenue leader may need models per business segment, sales rep or product line on a quarterly or yearly basis. Collect data at an event level to support these various compute times as they arise.
We’ve been on teams where train and test sets were at the whims of the particular data scientist. Our analysis wasn’t comparable to each other, and the model metrics we used were incomprehensible to the stakeholder. Once we standardized on business metrics, and times meaningful to the business (i.e. all deals from last quarter, subscription activity in the last month), it became easier to compare models internally and externally and easier to make present impactful business cases for the usage of our models.
3. Allow room for discovery
Data science is an inherently creative endeavor, oftentimes advancements in models come from unexpected places. The biggest breakthroughs come from exploring new avenues and new opportunities. One of the beautiful things about data science is that it takes ideas and methods from a broad array of scientific disciplines. Algorithms developed for genetics are used to analyze literature, methods to analyze literature can be adapted to make romantic matches on a dating app or provide recommendations for a vacation.
Advances in solutions often come from looking at the same problem from a different angle or frame of reference. For example, some of the first models didn’t take into account demographic information. For a long time now, data scientists have understood that including demographic data may help ads reach the right person or measure unintended bias. Then when the frame of psychology was introduced, data scientists began looking at the problem from a psychographic angle: Can demographics and demonstrated interest improve results? For example, adding in data about what someone shared on social media could provide a link to what they are likely to buy. Recently, event-based behavioral data, in near real time, has entered the space bringing both new information and time into the picture. Making very small gas station purchases then a very large TV purchase minutes later may signal a stolen credit card.
While you don’t want to spend all your time running down rabbit holes and chasing down wild geese, setting aside time to try new and creative solutions or explore different angles will pay off in the long run in new capabilities, better models, and faster time to results. Whether it’s setting aside time every week to chase down new leads or try new things, or allowing exploration tasks into your workflow, in the long run y