Lessons from The Black Swan – A Follow-up

Hero Image: Lessons from The Black Swan – A Follow-up

In the original article, “Qualitative Data in Surveys: Lessons from the Black Swan“, we discussed how qualitative data differ from quantitative data, especially in their distribution, and how that can have an impact on gathering useful insights. Specifically, unlike quantitative data, a single or few outlier comments can turn out to be very useful managerially, even if statistically it is seen as an anomaly.

But, the problem is in getting to those few outlier needles in a mass of hay. The argument then was that manual coding would likely be inadequate to the task on hand. Primarily, this is because manual coding focuses on categorizing comments into major categories and usually slots one-off comments into an “Other” category that may not be given sufficient attention. And, a coder (especially in an outsourced process) may not be knowledgeable enough to see the relevance of a stray comment.

So, the recommendation was for a knowledgeable analyst to read through every comment. In most cases, though, that is an impractical recommendation. Further, manual coding itself is used far less frequently, given the time, expense and difficulty involved. How do we get around these problems?

In the time since the first article was written, technology has advanced enough to offer a solution.

Recent evolution of text analytics

The original article was somewhat dismissive of text analytics as a potential solution to the problem. This was because, at that time, it seemed like text analytics had not advanced sufficiently enough to help in this type of situation.

Digital computers can work on simple numbers, and complex calculations are extrapolations that additional computing power and algorithms can handle. But, text data are qualitatively different and computers can’t directly process words. So, text has to be analyzed differently than numerical data to allow meaning to be derived.

Early approaches provided elementary information in the form of word frequencies. While these had some uses, their limitations are clearly apparent. Further along the way, word co-occurrences started being used to derive more meaning from textual data. This is broadly called the bag-of-words approach. As the name implies, ordering of words is not considered, but proximity of words to each other is used. With this approach, it is possible to get at categories or topics in textual data which is similar to the nets that manual coders build. While it was clearly better than the earlier generation, it still had significant limitations.

The big breakthrough occurred a few years ago through a process called word embedding which attempted to ascribe meaning using big data sources to contextualize words. For instance, imagine a perceptual map of brands. Those that are close together are more similar than those that are further apart. Similarly, embedding is a process that locates words in a multidimensional vector space, whereby words that are similar in meaning are closer together (and this is derived through exposure to gigantic datasets of words). “Closer together” is interesting because the embedding process is sophisticated enough to place substitutes and complements together; hence it provides useful marketing insights.

And, importantly, the words can now have a unique numerical signature that allow the application of familiar statistical techniques to manipulate them. This combination of approaches allows us to rapidly understand meaning in text.

The human-machine hybrid

The main problem we encountered previously was that we needed a knowledgeable analyst to comb through every open-ended response to identify potential gems – a practically infeasible approach unless the number of responses was very small. Instead, the new technology allows us to use a human-machine hybrid approach in a very efficient and effective manner. The superiority of using a human-machine hybrid approach over one or the other has been advocated in multiple academic papers. (See references for articles from researchers at MIT and Columbia University).

We see this pretty easily in practical applications. Imagine a study with several hundred open-ended responses. Leave aside the human-only approach (manual coding along with its issues of time, budget and expertise). While a machine-only approach will be fast, the informational content from the data is at a surface level and insights are almost non-existent.

In a human-machine hybrid approach, each component can focus on what it does best. The machine learning algorithm is fast, cheap and will not tire. Through the process described above, it can quickly and rigorously work through the data quantitatively and summarize information for the human analyst – even in the form of “nets” which consumers of manual coding are familiar. Rather than waste time on tedious tasks, the human is now free to bring fresh, higher-level thinking to the summarized information. It also makes it easier for the analyst to hunt for the elusive outlier comments that could be managerially useful, including potential red flags.

This process can work especially well with a two-person team (of a research analyst and a data science expert) working iteratively and rapidly. At the first level, the text data are analyzed algorithmically and rapidly by the data science expert creating categories or topics. With knowledge of the study objectives, this can be a partially guided approach. We have seen interesting examples of generated categories that a human coder may not consider developing. At the second level, the research analyst reviews the information, validating categories, closely scanning the outliers, and providing feedback for an iterative process.

In practice, we find a significant reduction in time and cost, as well as an enhanced ability to gain superior insights. When dealing with continuously generated tracking data, a machine learning model can automatically classify responses from new waves, while simultaneously allowing the analyst to focus on interesting outlier comments. Since a machine learning model can become better with more data, the chances of identifying interesting outliers increases with time.

The future

Is it possible for technology to advance to the point where the hybrid system will no longer be needed and a machine-only process will be sufficient? In theory, yes, but there is no indication of that happening any time soon. Current systems are good at providing continuous dashboard information at scale, but not the type of deep-dive insights that only human intervention can. That may be something for you to consider when confronted with text data.

References

  • Karlinsky-Sichor, Yael and Oded Netzer (2020), “Automating the B2B Salesperson Pricing Decisions: A Human-Machine Hybrid Approach,” under review at Marketing Science.
  • Timoshenko, Artem and John Hauser (2019), “Identifying Customer Needs from User-Generated Content”, Marketing Science, Vol 39, Jan.
  • Sambandam, Rajan (2013), “Qualitative Data in Surveys: Lessons from The Black Swan,” TRC White paper.
  • Taleb, Nassim Nicholas, The Black Swan: The Impact of the Highly Improbable, Random House, 2007