The laborious (and valuable) task of data preparation

In my last two analytical posts (the past two Mondays), I emphasized the challenges inherent in designing the analyses necessary to prove or disprove our ideas. My overarching point was that rarely does data and analysis align perfectly in a single convincing proof, and that as a result, we need to be creative and resourceful in designing multiple analyses which, when taken together, provide sufficient insight to claim some idea, explanation, or path forward as wise and sound. This is essential to great analytical work, and easier said than done.

In “Scorecasting,” an entertaining book that critically explores a number of sports axioms, I recently came across a great example of another valuable element of the analytical process: adding our own unique categorization to the data we collect.

In “Scorecasting,” the authors set out to quantify the value of a blocked shot in basketball. This led to a better question: Are all blocked shots equal in value? To answer this, they created their own categorization of blocked shots: a block of a shot that was unlikely to have been made, a block of a shot that was very likely to have been made, a block of a shot tipped directly to a teammate that started a fast break the other way, and so on. This categorization was essential to accurately value each blocked shot. There was only one problem: nowhere did this data exist.

How many times do we come across this problem? Gee, I’d love to know how aggressive our sales team was on each sales call last month. And, I’d sure like to know the educational level of each of our customers. And experience level. And wouldn’t it be great to know how much higher of a price each customer would have actually been willing to pay?

If only that data existed. But it doesn’t.

And some times that’s the end of story. The data truly doesn’t exist.

But other times, the data actually does exist – or some reasonable approximation of it – only we choose not to put forth the thought and effort to obtain it. Yes, we might have to review and assign each and every record to a particular category. And yes, we might have to first perform some additional research, or conduct a long series of interviews, before classifying our data. And yes, all this can be incredibly labor intensive. But it is possible. The data actually does exist.

In “Scorecasting,” the authors viewed seven years of video to categorize every blocked shot into the classifications they established. Without question, that took a lot of time! But it also enabled them to better value the true impact of each player’s shot blocking. Information that might be very useful in coaching shot blocking techniques, or in negotiating a player’s next contract.

Collecting, cleaning, categorizing and otherwise preparing data often requires large amounts of creativity, resourcefulness and time. That’s why very few people do it well. And why those that do, stand out.

This entry was posted in Problem Solving. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>