Tag Archives: lean

What (Really) is a Data Scientist?

Drew Conway's very popular Data Science Venn Diagram. From http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Drew Conway’s very popular Data Science Venn Diagram. From http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

What is a data scientist? What makes for a good (or great!) data scientist? It’s been challenging enough to determine what a data scientist really is (several people have proposed ways to look at this). The Guardian (a UK publication) said, however, that a true data scientist is as “rare as a unicorn”.

I believe that the data scientist “unicorn” is hidden right in front of our faces; the purpose of this post is to help you find it. First, we’ll take a look at some models, and then I’ll present my version of what a data scientist is (and how this person can become “great”).

#1 Drew Conway’s popularData Science Venn Diagram” — created in 2010 — characterizes the data scientist as a person with some combination of skills and expertise in three categories (and preferably, depth in all of them): 1) Hacking, 2) Math and Statistics, and 3) Substantive Expertise (also called “domain knowledge”). 

Later, he added that there was a critical missing element in the diagram: that effective storytelling with data is fundamental. The real value-add, he says, is being able to construct actionable knowledge that facilitates effective decision making. How to get the “actionable” part? Be able to communicate well with the people who have the responsibility and authority to act.

“To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.”Drew Conway, March 26, 2013

#2 In 2013, Harlan Harris (along with his two colleagues, Sean Patrick Murphy and Marck Vaisman) published a fantastic study where they surveyed approximately 250 professionals who self-identified with the “data science” label. Each person was asked to rank their proficiency in each of 22 skills (for example, Back-End Programming, Machine Learning, and Unstructured Data). Using clustering, they identified four distinct “personality types” among data scientists:

As a manager, you might try to cut corners by hiring all Data Creatives(*). But then, you won’t benefit from the ultra-awareness that theorists provide. They can help you avoid choosing techniques that are inappropriate, if (say) your data violates the assumptions of the methods. This is a big deal! You can generate completely bogus conclusions by using the wrong tool for the job. You would not benefit from the stress relief that the Data Developers will provide to the rest of the data science team. You would not benefit from the deep domain knowledge that the Data Businessperson can provide… that critical tacit and explicit knowledge that can save you from making a potentially disastrous decision.

Although most analysts and researchers who do screw up very innocently screw up their analyses by stumbling into misuses of statistical techniques, some unscrupulous folks might mislead other on purpose; although an extreme case, see I Fooled Millions Into Thinking Chocolate Helps Weight Loss.

Their complete results are available as a 30-page report (available in print or on Kindle).

#3 The Guardian is, in my opinion, a little more rooted in realistic expectations:

“The data scientist’s skills – advanced analytics, data integration, software development, creativity, good communications skills and business acumen – often already exist in an organisation. Just not in a single person… likely to be spread over different roles, such as statisticians, bio-chemists, programmers, computer scientists and business analysts. And they’re easier to find and hire than data scientists.”

They cite British Airways as an exemplar:

“[British Airways] believes that data scientists are more effective and bring more value to the business when they work within teams. Innovation has usually been found to occur within team environments where there are multiple skills, rather than because someone working in isolation has a brilliant idea, as often portrayed in TV dramas.”

Their position is you can’t get all those skills in one person, so don’t look for it. Just yesterday I realized that if I learn one new amazing thing in R every single day of my life, by the time I die, I will probably be an expert in about 2% of the package (assuming it’s still around).

#4 Others have chimed in on this question and provided outlines of skill sets, such as:

  • Six Qualities of a Great Data Scientist: statistical thinking, technical acumen, multi-modal communication skills, curiosity, creativity, grit
  • The Udacity blog: basic tools (R, Python), software engineering, statistics, machine learning, multivariate calculus, linear algebra, data munging, data visualization and communication, and the ultimately nebulous “thinking like a data scientist”
  • IBM: “part analyst, part artist” skilled in “computer science and applications, modeling, statistics, analytics and math… [and] strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.”
  • SAS: “a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solved. They’re part mathematician, part computer scientist and part trend-spotter.” (Doesn’t that sound exciting?)
  • DataJobs.Com: well, these guys just took Drew Conway’s Venn diagram and relabeled it.

#5 My Answer to “What is a Data Scientist?”:  A data scientist is a sociotechnical boundary spanner who helps convert data and information into actionable knowledge.

Based on all of the perspectives above, I’d like to add that the data scientist must have an awareness of the context of the problems being solved: social, cultural, economic, political, and technological. Who are the stakeholders? What’s important to them? How are they likely to respond to the actions we take in response to the new knowledge data science brings our way? What’s best for everyone involved so that we can achieve sustainability and the effective use of our resources? And what’s with the word “helps” in the definition above? This is intended to reflect that in my opinion, a single person can’t address the needs of a complex data science challenge. We need each other to be “great” at it.

A data scientist is someone who can effectively span the boundaries between

1) understanding social+ context, 

2) correctly selecting and applying techniques from math and statistics,

3) leveraging hacking skills wherever necessary,

4) applying domain knowledge, and

5) creating compelling and actionable stories and connections that help decision-makers achieve their goals. This person has a depth of knowledge and technical expertise in at least one of these five areas, and a high level of familiarity with each of the other areas (commensurate with Harris’ T-model). They are able to work productively within a small team whose deep skills span all five areas.

It’s data-driven decision making embedded in a rich social, cultural, economic, political, and technological context… where the challenges may be complex, and the stakes (and ultimately, the benefits) may be high. 


(*) Disclosure: I am a Data Creative!

(**)Quality professionals (like Six Sigma Black Belts) have been doing this for decades. How can we enhance, expand, and leverage our skills to address the growing need for data scientists?

My New Favorite Statistics & Data Analysis Book Using R

very-quick-cover-outline

NOTE: The 2nd Edition (Red Swan) was released in 2017. There is a companion book that presents end-to-end examples of each of the methods.


As of today, I now have a NEW FAVORITE introductory statistics textbook… the one I’ve always dreamed of having. I’ve been looking for a book to use in my classes for undergraduate sophomores and juniors, but none of the textbooks I considered over the past three years (and I’ve looked at over a hundred!) had all of the things I really, really wanted. So I had to go make it happen myself. These things are:

download the preview here (first ~100 pages)

1) An integrated treatment of theory and practice. All of my stats textbooks have a lot of formulas, and no information about how to do what the formulas do in the R statistical software. All of my R textbooks have a lot of information about how to run the commands, but not really much information about what formulas are being used. I wanted a book that would show how to solve problems analytically (using the equations), and then show how they’re done in R. If there were discrepancies between the stats textbook answers and the R answers, I wanted to know why. A lot of times, the developers of R packages use very sophisticated adjustments and corrections, which I only became aware of because my analytical solutions didn’t match the R output. At first, I thought I was wrong. But later, I realized I was right, and R was right: we were just doing different things. I wanted my students to know what was going on under the hood, and have an awareness of exactly which methods R was using at every moment.

2) An easy way to develop research questions for observational studies and organize the presentation of results. We always do small research projects in my classes, and in my opinion, this is the best way for students to get a strong grasp of the fundamental statistical concepts. But they always have the same questions: Which statistical test should I use? How should I phrase my research question? What should I include in my report? I wanted a book that made developing statistical research questions easy. In fact, I know a lot of people I went to PhD school with that would have loved to have this book while they were proposing, conducting, and defending their dissertations.

3) A confidence interval cookbook. This is probably one of the most important things I want my students to leave my class remembering: that from whatever sample you collect, you can construct a confidence interval that will give you an idea of what the true population parameter should be. You don’t even need to do a hypothesis test! but it can be difficult to remember which formula to use… so I wanted an easy reference where I’d be able to look things up, and find out really easily how to use R to construct those confidence intervals for me. Furthermore, some of the confidence intervals that everyone is taught in an introductory statistics course are wildly inaccurate – and statisticians know this. But they hesitate to scare away novice data analysts with long, scary looking equations, and so students keep learning those inaccurate methods and believing they’re good. Since so many people never get beyond introductory statistics and still turn into researchers in other fields, I thought this was horrible. I want to make sure my students know the best way to do each confidence interval in their first class… even if the equations are not as friendly.

4) An inference test cookbook. I wanted a book that stepped me through each of the primary parametric inference tests analytically (using the equations), and then showed me how it was done in R. If there were discrepancies, I wanted to know why. I wanted an easy way to remember the assumptions for each test, and when to use a pooled standard deviation versus an unpooled one. There’s a lot to keep track of! I wanted a reference that it would make it easy to keep track of all of it: assumptions, tests for assumptions, equations, R code, and diagnostic plots.

5) No step left behind. It’s really frustrating to me how so many R books assume you can do a psychic fill-in-the-blank for missing code. Since I’ve been using R for several years now, I’ve gotten to the point where my psychic abilities are pretty good, and at least 60% of the time I can figure out the missing pieces. But wow, what a waste of time! So I wanted a book that had all of the steps for each example. Even if it was a little repetitive. I may have missed this in a few places, but I think beginners will have a much easier time with this book. Also, I put all my data and functions on GitHub for people to run the examples with. I’m growing this slowly, but I don’t want people to be left in the lurch.

6) An easy way to produce any of the charts and graphs in the book. One of my pet peeves about R books is that the authors generate beautiful charts and graphs, and then you’re reading through the book and say “Yes!! Yes!! That’s the chart I need for my report… I want to do that… how did they do that?” and they don’t tell you anywhere how they did it. I did not want there to be any secrets in this book. If I generated a page of interesting looking simulated distributions, I wanted you to know how I did it (just in case you want to do it later).

GRANTED… I am sure it will not be perfect – no book is. (For example, Google Forms changes a lot and there are a couple examples that use it that will probably be outdated when the book gets to press… and I just found out this morning that you don’t need the source_https trick in R 3.2.0 and beyond.) [Note: data access has been fully updated in the 2nd Edition.] However, I will keep updating my blog with posts about useful things as they evolve.

In any case, I hope you enjoy my book as much as I’ve been enjoying using it as a reference for myself… it really is all my most important notes, neatly organized into just over 500 pages of everything I want to remember. And everything I want to make sure my students take with them after they leave my class.

[Note: Any errors and omissions from earlier printings (which have been taken care of in later printings) are being recorded at https://qualityandinnovation.com/errata/.]

Quality and Innovation in the Counterculture

Inside the Temple of Grace at Burning Man 2014. Image Credit: John David Tupper (photographerinfocus.com)

Inside the Temple of Grace at Burning Man 2014. Image Credit: John David Tupper (photographerinfocus.com)

This week, I was the guest blogger at the American Society for Quality’s “View from the Q” where I shared some anecdotes about encountering quality tools and concepts at Burning Man this past August.

Check it out and learn what’s so great about “MOOP“.

Google Docs to Markdown Converter: A Gateway Drug to Getting Your Books on LeanPub

doug-feb2

Image Credit: Doug Buckley of http://hyperactive.to

Following in the footsteps of fellow ASQ Influential Voice John Hunter (who published Management Matters on LeanPub) — I’ve had the intention for the past couple years to write my next book using LeanPub too.

There’s only one problem: LeanPub requires that you prepare and format your book in Markdown. I know Markdown is not that hard, but in order to move forward with it, I would have to find at least a couple days without distractions to get my head into it and start flowing with that approach. With work and kid’s-school-schedule and my travel schedule, this has been darn near next to impossible.

Until today! I found an article that shows how you can convert Google Docs to Markdown using a simple script.

I haven’t tried it yet, but I am convinced this huge productivity booster will be the gateway drug to getting my books onto LeanPub.

 

Moving Beyond Profit: Support the WHY

It’s amazing how sometimes, just a tiny TINY little stir-of-the-consciousness can yield amazing insights.

That’s what just happened to me a few minutes ago. While scanning this morning’s Twitter feed, I saw this one:

It reminded me of an article I posted in early 2011 titledIs Profit Waste?where I posed the question of whether profit was just one of many kinds of waste – that is, overproduction of revenue. When companies talk about a desire to grow, usually they mean they need to figure out a way to grow their revenue stream (and often this means growing the organization, expanding the scope, or adding to product lines and service offerings). In fact, one of the strongest drivers for pushing innovation is that desire to grow.

But WHY? Why do you want to grow? It’s that question that the tweet above answered for me in less than 140 characters.

Here’s a company that’s not trying to sell you on the WHAT that they do. It’s a new company, so obviously they’re trying to get started, but they’re immediately clear about WHY they want to grow… they want to get more women into technology! And the clear outward sign of successful growth will be getting more women into technology. And oh – by the way – in order for us to pursue our PURPOSE of getting women into technology, we need to make some money, and to do this we’ve written our first app. And won’t you please buy it… because if you do, you can help us work to get more women into technology!


I love this. I think more of us should approach our business stories this way! Don’t focus on business growth or profit growth, focus on WHY you’re working in the first place and WHAT you want more revenue to spend on. If we support your mission, we’re more likely to support your product, even if it doesn’t meet all our needs. Furthermore, we’re more likely to want to work with you to enhance the products, expand your reach, and collaborate to achieve higher levels of quality and serendipitous innovation.

An Unorthodox Tip for Improving Productivity and Eliminating Writer’s Block: Listen to the Earworm

(Image Credit: Doug Buckley of http://hyperactive.to)

The other day I read a news article or blog post (or something; I can’t remember) that explained one reason we get irritating songs stuck in our heads. The post was based on a research paper by Williamson et al. (2011) in the journal Psychology of Music. Usually, when we catch one of these “earworms” because we’ve heard a snippet of a catchy and familiar song, we’ll walk away or turn off the song in the beginning or the middle of it.

The tune, however, like a rapid flesh-eating organism invading our very soul, continues without compunction. Because we stopped the song in the middle, our unconscious becomes fixed on the task of finishing it. And so it continues, on and on, all day!

The solution, we’re told, is to listen to the annoying song until it’s over… our unconscious, at that point, will be content that the tune is complete and will be happy to move on to other topics.

I didn’t think too much of this piece of trivia until I was reading an interview with Erik Larson, author of the fantastic 2003 novel The Devil in the White City. His book provides an amazing account of the technology development and social context that went into organizing the 1893 World’s Fair in Chicago – it’s a totally satisfying read. When asked about his discipline for writing, and for avoiding writer’s block, he described a method that might actually leverage the same hold on the unconscious that earworms grab:

And I try to write a couple of pages. I’m not firm. I don’t have a specific goal. But the one thing I always adhere to is that I stop while I’m ahead. If I’m going to take that break for breakfast, I may stop in the middle of the sentence or the middle of the paragraph. Something I know how to finish. Because as any writer knows, it’s — that’s what kills you is when you just don’t know what to do when you come back. And all the demons accumulate. And then you go out for a cappuccino, that kind of thing.

If you want to avoid writer’s block, leave your unconscious a hook – an easy way back in to your writing productivity!

If you want to avoid ramp-up time (or context switching time) to get your head back into a problem – which has been estimated, for software development at least, to be on average a full 15 minutes for every interruption – leave your unconscious an easy way back in to productivity! A half written module or subroutine… or a half written sentence on your notepad!

These are just hypotheses, but they’re definitely testable. I’m going to try testing this out in my own life immediately.

Quality Soup: Too Many Quality Improvement Acronyms

Note: This post is NOT about soup. If you’re searching for really good soup to eat, you will not find it here.

This post is, in contrast, about something that @ASQ tweeted earlier today: “QP Perspectives Column: Is the quality profession undermining ISO 9000?

In this February 2012 column, author Bob Kennedy examines reflected on a heated discussion at a gathering of senior-level quality practitioners regarding the merit of various tools, methodologies and themes in the context of the quality body of knowledge – what I refer to as “quality soup”. These paragraphs sum up the dilemma captured at that meeting:

Next came the bombshell from a very senior quality consultant: “No one is interested in ISO 9000 anymore; they all want lean.” In hindsight, I think he was speaking from a consultant’s perspective. In other words, there’s no money to be made peddling ISO 9000, but there is with lean and LSS.

I was appalled at this blatant undermining of a fundamental bedrock of quality that is employed by more than 1 million organizations representing nearly every country in the world. The ISO 9000 series is Quality 101, and as quality practitioners, we should never forget it.

If we don’t believe this and promote it, we undermine the impact and importance of ISO 9000. We must ask ourselves, “Am I interested in ISO 9000 anymore?”

When I see articles like this, and other articles or books that question whether a tool or technique is just a passing fad (e.g. there’s a whole history of them presented in Cole’s 1999 book) my visceral reaction is always the same. How can so many quality professionals not see that each of these “things we do” satisfies a well-defined and very distinct purpose? (I quickly and compassionately recall that it only took me 6 years to figure this out, 4 of which were spent in a PhD program focusing on quality systems – so don’t feel bad if I just pointed a finger at you, because I’d actually be pointing it at past-me as well, and I’m still in the process of figuring all of this stuff out.)

In a successful and high-performing organization, I would expect to see SEVERAL of these philosophies, methodologies and techniques applied. For example:

  • The Baldrige Criteria provide a general framework to align an organization’s strategy with its operations in a way that promotes continuous improvement, organizational learning, and social responsibility. (In addition to the Criteria booklet itself, Latham & Vinyard’s users guide is also pretty comprehensive and accessible in case you want to learn more.)
  • ISO 9000 provides eight categories of quality standards to make sure we’re setting up the framework for a process-driven quality management system. (Cianfrani, Tsiakals & West are my two heroes of this system, because it wasn’t until I read their book that I realized what ISO 9001:2000, specifically, was all about.)
  • Thus you could very easily have ISO 9000 compliant processes and operations in an organization whose strategy, structure, and results orientation are guided by the Baldrige Criteria.
  • Six Sigma helps us reduce defects in any of those processes that we may or may not be managing via an ISO 9000 compliant system. (It also provides us with a couple of nifty methodologies, DMAIC and DMADV, that can help us structure improvement projects that might focus on improving another parameter that describes system performance OR design processes that tend not to yield defectives.)
  • The Six Sigma “movement” also provides a management philosophy that centers around the tools and technologies of Six Sigma, but really emphasizes the need for data-driven decision making that stimulates robust conclusions and recommendations.
  • Lean helps us continuously improve processes to obtain greater margins of value. It won’t help you reduce defects like Six Sigma will (unless your waste WAS those defects, or you’re consciously mashing the two up and applying Lean Six Sigma). It won’t help you explore alternative designs or policies like Design of Experiments, part of the Six Sigma DMAIC “Improve” phase, might do. It won’t help you identify which processes are active in your organization, or the interactions and interdependencies between those processes, like an ISO 9000 system will (certified or not).
  • ISO 9000 only guarantees that you know your processes, and you’re reliably doing what you say you’re supposed to be doing. It doesn’t help you do the right thing – you could be doing lots of wrong things VERY reliably and consistently, while keeping perfect records, and still be honorably ISO certified. The Baldrige process is much better for designing the right processes to support your overall strategy.
  • Baldrige, ISO 9000, and lean will not help you do structured problem-solving of the kind that’s needed for continuous improvement to occur. PDSA, and possibly Six Sigma methodologies, will help you accomplish this.

Are you starting to see how they all fit together?

So yeah, let’s GET LEAN and stop wasting our energy on the debate about whether one approach is better than another, or whether one should be put out to pasture. We don’t dry our clothes in the microwave, and we don’t typically take baths in our kitchen sink, but it is very easy to apply one quality philosophy, methodology or set of practices and expect a result that is much better generated by another.

Bob Kennedy comes to the same conclusion at the end of his column, one which I fully support:

All quality approaches have a place in our society. Their place is in the supportive environment of an ISO 9000-based QMS, regardless of whether it’s accredited. Otherwise, these approaches will operate in a vacuum and fail to deliver the improvements they promise.

« Older Entries