• AI for creativity
  • Build with us
  • Learn with us
linkedin
email
  • AI auditory models and cochlear implant design
  • Communication and Engineers Australia’s vision
  • Home
  • The debating series at NSW Parliament House
    • Architects vs Engineers
    • Economists vs Engineers
    • Engineers vs Lawyers
  • Contact us
  • Build with us
  • Learn with us
  • Past projects
  • The Expressive Engineering Blog

Big data pitfalls for engineers from the Financial Times

March 28, 2014
by Andrew Botros
0 Comment

When Expressive Engineering teaches statistics, our clients are typically engineers with access to vast amounts of data from their project domains. Software developers, for example, can easily run their programs on thousands or even millions of test files from their archives.

As Tim Harford of the Financial Times points out in his March feature on big data, the opportunities provided by large datasets can be elusive. It’s an insightful article, and with regards to large datasets, I’ll simplify the article’s warnings to two key points.

Firstly, the article warns against assuming that data from entire populations can be captured. When software developers run their programs against large test sets, it’s tempting to assume that their results accurately reflect the outcomes that will be observed in the real world.

But sampling bias should always be the concern of the data analyst. The Financial Times article cites the 1936 Literary Digest poll fiasco as a case in point (Expressive Engineering has used the same example in its courses for the last couple of years). The poll of millions of subscribers was far less accurate than the Gallup poll of 3,000, and incorrectly predicted a victory for Republican Alfred Landon over President Roosevelt. The reason? Only those who owned automobiles and telephones were sampled in the Literary Digest poll.

Closer to 2014, the Financial Times notes: “In 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.” In my own analysis for Cochlear Limited, I found that clinical datasets of physiological data can be unrepresentative of everyday experience, even when large. Data gatherers have a tendency to filter out measurements that don’t look ‘normal’, and as an analyst I had to compensate for this with my own data collection.

Secondly, the article warns against dismissing the value of finding causal relationships versus mere correlations. It cites Google Flu Trends as the latest high profile victim. In 2009, Google claimed that flu-related searches on its search engine had a high-performing correlation with the actual spread of flu. But as Nature News reported in 2013, Google Flu Trends “had drastically overestimated peak flu levels” four years later.

I think the Financial Times summarises the problem quite well, so it’s worth quoting directly: “The problem was that Google did not know—could not begin to know—what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. … The claim that causation has been ‘knocked off its pedestal’ is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it.”

For those of us engineers who are in the business of predictive modelling, big data is undoubtedly an exciting era. Heeding these two timeless warnings, however, will make you a better data analyst and, ultimately, a better engineer.

About the Author
Andrew Botros is the Founder and Director of Expressive Engineering.
Social Share

You may also like...

Another reason why most HR data are bad data
Apr 12, 2015
Why Sydney’s transport apps took years, not days
Mar 12, 2015
How Einstein persuaded the world: a masterclass
Feb 25, 2015
Introducing Alan Turing, with three degrees of separation
Jan 07, 2015
The one thing to fear in data-driven decision making
Dec 10, 2014
It's not too late for engineers to be great statisticians
Jul 15, 2014
In review: HBR's Persuading with Data
May 20, 2014
Big data pitfalls for engineers from the Financial Times
Mar 28, 2014
Good Design™ marks for Nucleus CR110 Remote Assistant Fitting (left) and the Nucleus CR120 Intraoperative Remote Assistant (right).
My decade of design at Cochlear
May 10, 2013

Quick links

  • Contact us
  • Build with us
  • Learn with us
  • Past projects
  • The Expressive Engineering Blog

>

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
© 2026 Expressive Engineering Pty Ltd | ABN 35 600 451 670