There is a growing body of evidence, at least in text processing, that of … data, features, [and] algorithms … data probably matters the most. Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data. But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive, they don’t scale well.
Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that most writers make correct choices (the training data may be noisy, since people make mistakes, but no matter). In 2001, Banko and Brill  published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy, using this task as the specific example. They explored several classification algorithms (the exact ones aren’t important, as we shall see), and not surprisingly, found that more data led to better accuracy.
Across many different algorithms, the increase in accuracy was approximately linear in the log of the size of the training data. Furthermore, with increasing amounts of training data, the accuracy of different algorithms converged, such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter, all that matters is the amount of data you have. This resulted in an even more controversial recommendation, delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data (while waiting for computers to become faster so we can process the data).
In 2007, Brants et al.  described language models trained on up to two trillion words. Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing  with another technique the authors affectionately referred to as “stupid backoff”. Not surprisingly, stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller corpora. However, it was simpler and could be trained on more data, which ultimately yielded better language models. That is, a simpler technique on more data beat a more sophisticated technique on less data.
Recently, three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data . Why is this so? It boils down to the fact that language in the wild, just like human behavior in general, is messy.
 Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001), pages 26–33, Toulouse, France, 2001.
 Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858–867, Prague, Czech Republic, 2007.
 Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. Communications of the ACM, 24(2):8–12, 2009.
Jimmy Lin & Chris Dyer, Data-Intensive Text Processing with MapReduce (2010)
This is the best summary of the “pro big-data” argument I’ve found. I still haven’t made up my mind on this topic. But my prior is closer to @johndcook’s comment #4 on http://www.johndcook.com/blog/2010/12/15/big-data-is-not-enough/.