Since people liked my last opinion piece on
#big data, here’s another one.
Imagine there was a technology that allowed me to record the position of every atom in a small room, thereby generating some ridiculous amount of data (Avogadro’s number is 𝒪(10²³) so some prefix around that order of magnitude — eg yoctobytes). And also imagine that there was a way for other scientists to decode and view all of that. (Maybe the latency and bandwidth can still be restricted even though neither capacity nor resolution nor fidelity nor coverage of the measurement are restricted — although that won’t be relevant to my thought experiment, it would seem “like today” where MapReduce is required.)
Let’s say I am running some behavioural economics experiment, because I like those. What fraction of the data am I going to make use of in building my model? I submit that the psychometric model might be exactly the same size as it is today. If I’m interested in decision theory then I’m going to be looking to verify/falsify some high-level hypothesis like “Expected utility” or “Hebbian learning”. The evidence for/against that idea is going to be so far above the atomic level, so far above the neuron level, I will basically still be looking at what I look at now:
- Did the decisions they ended up making (measured by maybe 𝒪(100), maybe even 𝒪(1) numbers in a table) correspond to the theory?
- For example if I draw out their assessment of the probability and some utility ranking then did I get them to violate that?
If I’ve recorded every atom in the room, then with some work I can get up to a coarser resolution and make myself an MRI. (Imagine working with tick-level stock data when you really are only interested in monthly price movements—but in 3-D.) (I guess I wrote myself into even more of a corner here, if we have atomic level data then it’s quantum, meaning you really have to do some work to get it to the fMRI scale!) But say I’ve gotten to fMRI level data, then what am I going to do with them? I don’t know how brains work. I could look up some theories of what lighting-up in different areas of the brain means (and what about 16-way dynamical correlations of messages passing between brain areas? I don’t think anatomy books have gotten there yet). So I would have all this fMRI data and basically not know what to do with it. I could start my next research project to look at numerically / mathematically obvious properties of this dataset, but that doesn’t seem like it would yield up a Master Answer of the Experiment because there’s no interplay beween theories of the brain and trying different experiments to test it out — I’m just looking at “one single cross section” which is my one behavioural econ experiment. Might squeeze some juice but who knows.
Then let’s talk about people critiquing my research paper. I would post all the atomic-level data online of course, because that’s what Jesus would do. But would the people arguing against my paper be able to use that granular data effectively?
I don’t really think so. I think they would look at the very high level of 𝒪(100) or 𝒪(1) data that I mentioned before, where I would be looking.
- They might argue about my interpretation of the numbers or statistical methods.
- They might say that what I count as evidence doesn’t really count as evidence because my reasoning was bad.
- They couldn’t argue that the experiment isn’t replicable because I imagined a perfect-fidelity machine here.
- They could go one or two levels deeper and find that my experimental setup was imperfect—the administrator of the questions didn’t speak the questions in exactly the same tone of voice each time; her face was at a slightly different angle; she wore a different coloured shirt on the other day. But in my imaginary world with perfect instruments, those kinds of errors would be so easy to see everywhere that nobody would take such a criticism seriously. (And of course because I am the author of this fantasy, there actually aren’t significant implementation errors in the experiment.)
Now think about either the scientists 100 years after that or if we had such perfect-fidelity recordings of some famous historical experiment. Let’s say it’s Michelson & Morley. Then it would be interesting to just watch the video from all angles (full resolution still not necessary) and learn a bit about the characters we’ve talked so much about.
But even here I don’t think what you would do is run an exploratory algorithm on the atomic level and see what it finds — even if you had a bajillion processing power so it didn’t take so long. There’s just way too much to throw away. If you had a perfect-fidelity-10²⁵-zoom-full-capacity replica of something worth observing, that resolution and fidelity would be useful to make sure you have the one key thing worth observing, not because you want to look at everything and “do an algo” to find what’s going on. Imagine you have a videotape of a murder scene, the benefit is that you’ve recorded every angle and every second, and then you zoom in on the murder weapon or the grisly act being committed or the face of the person or the tiny piece of hair they left and that one little sliver of the data space is what counts.
What would you do with infinite data? I submit that, for analysis, you’d throw most of the 10²⁵ bytes away.