Certainly, projects like archive.org and some governmental data provide long(er) term solutions, but my impression is that the Really Big Data is provided, in limited access, by corporations (ie. Google, Facebook, etc). Collecting the data may (or may not) be central to their business model, but providing access to the data for outside analysis certainly isn't.

While Google seems too big to fail currently, if big data is as revolutionary as you suggest, we need to be thinking about the long term preservation of that data. Unfortunately, long term, in internet years is 5 – 10 years, not 20, 50 or more accurately 200 years.

Some questions to consider (again, I'm not sure I grok the issue very well in the first place, so I'm lucky if these are valid questions–I won't presume to have any answers at all):

What sort of consequences derive from data being controlled almost exclusively by corporations and, more dangerously, by a single corporation? What sort of consequences does the fast pace of change in the online world hold for this sort of analysis long term? And what can we do to ensure wider, more permanent availability of the Data?

In my paleo view, you first think about simple harmonic motion, then about damped oscillator, then about driven oscillation, then about system with multiple basic frequences. The natural language to think about it is calculus.

Yours is not a paleo view, but what you call the horse and what you call the cart may depend upon the application. For example, discrete fourier transform can be motivated purely as a linear transform over finite length vectors, and it is this form (shorn of physical terms like "frequency" and "time") that is useful in its several applications in modern discrete mathematics such as multiplication of integers and polynomials, or the active discipline of additive combinatorics. Indeed, I can say from personal experience that in several computer science curricula, the whole continuous version of the Fourier transform is left aside as an interesting curiosity with applications in "other fields" in favor of the more readily applicable discrete version (which also has the advantage that the beautiful theory of Fourier transforms can be presented unencumbered by delicate real-analytic arguments that are needed to justify the continuous version in several cases). This is precisely what myl seems to be advocating for humanities too.

However, as far as I am concerned, every human being should be taught Calculus without leaving out any of the (ε, δ)-rigor. But out of considerations of my own personal physical integrity, I typically keep that opinion to myself.

]]>[(myl) This is all true and good. The question is, what are the priorities for (say) a sociologist or a neuroscience student who is going to take two, maybe three semesters of math-like stuff? Over and over again I see grad students who took two semesters of calculus as an undergrad, none of which they really remember very well, but who can't understand and implement a simple paper about linear dimensionality-reduction via PCA or SVD, or design and implement a simple digital filter, or etc. ]

By way of analogy, you can learn fluent Spanish leaving in English-only community from a good teacher for whom Spanish is not a mother tongue. You would probably do better by moving to a Spanish speaking country and having a native speaker for a teacher.

Again, this is not to justify two-year regimen of calculus complete with ε-δ language and techniques of integration.

]]>I know that DFT can be viewed as a unitary (or orthogonal, if you wish) transformation. But it is harder to understand why anybody would like to do that without knowledge of harmonic oscillators and differential equations.

At least as far as DFT is concerned, its usages in areas other than the study of oscillations abound. One example is the fast multiplication of integers.

[(myl) Because convolution and multiplication are dual operations in the "time" and "frequency" spaces… again no calculus is needed to explain (or use) this.]

]]>[(myl) Because sinusoids are eigenfunctions of linear shift-invariant systems! This is just as true of vectors as it is of continuous functions.]

It is also not clear how to explain the notion of spectral window, Nyquist theorem, resonance, and probably million other things I cannot think of right away.

[(myl) It's easier than you might think. The Nyquist theorem is basically just about degrees of freedom — see e.g. here, especially the "Subspace account of sampling" section at the end…]

I am sure it can be done. After all, it is possible to teach a bear ride a bycicle. Isaac Newton is said to explain all his work on dynamics using geometry. Werner Heisenberg invented quantum mechanics without knowing what matrix is.

This all is not to deny that linear algebra viewpoint is importent and useful.

[(myl) Frankly, I feel that much of this stuff is easier and more natural if taken from the discrete side first. See also the DSP First textbook. Not to disrespect calculus, but…]

]]>These days, a phonetician (or musician or sound engineer) mostly needs to understand the Discrete Fourier Transform, which is best thought of in linear-algebra terms as a linear transformation (a rigid rotation, in fact) of coordinates,

Perhaps off topic, but let me add that the continuous time Fourier transform is **also** best viewed as a change of basis in the space of functions over reals, from the basis of Dirac delta function to the basis of sinusoids. The calculus involved is just computation (analogous to the "multiply and sum" step in discrete Fourier transform).

Prof: Liberman: Welcome to Berkeley!

]]>I am also sure that Prof. Liberman, as a phonetitian, knows the value of calculus. Would it be nice to see a smart grad student who wants to work with you, but does not have a vaguest idea about Fourrier transform?

[(myl) These days, a phonetician (or musician or sound engineer) mostly needs to understand the Discrete Fourier Transform, which is best thought of in linear-algebra terms as a linear transformation (a rigid rotation, in fact) of coordinates, imposed by matrix multiplication on signals viewed as vectors, with sinusoids (expressed as complex exponentials) as basis functions in the appropriate arrangement in the matrix. See here for how I generally teach it — no calculus involved.]

]]>Well not-gone-into, QET! You have failed to make your point with great aplomb.

]]>When I was a grad student, trying to infer phylogeny from what little sequence data existed (protein sequences back then, not DNA) was a rather esoteric exercise of interest to a smallish subset of the biological community. The math behind such inference was non-trivial even then, but it wasn't considered an essential curriculum item or skill set. Now it is impossible to ignore the importance of sequence data to this and many other fascinating questions. We teach basic introductions to sequence comparison and genomic data analysis at all levels from undergraduate seminars to continuing education courses for MDs. As D.O. surmises, there are more people who use the available tools in ignorance than there are front-runners who develop and validate new data-mining tools. But IMHO that is a serious problem, and though the standard tools will hopefully (sic :-) become more foolproof as the field develops, black box science is not what we should be aiming for.

One of the major problems I observe and suffer from myself with the current state of the huge genomic data bases is the propagation of annotation errors. People continually do their best to attach useful annotations and commentary to the underlying data, but it is the nature of the endeavor that these inferences become out of date as additional data or newer analysis techniques become available. Yet the old, obsolete, annotations acquire a life of their own and pollute subsequent queries and analyses. Are similar concerns likely to arise for "big data" repositories in the social sciences? I am reminded of some earlier LL blogs about the vagaries of the meta-data attached to books scanned by Google from large library collections. My thought is that some of the problems that genomics data repositories currently suffer from could have been avoided if sufficient thought had been given early on to how meta-data is organized, validated, and curated. I could go on at length, but I'll leave it there. Anyhow, I'm sure it wouldn't hurt to consider in advance whether newly-emerging big data repositories in other fields could learn from our growing pains.

]]>