The popular Long tail concept by Chris Anderson was recently criticized by Prof. Anita Elberse, a marketing professor at Harvard’s business school. Lee Gomes at WSJ mistakenly commented that as debunking of the Long tail myth.
Mr Anderson gave an exceptionally polite response.
Wow. Before I noticed that Prof. Elberse is a marketing professor, I became really suspicious about Harvard. I even started to write an extensive post before I discovered an excellent comment by Ali Partovi. I would sign every word of it.
Generally, it is absolutely misleading to study long-tail distributions in terms of average values. Average homo sapiens has one testicle and one breast. Similarly, in a power-law distribution average value is not typical and it does not mean a lot. Prof. Elberse did that awful mistake; any further conclusions are just consequences.
I’ve implemented Myers’ diff in C++ using templates. Myers’ LCS is a classic diff algorithm implemented by GNU diff-utils (in C), Frazer’s (Google’s) diff-match-patch (in Java/Python/JavaScript) and many others. This particular implementation is rather brief and, importantly, templated. So, it might be used over any random access container: std::string, wstring, vector<int>, vector<string> or whatever implements random access iterators and element equality operator.
The program implements the linear-space algorithm variation, not the ND-space Dijkstra-like BFS (and not that scary NN-space matrix-based examples one might find on the Web).
To get the code: svn co svn://oc-co.org/TestDiff
TODO: a lot of things: diff cleanups, patching, whatever else I’ll need
P.S. Performance is acceptable; does per-char diff for “War and Peace” Book I (N=271942 D=4356) in 0.15sec, while GNU diff does line-based diff in 0.04s (N=5931 D=1404). Actually, GNU diff does in 0.01s without locales. My implementation wastes time reinstantiating iterators, while C code uses pointer arithmetic, but… who cares? That’s good enough for now.
If two users concurrently insert the same letter at the same place it doubles at merge. I.e.
both: aple
user1: apple user2: apple
merge: appple
Heuristic: if two atoms have equal content and equal predecessor => count as one
So,
1) Devil is in the details. What if “user1: apple user2: apple” ?
2) Is it really a problem?
P.S. 9 Apr Although convergence is considered to be a virtue of a version control system, in case of distributed wikis/forums it might easily be a misfeature. E.g. suppose a stereotypical situation of two users adding a “+1″ comment to some posting. Those “+1″s, although being identical, are not actually the same change.
Sometimes, a program better be predictable and consistent than smart.
I just finished playing with the next prototype and dumped my thoughts and experience to a new article, “Causal trees: towards real-time read-write web”. Briefly, v4 had a problem of conflicts; diff/patch approach is unacceptable for a distributed peer-to-peer environment of average users, so I implemented a simple OT-like algorithm (every letter/operation has an ID).
It is interesting how people instinctively build complex solutions; an enormous effort is needed to keep things simple.
Linus Torvalds’ Google talk on git, the version control system. Quotes:
The way merging is done is the way real security is done, by a network of trust. if you have ever done any security work and it didn’t involve the concept of network of trust it was not a security work; it was a masturbation.
…we don’t know hundred people. We have five, seven, ten close personal friends…