The Long Tail discussion

July 3rd, 2008

The popular Long tail concept by Chris Anderson was recently criticized by Prof. Anita Elberse, a marketing professor at Harvard’s business school. Lee Gomes at WSJ mistakenly commented that as debunking of the Long tail myth.
Mr Anderson gave an exceptionally polite response.
Wow. Before I noticed that Prof. Elberse is a marketing professor, I became really suspicious about Harvard. I even started to write an extensive post before I discovered an excellent comment by Ali Partovi. I would sign every word of it.

Generally, it is absolutely misleading to study long-tail distributions in terms of average values. Average homo sapiens has one testicle and one breast. Similarly, in a power-law distribution average value is not typical and it does not mean a lot. Prof. Elberse did that awful mistake; any further conclusions are just consequences.

diff algorithm in C++ (templated)

June 5th, 2008

I’ve implemented Myers’ diff in C++ using templates. Myers’ LCS is a classic diff algorithm implemented by GNU diff-utils (in C), Frazer’s (Google’s) diff-match-patch (in Java/Python/JavaScript) and many others. This particular implementation is rather brief and, importantly, templated. So, it might be used over any random access container: std::string, wstring, vector<int>, vector<string> or whatever implements random access iterators and element equality operator.
The program implements the linear-space algorithm variation, not the ND-space Dijkstra-like BFS (and not that scary NN-space matrix-based examples one might find on the Web).

To get the code: svn co svn://oc-co.org/TestDiff

TODO: a lot of things: diff cleanups, patching, whatever else I’ll need

P.S. Performance is acceptable; does per-char diff for “War and Peace” Book I (N=271942 D=4356) in 0.15sec, while GNU diff does line-based diff in 0.04s (N=5931 D=1404). Actually, GNU diff does in 0.01s without locales. My implementation wastes time reinstantiating iterators, while C code uses pointer arithmetic, but… who cares? That’s good enough for now.

P.P.S. 271942*4356/(5931*1404) = 142; 0.15/0.01 = 15. Wow! Excellent!

TODO: do it with bidirectional iterators; that is more natural

Causal trees problem: doubles

March 25th, 2008

If two users concurrently insert the same letter at the same place it doubles at merge. I.e.
both: aple
user1: apple user2: apple
merge: appple
Heuristic: if two atoms have equal content and equal predecessor => count as one
So,
1) Devil is in the details. What if “user1: apple user2: apple” ?
2) Is it really a problem?

P.S. 9 Apr Although convergence is considered to be a virtue of a version control system, in case of distributed wikis/forums it might easily be a misfeature. E.g. suppose a stereotypical situation of two users adding a “+1″ comment to some posting. Those “+1″s, although being identical, are not actually the same change.
Sometimes, a program better be predictable and consistent than smart.

Mantra

March 22nd, 2008

The objective is to automate information propagation as hyperlinks automated associations and search engines automated search.

Wikipedia as an ant-hill

March 15th, 2008


Occasionally, I posted that to LJ.

P.S. another visualization: LJ as ant-hill

New write-up

January 18th, 2008

I just finished playing with the next prototype and dumped my thoughts and experience to a new article, “Causal trees: towards real-time read-write web”. Briefly, v4 had a problem of conflicts; diff/patch approach is unacceptable for a distributed peer-to-peer environment of average users, so I implemented a simple OT-like algorithm (every letter/operation has an ID).
It is interesting how people instinctively build complex solutions; an enormous effort is needed to keep things simple.

Is wikipedia declining/saturating?

October 13th, 2007

The rate at which edits were being made to Wikipedia articles appears to have peaked in February to April 2007 and declined since.

(source: Wikipedia, Dragons_flight)

Torvalds on networks of trust

September 8th, 2007


Linus Torvalds’ Google talk on git, the version control system. Quotes:

The way merging is done is the way real security is done, by a network of trust. if you have ever done any security work and it didn’t involve the concept of network of trust it was not a security work; it was a masturbation.
…we don’t know hundred people. We have five, seven, ten close personal friends…

Web 2.0 … The Machine is Us/ing Us

September 7th, 2007


via Marek Kopel

Bouillon 4 (”broth”) to be showcased at CSR 2007

September 2nd, 2007

Ekaterinburg, Russia, Ural State University, Computer Science Russia’ 2007, Wednesday, September 5, 13-00, room 248. V. Grishchenko. “Bouillon: a wiki-wiki social web”. Also at Springer LNCS Volume 4649/2007. By the way, the second half of the paper (the technical one) is completely obsolete; it is months old! :)

Broth the uberwiki