Notes on peer-to-peer for TAG

Henry S. Thompson
5 Mar 2007

1. Background

I mentioned the TAG in a brief chat with Andrew Herbert, director of Microsoft Research Cambridge (UK), and issues we were thinking about wrt web architecture going forward, and we got on to p2p. He invited me to come to the lab and talk to some of his people, and I did so last week. I gave a talk about WebArch, and heard about p2p at some length from Christos Gkantsidis and Miguel Castro.

What follows is just a transcription of my notes -- I'll try to remember more about anything people want to drill on.

2. Notes

The current success of p2p for music/video/etc. involves the use of substantial bandwidth from university-hosted accounts.

The interesting technical problems for the future revolve around making p2p more like the OFWeb wrt proxying -- proxying to peers, restartable/chunked proxying.

Some ISPs (in UK and elsewhere) are using special-purpose hardware and software to cache p2p traffic (financial motive). Encryption defeats this, of course.

There is no standardisation wrt e.g. torrent, beyond de facto at the lowest data-transfer level. In particular, no interop wrt metadata, although many clients give added value to same-partner pairings via idiosyncratic metadata.

There is actual support for p2p built in to Windows XP from SP1, and in Vista.

In some cases p2p clients have legimitate reasons for wanting anonymity.

Security concerns -- 'sybil' attack (maybe where one host actually pretends to be many many distinct clients) -- defeats robustness goal of distributed backup functionality. "URIs don't help" (?)

Structured (tracker-based) vs. unstructured (random number addresses) search/connection graph

Distributed hash tables, CAM/CAD(?) Routing via random number neighbourhoods.

"CAD --> number into URI = vulnerability"

Many clients are download only (see above univ. bwidth claim)

Security has been less of a concern so far, because it's early days, and most users are conscious that they're breaking the law, so can't complain.

Only widely reported malicious behaviour to date has actually been from copyright owners, distributing broken data.

Distributed backup/archive and web cache are attractive possible functionalities.

Hashes (to confirm integrity) are important but difficult for p2p.

LOCKSS is an existing system (it assumes goodwill, but p2p in general cannot do so).

Indexing is a major unsolved problem.

MSoft developed interesting solution to the "last block" problem -- clients don't ship blocks, but linear combinations of all the blocks they have.

Today's web caches don't support range requests -- blocks need host-independent names!

eMule does some clevere URI manipulation to engage vanilla web caching mechanisms.

p2p and HTTP GET are both 'pull' mechanisms, should be able to co-operate.

Could (home) DSL routers open port 80 and do virtual hosting? [This in order to allow HTTP GET for p2p, using per-block URIs and redirection -- failed to get snap of whiteboard :-( ]

Apache 2+ server already has modules to convert HTTP GET of v. large files into p2p delivery (HST doesn't understand how this works)

Amazon S3 does something similar, and traffic to/from such facilities (web-based BLOB storage) is growing.

Would be helpful to have standard access to cache metadata sometimes.