Sunday, February 18, 2007

Misinformationweek

Today I happened to read this article on Informationweek entitled "Red Hat Joins Microsoft Interoperability Group". Amusingly, I saw the following quote:
An increasing number of Linux distributors appear to be concluding that their best chance for achieving growth is to ensure that their products are compatible with Windows, which holds by far the largest market share in desktop and server operating systems despite recent inroads by the Linux community. According to tracking site W3Counter.com, 85% of the world's Web sites run on Windows, while just 2% run on Linux.
(Emphasis mine.)

Well, those numbers are simply wrong. Luckily the article gives the source for the figures. A quick check reveals that W3Counter (as the name implies, a webcounter service) indeed does report that... 85% of visitors to websites are using Windows XP, while 2% use Linux (well, at least websites using a W3Counter webcounter). So here we have the origin of the 85% / 2% estimates. (Although 85% isn't the reported percentage of users running Windows, it's the percentage of users running Windows XP. Add Windows 2000 and 98, and you get 92%; add other versions, and the number may be higher; W3Counters only reports "<1%" for them).

An email has been sent to the article's author. Anyhow, mistakes will happen, as we all know (I, for one, do more than my share). With the sheer amount of technology news reports, errors are bound to creep in.

Sunday, February 11, 2007

Update: % of "GPL 2 only" in Linux

It has been brought to my attention that something similar to the test I reported on in my last post has already been carried out. Turns out that Linus himself did such a thing (scroll down to the bottom).

Comparing the two, we see that Linus estimates 33% of the kernel being "GPL 2 or above", while I estimated 40%. It is nice that the two estimates are fairly close; I think we can assume that they are not far off from the truth.

Linus' estimate is smaller, however, probably because of his methodology:
[torvalds@g5 linux]$ git-ls-files '*.c' | wc -l
7978
[torvalds@g5 linux]$ git grep -l "any later version" '*.c' | wc -l
2720
As you can see, Linus probably wasted around 5 seconds of his time doing this, while I on the other hand wasted around 2-3 hours. However, mine is more accurate, since, for example, I take into consideration some possibilities like "any later version" being cut off at a line break, for example (or at least some ways in which that can occur). This may explain why Linus' estimate is smaller; he misses a few. But not very much, it seems.

Another difference is that Linus counts the number of files, while I sum the file sizes. You can argue either way which is more 'correct' (I should probably have calculated both, but having already wasted some 20 times more of my life than Linus, I think I've wasted enough already).

More important is that I report on other categories, like the files that say only "GPL" but don't specify a version, which is interesting information, I think.

Anyhow, it was interesting to see this.

Saturday, February 10, 2007

How much Linux Kernel code is GPL 2 only? At most 60%

Linus and other Linux kernel developers have made it very clear that the kernel will continue to use the GNU GPL version 2, and not move to the GPL 3. But the fact that the kernel is 'effectively' GPL 2 does not mean that all of its code is GPL 2 only - by this I mean that some of the code may be dual-licensed. To get a working Linux system, you must of course use the GPL 2 license. But with dual-licensed code you can use another license when you use it somewhere else, say if you want to use the code in another kernel (perhaps OpenSolaris). But how much of the Linux kernel code is dual-licensed? That is what I tried to find out.

I am not a lawyer. But I do meddle with code. So, to investigate this, I wrote a short program to scan the Linux kernel code (version 2.6.20) and check what licenses appear in it. The code is in Python. If anyone is interested, I am willing to release it. Update: Following requests, the source code is available here.

You may be tempted to skip to the results, but you should probably read the following explanations of what exactly I did first.

Basically, the program I wrote reads the license statements at the beginning of each file, and tries to match them against patterns. This is extremely 'hackish'; I just randomly sampled large amounts of files and kept patching my code until it worked well. Given this method, please note that the results are only an estimate. I did not read each file myself, and while I manually checked a large amount of samples, there is room for some statistical error. However, I do believe that the results are more or less indicative of the true numbers.

I scanned only source files, not headers (hopefully I am not mistaken in doing so). I decided to measure the 'amount' of code under each license by, well, the amount of code - measured in file sizes. So, the total sum of file sizes is what counts, not the number of files or the number of lines in them.

I separated the files into 4 categories:

  1. GPL 2 only. These are files that say that only the GPL 2 may be used.
  2. GPL 2 or above. These files say explicitly that the GPL 2 may be used, or any later version. I included GPL/BSD dual-licensed code here, since it is (I presume) going to be compatible with the GPL 3.
  3. GPL, unspecified. These files just say "licensed under the GPL", or such (see examples below), without specifying any version number. There is some debate about what this means. Some believe that such code can use any GPL license (2, 3, etc.), while others believe otherwise, particularly since Linus has a general statement (in the COPYING file), saying that
    "Also note that the only valid version of the GPL as far as the kernel is concerned is _this_ particular version of the license (ie v2, not v2.2 or v3.x or whatever), unless explicitly otherwise stated."
    (Interestingly, Linus deliberately makes room for dual-licensing.) Now, as mentioned before, I am not a lawyer, so I do not know the status of this issue. However, when a source file said "GPL, see the file COPYING" (or such), I treated this as "GPL 2 only". So things fall under "unspecified" only when all a file says is "GPL'ed", with no further explanation.
  4. Other. Some files didn't mention a license at all, those fall into this category (perhaps they fall under the GPL 2, due to the COPYING file?). Esoteric license phrasings may also fall in here (of which there are few; I worked to get my code working with the vast majority of files).
Ok, after much introduction, here are the numbers for the Linux kernel, version 2.6.20:





GPL 2 only:32,215,150 bytes
GPL 2 or above:60,637,907 bytes
GPL, unspecified:19,773,264 bytes
Other:43,762,840 bytes


A first remark must acknowledge the total size of all of these numbers: the Linux kernel is enormous (as I guess we would expect for the operating system that supports the most hardware out of the box). Now, as for the various licenses: there seems to be quite a lot of "GPL 2 or above" code, even if we do assume that the 'unspecified' and 'other' sections are "GPL 2 only" - almost 40% of the kernel code is explicitly GPL 2 or above.

So, it seems that a GPL 3 operating system should be able to use a large part of the Linux kernel code. I am of course thinking mainly of OpenSolaris here (which would be interested mainly in device driver code, and not the rest; perhaps later I'll run the program just on the drivers). The implications of OpenSolaris using the GPL 3 were discussed in my previous post, but I didn't have any figures to talk about back then. But now, the picture is somewhat clearer. We may have interesting times ahead.


Notes:

When sifting through the kernel code, I noticed several things that caught my eye. Here are a few. Emphasis is always my own (but how would text files be emphasized, anyhow). I admit that these notes are mostly boring, except for the final one, which is amusing. You may want to skip ahead.

Let's get started. Some licenses are, well, confused. For example, /arch/arm/mach-integrator/integrator_cp.c contains
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License.
- and that is where the license ends. Either version 2, or what, I asked myself. Perhaps this is the result of some copy&pasting gone wrong. In any case I counted these as "GPL 2 only".

Some files are extremely brief in their licensing. For example, /Documentation/pcmcia/crc32hash.c states
crc32hash.c - derived from linux/lib/crc32.c, GNU GPL v2
Pretty concise. Of course, I counted this as GPL 2 only. Other phrasings are brief but specify no license: /Documentation/networking/ifenslave.c has
This file is under the GPL.
More extreme examples write only "GPL'd". These fell under "GPL, unspecified".

Some files are extremely specific in their licenses, e.g. /drivers/net/atp.c writes
This software may be used and distributed according to the terms of the GNU General Public License (GPL), incorporated herein by reference. Drivers based on or derived from this code fall under the GPL and must retain the authorship, copyright and license notice. This file is not a complete program and may only be used when the entire operating system is licensed under the GPL.
Very detailed - but no GPL version is given. So this is 'GPL, unspecified'.

Worthy of note are other dual-licensing schemes. I saw code that was both BSD and GPL, and code that was both GPL and MIT/XFree86. However, the vast majority is GPL, of one sort or another. Yet, there was one amusing exception: /drivers/md/dm-log.c has the following license:
/*
* Copyright (C) 2003 Sistina Software
*
* This file is released under the LGPL.
*/
LGPL, and not GPL. Are there no legal issues with that? Turns out that there aren't, since LGPL code can be converted to GPL as needed.

Friday, January 19, 2007

GPL3/OpenSolaris

Another day, another rumor that Sun will be releasing OpenSolaris under the GPL version 3. Personally, I think they'll do it, mainly because Sun have been getting pretty smart recently. Why would this be a clever move on Sun's part? Well, part of the Linux kernel code is licensed as "GPL 2 only" (in particular, Linus's code), while part is licensed as "GPL 2, or any later version". That means that a GPL 3 OpenSolaris would be able to incorporate some parts of the Linux kernel (by 'versioning' them up to GPL 3), while the Linux kernel could not use any OpenSolaris code in return. This one-way code flow could provide OpenSolaris with a lot of momentum.

Actually, the 'one-way code flow' issue is kind of amusing, given OpenSolaris's past. Rumor had it that the CDDL, the current license under which OpenSolaris is released, was chosen precisely because it wasn't compatible with the GPL, perhaps because Sun were afraid that Linux would benefit from OpenSolaris code. The fact that Linux isn't moving to the GPL 3 anytime soon allows the much craftier possibility of OpenSolaris using (some) Linux code, while Linux can't benefit from OpenSolaris at all.

Another advantage to using the GPL 3 is that a core part of the FLOSS crowd prefer the GPL 3 to the GPL 2, especially since the Microsoft-Novell deal. OpenSolaris being GPL 3 while the Linux kernel is only GPL 2 may convince them to switch. While not necessarily a majority in the FLOSS world, this group is highly influential, and could jumpstart the process of enlarging OpenSolaris's user base.

Therefore, all things considered, this would be a pretty smart move on Sun's part. Mr. Torvalds probably doesn't have anything to worry about in the short term, as Linux is highly successful and widely-used. But from a strategic point of view, a simple choice of license for OpenSolaris may have significant long-term implications.

Saturday, January 13, 2007

Set XP Free

Copyright law has its place, but its application to the modern digital world, and to software in particular, is pretty awkward. I'll give an example of this awkwardness, one which will, sadly, impact virtually everyone using a computer.

First, we should note that copyright isn't really a 'right', in the sense that "freedom of religion" is a right. Copyright prevents people from doing something, namely copying. Only the creator of a copyrighted work can authorize people to create copies of it; in this way, the creator can benefit from his creation (by charging for the permission to make copies). However, copyright could have been phrased in a different way: everyone could have been allowed to copy anything, so long as they do so while making some standard payment to the creator. Cover versions of songs work this way: anyone can make a cover version (you can't be prevented from doing so, and you don't need to give notification in advance or request permission), but you do need to send some fixed amount of money to the writer of the original song.

So, there are two possible copyright models: (1) you can't copy without permission, and (2) you can only copy if you pay the creator a standard sum. What is the difference between them? Well, in the first one creators have more freedom in negotiating the amount of money paid to them. But there is a more subtle difference that doesn't really come up in the 'normal' application of copyright law (i.e. books, music): under current copyright law, the creator can decide that no more copies of the creation will be made - simply by not authorizing anyone to make copies. In this way the creator can, in effect, prevent society from benefiting from the creation (although existing copies can continue to circulate). Why would anyone do this? Well, normally speaking no one would - authors want their books to be published (generally both because they want people to read them and also because they want to get paid). But in the software world, this isn't the case. Once Windows Vista is released, Windows XP will no longer be sold; no one will be authorized to make copies of it. Windows XP will, using copyright law, be 'killed off' by its creator, thus forcing consumers on their next purchase to buy a different operating system - one which may be less suitable for some people's needs (for example, Vista requires more expensive hardware than XP).

Was copyright law meant to allow this sort of thing? Surely not; even if lawmakers many years ago thought of the possibility of an author that decides to no longer allow his or her book to be published, they surely scoffed at the idea and quickly forgot it. Not that they were short-sighted - no one could have foreseen the digital revolution that we are currently undergoing. But if only they had adopted the second model - where anyone can copy anything, while paying a standard fee - people living today would have been better off.

Since the second model is just speculation at this point, let's speculate a little more. Imagine a world in which, once you no longer sell a product, you cannot prevent others from selling it. The rationale here is that "if you don't want something, let other people have it". In that setting, a copyrighted work that is no longer being sold would be set free into the public domain. In particular, the Windows XP code wouldn't gather dust in a Microsoft backup tape; it would be put to use for the good of everyone. Sadly, we don't live in such a world: XP will not be sold anymore, and new computers will be forced to run Vista, whether it is what the customer would have wanted or not.

Let's consider the concept
"if you don't want something, let other people have it" from the previous paragraph, which I believe is a reasonable ethical stand. This is related to another personal conviction of mine regarding copyright law, which I summarize as "if money can be made from something, then the person who created it should be the one doing so": when at all possible, I believe that people that produce content should be able to make a living from doing so, and the test for 'at all possible' should be whether anyone can make money from it, in a reasonable way. For example, if a lyrics site makes money from advertisements, then that money should have gone to the people who wrote the songs. But in this age, when anyone can type the lyrics of a song and send it to their friends via email, or post it to usenet, etc., I don't think we should assume that making money from lyrics is automatic - perhaps it is, perhaps it isn't. Whatever money can be made should go to the artists, but trying to prevent people from casually sharing lyrics just doesn't make sense.

Hopefully a day will come when copyright law is modernized. Sadly, I see no indication of that happening anytime soon.

Wednesday, November 22, 2006

Is Bittorrent Monopolized?

Generally speaking, Bittorrent is a symbol of internet freedom, allowing anyone to download pretty much whatever they want. But I suspect that freedom is only skin-deep. What follow are the conclusions I have arrived at from my experiences as a user of various bittorrent clients, as well as doing some development in that area. Put simply, it seems that we may very well have an open protocol being monopolized in a surprisingly successful manner. This is possible because (1) even an open protocol can be 'controlled' by extending it in a nonstandard way, and (2) in certain circumstances even an open protocol can be implemented in an 'unfair' manner.

In all my experience (which amounts to simply looking at the peers I am connected to as I download) there are only two clients which are widely-used: Azureus and µtorrent (so perhaps I should have said 'duopoly', or 'cartel', and not 'monopoly'). Most of the time I see about 40% or so of peers using Azureus, a similar number using µtorrent, and the remainder split up among the various other clients (BitComet deserves an honorable mention as a noticeable third).

Why are there only two bittorrent clients in wide use? After all, there are several open-source bittorrent libraries out there (the original Mainline; Sourceforge libtorrent; Rakshasa's libtorrent; etc.). Anyone can make their own client; all you need to do is write a frontend. I worked on such a project myself recently (to refrain from self-advertising, its name won't appear anywhere in this document). So, the ease of creating a bittorrent client might make you think that it is easy to enter that market. But, as it turns out, that is not the case.

Recently I have done a lot of testing of various clients. Now, on a very healthy torrent (plenty of peers, good seed/peer ratio, etc.), most bittorrent implementations will do reasonably well. But on more difficult torrents, that is no longer true. In some cases standard backends simply stall - no connections or very few, and no actual downloading. More advanced clients like KTorrent do a little better, managing to connect to a few peers and get some data, albeit slowly. But run Azureus or µtorrent, and you suddenly see what bittorrent is meant to be: fast. It should be no wonder, therefore, that the vast majority of people use one of those two clients. But why are they so good at what they do? After all, the bittorrent protocol is open: anyone is free to implement it, and get the same excellent performance, aren't they?

Yes and no. While it is true that the basic protocol is open and well-known, it has some extensions which are not used by everyone. For example, protocol encryption was first devised by Azureus and µtorrent; later on, other clients adopted it as well - in particular KTorrent, which may account for its relative success in comparison to some other minority clients. Another feature is Peer Exchange, which allows exchanging lists of peers without a tracker. There is no single standard for Peer Exchange; µtorrent and Azureus each use their own (as does BitComet). Distributed Hash Table (DHT) capability is also useful, allowing trackerless torrents. This feature is steadily becoming commonplace, but it should be noted that Azureus use their own version of DHT, incompatible with all the rest.

So, what does all of this mean? Well, when you run µtorrent, you have the numerous other µtorrent peers that you can use Peer Exchange with. And when you run Azureus, you can utilize the large network of other Azureus peers using the Azureus DHT. Thus, both Azureus and µtorrent have advantages beyond a brand-new client. In both cases the only reason for the advantage is the sizable pre-existing client base.

µtorrent warrants further discussion. Since µtorrent is closed-source, there is no way to tell whether or not it gives peers running the same client 'preferential treatment': for all we know, it may be that when µtorrent gets connection requests from several peers, it responds first to those of them that are also running µtorrent, and so forth. I am not accusing µtorrent of anything - I don't have enough evidence - but I do have a few reasons to suspect this. First, it is not uncommon for me to see, when running µtorrent, that most of my downloading is from other µtorrent clients; likewise uploading (I am of course taking into account the large proportion of peers running µtorrent when I say this; the phenomenon seems large even when taking that matter into consideration). A second issue is that, in my experience, µtorrent consistently sees more peers than any other client, Azureus included. There seem to be peers that only µtorrent can see; it may be that those peers are only reachable via µtorrent's Peer Exchange. Does µtorrent, in some cases, not report itself to the trackers on purpose, thus effectively creating a 'µtorrent-only' section? I don't know, but it does seem like that might be true. I'll say it again: both of these observations can be contested and/or explained away in various ways. Still, I have my suspicions. Perhaps people reading this document have further information from their own experience.

It is possible that Azureus and µtorrent simply have better implementations of the bittorrent protocol. But I don't believe that is the only reason for their performing far better than the competition. Part of it is probably due to the unique advantages each has - µtorrent's Peer Exchange, and Azureus's DHT, each of which leverage their enormous existing client bases to boost performance (and I mention only these two features so as not to complicate the discussion: the real picture is even more complicated). Sadly, this makes entering the market difficult for new clients. Note that this is true even though e.g. Azureus's DHT is 'open', because (1) other clients will need to catch up and implement those extensions, (2) with several competing extensions, it becomes confusing to decide which to implement, and (3) while open, the specification may change, thus becoming a 'moving target'. (See, for example, this.)

What is ironical is that this 'monopolization' of the market is done on a protocol that is open. But despite it being open, some find ways to utilize it in which they gain a special advantage, whether by extending the protocol in a nonstandard way (even if that way is still 'open'), or by (perhaps) writing closed-source code that implements the protocol in an 'unfair' manner.

If these things sound familiar, they should: Microsoft were accused of both of them in the past (in particular, a nonstandard extension of an open standard is part of what has been called their 'Embrace, Extend and Extinguish' tactic). Hopefully Microsoft are not currently doing either one, yet if in fact they are not, one reason for that must be the risk of legal action. However, with bittorrent, no regulation exists (Neelie Kroes, we need you!). In such extreme 'free market' conditions, it appears that freedom may have, in fact, been lost.

Tuesday, November 21, 2006

Will the GPL3 Nullify the Microsoft-Novell pact?

Eben Moglen is quoted on The Register as saying:

"Suppose GPL3 says something like, 'if you distribute (or procure the distribution), of a program (or parts of a program) - and if you make patent promises partially to some subset of the distributees of the program - then under this license you have given the same promise or license at no cost in royalties or other obligations to all persons to whom the program is distributed'.

If GPL 3 goes into effect with these terms in it, Novell will suddenly [become] a patent laundry; the minute Microsoft realizes the laundry is under construction it will withdraw."

Obviously there would be a problem for Microsoft if Novell became a 'patent laundry', as Moglen says. But perhaps the Microsoft-Novell agreement specifically states that Novell cannot distribute code in a 'patent laundry' way, i.e. that Novell is not allowed to give a license to anyone but Novell customers. In that case, Novell would be in violation if they distributed GPL3 code, and their only option would be to stop distributing that code. Contrary to what Moglen suggests, Microsoft wouldn't have to withdraw - Novell would. Which would shut Novell down.

In effect, Microsoft would be terminating the 2nd-from-the-top Linux vendor, at a negligible cost (to them) of a few hundred million dollars. And the interesting bit is that they let the free/open-source community carry out the hit; Microsoft themselves are perfectly innocent. Pretty devious.

What benefit would Microsoft gain by taking out Novell? While it is true that most SUSE Linux users would probably continue using Linux, in some other form, Novell's demise would still have two major consequences: first, the Linux industry would seem 'fragile', which might scare enterprise IT buyers (while Microsoft, unstoppable as ever, can always be relied on to continue existing); second, the fact that a free software license is used to destroy an open-source corporation (I use both terms loosely) would cause internal discord in the free/open-source community. In any case, regardless of specifics, Microsoft is the only party with a positive outcome.

So what can the free/open-source community do - perhaps refrain from using the GPL3 to effect Novell's demise? I am not sure that the community can do so; protecting free software from patent abuse is absolutely critical. So, if we assume that the Microsoft-Novell deal contains a clause limiting Novell's ability to bestow patent protection to their own customers only, then Novell's demise may be inevitable.

Hopefully Novell were not so foolish as to enter into such a pact. In fact, considering that Eben Moglen saw the details of the Microsoft-Novell deal and seems intent on 'nullifying' that agreement via the GPL3, my guess is that Novell were, in fact, not so foolish (or else Moglen might be phrasing himself a little... differently, given the risk to Novell). Perhaps the deal contains a clause that cancels it, or parts of it, if certain circumstances occur, such as Novell no longer being able to distribute GPL code. This would seem consistent with most of the information that we currently have.