I've finally gotten around to downloading those US state department cables that Wikileaks acquired. You can acquire them at Cryptome:

http://cryptome.org/z/z.7z

John Young would probably appreciate it if you can find another way of acquiring the file so he doesn't have to pay the bandwidth bill. I'm not linking directly so he doesn't get hammered by bots following the link.


Storing the cables

The cables will need to be imported into a database before they can be read. I am using MySQL from the XAMPP distribution.

Create a final table and an all-text table for importing into. (Attempting to import directly into the date field will zero out the dates).

create table Cables (
	id int PRIMARY KEY,
	date datetime, 
	local_title varchar(128),
	origin varchar(255), 
	classification varchar(128),
	referenceIDs text, -- really a pipe-separated array
	header text,
	data mediumtext -- some larger than 65536 chars
);
create table Cables2 (
	id text, 
	datetime text, 
	local_title text, 
	origin text, 
	classification text, 
	referenceIDs text,
	header text,
	data mediumtext
);

Load the data using mysql's LOAD DATA INFILE, which needs some help to learn to read multi-line CSV correctly.

load data infile 'c:\\cables.csv' into table cables2 fields ENCLOSED BY '"' escaped by '\\' terminated by ',';

There should be zero warnings. If you see warnings, you can use "show warnings" to see at what record number the import failed.

Copy from the staging table to the final table. This will take about two minutes.

insert into cables (select
id, 
str_to_date(datetime, "%m/%d/%Y %k:%i"),
local_title,
origin,
classification,
referenceIDs,
header,
data
FROM cables2); 

Then create an index on the dates to speed up future searches. This takes two minutes on my computer.

create index idx_date on cables (date);

Building an index for text searches

You will want to search the cables for a specific subject. Two methods are

Fulltext Index

MySQL has fulltext indexing. I found it to be too slow.

Creating the index took 25 minutes:

create fulltext index idx_text on cables (header, data);

Fulltext index queries using MATCH AGAINST took 1-2 minutes to run.

select count(*) from cables where match (header, data) against ('Sudan');
select count(*) from cables where match (header, data) against ('Sudan') OR match (header,data) against ('Sudanese');

I found the fulltext index to be slower than sequentially searching the table for "data like '%SUDAN%'". YMMV.

Custom word cache

I built my own index using separate tables for all search terms and for the connections between the search terms and the cables.

create table words(
	wordID int auto_increment NOT NULL,
	word varchar(64) NOT NULL, 
	CONSTRAINT cx_word_uniqueness UNIQUE (wordID, word)
);

create table idx_words(
	wordID int NOT NULL REFERENCES words(id),
	cableID int NOT NULL REFERENCES cables(id),
	INDEX idx_word_match (wordID, cableID), 
	CONSTRAINT cx_words UNIQUE (wordID, cableID)
);

The custom index can be seeded with two queries:

INSERT INTO words (word) VALUES ('SUDAN'); 
INSERT INTO idx_words (
 SELECT words.wordID, cables.id FROM words, cables 
 WHERE words.word = 'SUDAN' && upper(cables.data) LIKE '%SUDAN%'
);

It takes few minutes to seed each word, but searches are instantaneous with a small number of seeded words. I have not tested this with a large number of seeded words.

Other text search methods?

If you know of a better search method, please mention it in comments.

Other search indexes?

The cables have additional information that a sufficiently intelligent program can pull out of the data field and add to the database metadata. If someone has already done this work, please mention it in comments.

Reading the cables

You will need a program to pull the data out of the database in a form that you can read. I wrote a quick and dirty PHP program to display search results as HTML.

CAPS reformatting?

Cables before circa 2000 were in ALL CAPS and are difficult to read. A program could potentially convert the text to normal mixed case, although it would need to be able to recognize acronyms and peoples' names. If someone has already done this work, please mention it in comments.

Keyword recognition?

A program could potentially recognize key words such as peoples' names, and link these words to other sources of information such as History Commons and Wikipedia. If someone has already done this work, please mention it in comments.

It was the spring of 2010 and Julian Assange had a scoop. He had video of a US helicopter pilot firing on a group of regrouping Iraqi militiamen as they prepared a rocket-propelled grenade launcher to attack an advancing US vehicle. However, that's not what he saw. The weapons were hard to see in the monochrome video. To Assange, it looked like the murder of an unarmed group of Iraqis and so he called it murder. He further called it "collateral" murder to mock the US military's term for accidental killings, even though accidental killings are by definition not murder. And after it was proven that the attack was on an armed group of warriors, Assange made no effort to realign his rhetoric with the truth. He had the headline that he wanted.

Such is the Barnum-like showmanship that Assange and his organization Wikileaks have become known for. Stories are published with information that may or may not embarrass the US government, but are told in such a fashion as to condemn the US whether it has done any wrong or not. In the same vein, the recent delivery of copied State Department diplomatic communications to newspapers is called "Cablegate" even though it has as little to do with the Watergate scandal that it is named for as today's Tea Party movement has to do with the ideals of the American Revolution.

As annoying as Wikileaks's dissembling is, it is not nearly as outrageous as the response by certain influential people to the release of diplomatic notes. To call these people critics of Assange would be insufficient to describe their attitude towards him, which follows a simple line of thought:

  1. We disagree with his conduct.
  2. Kill him.

Among those calling for Assange to be killed for treason or otherwise euphemistically treated like a terrorist are:

They are calling for retribution while there is still a question of whether Assange actually committed a crime or not in publishing the State Department communications, and while the answer to the question is most likely 'no'. To kill a man for redistributing these cables which have only embarrassed the United States would be a far greater assault on American liberty than anything Assange has been accused of doing. Sarah Palin may not be able to see Russia from her house -- that was a Saturday Night Live parody -- but her vision of an America where people are afraid to publish information like this would be a lot more like Russia where over twenty journalists have been assassinated by various parties in the past ten years and in one high-profile case, a Russian journalist was jailed for exposing the Navy's dumping of nuclear waste at sea.

The problem of this reactionary bloodlust is multiplied by the influence of some of these people. If they were not influential they could safely be ignored, but Palin represents the current mainstream of the Republican Party and Huckabee is portrayed in the press as the Republican Party's closest equivalent to a liberal who can still achieve the party's nomination for President. They are leaders who can draw a large share of the public to follow them. Rather than their statements discrediting them, their followers will adjust what they believe to fit the views of their leaders unless more people speak out and say that it is wrong.


Side note: Another person calling for Assange to be killed is Washington Times columnist Jeffrey T. Kuhner. The more interesting thing is that Google's summary of Kuhner's column in its search results pulls the string "Calls for assassination of Julian Assange really shows a barbaric side of some Americans, such as Lieberman and Palin" from the middle of a comment underneath the article. I don't see any sites linking to the article with that text, and that's not what the Washington Times would put in their meta tags, so I doubt an algorithm did this. Perhaps Google has a cranky engineer.

As a side note to the side note, Lieberman called Assange's actions a violation of the Espionage Act and called on the Justice Department to investigate whether the New York Times broke any law by accepting the documents. His opinion is still bad for journalism and would be worth condemning in its own right if I were not busy condemning the people who want Assange to be killed, but to my knowledge Lieberman has not gone that far.


Side note: Former Wikileaks organizers are launching Openleaks, an alternative to Wikileaks. They're disgruntled with Assange over some unspecified internal disputes which they think we don't need to know about. Maybe somebody will leak the details about that.


Page generated May. 30th, 2017 03:45 am
Powered by Dreamwidth Studios