How to read the Wikileaks cables
Apr. 28th, 2012 04:37 pmI've finally gotten around to downloading those US state department cables that Wikileaks acquired. You can acquire them at Cryptome:
http://cryptome.org/z/z.7z
John Young would probably appreciate it if you can find another way of acquiring the file so he doesn't have to pay the bandwidth bill. I'm not linking directly so he doesn't get hammered by bots following the link.
Storing the cables
The cables will need to be imported into a database before they can be read. I am using MySQL from the XAMPP distribution.
Create a final table and an all-text table for importing into. (Attempting to import directly into the date field will zero out the dates).
create table Cables ( id int PRIMARY KEY, date datetime, local_title varchar(128), origin varchar(255), classification varchar(128), referenceIDs text, -- really a pipe-separated array header text, data mediumtext -- some larger than 65536 chars );
create table Cables2 ( id text, datetime text, local_title text, origin text, classification text, referenceIDs text, header text, data mediumtext );
Load the data using mysql's LOAD DATA INFILE, which needs some help to learn to read multi-line CSV correctly.
load data infile 'c:\\cables.csv' into table cables2 fields ENCLOSED BY '"' escaped by '\\' terminated by ',';
There should be zero warnings. If you see warnings, you can use "show warnings" to see at what record number the import failed.
Copy from the staging table to the final table. This will take about two minutes.
insert into cables (select id, str_to_date(datetime, "%m/%d/%Y %k:%i"), local_title, origin, classification, referenceIDs, header, data FROM cables2);
Then create an index on the dates to speed up future searches. This takes two minutes on my computer.
create index idx_date on cables (date);
Building an index for text searches
You will want to search the cables for a specific subject. Two methods are
Fulltext Index
MySQL has fulltext indexing. I found it to be too slow.
Creating the index took 25 minutes:
create fulltext index idx_text on cables (header, data);
Fulltext index queries using MATCH AGAINST took 1-2 minutes to run.
select count(*) from cables where match (header, data) against ('Sudan');
select count(*) from cables where match (header, data) against ('Sudan') OR match (header,data) against ('Sudanese');
I found the fulltext index to be slower than sequentially searching the table for "data like '%SUDAN%'". YMMV.
Custom word cache
I built my own index using separate tables for all search terms and for the connections between the search terms and the cables.
create table words( wordID int auto_increment NOT NULL, word varchar(64) NOT NULL, CONSTRAINT cx_word_uniqueness UNIQUE (wordID, word) ); create table idx_words( wordID int NOT NULL REFERENCES words(id), cableID int NOT NULL REFERENCES cables(id), INDEX idx_word_match (wordID, cableID), CONSTRAINT cx_words UNIQUE (wordID, cableID) );
The custom index can be seeded with two queries:
INSERT INTO words (word) VALUES ('SUDAN'); INSERT INTO idx_words ( SELECT words.wordID, cables.id FROM words, cables WHERE words.word = 'SUDAN' && upper(cables.data) LIKE '%SUDAN%' );
It takes few minutes to seed each word, but searches are instantaneous with a small number of seeded words. I have not tested this with a large number of seeded words.
Other text search methods?
If you know of a better search method, please mention it in comments.
Other search indexes?
The cables have additional information that a sufficiently intelligent program can pull out of the data field and add to the database metadata. If someone has already done this work, please mention it in comments.
Reading the cables
You will need a program to pull the data out of the database in a form that you can read. I wrote a quick and dirty PHP program to display search results as HTML.
CAPS reformatting?
Cables before circa 2000 were in ALL CAPS and are difficult to read. A program could potentially convert the text to normal mixed case, although it would need to be able to recognize acronyms and peoples' names. If someone has already done this work, please mention it in comments.
Keyword recognition?
A program could potentially recognize key words such as peoples' names, and link these words to other sources of information such as History Commons and Wikipedia. If someone has already done this work, please mention it in comments.