Entries tagged with programming

Entry tags:

Nifty link of the nonce

"Why Python Is Slow" is a quick read on Python's dynamic typing. I had not known that it was possible to change the value of an integer constant in Python.

Entry tags:

US court outlaws software compatibility

The US Court of Appeals for the Federal Circuit has granted copyright protection to APIs and the layout of files on a filesystem. This makes it legally impossible to write a homebrew replacement for a software library in the United States. The decision, by Kathleen O'Malley with S. Jay Plager and Richard Taranto concurring, gives the example of the name "java.lang.Math.max" being a copyrighted work on the grounds that someone trying to implement a replacement math library could have used the name "Math.maximum" or "Arith.larger". Until overturned by an en banc rehearing or the Supreme Court, this decision outlaws Wine, Samba, Mono, Blackdown, probably half of GNU, and certainly Linux and the BSDs, not to mention OSX.

Entry tags:

Proprietary Value Decomposition

Math has a concept called singular value decomposition. The short version is that you put in one matrix and get three out. This apparently being a well known concept in engineering, it is implemented in the data analysis language IDL and in the NumPY library for Python, and you can probably guess where this is going. The singular value decomposition functions in IDL and NumPy produce different matrices for the same input matrix.

All three output matrices have their columns swapped.
One of the columns in the 'u' matrix is the negative of the matching column in the other language.
One of the rows in the 'v' matrix is the negative of the matching row in the other language.

I've only tested one chunk of data, so I do not know if the pattern will hold for different input matrices.

Entry tags:

Obscure bug of the nonce

Python easy_install on Windows sometimes fails with a UnicodeDecodeError:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6034: ordinal not in range(128)

The solution is to comment out the config = config.decode('ascii') line in Lib/site-packages/setuptools/easy_install.py.

Here's an even more fun one:

  File "geopts.py", line 128, in xp2str
    s = etree.tostring(resultset[0], method="text")
  File "lxml.etree.pyx", line 3165, in lxml.etree.tostring (src\lxml\lxml.etree.
c:69399)

exceptions.TypeError: Type '_ElementStringResult' cannot be serialized.

This says that the XML library's own tostring() function cannot convert one of its own string types to a string. What's especially brilliant about this is that _ElementStringResult inherits from the native string class.

Here is a hacky attempt to manage the problem:

r = resultset[0]
if isinstance(r, etree._ElementStringResult):
    s = r
else:
    s = etree.tostring(r, method="text")

Entry tags:

Hack

The Hack language adds several missing features to PHP, most notably static typing. It was developed by Facebook for their HipHop VM. Via soy.

Entry tags:

easy_fail_to_install

This is only my opinion fwiw, but library/add-on packages for a high-level scripting language should not require a C compiler, any version of Visual Studio, or any development environment for any different lower-level language.

Entry tags:

Note dump: accessing extended members of extended classes

In languages based on message-passing, you just call the function and an error will occur at run time if it does not exist. In languages with reflection, you can test to see if the function exists and then call it. In Haxe... ( Read more... )

Entry tags:

Modern Perl Programming

Chromatic's book Modern Perl Programming (2011) is a free download. It's a good refresher/updater for people who already have some experience with Perl, like me whose copy of the Camel Book is a 1996 edition.

Entry tags:

programming

Optimize later: functions

Consider a generic function like this one:

function square(x){ return x * x; }

In C, you tell the compiler what data type 'x' is and the compiler will print out some corresponding assembly code and consider that the canonical square() function. In an object oriented system, the system adds a layer of overhead to track what objects have what interfaces and will decide what to do at runtime.

Imagine something in between with JIT-like compilation based on how the caller uses the function.

If the caller will send an int, compile some assembly code using an int.
If the caller will send a float, compile some assembly code using a float.
If the value in the caller is known to be low enough to not need a 64-bit int, like if it's in range(0..10), and if the smaller data types are faster on this hardware, then use a smaller data type.
If the caller will send an object that has an overridden * method, see if it's possible to optimize that.
If the data type is indeterminate, go with the high-overhead object oriented method.

There would be no one canonical square() implementation. The compiler/interpreter would be aware of several different square() implementations and would choose to use or create a particular one based on the circumstances.

Now consider having these different compiled segments stored in the binary, with the high level version also available for any subroutines that might need it.

Do any language environments do this?

Entry tags:

Haxe notes

I found myself with a bit of free time, so I decided to try getting back into programming for fun and see what Haxe is like these days. ( Read more... )

Entry tags:

XHP: Making mixed PHP+HTML cleaner

This has been out for three years and I hadn't heard of it before? It looks really useful. An interesting point is the reversal of flow control: Instead of a PHP page being an HTML template that breaks into and out of PHP, XHP promotes the design style of a PHP file that breaks into and out of HTML.

Another unrelated awesome project from three years ago: The LBW project is Wine for Windows, allowing ELFs to be run under Windows XP. Immediately impressive is that "it's adequate for ... downloading and installing packages with apt and dpkg" since I could never get apt and dpkg to compile under mingw or cygwin.

Entry tags:

Notes on dragging and dropping in Javascript

One of my side projects is a Javascript drag-and-drop list using the modern drag and drop specification with the goal of allowing users to rearrange the order of items in a list. I quickly ran into a problem: the drag-and-drop spec was designed for dragging one object onto another single discrete object which is expecting a drag event. This does not translate well to a draggable list's use cases of dragging above, below, and between objects (or subobjects), or of dragging an item off of the drag area to bring it to the top or bottom of the list. To do anything fancy, we need to find a relationship between the coordinates of the MouseEvent parent of a drag event, and the coordinates of the elements on the screen.

Visual elements have these coordinate attributes:

offsetTop
scrollTop

Drag events have these coordinate attributes:

clientY
pageY
screenY

There is no correlation between the two sets of coordinates. I tried summing the offsetTop of an item and its ancestors but found no correlation between that sum and any of the mouse coordinates. I also had no luck using the various page and scroll properties for window and document. Since I couldn't find the answer, I changed the question. Element.clientHeight reliably works across browsers, so we can do this:

Save the initial drag event at the start of a drag.
Calculate the difference between the start and end events.
Count the heights of the elements to see where to place the dragged item.
If we run out of elements, place the dragged item at the head or tail of the list.

This should work. The MouseEvent gives us three sets of coordinates, so we should be able to pick one and it should work.

Hah.

Among the problems:

In Firefox, the clientY of the starting DragEvent is zero. This is bug #505521 which has been open since 2009.
In Safari 5.1.7, the clientY of the starting DragEvent is measured from top of window while the clientY of the ending DragEvent is measured from the bottom of the window.
In Safari, the pageY of the ending DragEvent is some ridiculous number that seems to be measured from some point over 500px off the bottom of the screen.
In both Firefox and Safari, the differences in clientY, pageY, and screenY are different for the same beginning and ending mouse position.
In Opera, the Y values for MouseEvents ending on the sidebar panel are different from the Y values for MouseEvents on a page at the same vertical level.

I decided to use the difference in screenY, even though there is the obvious bug that the math will be wrong if the screen scrolls in the middle of a drag, because it produces the least number of compatibility problems across browsers.

Side note: The best practice for defining class methods in Javascript is to use the prototype:

ClassName.prototype.method = function(){...}

This allows every instance of the class to use the same function instead of giving each instance its own copy of the function.

Member variables are not in scope in prototype methods; the method is expected to use this to access them. In the context of an event handler, however, this is not the containing object. Therefore, using prototyped methods as event handlers is not a good idea. A solution is to use the old-fashioned "this.method=" declaration which suffers inefficiency but does the job:

function ClassName {
this.method = function(){...}

I ran into this problem when I tried to fix my old-style drag-and-drop code to use the best practice instead.

Entry tags:

Enforcing perl warnings and taint mode

I was getting an unexplained, unlogged 500 internal server error response for a perl Hello World script.

#!/usr/bin/perl

print "Content-type: text/html\n\n<p>Hello World</p>\n";

This was especially odd because I have a perl program running elsewhere on the same server. After comparing .htaccess settings and triple-checking my Content-type syntax, I found the apparent cause: the server requires either the -w (warnings) or the -T (taint checks) flag be turned on.

I was unable to determine which setting causes this. mod_perl's PerlSwitches can force all scripts to be run with -wT turned on, but I could find no setting to refuse to run scripts which lack either flag.

For a description of taint mode, read perlsec.

Entry tags:

Two reasons to use descriptive variable names

Reason #1:

foreach($items as $i){ // $i is an item reference
    ...
    // Now let's loop through something
    for($i=0,$max=10; $i<$max; $i++){ // d'oh

Reason #2:


$dir = $asdf ? -1 : +1; // direction
... 
$dir = getDirectory(...); // d'oh

Entry tags:

References to static methods in PHP

Another reason PHP sucks: references to static methods are not supported. There is a hacky way to use them, however. ( Read more... )

Entry tags:

Scrap dump: Key handling in PyGame

Some notes on keypress handling in Python's PyGame library, from about seven years ago: ( Read more... )

Entry tags:

If only all of my projects were this close to completion

I finally got my Javascript implementation of Craig Reynolds's boids algorithm to work. Here is the relevant code:

var weights= new Array(1.0, 1.0, 1.0);

sepvect = sepvect.multiply(weights[0]);
alivect = alivect.multiply(weights[1]);
cohvect = cohvect.multiply(weights[2]);

I just needed to record magnitudes for one run and fiddle with the weights. (0.5,1.0,0.2) seemed to do the trick, though I should test with different numbers of boids to see if the magnitudes are relative to that variable.

As regards "finally", I started on this so long ago that I forget when and have intermittently picked it up and re-abandoned it since then. The oldest timestamp that I can find for it is 2006, but I think it goes back to 2003 or 2004 when it was going to be something that I would put together during spring break. The biggest problem was a trig error that I fixed last month after having almost fixed it earlier, causing directions to be wrong in one or two of the four quadrants.

Entry tags:

The TODONE file

While looking through old files, I found a TODO list so old that I've actually done most of the things on it. Usually these things double in size every year. I shall celebrate with a mocha. ( Details below the cut, if anyone cares. )

Entry tags:

Rethinking try/catch

Consider these changes to the try/catch model of C++ and Java:

Every block is a try{} block. The "try" keyword is dropped. All exceptions will filter upwards until they reach a block that catches the exception.
The new "recover" keyword returns from a "catch" block to the next line of the context that threw the exception. The exception has a ".scope" object that allows access to the variables that were in scope at the time that the exception was thrown. Whenever an exception happens, programmers could twiddle a few variables and set the program back to where it was before.

Has any language already done this or something similar? How would this affect program design, code quality, and readability? What would language developers need to do to implement these features, and how would it impact performance?

Also posted to HN, where one of the users informs me that I've reinvented the Common Lisp condition system.

Entry tags:

How to read the Wikileaks cables

I've finally gotten around to downloading those US state department cables that Wikileaks acquired. You can acquire them at Cryptome:

http://cryptome.org/z/z.7z

John Young would probably appreciate it if you can find another way of acquiring the file so he doesn't have to pay the bandwidth bill. I'm not linking directly so he doesn't get hammered by bots following the link.

Storing the cables

The cables will need to be imported into a database before they can be read. I am using MySQL from the XAMPP distribution.

Create a final table and an all-text table for importing into. (Attempting to import directly into the date field will zero out the dates).

create table Cables (
	id int PRIMARY KEY,
	date datetime, 
	local_title varchar(128),
	origin varchar(255), 
	classification varchar(128),
	referenceIDs text, -- really a pipe-separated array
	header text,
	data mediumtext -- some larger than 65536 chars
);

create table Cables2 (
	id text, 
	datetime text, 
	local_title text, 
	origin text, 
	classification text, 
	referenceIDs text,
	header text,
	data mediumtext
);

Load the data using mysql's LOAD DATA INFILE, which needs some help to learn to read multi-line CSV correctly.

load data infile 'c:\\cables.csv' into table cables2 fields ENCLOSED BY '"' escaped by '\\' terminated by ',';

There should be zero warnings. If you see warnings, you can use "show warnings" to see at what record number the import failed.

Copy from the staging table to the final table. This will take about two minutes.

insert into cables (select
id, 
str_to_date(datetime, "%m/%d/%Y %k:%i"),
local_title,
origin,
classification,
referenceIDs,
header,
data
FROM cables2);

Then create an index on the dates to speed up future searches. This takes two minutes on my computer.

create index idx_date on cables (date);

Building an index for text searches

You will want to search the cables for a specific subject. Two methods are

Fulltext Index

MySQL has fulltext indexing. I found it to be too slow.

Creating the index took 25 minutes:

create fulltext index idx_text on cables (header, data);

Fulltext index queries using MATCH AGAINST took 1-2 minutes to run.

select count(*) from cables where match (header, data) against ('Sudan');
select count(*) from cables where match (header, data) against ('Sudan') OR match (header,data) against ('Sudanese');

I found the fulltext index to be slower than sequentially searching the table for "data like '%SUDAN%'". YMMV.

Custom word cache

I built my own index using separate tables for all search terms and for the connections between the search terms and the cables.

create table words(
	wordID int auto_increment NOT NULL,
	word varchar(64) NOT NULL, 
	CONSTRAINT cx_word_uniqueness UNIQUE (wordID, word)
);

create table idx_words(
	wordID int NOT NULL REFERENCES words(id),
	cableID int NOT NULL REFERENCES cables(id),
	INDEX idx_word_match (wordID, cableID), 
	CONSTRAINT cx_words UNIQUE (wordID, cableID)
);

The custom index can be seeded with two queries:

INSERT INTO words (word) VALUES ('SUDAN'); 
INSERT INTO idx_words (
 SELECT words.wordID, cables.id FROM words, cables 
 WHERE words.word = 'SUDAN' && upper(cables.data) LIKE '%SUDAN%'
);

It takes few minutes to seed each word, but searches are instantaneous with a small number of seeded words. I have not tested this with a large number of seeded words.

Other text search methods?

If you know of a better search method, please mention it in comments.

Other search indexes?

The cables have additional information that a sufficiently intelligent program can pull out of the data field and add to the database metadata. If someone has already done this work, please mention it in comments.

Reading the cables

You will need a program to pull the data out of the database in a form that you can read. I wrote a quick and dirty PHP program to display search results as HTML.

CAPS reformatting?

Cables before circa 2000 were in ALL CAPS and are difficult to read. A program could potentially convert the text to normal mixed case, although it would need to be able to recognize acronyms and peoples' names. If someone has already done this work, please mention it in comments.

Keyword recognition?

A program could potentially recognize key words such as peoples' names, and link these words to other sources of information such as History Commons and Wikipedia. If someone has already done this work, please mention it in comments.