Math has a concept called singular value decomposition. The short version is that you put in one matrix and get three out. This apparently being a well known concept in engineering, it is implemented in the data analysis language IDL and in the NumPY library for Python, and you can probably guess where this is going. The singular value decomposition functions in IDL and NumPy produce different matrices for the same input matrix.

  • All three output matrices have their columns swapped.
  • One of the columns in the 'u' matrix is the negative of the matching column in the other language.
  • One of the rows in the 'v' matrix is the negative of the matching row in the other language.

I've only tested one chunk of data, so I do not know if the pattern will hold for different input matrices.

Python easy_install on Windows sometimes fails with a UnicodeDecodeError:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6034: ordinal not in range(128)
The solution is to comment out the config = config.decode('ascii') line in Lib/site-packages/setuptools/easy_install.py.

Here's an even more fun one:

  File "geopts.py", line 128, in xp2str
    s = etree.tostring(resultset[0], method="text")
  File "lxml.etree.pyx", line 3165, in lxml.etree.tostring (src\lxml\lxml.etree.
c:69399)

exceptions.TypeError: Type '_ElementStringResult' cannot be serialized.

This says that the XML library's own tostring() function cannot convert one of its own string types to a string. What's especially brilliant about this is that _ElementStringResult inherits from the native string class.

Here is a hacky attempt to manage the problem:

r = resultset[0]
if isinstance(r, etree._ElementStringResult):
    s = r
else:
    s = etree.tostring(r, method="text")

Hack

Mar. 22nd, 2014 05:39 am
The Hack language adds several missing features to PHP, most notably static typing. It was developed by Facebook for their HipHop VM. Via soy.
This is only my opinion fwiw, but library/add-on packages for a high-level scripting language should not require a C compiler, any version of Visual Studio, or any development environment for any different lower-level language.
In languages based on message-passing, you just call the function and an error will occur at run time if it does not exist. In languages with reflection, you can test to see if the function exists and then call it. In Haxe... Read more... )

Chromatic's book Modern Perl Programming (2011) is a free download. It's a good refresher/updater for people who already have some experience with Perl, like me whose copy of the Camel Book is a 1996 edition.

Consider a generic function like this one:

function square(x){ return x * x; }

In C, you tell the compiler what data type 'x' is and the compiler will print out some corresponding assembly code and consider that the canonical square() function. In an object oriented system, the system adds a layer of overhead to track what objects have what interfaces and will decide what to do at runtime.

Imagine something in between with JIT-like compilation based on how the caller uses the function.

  • If the caller will send an int, compile some assembly code using an int.
  • If the caller will send a float, compile some assembly code using a float.
  • If the value in the caller is known to be low enough to not need a 64-bit int, like if it's in range(0..10), and if the smaller data types are faster on this hardware, then use a smaller data type.
  • If the caller will send an object that has an overridden * method, see if it's possible to optimize that.
  • If the data type is indeterminate, go with the high-overhead object oriented method.

There would be no one canonical square() implementation. The compiler/interpreter would be aware of several different square() implementations and would choose to use or create a particular one based on the circumstances.

Now consider having these different compiled segments stored in the binary, with the high level version also available for any subroutines that might need it.

Do any language environments do this?

Haxe notes

Mar. 24th, 2013 08:09 pm

I found myself with a bit of free time, so I decided to try getting back into programming for fun and see what Haxe is like these days. Read more... )

This has been out for three years and I hadn't heard of it before? It looks really useful. An interesting point is the reversal of flow control: Instead of a PHP page being an HTML template that breaks into and out of PHP, XHP promotes the design style of a PHP file that breaks into and out of HTML.
Another unrelated awesome project from three years ago: The LBW project is Wine for Windows, allowing ELFs to be run under Windows XP. Immediately impressive is that "it's adequate for ... downloading and installing packages with apt and dpkg" since I could never get apt and dpkg to compile under mingw or cygwin.

One of my side projects is a Javascript drag-and-drop list using the modern drag and drop specification with the goal of allowing users to rearrange the order of items in a list. I quickly ran into a problem: the drag-and-drop spec was designed for dragging one object onto another single discrete object which is expecting a drag event. This does not translate well to a draggable list's use cases of dragging above, below, and between objects (or subobjects), or of dragging an item off of the drag area to bring it to the top or bottom of the list. To do anything fancy, we need to find a relationship between the coordinates of the MouseEvent parent of a drag event, and the coordinates of the elements on the screen.

Visual elements have these coordinate attributes:

  • offsetTop
  • scrollTop

Drag events have these coordinate attributes:

  • clientY
  • pageY
  • screenY

There is no correlation between the two sets of coordinates. I tried summing the offsetTop of an item and its ancestors but found no correlation between that sum and any of the mouse coordinates. I also had no luck using the various page and scroll properties for window and document. Since I couldn't find the answer, I changed the question. Element.clientHeight reliably works across browsers, so we can do this:

  1. Save the initial drag event at the start of a drag.
  2. Calculate the difference between the start and end events.
  3. Count the heights of the elements to see where to place the dragged item.
  4. If we run out of elements, place the dragged item at the head or tail of the list.

This should work. The MouseEvent gives us three sets of coordinates, so we should be able to pick one and it should work.

Hah.

Among the problems:

  • In Firefox, the clientY of the starting DragEvent is zero. This is bug #505521 which has been open since 2009.
  • In Safari 5.1.7, the clientY of the starting DragEvent is measured from top of window while the clientY of the ending DragEvent is measured from the bottom of the window.
  • In Safari, the pageY of the ending DragEvent is some ridiculous number that seems to be measured from some point over 500px off the bottom of the screen.
  • In both Firefox and Safari, the differences in clientY, pageY, and screenY are different for the same beginning and ending mouse position.
  • In Opera, the Y values for MouseEvents ending on the sidebar panel are different from the Y values for MouseEvents on a page at the same vertical level.

I decided to use the difference in screenY, even though there is the obvious bug that the math will be wrong if the screen scrolls in the middle of a drag, because it produces the least number of compatibility problems across browsers.


Side note: The best practice for defining class methods in Javascript is to use the prototype:

ClassName.prototype.method = function(){...}

This allows every instance of the class to use the same function instead of giving each instance its own copy of the function.

Member variables are not in scope in prototype methods; the method is expected to use this to access them. In the context of an event handler, however, this is not the containing object. Therefore, using prototyped methods as event handlers is not a good idea. A solution is to use the old-fashioned "this.method=" declaration which suffers inefficiency but does the job:

function ClassName {
this.method = function(){...}

I ran into this problem when I tried to fix my old-style drag-and-drop code to use the best practice instead.

Recommended reading: Douglas Crockford's tutorial: Private Members in Javascript.

I was getting an unexplained, unlogged 500 internal server error response for a perl Hello World script.

#!/usr/bin/perl

print "Content-type: text/html\n\n<p>Hello World</p>\n";

This was especially odd because I have a perl program running elsewhere on the same server. After comparing .htaccess settings and triple-checking my Content-type syntax, I found the apparent cause: the server requires either the -w (warnings) or the -T (taint checks) flag be turned on.

I was unable to determine which setting causes this. mod_perl's PerlSwitches can force all scripts to be run with -wT turned on, but I could find no setting to refuse to run scripts which lack either flag.

For a description of taint mode, read perlsec.

.

Reason #1:

foreach($items as $i){ // $i is an item reference
    ...
    // Now let's loop through something
    for($i=0,$max=10; $i<$max; $i++){ // d'oh

Reason #2:


$dir = $asdf ? -1 : +1; // direction
... 
$dir = getDirectory(...); // d'oh 

Another reason PHP sucks: references to static methods are not supported. There is a hacky way to use them, however. Read more... )

Some notes on keypress handling in Python's PyGame library, from about seven years ago: Read more... )

I finally got my Javascript implementation of Craig Reynolds's boids algorithm to work. Here is the relevant code:

var weights= new Array(1.0, 1.0, 1.0);

sepvect = sepvect.multiply(weights[0]);
alivect = alivect.multiply(weights[1]);
cohvect = cohvect.multiply(weights[2]);

I just needed to record magnitudes for one run and fiddle with the weights. (0.5,1.0,0.2) seemed to do the trick, though I should test with different numbers of boids to see if the magnitudes are relative to that variable.

As regards "finally", I started on this so long ago that I forget when and have intermittently picked it up and re-abandoned it since then. The oldest timestamp that I can find for it is 2006, but I think it goes back to 2003 or 2004 when it was going to be something that I would put together during spring break. The biggest problem was a trig error that I fixed last month after having almost fixed it earlier, causing directions to be wrong in one or two of the four quadrants.

While looking through old files, I found a TODO list so old that I've actually done most of the things on it. Usually these things double in size every year. I shall celebrate with a mocha. Details below the cut, if anyone cares. )

Consider these changes to the try/catch model of C++ and Java:

  • Every block is a try{} block. The "try" keyword is dropped. All exceptions will filter upwards until they reach a block that catches the exception.
  • The new "recover" keyword returns from a "catch" block to the next line of the context that threw the exception. The exception has a ".scope" object that allows access to the variables that were in scope at the time that the exception was thrown. Whenever an exception happens, programmers could twiddle a few variables and set the program back to where it was before.

Has any language already done this or something similar? How would this affect program design, code quality, and readability? What would language developers need to do to implement these features, and how would it impact performance?


Also posted to HN, where one of the users informs me that I've reinvented the Common Lisp condition system.

I've finally gotten around to downloading those US state department cables that Wikileaks acquired. You can acquire them at Cryptome:

http://cryptome.org/z/z.7z

John Young would probably appreciate it if you can find another way of acquiring the file so he doesn't have to pay the bandwidth bill. I'm not linking directly so he doesn't get hammered by bots following the link.


Storing the cables

The cables will need to be imported into a database before they can be read. I am using MySQL from the XAMPP distribution.

Create a final table and an all-text table for importing into. (Attempting to import directly into the date field will zero out the dates).

create table Cables (
	id int PRIMARY KEY,
	date datetime, 
	local_title varchar(128),
	origin varchar(255), 
	classification varchar(128),
	referenceIDs text, -- really a pipe-separated array
	header text,
	data mediumtext -- some larger than 65536 chars
);
create table Cables2 (
	id text, 
	datetime text, 
	local_title text, 
	origin text, 
	classification text, 
	referenceIDs text,
	header text,
	data mediumtext
);

Load the data using mysql's LOAD DATA INFILE, which needs some help to learn to read multi-line CSV correctly.

load data infile 'c:\\cables.csv' into table cables2 fields ENCLOSED BY '"' escaped by '\\' terminated by ',';

There should be zero warnings. If you see warnings, you can use "show warnings" to see at what record number the import failed.

Copy from the staging table to the final table. This will take about two minutes.

insert into cables (select
id, 
str_to_date(datetime, "%m/%d/%Y %k:%i"),
local_title,
origin,
classification,
referenceIDs,
header,
data
FROM cables2); 

Then create an index on the dates to speed up future searches. This takes two minutes on my computer.

create index idx_date on cables (date);

Building an index for text searches

You will want to search the cables for a specific subject. Two methods are

Fulltext Index

MySQL has fulltext indexing. I found it to be too slow.

Creating the index took 25 minutes:

create fulltext index idx_text on cables (header, data);

Fulltext index queries using MATCH AGAINST took 1-2 minutes to run.

select count(*) from cables where match (header, data) against ('Sudan');
select count(*) from cables where match (header, data) against ('Sudan') OR match (header,data) against ('Sudanese');

I found the fulltext index to be slower than sequentially searching the table for "data like '%SUDAN%'". YMMV.

Custom word cache

I built my own index using separate tables for all search terms and for the connections between the search terms and the cables.

create table words(
	wordID int auto_increment NOT NULL,
	word varchar(64) NOT NULL, 
	CONSTRAINT cx_word_uniqueness UNIQUE (wordID, word)
);

create table idx_words(
	wordID int NOT NULL REFERENCES words(id),
	cableID int NOT NULL REFERENCES cables(id),
	INDEX idx_word_match (wordID, cableID), 
	CONSTRAINT cx_words UNIQUE (wordID, cableID)
);

The custom index can be seeded with two queries:

INSERT INTO words (word) VALUES ('SUDAN'); 
INSERT INTO idx_words (
 SELECT words.wordID, cables.id FROM words, cables 
 WHERE words.word = 'SUDAN' && upper(cables.data) LIKE '%SUDAN%'
);

It takes few minutes to seed each word, but searches are instantaneous with a small number of seeded words. I have not tested this with a large number of seeded words.

Other text search methods?

If you know of a better search method, please mention it in comments.

Other search indexes?

The cables have additional information that a sufficiently intelligent program can pull out of the data field and add to the database metadata. If someone has already done this work, please mention it in comments.

Reading the cables

You will need a program to pull the data out of the database in a form that you can read. I wrote a quick and dirty PHP program to display search results as HTML.

CAPS reformatting?

Cables before circa 2000 were in ALL CAPS and are difficult to read. A program could potentially convert the text to normal mixed case, although it would need to be able to recognize acronyms and peoples' names. If someone has already done this work, please mention it in comments.

Keyword recognition?

A program could potentially recognize key words such as peoples' names, and link these words to other sources of information such as History Commons and Wikipedia. If someone has already done this work, please mention it in comments.

Imagine a highly-abstract, highly-verbose, machine-readable superlanguage that supports a superset of the features available in the common general-purpose object-oriented languages like C++, Java, Python, Perl, Javascript, and PHP.

Instead of programming code being interpreted or compiled down to a base machine language, it is instead decompiled upward to the more abstract language. This more abstract language is later compiled down to JVM or machine code, and can also be compiled downward to any of the supported input languages.
Read more... )

During a classroom exercise, the programming teacher told us not to put argument checking code such as ptr != NULL inside a function but instead to check the arguments before calling the function. My first reaction was, well, you're old. There was also the fact that at the time he was teaching us recursion, where small amounts of spent time build up through repetition, and precautions can sometimes be ignored in the rare case of internal functions that are never meant to be used in other places or by other developers, but let me explain my initial reaction.

There were arguments in the 1990s over who is to blame when a bad parameter causes a function to crash a program. Coders of the old school maintained that crashing libraries where the fault of the third-party programmers for passing in bad values when the documentation clearly said that such values were not allowed. This practice probably stems from the 1970s and earlier when every CPU cycle counted, when programmer effort was cheaper than CPU effort and memory space. The new and contrary idea which won the argument was that libraries should be so solid and robust that third-party programmers should not be able to crash them.

There is still a concept that the old-school programmers recognized. The data-checking code which only validates the data can be considered separate from the operations code that does whatever the function is meant to do with it. This might not affect how we write code, but perhaps this idea could be used to affect how code is compiled and run.

Applying this concept to the build process

Consider this pseudocode:

myfunction (x,y,z)
10: return false if x == NULL

20: return false if y > 24
30: return false if z < 0
40: Do something
...
90: return true

The first several lines of the function ensure that the arguments are passed correctly. A sufficiently smart compiler could recognize the parameter checking code -- perhaps as code which returns false or throws an error before any data is modified -- and create some metadata saying that for all calls to myfunction(), x must be non-null, y must be above 24, and z must be below zero.

An even smarter compiler[1] could inspect later code and see if the parameters passed into myfunction() are set to known valid values, which would include constants or non-volatile values which have already been checked for the same constraints and have not changed since. If the values are all knowable and within acceptable ranges, the compiler can have this function call jump to line 40 instead of line 10, saving a whole three to six ops.

[1] I've earlier used the term "never-watches" to describe this kind of inspection. I hear that modern compilers have some functionality like this, but I don't know what they are capable of.

Problems with the idea

That's not much of a savings

The CPU is probably spending more time blocked on memory I/O than it would spend running these checks at the start of each function. Code reordering means there might be literally no time savings in the common case, since a data fetch instruction might be moved before the checks and then the checks can be run while waiting for the data.

There are also no memory savings, not that it would matter today. The compiler cannot leave lines 10 through 30 out of the library because third-party developers will often pass in variables whose range of possible values cannot be determined at compile time. This is a form of the P!=NP problem; some ranges can be determined, others cannot.

If we were to go further and leave lines 10 through 30 out of the library, relying on the compiler and linker to reject code that does not match the constraints, then anybody could link in bad code by using their own development tools.

Those aren't always check constraints

Functions like isalpha() may use tests indistinguishable from argument checking code as part of their functional logic. A value which causes these functions to immediately return false is not an invalid value. When our too-sufficiently smart compiler says that the arguments should be restricted to a range, it is wrong.

The one advantage of the sufficiently smart compiler is that it would have code with known arguments jump straight to the logic for handling these arguments. An even smarter compiler could inline and optimize that code to get rid of the function call entirely.

The data may change on runtime

So a particular function call goes straight to line 40 because only constants were used in the code. A debugger or hostile code sets x to null.

Page generated Apr. 28th, 2017 08:20 am
Powered by Dreamwidth Studios