A variable representing a link to a website might be called $url or $link or $site. Take this project I'm working on. At one point in the code the variable might be called $url. At another point it is $link. Further on, it is $site. I want to reach back across time to my younger self and slap him the face and scream at him and impart a lesson on the value of maintaining a consistent naming convention.

So let's say $site represents a link to a website except when it doesn't because I am also using $site as a unique key to represent a website in the general sense that "this website contains web pages". For this purpose I send $site through a filter that strips out the protocol and leading www. and the ending index.html if it exists. This means that $site is not the same thing that it used to be. So $site and $link actually are two different things with two different meanings. They should be two different variables.

I fixed a few bugs just by noticing that the meaning of these variables was not consistent.


In the same program I have two variables $sites_visited and $destination_count that seem to have a similar purpose.

$sites_visited is:

  • set to 1 when process_site() is run
  • checked when recursing into new sites
  • was sent to process_site_results() in an earlier version, but this is commented out

$destination_count is:

  • increased by 1 when process_site() is run (this is a bug)
  • increased by 1 when a site's list of links is finalized
  • used by process_site_results() to determine whether the site should be included in output.

So both of them are set to true when a site is processed and were intended to be used to tell if a site is worth processing. However, they mean different things.

  • $sites_visited is true when a site has been fetched and processed.
  • $destination_count counts the number of times a site has been linked from another site. It is nonzero (true) before the site is fetched or processed.

So they remain two different variables because they account for different concepts.

Has-A pattern:

class MyClass { 
 FooType foo;
 BarType bar;
}
  • each MyClass allocates space for a FooType and a BarType.
  • foo and bar establish their own namespaces.

Inheritance:

class MyClass extends FooType, BarType { }
  • each MyClass allocates space for a FooType and a BarType.
  • The namespaces for FooType and BarType are incorporated into the MyClass namespace.

So what's the problem? Something in FooType or BarType might overlap.

General outline of a solution:

  • All namespace conflicts must be resolved by the programmer.

Considering specific examples:

1. Both FooType and BarType have a .x property?

  • Solution 1: Forced upcasting. All attempts to access MyClass.x must be upcasted to the parent. All parent methods work on their own .x property.

2. Both FooType and BarType have a .fooMethod()?

  • Solution 1: Forced upcasting. The compiler should warn if the programmer fails to resolve the conflict, and throw an error if the programmer tries to run MyClass.fooMethod() without overriding it. This can cause problems (see case 4).

3. MyClass has its own .x property?

  • Solution 1: Override parent access methods. If the property is interface-compatible with the parent's .x property, the FooType/BarType methods will run on MyClass.x instead of their own .x properties. If the property is not interface-compatible with the parent's .x, the compiler should throw an error and refuse to compile. PROBLEM: FooType and BarType may intend for their .x property to mean different incompatible things or to hold separate pieces of information. Sharing the same piece of data will introduce bugs.
  • Solution 2: Do not override parent access methods. Parents continue to operate on their own .x properties.

4. MyClass has its own .fooMethod()? Theoretically, this should override the FooType/BarType methods. However, it introduces a double-call problem.

  • Solution 1: Override parent methods. This introduces a double-call problem. If both parents have a method called every second that calls their own fooMethod(), this causes the overridden fooMethod() to be called twice per second instead of once per second.
  • Solution 2: Do not override parent methods. Parent calls to fooMethod() will call the parent version of fooMethod().

Conclusions

More thought needs to go into this.

  • Multiple inheritance is fine until you have a namespace collision, and then you're in trouble.
  • If you have the parent classes share data when both have a .x property, then you have the problems that each might treat .x as a different conceptual thing and that conflicting actions on the same .x may introduce race conditions that do not exist when each class uses its own data.
  • If you do not have the parent classes share data, then you have multiple copies of what might be the same conceptual thing, updates to one do not affect the other, and the state of ".x" is unclear.
  • If the child class .x overrides the property for both parent classes, then you risk the parent methods modifying the data in ways you did not intend.
  • If the child's .foo() overrides the method for both parent classes, then you might get a double-call problem when methods in both parent classes call their overridden .foo() method.
  • If the child's .foo() does not override the parent method, then you lose the ability to extend the parent class by having its .foo() method do something different.
  • The programmer may want to resolve namespace collisions in a different way for different properties and methods in the same class.

Gee, I think this software of mine is close to ready for release. I just need to brush it up a little.

... (create a todo list ... it's half a dozen things)

... (open the file, and there's a list of another half dozen things to do)

... (there are several known bugs that have no discernable cause)

... (none of the functions are documented and I cannot tell what one of them is supposed to return)

... (making it run well will require repeatedly rerunning it and massaging input data files)

... (it produces no output when given the same input that had produced good results the last time I had worked on it)

... (the file is still named test.pl)

This might take more time than I had thought.

Consider that a class in c++ or java defines both an implementation and an interface.

Consider the similarities in these three examples:

// Example 1 
interface IFoo {
 // interface
 doStuff();

 // implementation not defined
}
// Example 2
class Foo {

 // implementation
 int x;
 string y;

 //interface
 doStuff(){...}
}
// Example 3
// implementation 
struct foo {
 int x;
 string y;
}

void foo_doStuff(); // interface

An ordinary compiler might be expected to convert lower-level code to a low-level implementation immediately upon reading it. The programmer has defined exactly how the code is to be implemented, and there is no apparent need for higher level considerations.

A hypothetical high-level compiler might first convert as much low level code as possible into a higher-level intermediary language.

  • class Foo is automatically built from struct foo and is automatically populated by the foo_() functions that operate on a struct foo.
  • interface IFoo is automatically built from the internal representation of class Foo
  • This high level representation is saved to an object file and can be accessed by any programming language that can load libraries.

Additional thoughts:

  • An autodoc tool can produce documentation from the resulting high-level objects, especially if the high level object includes comments or links back to the original source code.
  • An IPC mechanism can send the object definition to the receiving side if the receiving side does not have the structure defined.
  • The compiler can detect classes that are binary-compatible with one another and allow them to be casted without harm.

Also, it should be possible for a sufficiently smart compiler to recognize that some for-loops are examples of an iteration. These can be interpreted upwards to a high-level description of an iteration like foreach x in myArray{...} for which the developer has provided a low-level implementation for (i=0; i<length; i++){...}

I cannot think of any benefit to doing this, but it could potentially be done.

A new preview release of Swift 3 is out this month. I tried compiling it with debugging flags turned on, left my computer running, and came back several hours later to a failed build.

FAILED: lib/libLTO.so
(...cut giant parameter list...)
clang-3.7: error: unable to execute command: Killed
clang-3.7: error: linker command failed due to signal (use -v to see invocation)

Huh. I tried it again while browsing the web in the background. A few hours later, my system started swapping like crazy. KDE froze for minutes at a time. top showed several ld processes each using over 1GB of memory. I think it died when one of them hit the 32-bit limit at 2GB.

So, WTF? Let's check out the sizes of some of these libraries.

560968756 Aug  6 19:22 libclangSema.a
620546286 Aug  6 19:49 libclangStaticAnalyzerCheckers.a
660270964 Aug  6 19:19 libLTO.so

And let's see what's in some of these archives.

~/swift/src/build/Ninja-DebugAssert/llvm-freebsd-x86_64/lib% ar -tv libclangStaticAnalyzerCheckers.a
rw-r--r--       0/0        164312 Dec 31 16:00 1969 AllocationDiagnostics.cpp.o
rw-r--r--       0/0       7163456 Dec 31 16:00 1969 AnalyzerStatsChecker.cpp.o
rw-r--r--       0/0       7142344 Dec 31 16:00 1969 ArrayBoundChecker.cpp.o
rw-r--r--       0/0       7290936 Dec 31 16:00 1969 ArrayBoundCheckerV2.cpp.o
rw-r--r--       0/0       9251672 Dec 31 16:00 1969 BasicObjCFoundationChecks.cpp.o
rw-r--r--       0/0       7271432 Dec 31 16:00 1969 BoolAssignmentChecker.cpp.o
rw-r--r--       0/0       7159208 Dec 31 16:00 1969 BuiltinFunctionChecker.cpp.o
...

There are many object files in the 7MB range. It adds up. Looking at one of them with elfdump, nearly all of the space is used by DWARF debugging info.

entry: 730
        sh_name: .debug_str
        sh_type: SHT_PROGBITS
        sh_flags: 
        sh_addr: 0
        sh_offset: 26152
        sh_size: 3267448
        sh_link: 0
        sh_info: 0
        sh_addralign: 1
        sh_entsize: 1
entry: 734
        sh_name: .debug_info
        sh_type: SHT_PROGBITS
        sh_flags: 
        sh_addr: 0
        sh_offset: 3296469
        sh_size: 1252146
        sh_link: 0
        sh_info: 0
        sh_addralign: 1
        sh_entsize: 0

entry: 735
        sh_name: .rela.debug_info
        sh_type: SHT_RELA
        sh_flags: 
        sh_addr: 0
        sh_offset: 5007208
        sh_size: 2054928
        sh_link: 748
        sh_info: 734
        sh_addralign: 8
        sh_entsize: 24

How big are the original source code files?

4944

So that gets turned into 7MB when debugging flags are turned on, and it all goes over 2GB when multiplied by the many files in a large package. This causes the build to fail.

Scrap Dump

Jul. 31st, 2016 06:56 pm

pattern for resolution of an abstraction to a meaningful value

A = an abstract something

function f(A) {
  if(A == "FOO")
    return FOO
  if (A == "BAR")
    return BAR 
  if (A matches /[A-Za-z0-9]+/)
    return TOKEN_ALNUM
  if (A has method "next")
    return I_SEQUENCE
  if ( not (A contains meat) )
    return VEGETARIAN

  return DEFAULT 
}

This can be generalized to a function taking input:

   f(A, DEFAULT, array of [f_comparator -> f_resolver])

where the comparator function runs a comparison or pattern match on A, and the resolver function takes A and produces an output.

Possible error conditions:

  • All comparators fail. Solution: force caller to provide a DEFAULT value, have the caller handle it.
  • Multiple comparators match. Possible solutions:
    1. Use the first matching comparator.
    2. Give different values to comparators. Use the comparator with the highest matching score.
    3. (combination of 1 and 2) Sort the comparator->resolver pairs by comparator value before passing them in.
    4. Throw an exception (nooooooo!)
    5. [Edit Aug 2] Perform ALL of the matching operations. Assume the function is a decorator.
    6. [Edit Aug 2] Perform all of the matching operations and return a set of results. Assume the function returns results, plural, rather than one result.

Possible optimizations:

  • Inline the function calls.

These are not particularly new ideas. All of this has to be well tread ground.


[Edit Aug 2] Thoughts on "inline the function calls"

The comparator functions may be very simple and easily optimizable in theory. My preference is to write code to pass them in as a set and then let the compiler somehow produce code that has the simple comparison functions inlined as in the original example. This might be possible in theory if the functions are defined at compile time. The optimizer would need to be smart enough to unroll the loop and examine its contents and understand them.

If the functions are defined at run time, the optimizer would need to be part of the run time environment. It seems that what I want is the ability to arbitrarily optimize away function calls at runtime.

PC-BSD needs a better startup system

It's 2015 and we still haven't fixed init systems yet.

  • Problem: Duplication of code in the scripts. For example, every script contains code that checks to make sure the script is not disabled. This is best handled by the caller.
  • Problem: Startup scripts are disabled by looking for the name of a magic word in the startup script and manually adding it to a global config file. This is clunky and there has to be a better way.
  • Problem: Startup waits 30s for wireless to not connect when booting my laptop away from home. I can ^C but shouldn't have to.

Goal: allow a parallel system startup Read more... )

  • Any operation may take operands (parameters, arguments). Example: function foo($arg1, $arg2, $arg3)
  • Any operands may have attributes. Example: f(long int $arg1, fancy blue dishwasher $arg2)
  • The reader (compiler or interpreter) takes operation + operands and produces instructions that will have the desired effect. This can be done at runtime or at compile time.

Operations, operands, and attributes might be seen respectively as Verbs, Nouns, and Adjectives.

The question of what instructions are to run depends upon the attributes of the operands. For the simplest example, the code "2 + x" will produce different instructions depending upon the attributes of x.

small int x = 42
return 2 + x // add ax, 2

long int x = 42
return 2 + x // add eax, 2

float x = 42.0
return 2 + x // fadd(42.0) ; adds to st0

At some point the system determines "x is an int, use integer add" or "x is a float, use floating point add". The process of compilation can be described as the resolution of high-level code to lower-level implementation. This may involve translation through intermediary languages.

An interpreted high-level language will make these determinations at run time. A compiler may make the process more efficient by stripping out all of these determinations at compile time. At the most extreme of efficiency the reader may determine that the code will always produce the same result (42+2 = 44) and insert the resulting value rather than the code to produce it.

Optimization vs. reuse

A downside of such optimizations is that it becomes impossible to modify the behavior of a program at runtime if the behavior-determining code is optimized out. Correspondingly, an advantage to keeping code unoptimized is that the program may be modified by a wrapper that changes the resolution of a function call.

A system could provide the best of both worlds by using an object format that stores both the optimized version and a high-level interpretation, and recompiling on the fly when directed to change part of the program. To allow the recompilation of one subroutine at a time, instead of rebuilding the whole program, the metadata may include a catalogue of entry points.

Go claims to have strings. That would be a nice feature except that most of the core library functions operate on byte arrays and you need to make an explicit cast every time you move from strings to byte arrays and back.

Do you want to interpret a boolean as an int? Fuck you.

It is difficult to define complex types such as sets of function references. The syntax that works in a parameter definition produces a syntax error when I try to cast an array. By the way, you need to cast your arrays when you define them because the compiler cannot determine that information from the array's contents.

Error handling is... different. For example, if you wanted to write to a file in any other language you would do something like this:

status = fopen("foo.txt", "w")

or:

try {
 fopen("foo.txt", "w")
} catch e {
...
}

In Golang you need to do something like this:

var writer *bufio.Writer 
f1, err1 := os.Open("foo.txt")
if(err1 != nil){
	return writer, err1
}
tmp2, err2 := io.Writer(f1)
if(err2 ! nil){
	return writer, err2
}

return bufio.Writer(tmp2)

... so you end up with long repeating branches in a function that would be a few lines in any other language.

Do you want to close the file, allow other subprocesses to work on it, and reopen it later for reading? Fuck you.

Have I mentioned that I don't like golang's error handling pattern? I don't like golang's error handling pattern.

Go gets one thing right and that is its goroutines. Parallelization of a function should be as easy as putting the word "go" in front of the function. Other languages should pick up on this.

Sorting

Nov. 29th, 2015 08:44 pm

Streamable sorting

A streamable sorting algorithm is one that has an intermediary state in which the entire set may not yet be sorted, but it can be guaranteed that some number of items at the head are sorted. For example, the bubble sort will bring the optimal item to the head of the set before the rest of the set is sorted. The sorted elements at the head may be dispatched while the sorting algorithm continues to run on the remainder of the list.

The use case is to pipe data to a different operation in a multi-stage process so that the entire process will be completed more quickly than if the process had waited for the set to be sorted. This could be in shell scripting or any I/O bound transfer of data. Another use case is in a user-facing interface where the appearance of doing something is more important than total runtime.

Speed considerations

A strong argument against use of a streamable sorting algorithm is that non-streamable algorithms are so much faster than the bubble sort that the process is likely to complete sooner if one chooses the better algorithm and waits for it to complete.

Made-up example:

  1. Algorithm A will sort in 10 seconds and can start transfering after 4 seconds because it knows that the head is optimal.
  2. Algorithm B will sort in 5 seconds, but I/O cannot begin transfer until all items are sorted.

The slower algorithm will get the data across sooner if certain conditions are true:

  1. Transfer time is slower than the sort time of the faster algorithm.
  2. The transfer process will not interfere with the sorting process. On modern multi-cpu systems this should not be a problem.
  3. The fraction of time when transfer may begin (say, at 40% of sort time) is lower than the speed of the comparable nonstreamable algorithm compared to the nonstreamable one (say, algorithm B finishes sorting in 50% of the time of the streamable algorithm). Note that this will depend heavily on the set size due to the performance differences in sorting algorithms.

Variables:

  • Tx = (Time to transfer X objects)
  • At = (start of transfer for algorithm A)
  • Bt = (start of transfer for algorithm B)

When transfer time is slower than the sorting time of either sort algorithm, the overall process is bound by when the transfer begins. The streamable algorithm will be faster when it begins transfer before the non-streamable algorithm would finish and begin transfer.

When transfer time is faster than the slower streamable sorting algorithm, the streamable algorithm is bound by its own slowness. The process will be faster than a non-streamable algorithm only if transfer time remains slow enough for (Tx + Bt) to be greater than the time needed for the streamable algorithm to finish sorting.


Dividing sort time across sender and receiver

Imagine the following:

  • We need to sort and transfer data between two systems that both need to minimize CPU use.
  • We have a sorting algorithm that divides the set into chunks and sorts the chunks over multiple passes.
  • We can predict the number of passes needed by the algorithm from the set size.

Assuming that it is still the 1990s when these limits would matter (this is more of an intellectual exercise than a serious proposal), we can divide the effort of sorting between client and server. The sender runs half of the expected number of passes and sends the data partially-sorted. The receiver finishes sorting the half-sorted data as the data is received and inserts the items coming down the pipe where they best fit.

Let us say that we have a data object and we want to manage metadata about the object. There are multiple ways of handling this. Read more... )

Proposed: The basic data types of a programming language should be interfaces, the standard library should work on interfaces, and the low-level data types should be implementations of the basic interface.

The php devs spent years trying to convert their default string type from ascii to unicode before giving up. Golang shits a brick if you try to convert between a string and a []byte without an explicit cast. Consider instead that every function were written to operate on a string interface rather than a string data type, and we let the compiler and optimizer figure out what to do with whatever is passed into the function. []byte can implement the string interface. The programmer can send the standard string functions data in any charset (korean, japanese, whatever) and it will just work because a string of any charset will implement the string interface.

Another use case: You read some numbers as an integer. It could be a bigint or a small int. You don't care until you need to store it again. Performance and storage do not matter, you just need some math or a comparison to work. You write your code to work on an "int" interface and let the compiler figure out how to implement it.

Layers of abstraction

At the hardware layer we have native bytecode created from amd64 through a translator. We operate on registers, have to keep track of what data is in what register, and if the data has any meaning there is no way for the computer to know about it.

At the low level of abstraction we operate on bytes and sets of bytes. OS, POSIX, and standard library interfaces are written in C. One might consider high level assembler with strings and matrices to be at this level. There are multiple data definition formats to define how data is laid out in memory: C header files, SQL data definitions, a custom format for each language.

At the intermediate level of abstraction we define classes to associate the data structures with the methods that work on the data. We have templates, interfaces, abstracts, generics to allow code reuse.

At higher levels everything is abstract. We create anonymous objects on the fly and tell the computer to run operations on members that we hope will have been defined by runtime.

Method abstraction

There are similar layers of abstraction for methods, functions, subroutines. At the hardware layer a method is just a series of instructions that you jump to. Low-level programming allows you to describe the inputs and outputs: this function takes a byte pointer and returns a byte pointer. At the intermediate level of abstraction you can define a template function that allows the same code to be reused on different data types. At higher levels of abstraction all functions do that by default.

Abstracting operating environments

We have had all of these advances in abstraction in the programming area but our shells are still stuck in the early 1990s. Programs are written to read bytes or lines at a time. Streams of characters are passed between programs with no information about what the characters mean. Each program in the sequence must re-parse the stream.

Consider this sequence.

sh$ ls -l | bar > baz

Presently this outputs a chunk of text, and the bar program needs to know what to do with the text. Imagine instead that ls -l generates a table or recordset of files and their attributes, and what it outputs is context-dependent.

  • If the receiver is a text terminal, the shell uses a formatter provided with ls to generate a text table compatible with today's ls -l.
  • If the receiver is a hypertext terminal running in a windowed environment, the shell uses a formatter to send the data and metadata to the terminal in a given standard format that the terminal will use to produce an interactive display of files.
  • If the receiver is a .csv file, the shell automatically uses the standard table-to-CSV converter from the operating system's libcsv.
  • If the receiver is a program that requests input of a given class, the shell attempts to present the input as that class.

In a high-level shell design,

  • Programs are written to read more abstract sequences of objects. The implementation of the reader is chosen at runtime. Programs will have metadata exposing their expected input types.
  • Programs may pass both data and metadata. There will likely be a standard for a data+metadata format, and all standard readers will automatically recognize it.
  • Programs may pass pre-parsed chunks of data to skip the step of parsing by subsequent programs. This may be as simple as copying a chunk of memory from one program to another on the same computer. The high level program code will be the same whether the receiving program runs on the same computer (memory copy) or sends the data over the network.

The Supreme Court let stand the Google v. Oracle decision finding APIs to be copyrightable (mentioned earlier). They did not make a decision but chose not to hear it on the advice of Solicitor General Donald Verrilli who affirmed the appeals court ruling.

Nim

Jan. 4th, 2015 06:24 pm

Nim (formerly Nimrod) looks like an interesting language. Via lobsters. Some links:

scrap dump

Jun. 6th, 2014 05:15 am

I had intentions to write a blog post declaring 2011 as the Year of the Hacker but never got around to it. We had LulzSec, J3ster, and Web Ninjas. We had Stuxnet. We had Iran hacking certificate agencies. We had the iPhone "Towson" hack. We had noobs getting arrested for using LOIC from home. We had a virus hit control computers for the CIA's drones.


The benefit of modern Javascript+HTML is that you can do anything with it. The drawback of making Javascript+HTML a Turing-complete environment is that you can do anything with it.


I was studying Android programming a few months years ago, and they made this recommendation: "Don't call the UI-construction code directly! Use XML for your interfaces!"

The Android XML format is so painful to look at that I thought it worth my time to design my own alternate domain-specific language rather than use the one they gave me. (I never finished it)


The old practice of web development

<HTML>
<HEAD>
<TITLE>My Webpage</TITLE>

The new practice of web development

var node_html = document.createElement("HTML");
var node_html_head = document.createElement("HEAD");
var node_html_head_title = document.createElement("TITLE");
node_html_head_title.innerHTML ="My Webpage" 

Android and the Web seem to be going in opposite directions there.


I propose a new baseball statistic that weights and combines multiple different types of failure. Call it Derps Per Perp.

For pitchers:

  • any appearance including:
    • an inning surrendering four or more runs, OR
    • any appearance of under one inning that is not the final appearance of that inning.
  • Balks
  • Hit batters

for batters:

  • Hitting into double and triple plays
  • Fielding errors

Divide batter stats by factors related to at-bats

Divide pitcher stats by factors related to innings pitched

Statistics for the AL might need to be adjusted due to pitchers not hitting and designated hitters not fielding.

"Why Python Is Slow" is a quick read on Python's dynamic typing. I had not known that it was possible to change the value of an integer constant in Python.

The US Court of Appeals for the Federal Circuit has granted copyright protection to APIs and the layout of files on a filesystem. This makes it legally impossible to write a homebrew replacement for a software library in the United States. The decision, by Kathleen O'Malley with S. Jay Plager and Richard Taranto concurring, gives the example of the name "java.lang.Math.max" being a copyrighted work on the grounds that someone trying to implement a replacement math library could have used the name "Math.maximum" or "Arith.larger". Until overturned by an en banc rehearing or the Supreme Court, this decision outlaws Wine, Samba, Mono, Blackdown, probably half of GNU, and certainly Linux and the BSDs, not to mention OSX.
Page generated May. 27th, 2017 05:27 pm
Powered by Dreamwidth Studios