Thoughts on databases, to no effect
Apr. 2nd, 2012 03:43 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
The other day I had a wild idea about database design.
Indexes should be on constraints rather than on columns.
It just so happens that equality (select where x=y) is the most commonly needed constraint, so conventional indexes work for most needs.
What new features could be added to a database engine with this concept in mind?
A gt/lt index can be implemented as a binary search algorithm.
The "index" might be a set of references to the records, sorted by the indexed columns (a, b, ...). Searches of type x<y could then use a binary search to find a starting point to read through the records.
Indexes already can be implemented with B+Trees which are sorted and searchable, so this has been done. Next idea.
A hardcoded static constraint index can be implemented for common queries
An index could be made on the result of a conditional test using SQL, like (column1==foo and column2 > bar), where the values are known at creation time. This could add a bit of speed when there is a commonly run query that does not already trigger an index.
This has been done too. It's called "ADD COLUMN", "SET COLUMN", and add an index to the column. Triggers can be used if the constraint depends on the contents of other tables.
A "we already did that query" index
A hypothetical index could be a simple subset of the data, the results of a constraint.
A downside is that all other indexes would need to be reimplemented against this subset.
This could come in handy if paired with new logic to recognize common arguments to queries. A sufficiently intelligent system might recognize that same long-running query with "where x like '%foo%'" has been run three times in the past hour. A sufficiently intelligent system might notice that the clause is slow and keep the results around in case the same clause is used again on the same data.
A database engine could also cache the results of a whole query in a structure containing the result set, references to the tables and subqueries required, and timestamps of the last changes to the data sources at the time of the query. If the query is the same and none of the tables have changed since the data was last collected, and all of the subqueries pass the same test, the result will be the same. This has been done; Postgresql has had a query result cache for at least a decade. If the result of a query with multiple constraints can be cached, there is little benefit to cacheing the result of each individual constraint.
Date range indexes
Timeframe queries from the current time are fairly common and never the same. An index for dates could be a data structure that contains pointers to:
- a day ago
- a week ago
- ...other time periods as needed...
Time can be expected to move forward. When the query is eventually run, the record for "a day ago" is now likely to be older than a day ago but close enough that a linear search from that time should find the proper cutoff point soon.
When wouldn't the "a day ago" record be longer than a day ago? When the last record is only a few hours old.
This index will need to be rewritten every single time it is used. It should be a simple matter of replacing the pointer.
B+tree indexes are probably fast enough that only a minimal speedup would be gained from this setup. In fact, the linear search from the entry pointer would be slower than a binary search for a very small number of records.
The pointer could be used as a hint to begin a binary search. The difference should not be noticeable, likely saving one or two search steps at most.
This idea can be generalized to storing entry-point hints for commonly requested searches on sorted fields.
Slightly off topic...
Some indexes use bitmap fields, reducing the constraint to a boolean value and comparing that to a bit that was already set. This speeds up searching because the record contents do not have to be examined.
A sorted index could lead to chunks of serialized bitmap for all possible bitmap queries so that iteration is just i++ rather than a memory dereference. This would require copying (and resorting) the bitmaps for every sorted index. There would be a cost of slower updates from longer index rebuild times, and a small storage cost of each bitmap being repeated for each sorted index. Surely that's something that has been thought of and tested. It's too obvious. In fact, a little more reading shows it's how bitmap fields were first implemented.
Veering further off topic, it is very common for a web browser to request a few items from the web server... then the next page... then the next page. All use the same query with the only difference being the LIMIT. Normally, a new session is created for each request. It would be more optimal for the web server to keep the same query open and feed the browser the data as it comes in.
The browser would have to be given a sessionID and return it. The web server would have to expose the sessionID to the runtime environment which would need to be smart enough to keep the session open. It can front-load the next page of data and wait, accepting a record every few seconds so the connection does not close, and timing out and closing the session after an hour of inactivity.
The system should support multiple sessionIDs since a browser application should conceivably support pulling different data from multiple databases at the same time.
This system could be DoSed by visiting the page a few hundred times in the same hour. The database connection limit will be filled up by normal web use. If the system instead runs the query and holds the data, it will DoS itself by using all of its memory to store multiple copies of the same query results for different sessions as different users come in. It would do more harm than good; and if the query is already cached, there's really no point.
The low-hanging fruit has already been picked.