User-Defined Functions

In addition to all the built-in SQL functions, such as lower( ) and upper( ), you can extend SQLite to include functions of your own written in PHP. These are known as user-defined functions, or UDFs for short. With a UDF, you embed logic into SQLite and avoid doing it yourself in PHP. This way, you take advantage of all the features inherent in a database, such as sorting and finding distinct entries.

There are two types of UDFs: standard and aggregate. Standard UDFs are one-to-one: when given a single row of data, they return a single result. Functions that change case, calculate cryptographic hashes, and compute the sales tax on an item in a shopping cart are all standard functions. In contrast, aggregate functions are many-to-one: when using an aggregate function, SQLite passes it multiple rows and receives only a single value.

Although it is not a UDF, the most popular aggregate function is count( ), which returns the number of rows passed to it. Besides count( ), most aggregate functions are related to statistics: finding the average, standard deviation, or the maximum or minimum value of a set of data points.

Standard Functions

UDFs are good for chopping up strings so you can perform nonstandard collations and groupings. For example, you want to sort through a list of URLs, maybe from a referrer log file, and create a list of unique hostnames sorted alphabetically. So, http://www.example.com/directory/index.html and http://www.example.com/page.html would both map to one entry: http://www.example.com.

To do this in PHP, you need to retrieve all the URLs, process them inside your script, and then sort them. Plus, somewhere in all that, you need to do the deduping. However, if it weren’t for that pesky URL-conversion process, this could all be done in SQL using the DISTINCT and ORDER BY keywords.

With a UDF like the one shown in Example 4-3, you foist all that hard work back onto SQLite where it belongs.

Example 4-3. Retrieving unique hostnames using an SQLite UDF

// CREATE table and INSERT URLs
$db = sqlite_open('/www/support/log.db');
$sql = 'CREATE TABLE access_log(url);';

$urls = array('http://www.example.com/directory/index.html', 
              'http://www.example.com/page.html');

foreach ($urls as $url) {              
    $sql .= "INSERT INTO access_log VALUES('$url');";
}
sqlite_query($db, $sql);


// UDF written in PHP
function url2host($url) {
    $parts = parse_url($url);
    return "$parts[scheme]://$parts[host]";
}

// Tell SQLite to associate PHP function url2host( ) with the
// SQL function host( ). Say that host( ) will only take 1 argument.
sqlite_create_function($db, 'host', 'url2host', 1);

// Do the query
$r = sqlite_query($db, 'SELECT DISTINCT host(lower(url)) AS clean_host 
                        FROM access_log ORDER BY clean_host;');

// Loop through results
while ($row = sqlite_fetch_array($r)) {
    print "$row[clean_host]\n";
}

http://www.example.com

To use a UDF, you first write a regular function in PHP. The function’s arguments are what you want to pass in during the SELECT, and the function should return a single value. The url2host( ) function takes a URL; calls the built-in PHP function parse_url( ) to break the URL into its component parts; and returns a string containing the scheme, ://, and the host. So, http://www.example.com/directory/index.html gets broken apart into many pieces. http is stored into $parts['scheme'] and www.example.com goes into $parts['host'].[3] This creates a return value of http://www.example.com.

The next step is to register url2host( ) with SQLite using sqlite_create_function( ). This function takes four parameters: the database handle, the name you want the function to be called inside SQLite, the name of your function written in PHP, and the number of arguments your function expects. The last parameter is optional, but if you know for certain that your function accepts only a specific number of parameters, providing this information helps SQLite optimize things behind the scenes. In this example, the SQL function is host( ), while the PHP function is url2host( ). These names can be the same; they’re different here to make the distinction between them clear.

Now you can use host( ) inside any SQL calls using that database connection. The SQL in Example 4-3 SELECTs host(lower(url)) AS clean_host. This takes the URL stored in the url column, converts it to lowercase, and calls the UDF host( ).

The function is not permanently registered with the database, and goes away when you close the database. If you want to use it when you reopen the database, you must reregister it. Also, the function is registered only for that database; if you open up a new database using sqlite_connect( ), you need to call sqlite_create_function( ) again.

The returned string is then named AS clean_host; this lets you refer to the results later on in the SQL query and also access the value in PHP using that name. Since you’re still in SQLite, you can take advantage of this to sort the list using ORDER BY host. This sorts the results in alphabetical order, starting at a.

Now that’s cool, but it’s not that cool. What is cool is SQLite’s ability to call UDFs in the ORDER BY clause. If you use the default alphabetical sort, http://php.example.org and http://www.example.org won’t be near each other, because “p” and “w” aren’t next to each other in the alphabet. Yet both hosts are located under the example.org domain, so it makes sense that they should be listed together. Not surprisingly, another UDF saves the day.

function reverse_host($url) {
    list ($scheme, $host) = explode('://', $url);
    return join('.',array_reverse(explode('.',$host)));
}

sqlite_create_function($db, 'reverse', 'reverse_host', 1);

The reverse_host( ) function takes a URL and chops it into two bits, the scheme and host, by explode( )ing on ://. You can do this because the previous UDF, host( ), has specifically created strings in this manner. Next, $host is passed through a series of three functions that splits it up into its component parts, reverses those parts, and then glues them back together. This flips around the pieces of the host separated by periods, but doesn’t actually reverse the text. So, www.example.org becomes org.example.www and not gro.elpmaxe.www or www.elpmaxe.gro.

This reversed hostname is perfect for sorting. When you alphabetize org.example.www, it nicely sits next to all its brethren in the .org top-level domain, then sorts by the other hosts inside example.org, and finally orders the remaining subdomains. And that’s exactly what you want.

You then register reverse_host( ) in the exact same way you registered url2string( ), using sqlite_create_function( ).

Once that’s done, you can call reverse( ) inside your SQL query:

$r = sqlite_query($db, 'SELECT DISTINCT host(lower(url)) AS clean_host
                        FROM access_log ORDER BY reverse(clean_host);');

Given the following list of URLs as input:

http://www.example.com
http://php.example.org
http://www.example.org

you get the following as output:

               http://www.example.com
               http://php.example.org
               http://www.example.org

The URL containing php.example.com has filtered down in the list below www.example.com, even though php comes before www in the alphabet.

In contrast, Example 4-4 shows what you need to do to implement this in PHP without UDFs.

Example 4-4. Sorting unique hostnames without using SQLite UDFs

function url2host($url) {
    $parts = parse_url($url);
    return "$parts[scheme]://$parts[host]";
}

function reverse_host($url) {
    list ($scheme, $host) = explode('://', $url);
    return join('.',array_reverse(explode('.',$host)));
}

function host_sort($a, $b) {
    $count_a = $GLOBALS['hosts'][$a];    
    $count_b = $GLOBALS['hosts'][$b];
    
    if ($count_a < $count_b) { return  1; }
    if ($count_a > $count_b) { return -1; }

    return strcasecmp(reverse_host($a), reverse_host($b));
}

$hosts = array( );

$r = sqlite_unbuffered_query($db, 'SELECT url FROM access_log');
while ($url = sqlite_fetch_single($r)) {
    $host = url2host($url);
    $hosts[$host]++ ;
}

uksort($hosts, 'host_sort');

This process breaks down into many steps:

  1. Make a database query for urls.

  2. Retrieve url into $url using sqlite_fetch_single( ).

  3. Convert $url into a host and store it in $host.

  4. Place $url as a new element in the $hosts array and increment that element by 1. This tracks the number of times each URL has appeared.

  5. Perform a user-defined key sort on the $hosts array.

The sqlite_fetch_single( ) function returns the first (and in this case only) column from the result as a string. This allows you to skip the step of saving the result as an array and then extracting the element, either by using list or as the 0 th index.

Doing $hosts[$host]++ is a old trick that allows you to easily count the number of times each key appears in a list.

Since uksort( ) only passes array keys to the sorting function, host_host( ) is not very elegant, because it requires using a global variable to determine the number of hits for each element.

Overall, compared to a UDF, this method requires more memory, execution time, and lines of code, because you’re replicating database functionality inside PHP.

User-Defined Aggregate Functions

As discussed earlier, most aggregate functions are statistical functions, such as AVG( ) or STDEV( ). People usually use aggregate functions to return a single row from their query, but that’s not a requirement. You can use them to link together a set of related rows, to compact your query and return one row per group.

This extension to the earlier referrer log sorting example shows how to use an aggregate function to provide the total number of hits per hostname, in addition to everything in the previous section:

SELECT DISTINCT host(lower(url)) AS clean_host, COUNT(*) AS hits 
FROM access_log GROUP BY clean_host ORDER BY hits DESC, reverse(clean_host)

The COUNT(*) function sums the total number of rows per host. However, this won’t work without adding the GROUP BY host clause. GROUPing rows allows COUNT(*) to know which sets of entries belong together. Whenever you have an aggregate function—such as COUNT( ), SUM( ), or any function that takes a set of rows as input and returns only a single value as its output—use GROUP BY when you want your query to return multiple rows. (If you’re just doing a basic SELECT COUNT(*) FROM host to find the total number of rows in the table, there’s no need for any GROUPing.)

COUNT(*) is aliased to hits, which allows you to refer to it in the ORDER BY clause. Then, to sort the results first by total hits, from most to least, and then alphabetically within each total, use ORDER BY hits DESC, reverse(host). By putting hits first, you prioritize it over reverse(clean_host) and the DESC keyword flips the sorting order to descending (the default is ascending).

Using that query, this set of sites:

http://www.example.org
http://www.example.org
http://www.example.com
http://php.example.org

and this PHP code:

while ($row = sqlite_fetch_array($r)) {
    print "$row[hits]: $row[clean_host]\n";
}

gives:

               2: http://www.example.org
               1: http://www.example.com
               1: http://php.example.org

Furthermore, to restrict results to sites with more hits than a specified amount, use a HAVING clause:

SELECT DISTINCT host(lower(url)) AS clean_host, COUNT(*) AS hits
       FROM access_log
       GROUP BY clean_host
       HAVING hits > 1
       ORDER BY hits DESC, reverse(clean_host)

You cannot use WHERE here, because WHERE can only operate on data directly from a table. Here the restriction hits > 1 compares against the result of a GROUP BY, so you need to employ HAVING instead.

You can define your own aggregate functions for SQLite in PHP. Unlike standard UDFs, you actually need to define two functions: one that’s called for each row and one that’s called after all the rows have been passed in.

The code in Example 4-5 shows how to create a basic SQLite user-defined aggregate function that calculates the average of a set of numbers.

Example 4-5. Averaging numbers using an SQLite aggregate function

// CREATE table and INSERT numbers
$db = sqlite_open('/www/support/data.db');
$sql = 'CREATE TABLE numbers(number);';
$numbers = array(1, 2, 3, 4, 5);
foreach ($numbers as $n) {              
    $sql .= "INSERT INTO numbers VALUES($n);";
}
sqlite_query($db, $sql);

// average_step( ) is called on each row.
function average_step(&$existing_data, $new_data) {
    $existing_data['total'] += $new_data;
    $existing_data['count']++;
}

// average_final( ) computes the average and returns it.
function average_final(&$existing_data) {
    return $existing_data['total'] / $existing_data['count'];
}

sqlite_create_aggregate($db, 'average', 'average_step', 'average_final');

$r = sqlite_query($db, 'SELECT average(number) FROM numbers');
$average = sqlite_fetch_single($r);
print $average;

3

First, you define the two aggregate functions in PHP, just as you do for regular UDFs. However, the first parameter for both functions is a variable passed by reference that is used to keep track of the UDF’s state. In this example, you need to track both the running sum of the numbers and how many rows have contributed to this total. That’s done in average_step( ).

In average_final( ), the final sum is divided by the number of elements to find the average. This is the value that’s returned by the function and passed back to SQLite (and, eventually, to you).

To formally create an aggregate UDF, use sqlite_create_aggregate( ). It works like sqlite_create_function( ), but you pass both PHP function names instead of just one.

Binary Data

SQLite is not binary safe by default. Requiring PHP to automatically protect against problems caused by binary data causes a significant reduction in speed, so you must manually encode and decode data when it might be anything other than plain text. If your UDFs only operate on text, this isn’t a problem.

Inside a UDF, use sqlite_udf_binary_decode( ) to convert data stored in SQLite into usable strings in PHP:

function udf_function_encode($encoded_data) {
    $data = sqlite_udf_binary_decode($encoded_data);
    // rest of the function...
}

When you’re finished, if the return value might also be binary unsafe, re-encode it using sqlite_udf_binary_encode( ):

function udf_function_decode($encoded_data) {
    // rest of the function... 
    return sqlite_udf_binary_encode($return_value);
}


[3] The other portions of the URL are stored in different variables.

Get Upgrading to PHP 5 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.