Jul 16 2023 Querying a power outage database

I have been tracking power outage data using a technique I learned about from Simon Willison called git-scraping. The outages are broken down by city, a stylized version of the data looks like:

1pm     seattle: 300
2pm     seattle: 200    tacoma: 150
3pm                     tacoma: 140     # implied seattle = 0
4pm     seattle:  50                    # tacoma = 0, seattle has new outages

Mr. Willison also has written a library to convert a git-scraped repo into an SQLite database, called git-history.

Some simple information I'd like to query the database for: what's the total number of customers experiencing an outage over time? My previous solution parses the JSON in each commit in the git repo, but it's kind of slow, and I'd like to use git-history for more complex queries later on. However, I ran into two issues.

Data format

The first problem was my data format. If a city drops from the list of outages between two commits, that implies the outages there were fixed. (See the 3pm line in the example above.) By default, git-history interprets an entry disappearing from the data as "no change".

I solved this in a straightforward, brute-force fashion: at each commit, query the DB for all known cities. Set their outage count to 0, and then overwrite with any cities actually in the current commit.

Using the --convert option in git-history:

git-history file outages.db outages.json --id "name" --namespace "outage" \
--import sqlite_utils \
--import sqlite3 \
--convert '
from sqlite_utils import Database

db = Database("outages.db")

# get the name of every city in the DB so far
known_cities = {city["name"]: 0 for city in db.query("SELECT DISTINCT name FROM outage")}

# overwrite 0 for cities actually in this commit
for city in json.loads(content):
    known_cities[city["name"]] = city["outage_count"]

for city in known_cities:
    yield city
'

It's not pretty, but it works!

Querying the database

The second issue was actually querying the DB. I have technically found a solution... but it's terribly slow. 2 nested subqueries! 😬 This is the general idea:

for each timestamp there's a commit at:
    sum the outage counts for:
        select the most recent row for each city (not necessarily in the "current" commit!)

Here's the query I have:

-- for each commit timestamp
select
  distinct _commit_at,
  (

    -- sum the outages for a given datetime
    select
      sum(customers_affected) as total
    from
      (

        -- select the most recent # of outages for each city, on or before the given datetime
        select
          max(inner._commit_at) as timestamp,
          customers_affected
        from
          outage_version_detail as inner
        where
          inner._commit_at <= outer._commit_at
        group by
          _item
      )
  )
from
  outage_version_detail as outer

I believe it runs correctly, but at ~6 seconds per row returned, it would be faster to iterate over the original git repo and parse the JSONs. Surely there's a better way to structure the two inner queries, but I haven't gotten there yet. If anyone has ideas, please get in touch!