Processing NGINX Logs With DuckDB; Observable Notebooks 2.0 ๐ ๐ฆ; Textplot
It is time, once again, to obsess over my fav database, which weโll do with three very different resource sections.
TL;DR
(This is an LLM/GPT-generated summary of todayโs Drop using SmolLM3-3B-8bit via MLX and a custom prompt.)
- Processing NGINX Logs With DuckDBย (https://duckdb.org/docs/stable/guides/file_formats/read_file)
This section describes parsing gzipโd nginx logs using DuckDB, leveraging regex to extract client IP, timestamp, full request, and other fields. It notes that ~1.3 million log lines were processed in ~2.4 seconds, with a focus on identifying bot traffic via user agent analysis. - Observable Notebooks 2.0 ๐ ๐ฆย (https://observablehq.com/notebook-kit/databases)
This sectionhighlights the improved database support in Observable Notebooks 2.0, including interactive SQL querying, local caching, and support for PostgreSQL, BigQuery, and more. It references an updated blog post demonstrating DuckDB integration. - Textplotย (https://query.farm/duckdb_extension_textplot.html)
This section introduces the Textplot DuckDB Extension, enabling real-time ASCII/Unicode visualizations within SQL queries. It emphasizes the extensionโs utility for quick data exploration and lightweight dashboards without external charting libraries.
Processing NGINX Logs With DuckDB

Iโm in the long and slow process of moving off of WordPress, first for my legacy blog and also for this newsletter. The old blog runs on an ancient VPS that desperately needs a migration, and the blog itself has not been reacting well (thanks to the puny amount of resources on the VPS) to the influx of our โAIโ overlords.
I havenโt poked super hard (until today) at the bot traffic (this is the only site I still have detailed logging on), and I keep threatening some infosec pals that Iโd show how to process semi-structured logs with DuckDB, and this seemed like a perfect opportunity to get two red items off the ledger (yes, I did recently watch โThe First Avengerโ).
I have ~50-ish gzipโd nginx logs that weโre going to read in, parse, and have ready to start looking at bot encounters.
While DuckDB can just read text files, itโs far easier to reconfigure the CSV reader to just return a single column that we can post-process:
FROM read_csv('logs/rud.is.access.log*',
auto_detect=FALSE,
header=FALSE,
columns={'line':'VARCHAR'},
delim='\t',
strict_mode=FALSE)
LIMIT 5;
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ line โ
โ varchar โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 98.82.110.41 - - [05/Sep/2025:02:10:33 -0400] "GET /cisa-kev.rss HTTP/1.1"โฆ โ
โ 47.128.31.167 - - [05/Sep/2025:02:10:56 -0400] "GET /rswitch/guide/index.hโฆ โ
โ 47.128.31.167 - - [05/Sep/2025:02:10:56 -0400] "GET /rswitch/guide/site_liโฆ โ
โ 47.128.31.167 - - [05/Sep/2025:02:10:56 -0400] "GET /rswitch/guide/site_liโฆ โ
โ 47.128.31.167 - - [05/Sep/2025:02:10:56 -0400] "GET /rswitch/guide/site_liโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
With log lines in hand, letโs use some regex magic to break up the nginx fields and see what we have to work with:
CREATE OR REPLACE TABLE logs AS (
FROM read_csv(
'logs/rud.is.access.log*',
auto_detect=FALSE,
header=FALSE,
columns={'line':'VARCHAR'},
delim='\t',
strict_mode=FALSE
)
SELECT
-- Basic fields
regexp_extract(line, '^([0-9\.]+)', 1) AS client_ip,
regexp_extract(line, '\[(.*?)\]', 1) AS timestamp,
-- Full request string (everything between first set of quotes)
regexp_extract(line, '"([^"]+)"', 1) AS full_request,
-- Parse components from full request
regexp_extract(line, '"([A-Z]+)\s', 1) AS http_method,
regexp_extract(line, '"[A-Z]+\s([^\s]+)\s', 1) AS request_path,
regexp_extract(line, 'HTTP/([0-9\.]+)"', 1) AS http_version,
-- Response details
regexp_extract(line, '"\s([0-9]{3})\s', 1)::INT AS status_code,
regexp_extract(line, '"\s[0-9]{3}\s([0-9]+)', 1)::BIGINT AS response_size_bytes,
-- Referrer and User Agent
regexp_extract(line, '[0-9]+\s"([^"]*)"', 1) AS referrer,
regexp_extract(line, '"([^"]+)"$', 1) AS user_agent,
-- Parse timestamp components (optional)
regexp_extract(timestamp, '([0-9]+/[A-Za-z]+/[0-9]+)', 1) AS date,
regexp_extract(timestamp, ':([0-9]{2}:[0-9]{2}:[0-9]{2})', 1) AS time,
regexp_extract(timestamp, '([-+][0-9]+)$', 1) AS timezone,
-- Categorize status codes
CASE
WHEN regexp_extract(line, '"\s([0-9]{3})\s', 1)::INT < 300 THEN 'Success'
WHEN regexp_extract(line, '"\s([0-9]{3})\s', 1)::INT < 400 THEN 'Redirect'
WHEN regexp_extract(line, '"\s([0-9]{3})\s', 1)::INT < 500 THEN 'Client Error'
ELSE 'Server Error'
END AS status_category
);
FROM logs SELECT COUNT(*) AS nrows;
โโโโโโโโโโโโโโโโโโ
โ nrows โ
โ int64 โ
โโโโโโโโโโโโโโโโโโค
โ 1289059 โ
โ (1.29 million) โ
โโโโโโโโโโโโโโโโโโ
FROM (
FROM logs
SELECT
strptime(date, '%d/%b/%Y')::DATE AS date
)
SELECT
MIN(date) AS start_date,
MAX(date) AS end_date;
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ start_date โ end_date โ
โ date โ date โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ 2025-07-15 โ 2025-09-05 โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโ
So, ~1.3 million log lines spanning mid-July to this morning.
The hyperfine benchmarking tool says it took ~2.4s to process them all on my, now ancient, M1 MacBook Pro.
Letโs see how the parsing went:
duckdb --json -c "FROM logs LIMIT 1" nginx.ddb | jq
[
{
"client_ip": "98.83.163.27",
"timestamp": "04/Sep/2025:02:10:12 -0400",
"full_request": "GET /cisa-kev.rss HTTP/1.1",
"http_method": "GET",
"request_path": "/cisa-kev.rss",
"http_version": "1.1",
"status_code": 200,
"response_size_bytes": 100463,
"referrer": "-",
"user_agent": "Zapier",
"date": "04/Sep/2025",
"time": "02:10:12",
"timezone": "-0400",
"status_category": "Success"
}
]
We can definitely work with that!
Halfway decent bots/crawlers/spiders/etc. generally let you know who they are in the User-Agent field, and most use some format like Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) as a signature. Letโs see how many we have to work with:
FROM logs
SELECT
COUNT(DISTINCT user_agent) AS count
WHERE
regexp_matches(lower(user_agent), 'bot|crawl|spider|slurp|crawler|fetch|robot|archive|index|search|bot|spider|crawl|slurp|fetch|robot|archive|index|search');
โโโโโโโโโ
โ count โ
โ int64 โ
โโโโโโโโโค
โ 2690 โ
โโโโโโโโโ
Ugh. Thatโs alot of unique signatures.
Thankfully, our โAIโ overlords have trained their proprietary and open weights models on lots of internet data. Said data included a decent percentage of these strings, many of which are paired with colloquial/short names for the service.
I used Ollama and llama3.2b in a Bash/curl script that I made, and had Claude pretty up to assign short names to the ugly signature strings:
#!/bin/bash
# Configuration
USER_AGENTS_FILE="user_agents.txt"
OLLAMA_URL="http://localhost:11434/api/generate"
MODEL="llama3.2:latest"
MAX_LINES=$(wc -l < "${USER_AGENTS_FILE}" | xargs)
OUTPUT_FILE="bot_names_results.txt"
# ANSI color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Check if user_agents.txt exists
if [ ! -f "$USER_AGENTS_FILE" ]; then
echo -e "${RED}Error: $USER_AGENTS_FILE not found!${NC}"
exit 1
fi
# Function to call Ollama API
generate_bot_name() {
local user_agent="$1"
# Craft an improved prompt with examples and context
local prompt="You are a bot naming expert. Generate a SHORT, memorable name (1-3 words) for this web crawler/bot based on its user agent string.
Rules:
- Keep it simple and descriptive
- Use the actual bot/product name if clearly visible
- Make it memorable and easy to pronounce
- Avoid generic terms like 'Web Crawler' or 'Bot'
- Focus on the brand/service name when available
Examples of good names:
- For Googlebot user agent โ 'Google Bot'
- For AhrefsBot user agent โ 'Ahrefs Spider'
- For Slackbot user agent โ 'Slack Bot'
- For meta-externalagent โ 'Meta Agent'
- For MJ12bot โ 'MJ12 Bot'
- For ClaudeBot โ 'Claude Bot'
- For PetalBot โ 'Petal Bot'
User Agent: $user_agent
Generate ONLY the bot name (nothing else):"
# Create JSON payload
local json_payload=$(jq -n \
--arg model "$MODEL" \
--arg prompt "$prompt" \
'{model: $model, prompt: $prompt, stream: false, temperature: 0.3}')
# Call Ollama API with timeout
local response=$(timeout 10 curl -s -X POST "$OLLAMA_URL" \
-H "Content-Type: application/json" \
-d "$json_payload")
# Check if curl timed out
if [ $? -eq 124 ]; then
echo "Request Timeout"
return
fi
# Extract the response text and clean it
local bot_name=$(echo "$response" | jq -r '.response' 2>/dev/null | \
tr -d '\n' | \
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' | \
sed 's/^"//;s/"$//' | \
head -n 1)
# If response is empty or too long, return error
if [ -z "$bot_name" ] || [ ${#bot_name} -gt 50 ]; then
echo "Generation Failed"
else
echo "$bot_name"
fi
}
# Main script
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
echo -e "${GREEN}Bot Name Generator - Processing first $MAX_LINES user agents${NC}"
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
echo
# Initialize output file with header
{
echo "Bot Name Generation Results"
echo "Generated on: $(date)"
echo "Model: $MODEL"
echo "========================================="
echo
} > "$OUTPUT_FILE"
# Counter for limiting to first 20
count=0
successful=0
failed=0
# Store results for summary
declare -a results
declare -a bot_names_only
# Read file line by line
while IFS= read -r line && [ $count -lt $MAX_LINES ]; do
# Skip empty lines
if [ -z "$line" ]; then
continue
fi
# Increment counter
((count++))
echo -e "${YELLOW}[$count/$MAX_LINES]${NC} Processing..."
# Clean and truncate user agent for display
clean_line=$(echo "$line" | tr -d '"')
if [ ${#clean_line} -gt 70 ]; then
display_line="${clean_line:0:67}..."
else
display_line="$clean_line"
fi
echo -e " User Agent: ${BLUE}$display_line${NC}"
# Generate bot name
bot_name=$(generate_bot_name "$line")
# Check if generation was successful
if [[ "$bot_name" == "Generation Failed" ]] || [[ "$bot_name" == "Request Timeout" ]]; then
echo -e " Generated: ${RED}$bot_name${NC}"
((failed++))
results+=("FAILED|$display_line")
else
echo -e " Generated: ${GREEN}$bot_name${NC}"
((successful++))
results+=("$bot_name|$display_line")
bot_names_only+=("$bot_name")
# Save to file
{
echo "Entry #$count"
echo "User Agent: $line"
echo "Generated Name: $bot_name"
echo "-----------------------------------------"
echo
} >> "$OUTPUT_FILE"
fi
echo
# Small delay to avoid overwhelming the API
sleep 0.3
done < "$USER_AGENTS_FILE"
# Print summary
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
echo -e "${GREEN}Summary Report${NC}"
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
echo -e "Total Processed: ${YELLOW}$count${NC}"
echo -e "Successful: ${GREEN}$successful${NC}"
echo -e "Failed: ${RED}$failed${NC}"
if [ $count -gt 0 ]; then
echo -e "Success Rate: ${GREEN}$(( successful * 100 / count ))%${NC}"
fi
echo
echo -e "${BLUE}Results saved to: ${YELLOW}$OUTPUT_FILE${NC}"
echo
# Display all results in a table format
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
echo -e "${GREEN}Generated Names Summary${NC}"
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
echo
i=1
for result in "${results[@]}"; do
IFS='|' read -r name agent <<< "$result"
if [[ "$name" == "FAILED" ]]; then
printf "%2d. ${RED}%-25s${NC} <- %s\n" "$i" "[Failed]" "${agent:0:50}"
else
printf "%2d. ${GREEN}%-25s${NC} <- %s\n" "$i" "$name" "${agent:0:50}"
fi
((i++))
done
echo
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
# Save just the bot names to a separate file for easy use
if [ ${#bot_names_only[@]} -gt 0 ]; then
echo
echo -e "${YELLOW}Saving bot names list to: bot_names_only.txt${NC}"
printf "%s\n" "${bot_names_only[@]}" > bot_names_only.txt
fi
echo
echo -e "${GREEN}Script complete!${NC}"
echo -e "${BLUE}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ${NC}"
(These things sure do like color and emojis.)
I did some manual post-processing to combine some names together, then used those mappings to see what the top 20 bots hitting the site have been since mid-July:
FROM
logs l,
ua_map u
SELECT
u.bot_name AS bot_name,
COUNT(*) AS count
WHERE
l.user_agent = u.user_agent
GROUP BY 1
ORDER BY 2 DESC
LIMIT 30;
โโโโโโโโโโโโโโฌโโโโโโโโ
โ bot_name โ count โ
โ varchar โ int64 โ
โโโโโโโโโโโโโโผโโโโโโโโค
โ MJ12 โ 37710 โ
โ OpenAI โ 33956 โ
โ Petal โ 30720 โ
โ ThinkBot โ 27538 โ
โ Slack โ 22008 โ
โ Anthropic โ 20682 โ
โ Bing โ 16086 โ
โ ByteDance โ 15752 โ
โ Mau โ 14506 โ
โ Semrush โ 14405 โ
โ Barkrowler โ 14174 โ
โ Amazon โ 13979 โ
โ Ahrefs โ 11966 โ
โ Google โ 11215 โ
โ Apple โ 9602 โ
โ BLEX โ 9214 โ
โ StartMe โ 7477 โ
โ Serpstat โ 6007 โ
โ Ready โ 5704 โ
โ Seekport โ 5645 โ
โ Meta Agent โ 5379 โ
โ InoReader โ 5334 โ
โ Dot โ 5242 โ
โ Scour RSS โ 4821 โ
โ SuperFeedr โ 4320 โ
โ Mastodon โ 3697 โ
โ Feedly โ 2563 โ
โ Inoreader โ 2083 โ
โ DataForSeo โ 2068 โ
โ Yandex โ 2059 โ
โโโโโโโโโโโโโโดโโโโโโโโค
โ 30 rows 2 columns โ
โโโโโโโโโโโโโโโโโโโโโโ
From that list, the ones I know for sure are AI-related or have malicious intent are:
- OpenAI: 33,956 (~1K/day)
- ThinkBot: 27,538 (~800/day)
- Anthropic: 20,682 (~600/day)
While those are not giant numbers, when they do their 600-1,000 iterations a day, they do not do so in a nice manner and put just enough load to have the WordPress install stutter for 1-5 seconds (which I notice because I do monitor everything).
I block ThinkBot at the nginx-level and am considering doing to for a few more of the user-agents, since Iโm kind of โdoneโ with commercial, generative โAIโ companies.
I put both the user-agent short-namer script and the user-agent mapping table into the gists for these Drops, and hope the not-too-deep look at using DuckDB to work with logs was helpful to a few folks.
Observable Notebooks 2.0 ๐ ๐ฆ (& More DBs)

Database support was a bit sketch in the inaugural release of Observable Notebooks 2.0, but the team sure made up for that in the most recent update.
Their database connectors support a wide array of sources, and bridge the gap between your notebooks (which run in browsers) and databases (which typically donโt expose HTTP APIs). They let you write SQL directly in notebook cells and see results instantly, without manually exporting data files.
Like their predecessor ObservableHQ notebooks, you can query databases interactively with dynamic parameters from UI elements like sliders or dropdowns. This dramatically speeds up exploratory data analysis since you donโt need to predict what data youโll need upfront.
Unlike Observable 1.0 noebooks, query results are automatically saved locally in .observable/cache. This means faster loading, consistent data across visitors, and the ability to share notebooks without database access. Everything runs locally on your machine – your data and credentials never go through Observableโs servers, and are never exposed through the notebook โappsโ.
Supported databases include:
- DuckDBย (great for analytics, works out of the box)
- SQLiteย (lightweight, file-based)
- PostgreSQLย (and compatible databases like ClickHouse, Redshift)
- Snowflakeย (cloud data warehouse)
- Google BigQueryย (Googleโs analytics database)
- Databricksย (unified analytics platform)
Database settings go in .observable/databases.json. For example:
{
"mydb": {
"type": "postgres",
"host": "localhost",
"database": "analytics",
"username": "analyst"
}
}
MAKE SURE to add .observable/ to your .gitignore to avoid committing credentials.
Iโve updated [my kicking the tyres post notebook])(https://codeberg.org/hrbrmstr/tyres/src/branch/batman/docs/index.html) to show how to work with DuckDB, and also updated my handy standalone notebooks creator NPM script to make it straightforward to kickstart a new DB-powered local notebook.
Iโm really looking forwared to the incremental improvements the Observable team are clearly planning to release, and am excited to figure out more ways to use and embed v2.0 notebooks into EDA and other workflows.
Textplot

Full Disclosure: I have not tried this extension out, yet, but it looks super cool!
The Textplot DuckDB Extension makes it possible to add pretty fun (and, useful) ASCII/Unicode data visualizations directly into your SQL queries.
Instead of exporting data to create charts elsewhere, you can now generate visual representations right in your DuckDB console session. This is super handy for quick data exploration, monitoring, and even creating lightweight dashboards without heavy charting libraries.
The site has scads of detailed examples with code and output, so thereโs no need for me to add any here (especially since I havenโt tried working with it yet).
If you do get to play with it, drop me a note to let me know your experience with it.
FIN
Remember, you can follow and interact with the full text of The Daily Dropโs free posts on:
- ๐ Mastodon viaย
@dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev - ๐ฆ Bluesky viaย
https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy
โฎ๏ธ
Leave a comment