blog

Running Apache Spark (PySpark) jobs locally

October 19, 2020 .

Painless first party event collection

August 15, 2020 .

Subtle data loss in Go: printed timestamps are not ordered

July 21, 2020 .

Performance: Using arrays instead of tiny maps in Go

July 20, 2020 .

Intel NUC for the family - journey to non-portable hardware

July 12, 2020 .

Accessing S3 data with Apache Spark from stock PySpark

June 22, 2020 .

High Performance Data Loss

June 12, 2020 .

Assorted tech links #4

May 27, 2020 .

Waiting for a column store in Postgres

May 23, 2020 .

Abandoning Apache Airflow

April 30, 2020 .

Nine years with a MacBook Air

March 18, 2020 .

Byte slices with offsets for read only slices of strings

February 15, 2020 .

Assorted tech links #3

November 9, 2019 .

Assorted tech links #2

July 20, 2019 .

How to lose data in Apache Spark

July 9, 2019 .

Busting APIs to win Lego Hulkbusters

May 31, 2019 .

Assorted tech links #1

May 8, 2019 .

Unrolling regular expressions

March 31, 2019 .

Benchmarking deserealising of ints from byte arrays

March 19, 2019 .

These DataFrames are made for reading

March 4, 2019 .

Github Issues History

December 17, 2018 .

Merging Python iterables using sort merge join

November 25, 2018 .

Big-O doesn't mean performance cannot differ greatly

October 9, 2018 .

Streaming S3 objects in Python

July 26, 2018 .

Sane CSV processing in Apache Spark

May 19, 2018 .

Beyond Testing

May 15, 2018 .

Reading about Hadoop (yes, get the Definitive Guide)

April 2, 2018 .

Downloading Facebook API data in Python with no dependencies

February 4, 2018 .

Wikidata: Reading large streams with a tiny memory footprint

November 12, 2017 .

Here we go again

November 12, 2017 .