Jekyll2021-10-25T11:46:16+00:00https://kokes.github.io/blog/feed.xmlblograndom ramblingsRunning Apache Spark (PySpark) jobs locally2020-10-19T05:03:53+00:002020-10-19T05:03:53+00:00https://kokes.github.io/blog/2020/10/19/running-apache-spark-pyspark-locally<p>Apache Spark is not among the most lightweight of solutions, so it’s only natural that there is a whole number of hosted solutions. AWS EMR, SageMaker, Glue, Databricks etc. While these services abstract out a lot of the moving parts, they introduce a rather clunky workflow with a slow feedback loop. In short, it’s not quite like developing locally, so I want to talk about enabling that.</p>
<p>Now, as long as you have Java and Python installed, you’re almost all set. At this point you can <code class="language-plaintext highlighter-rouge">python3 -m pip install pyspark</code>, be it in a virtual env or globally (the former is usually preferred, but it’s up to your setup). Spark is neatly packaged with all the JARs included in the Python package, so there’s nothing else to install.</p>
<p>When working in Spark, you get a global variable <code class="language-plaintext highlighter-rouge">spark</code> that enables you to read in data, create dataframes etc. This variable is usually conjured out of thin air in hosted solutions, so we’ll have to create it ourselves.</p>
<p>Let’s create a <code class="language-plaintext highlighter-rouge">pipeline.py</code> with just the following</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">spark</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">'select 1'</span><span class="p">).</span><span class="n">collect</span><span class="p">())</span>
</code></pre></div></div>
<p>The session alone would suffice, but the SQL statement is there just to test things out. Running <code class="language-plaintext highlighter-rouge">python3 pipeline.py</code> should launch a Spark session, run the simple query and exit (you could equally run <code class="language-plaintext highlighter-rouge">spark-submit</code>, but it’s not necessary here). If you want to investigate the environment some more, you can run <code class="language-plaintext highlighter-rouge">python3 -i pipeline.py</code>, which is Python’s way of saying “run this file and drop me in a Python interpreter with the state this run results in”. Super helpful for debugging.</p>
<h2 id="taking-things-further">Taking things further</h2>
<p>Since we have our Spark session, we can do our typical <code class="language-plaintext highlighter-rouge">df = spark.read.csv('our_data')</code>, but the data itself may be an issue. At this point, a lot of the data we interact with our within our shared infrastructure, be it HDFS or an object store like S3.</p>
<p>If we point our Spark instance at these sources, we will face a few issues.</p>
<ol>
<li>We need libraries to access these filesystems, which makes our lean operation less lean.</li>
<li>There will be tremendous latency as well as high costs for out-of-network transfers (aka egress fees)</li>
<li>The source data are likely too big for local experimentation, so any quick development will be hampered by slow processing</li>
</ol>
<p>We can solve all these issues by a simple sampling - just copy a few of your source files to your local disk (ideally as many files as your CPUs, to enable I/O parallelism). This will lead to quick local testing with no network involved (so you can even develop offline, if that’s your thing) and just enough data for your laptop to easily handle.</p>
<p>Here’s a more complete example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="kn">import</span> <span class="nn">pyspark.sql.functions</span> <span class="k">as</span> <span class="n">f</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">csv</span><span class="p">(</span><span class="s">'sales_raw'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="nb">filter</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'price'</span><span class="p">)</span> <span class="o">></span> <span class="mi">1000</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'region'</span><span class="p">).</span><span class="n">count</span><span class="p">().</span><span class="n">show</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="n">partitionBy</span><span class="p">(</span><span class="s">'region'</span><span class="p">,</span> <span class="s">'brand'</span><span class="p">).</span><span class="n">parquet</span><span class="p">(</span><span class="s">'sales_processed'</span><span class="p">)</span>
</code></pre></div></div>
<p>A minimal Docker image for PySpark looks like this. This is just a generic example, make sure to pin both the base image and your dependencies.</p>
<div class="language-Dockerfile highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> python:3.8-slim</span>
<span class="c"># the mkdir call is there because of an issue in python's slim base images</span>
<span class="c"># see https://github.com/debuerreotype/docker-debian-artifacts/issues/24</span>
<span class="k">RUN </span><span class="nb">mkdir</span> <span class="nt">-p</span> /usr/share/man/man1 <span class="o">&&</span> apt-get <span class="nt">-y</span> update <span class="o">&&</span> apt-get <span class="nt">-y</span> <span class="nt">--no-install-recommends</span> <span class="nb">install </span>default-jre
<span class="c"># you'll likely use requirements.txt or some other dependency management</span>
<span class="k">RUN </span>python3 <span class="nt">-m</span> pip <span class="nb">install </span>pyspark
</code></pre></div></div>
<p>After building this (<code class="language-plaintext highlighter-rouge">docker build -t imgname .</code>), you can test things out, just make sure to override the entrypoint - the one inherited from the base image is a Python interpreter. So drop into the shell via <code class="language-plaintext highlighter-rouge">docker run --rm -it imgname bash</code> and off you go.</p>
<p>You’re now ready to run a local, reproducible, quickly iterated workflow involving a distributed processing system. Isn’t that nice?</p>
<p>Happy Sparking!</p>Apache Spark is not among the most lightweight of solutions, so it’s only natural that there is a whole number of hosted solutions. AWS EMR, SageMaker, Glue, Databricks etc. While these services abstract out a lot of the moving parts, they introduce a rather clunky workflow with a slow feedback loop. In short, it’s not quite like developing locally, so I want to talk about enabling that.Painless first party event collection2020-08-15T05:03:53+00:002020-08-15T05:03:53+00:00https://kokes.github.io/blog/2020/08/15/first-party-event-collection<p>Whenever users interact with your sites, you almost automatically find out certain information. What pages they loaded, what assets they requested, what page they got to your site from etc. But sometimes you need more granular information, some that cannot be read from your server logs. This is usually what Google Analytics will give you. While Google gives you some aggregated statistics, you may want to dig a little deeper and perhaps roll your own solution.</p>
<p>Before we go any further, I want to clarify that this is not about <em>tracking</em>, the solution offered here doesn’t really take into account any user identification and does not track anyone across sites. This is to find out if your users actually read your articles, if they open modals, how long it takes them to discover certain features, if they click on things that are not clickable, for debugging purposes etc.</p>
<p>We used to run an off the shelf commercial product, think Heap or Mixpanel. We eventually decided to migrate for the following reasons:</p>
<ol>
<li>Warehouse duplication - we were paying for data being stored in an alternative data warehouse together with its own visualisation. But we had our own warehouse and BI tools, so it didn’t make sense to run this, do ETL, validation, fix pipelines whenever the format changed etc.</li>
<li>Blocking - the solution used a vendor’s piece of JavaScript, so it was blocked by a non-trivial number of our customers. We ran into this issue even within the company, when people complained they couldn’t their own actions in the system.</li>
<li>Non-linear scaling - the solution wasn’t that expensive, but if we wanted to triple our event count, we suddenly got a 10x bill. So we had to work around this by deleting a lot of events, which made the whole thing counterproductive.</li>
<li>Ownership of data and schema - this wasn’t a terribly sensitive dataset, but it did make sense to own it, not just for its value, but also for the stability of the schema and data model.</li>
</ol>
<h2 id="rolling-your-own-solution">Rolling your own solution</h2>
<p>There are just a few parts to this:</p>
<ol>
<li>Frontend data collection</li>
<li>Backend data persistence</li>
<li>Visualisation and analytics</li>
</ol>
<p>We won’t cover the third part, that’s a whole can of worms and companies usually have some solution that works for them (e.g. Tableau, Looker).</p>
<p>First let’s build a simple frontend site. This won’t really do anything fancy, it will just report any click on any button on the site. You can attach these events pretty much to anything, including mouse moves.</p>
<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"><!DOCTYPE html></span>
<span class="nt"><html></span>
<span class="nt"><head><title></span>My Site<span class="nt"></title></head></span>
<span class="nt"><body></span>
<span class="nt"><button></span>Click me!<span class="nt"></button></span>
<span class="nt"><button></span>Click me as well!<span class="nt"></button></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">'text/javascript'</span><span class="nt">></span>
<span class="c1">// common functionality across all event logs</span>
<span class="k">async</span> <span class="kd">function</span> <span class="nx">sendEvent</span><span class="p">(</span><span class="nx">eventName</span><span class="p">,</span> <span class="nx">properties</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">event</span> <span class="o">=</span> <span class="p">{</span>
<span class="na">event</span><span class="p">:</span> <span class="nx">eventName</span><span class="p">,</span>
<span class="na">url</span><span class="p">:</span> <span class="nb">window</span><span class="p">.</span><span class="nx">location</span><span class="p">.</span><span class="nx">href</span><span class="p">,</span>
<span class="na">properties</span><span class="p">:</span> <span class="nx">properties</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nx">fetch</span><span class="p">(</span><span class="dl">'</span><span class="s1">/events</span><span class="dl">'</span><span class="p">,</span> <span class="p">{</span>
<span class="na">method</span><span class="p">:</span> <span class="dl">'</span><span class="s1">POST</span><span class="dl">'</span><span class="p">,</span>
<span class="na">headers</span><span class="p">:</span> <span class="p">{</span><span class="dl">'</span><span class="s1">Content-Type</span><span class="dl">'</span><span class="p">:</span> <span class="dl">'</span><span class="s1">application/json</span><span class="dl">'</span><span class="p">},</span>
<span class="na">body</span><span class="p">:</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="nx">event</span><span class="p">),</span>
<span class="p">})</span>
<span class="p">}</span>
<span class="c1">// now let's register all event calls</span>
<span class="kd">const</span> <span class="nx">buttons</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">getElementsByTagName</span><span class="p">(</span><span class="dl">'</span><span class="s1">button</span><span class="dl">'</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kd">let</span> <span class="nx">button</span> <span class="k">of</span> <span class="nx">buttons</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">button</span><span class="p">.</span><span class="nx">addEventListener</span><span class="p">(</span><span class="dl">'</span><span class="s1">click</span><span class="dl">'</span><span class="p">,</span> <span class="p">(</span><span class="nx">e</span><span class="p">)</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">properties</span> <span class="o">=</span> <span class="p">{</span>
<span class="dl">'</span><span class="s1">button_text</span><span class="dl">'</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">target</span><span class="p">.</span><span class="nx">textContent</span><span class="p">,</span>
<span class="p">}</span>
<span class="nx">sendEvent</span><span class="p">(</span><span class="dl">'</span><span class="s1">clicked_event</span><span class="dl">'</span><span class="p">,</span> <span class="nx">properties</span><span class="p">);</span>
<span class="p">})</span>
<span class="p">}</span>
<span class="nt"></script></span>
<span class="nt"></body></span>
<span class="nt"></html></span>
</code></pre></div></div>
<p>What this will do is that it will send a POST request to your backend, so let’s deal with that. Here’s the poorest man’s Node.js server that handles these incoming events.</p>
<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">http</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">http</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">fs</span> <span class="o">=</span> <span class="nx">require</span><span class="p">(</span><span class="dl">'</span><span class="s1">fs</span><span class="dl">'</span><span class="p">);</span>
<span class="kd">const</span> <span class="nx">listener</span> <span class="o">=</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">req</span><span class="p">,</span> <span class="nx">res</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// serve the frontend</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">req</span><span class="p">.</span><span class="nx">url</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">/</span><span class="dl">'</span> <span class="o">&&</span> <span class="nx">req</span><span class="p">.</span><span class="nx">method</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">GET</span><span class="dl">'</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">fs</span><span class="p">.</span><span class="nx">readFile</span><span class="p">(</span><span class="dl">'</span><span class="s1">index.html</span><span class="dl">'</span><span class="p">,</span> <span class="kd">function</span> <span class="p">(</span><span class="nx">err</span><span class="p">,</span> <span class="nx">data</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">err</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">writeHead</span><span class="p">(</span><span class="mi">404</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">writeHead</span><span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="p">{</span><span class="dl">'</span><span class="s1">Content-Type</span><span class="dl">'</span><span class="p">:</span> <span class="dl">'</span><span class="s1">text/html</span><span class="dl">'</span><span class="p">});</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">end</span><span class="p">(</span><span class="nx">data</span><span class="p">,</span> <span class="dl">'</span><span class="s1">utf-8</span><span class="dl">'</span><span class="p">)</span>
<span class="p">});</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// fail on anything but the frontend and event handling</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="nx">req</span><span class="p">.</span><span class="nx">url</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">/events</span><span class="dl">'</span> <span class="o">&&</span> <span class="nx">req</span><span class="p">.</span><span class="nx">method</span> <span class="o">===</span> <span class="dl">'</span><span class="s1">POST</span><span class="dl">'</span><span class="p">))</span> <span class="p">{</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">writeHead</span><span class="p">(</span><span class="mi">400</span><span class="p">);</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">end</span><span class="p">();</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// accumulate data and parse it once it all comes in</span>
<span class="kd">let</span> <span class="nx">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="nx">req</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">data</span><span class="dl">'</span><span class="p">,</span> <span class="nx">chunk</span> <span class="o">=></span> <span class="p">{</span>
<span class="nx">data</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">chunk</span><span class="p">)</span>
<span class="p">})</span>
<span class="nx">req</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="dl">'</span><span class="s1">end</span><span class="dl">'</span><span class="p">,</span> <span class="p">()</span> <span class="o">=></span> <span class="p">{</span>
<span class="kd">const</span> <span class="nx">event</span> <span class="o">=</span> <span class="nx">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span>
<span class="nx">event</span><span class="p">.</span><span class="nx">server_time</span> <span class="o">=</span> <span class="nb">Date</span><span class="p">.</span><span class="nx">now</span><span class="p">();</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">event</span><span class="p">);</span>
<span class="c1">// send the event to your backend</span>
<span class="p">})</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">writeHead</span><span class="p">(</span><span class="mi">202</span><span class="p">);</span>
<span class="nx">res</span><span class="p">.</span><span class="nx">end</span><span class="p">();</span>
<span class="p">}</span>
<span class="kd">const</span> <span class="nx">server</span> <span class="o">=</span> <span class="nx">http</span><span class="p">.</span><span class="nx">createServer</span><span class="p">(</span><span class="nx">listener</span><span class="p">);</span>
<span class="nx">server</span><span class="p">.</span><span class="nx">listen</span><span class="p">(</span><span class="mi">8080</span><span class="p">);</span>
</code></pre></div></div>
<p>As you can see, the server just accepts the JSON and adds a timestamp. This is to ensure that we have a credible source of <em>event time</em>. This is a term in stream processing and while we don’t strictly adhere to the definition - the event happened in the browser, not on the backend - we cannot get a reliable timestamp from the frontend, so this will have to do. Google “event time vs. processing time” to find out more about this.</p>
<p>The only thing we didn’t implement in this server was the actual event handling. This really depends on what existing infrastructure you already have and also what kind of traffic you expect.</p>
<h2 id="handling-event-processing">Handling event processing</h2>
<p>For simple sites with not much traffic, I’d just insert this event to a relational database system (Postgres, MySQL etc.). Modern RDBMS can handle JSON data just fine, you can then have some simple ETL process to extract some of the bits in the events into relational objects, but all in all, the pipeline will remain quite simple.</p>
<p>If you anticipate more traffic and/or you want to decouple your application from your database, there are a few alternatives. My favourite, and the one we used in production, was that you simply send the event to a message queue - not Kafka or Pulsar, plain AMQP like RabbitMQ or ActiveMQ. These can handle most workloads and they are simple, battle tested and, more importantly, already common within infrastructures.</p>
<p>Now it’s completely up to you to handle this stream. We had three independent processes:</p>
<ol>
<li>Collect the data and save it to S3 (as a <code class="language-plaintext highlighter-rouge">.json.gz</code>) once we reach <code class="language-plaintext highlighter-rouge">n</code> events or <code class="language-plaintext highlighter-rouge">m</code> seconds, whicever comes first. This way we don’t depend on the stream to be the durable part of the architecture, data are safely stored in our blob storage.</li>
<li>From time to time, go and compact those S3 blobs into larger files. It’s not the best idea to have a bunch of small files on S3, so we combined them into larger (hourly) chunks.</li>
<li>Load some of the S3 data into a database, so that we can readily query it.</li>
</ol>
<p>This was a nicely working pipeline - we leveraged existing technologies across the whole stack, there was miniscule utilisation of resources (< 1%), we could easily scale this 100x without worrying about a single part of the pipeline.</p>
<p>The collection endpoint could be easily load balanced, the message queue had enough capacity, S3 scales beyond what we’d ever need, and the database could be swapped for Presto, Drill, Athena, BigQuery or anything of this sort - we already have the data in blob storage, so it’s readily available for these big data processing tools.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This is not the only solution, obviously. Common alternatives are that you collect your data in the fronend using “your” library (it’s bundled within your JavaScript, but it’s really your vendor’s code), send it to your backend and then log it to your vendors APIs from there, bypassing the frontend blockers.</p>
<p>This only solves one of the problems I outlined, that’s why we went another route. But your mileage may vary! Also, I omitted a number of things - error handling, uuid generation, validation, … - this only serves to illustrate the principles of our solution.</p>
<p>There are a number of things I like about this solution:</p>
<ol>
<li>It’s super simple in terms of architecture.</li>
<li>It’s composed, not inherited, so parts can easily be swapped, added, removed.</li>
<li>It scales way beyond what we need.</li>
</ol>
<p>I hope you find value in this.</p>Whenever users interact with your sites, you almost automatically find out certain information. What pages they loaded, what assets they requested, what page they got to your site from etc. But sometimes you need more granular information, some that cannot be read from your server logs. This is usually what Google Analytics will give you. While Google gives you some aggregated statistics, you may want to dig a little deeper and perhaps roll your own solution.Subtle data loss in Go: printed timestamps are not ordered2020-07-21T05:03:53+00:002020-07-21T05:03:53+00:00https://kokes.github.io/blog/2020/07/21/subtle-data-loss-time-format-rfc-3339<p>When I wrote about <a href="https://kokes.github.io/blog/2020/06/12/high-performance-data-loss.html">high performance data loss</a>, many of the scenarios were fairly obvious. You don’t adhere to a standard, you get bitten. Here’s another example of non-adherance, which can bite you in rather unexpected ways.</p>
<p>Keeping time in computers is extremely tricky, it’s riddled with exceptions, changes, corner cases, it’s just hard to implement correctly. But when done right, you are rewarded - e.g. when you print time as strings, especially if it’s ISO-8601 or RFC3339, you get a really nice benefit, <a href="https://tools.ietf.org/html/rfc3339#section-5.1">these strings are ordered</a>. That way, you know that <code class="language-plaintext highlighter-rouge">2020-07-21</code> comes after <code class="language-plaintext highlighter-rouge">2020-02-13</code> or <code class="language-plaintext highlighter-rouge">1970-01-05</code> and you know this without parsing these strings.</p>
<p>Why is that important? Two reasons.</p>
<ol>
<li>Parsing time and date is expensive. It might not seem that way, but try parsing thousands or millions of dates and you’ll see it.</li>
<li>In some scenarios parsing is not suitable - e.g. file listing - many filesystems or object stores give you files in sorted order and if you prepend each filename with a date, you can ensure they are sorted by time as well.</li>
</ol>
<p>Now to our problem at hand. If you use Go and its <code class="language-plaintext highlighter-rouge">time</code> package, it will deviate from the standard in the subtlest of ways: it truncates trailing zeroes in mili/micro/nanoseconds, so <code class="language-plaintext highlighter-rouge">2020-07-21T08:21:00.629280Z</code> becomes <code class="language-plaintext highlighter-rouge">2020-07-21T08:21:00.62928Z</code>. Doesn’t seem like a big deal, but it breaks the ordering guarantee we had originally.</p>
<p>This now means that if you prepend your filenames or if you have objects stored as <code class="language-plaintext highlighter-rouge">${time}.csv</code>, you can’t just invoke a listing function and process data in sorted order, you need to load <em>all</em> you filenames, parse the dates within, sort them, and then process the data. Or you can… lose data since you’re relying on them being sorted and they aren’t.</p>
<p>Here’s a bit of code to reproduce the issue.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="s">"log"</span>
<span class="s">"sort"</span>
<span class="s">"time"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">run</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">run</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
<span class="n">n</span> <span class="o">:=</span> <span class="m">100</span>
<span class="n">times</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="kt">string</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="c">// now := time.Now().UTC()</span>
<span class="n">now</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Parse</span><span class="p">(</span><span class="s">"2006-01-02"</span><span class="p">,</span> <span class="s">"2020-07-21"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">j</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span> <span class="p">{</span>
<span class="n">now</span> <span class="o">=</span> <span class="n">now</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">Microsecond</span><span class="p">)</span>
<span class="n">newTime</span> <span class="o">:=</span> <span class="n">now</span><span class="o">.</span><span class="n">Format</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">RFC3339Nano</span><span class="p">)</span>
<span class="n">times</span> <span class="o">=</span> <span class="nb">append</span><span class="p">(</span><span class="n">times</span><span class="p">,</span> <span class="n">newTime</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">sort</span><span class="o">.</span><span class="n">Strings</span><span class="p">(</span><span class="n">times</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="o">:=</span> <span class="m">1</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="nb">len</span><span class="p">(</span><span class="n">times</span><span class="p">);</span> <span class="n">j</span><span class="o">++</span> <span class="p">{</span>
<span class="n">t1</span><span class="p">,</span> <span class="n">t2</span> <span class="o">:=</span> <span class="n">times</span><span class="p">[</span><span class="n">j</span><span class="o">-</span><span class="m">1</span><span class="p">],</span> <span class="n">times</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>
<span class="n">t1p</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Parse</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">RFC3339</span><span class="p">,</span> <span class="n">t1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
<span class="n">t2p</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Parse</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">RFC3339</span><span class="p">,</span> <span class="n">t2</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">err</span>
<span class="p">}</span>
<span class="k">if</span> <span class="n">t1p</span><span class="o">.</span><span class="n">UnixNano</span><span class="p">()</span> <span class="o">></span> <span class="n">t2p</span><span class="o">.</span><span class="n">UnixNano</span><span class="p">()</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"apparently, %s precedes %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">t2</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This yields</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apparently, 2020-07-21T00:00:00.000019Z precedes 2020-07-21T00:00:00.00001Z
apparently, 2020-07-21T00:00:00.000029Z precedes 2020-07-21T00:00:00.00002Z
apparently, 2020-07-21T00:00:00.000039Z precedes 2020-07-21T00:00:00.00003Z
apparently, 2020-07-21T00:00:00.000049Z precedes 2020-07-21T00:00:00.00004Z
apparently, 2020-07-21T00:00:00.000059Z precedes 2020-07-21T00:00:00.00005Z
apparently, 2020-07-21T00:00:00.000069Z precedes 2020-07-21T00:00:00.00006Z
apparently, 2020-07-21T00:00:00.000079Z precedes 2020-07-21T00:00:00.00007Z
apparently, 2020-07-21T00:00:00.000089Z precedes 2020-07-21T00:00:00.00008Z
apparently, 2020-07-21T00:00:00.000099Z precedes 2020-07-21T00:00:00.00009Z
</code></pre></div></div>
<p>If you print the affected portion of the sorted strings slice, you get this series</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2020-07-21T00:00:00.000015Z
2020-07-21T00:00:00.000016Z
2020-07-21T00:00:00.000017Z
2020-07-21T00:00:00.000018Z
2020-07-21T00:00:00.000019Z
2020-07-21T00:00:00.00001Z <- this precedes all the timestamps above
2020-07-21T00:00:00.000021Z
2020-07-21T00:00:00.000022Z
2020-07-21T00:00:00.000023Z
2020-07-21T00:00:00.000024Z
</code></pre></div></div>
<p>At this point you can’t rely on these formatted strings being sorted, that is unless you use your own formatting string. <code class="language-plaintext highlighter-rouge">RFC3339Nano</code> in Go cannot be relied on, not in terms of its ordering guarantees.</p>
<p>This is <a href="https://github.com/golang/go/issues?q=is%3Aissue+RFC3339+is%3Aclosed">a known issue</a>, among many other ones.</p>When I wrote about high performance data loss, many of the scenarios were fairly obvious. You don’t adhere to a standard, you get bitten. Here’s another example of non-adherance, which can bite you in rather unexpected ways.Performance: Using arrays instead of tiny maps in Go2020-07-20T05:03:53+00:002020-07-20T05:03:53+00:00https://kokes.github.io/blog/2020/07/20/tiny-maps-arrays-go<p>I wrote about performance characteristics of maps vs. slices/arrays <a href="https://kokes.github.io/blog/2018/10/09/big-o-performance.html">a while ago</a> and I stumbled upon this again today, so I figured I’d describe the issue at hand with a bit more context.</p>
<p>I have some text data and I want to infer for each datapoint, if it’s an integer, a float etc. My data types look something like this:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">type</span> <span class="n">Dtype</span> <span class="kt">uint8</span>
<span class="k">const</span> <span class="p">(</span>
<span class="n">DtypeInvalid</span> <span class="n">Dtype</span> <span class="o">=</span> <span class="no">iota</span>
<span class="n">DtypeNull</span>
<span class="n">DtypeString</span>
<span class="n">DtypeInt</span>
<span class="n">DtypeFloat</span>
<span class="n">DtypeBool</span>
<span class="c">// thanks to DtypeMax, I can add new types here without worrying about anything</span>
<span class="n">DtypeMax</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Notice the <code class="language-plaintext highlighter-rouge">DtypeMax</code>, that’s the key to our success here. But first let’s generate some data</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="n">Dtype</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span> <span class="p">{</span>
<span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">Dtype</span><span class="p">(</span><span class="n">rand</span><span class="o">.</span><span class="n">Intn</span><span class="p">(</span><span class="kt">int</span><span class="p">(</span><span class="n">DtypeMax</span><span class="p">)))</span>
<span class="p">}</span>
</code></pre></div></div>
<p>An intuitive way to count the occurence of each type would be to do something like this:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tgmap</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="n">Dtype</span><span class="p">]</span><span class="kt">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">el</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">data</span> <span class="p">{</span>
<span class="c">// work to determine type</span>
<span class="n">tgmap</span><span class="p">[</span><span class="n">el</span><span class="p">]</span><span class="o">++</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This implementation worked just fine for months, but earlier today, I was wondering if the work that goes into populating the map isn’t significant after all. I leveraged two factors in my rewrite:</p>
<ol>
<li>I know how many types I have, so I can create an array - no manual allocation (unlike slices or maps), size known at compile time.</li>
<li>The type constants are numeric, so I can use them as indexes into this array - much cheaper than hash lookups.</li>
</ol>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">var</span> <span class="n">tgarr</span> <span class="p">[</span><span class="n">DtypeMax</span><span class="p">]</span><span class="kt">int</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">el</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">data</span> <span class="p">{</span>
<span class="c">// work to determine type</span>
<span class="n">tgarr</span><span class="p">[</span><span class="n">el</span><span class="p">]</span><span class="o">++</span>
<span class="p">}</span>
</code></pre></div></div>
<p>How did it go? I ran each with a million entries ten times. I didn’t use Go’s testing package, because I wanted to play around in <code class="language-plaintext highlighter-rouge">main()</code>, it’s usually enough for small tests with big differences in performance (I don’t need statistical tests here, otherwise I’d use <code class="language-plaintext highlighter-rouge">testing</code> and <code class="language-plaintext highlighter-rouge">benchstat</code>).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>took 31.441122ms to populate a map
took 29.453898ms to populate a map
took 29.518603ms to populate a map
took 29.254208ms to populate a map
took 31.35607ms to populate a map
took 28.851067ms to populate a map
took 29.209222ms to populate a map
took 29.355206ms to populate a map
took 29.451898ms to populate a map
took 30.235153ms to populate a map
took 521.01µs to populate an array
took 512.882µs to populate an array
took 516.915µs to populate an array
took 517.194µs to populate an array
took 541.786µs to populate an array
took 534.958µs to populate an array
took 525.162µs to populate an array
took 597.569µs to populate an array
took 534.92µs to populate an array
took 532.322µs to populate an array
</code></pre></div></div>
<p>So without any work involved in the tight loop, I get a 60x speedup. Obviously this won’t give you a 60x end-to-end speedup, but depending on how expensive your loop iteration is, this may be significant (it gave me 10-20%, which is nice).</p>
<p>Takeaway: if you have a map with few entries, known domain, and constrained range (i.e. max-min is reasonable), it may be very efficient to use a slice or an array instead, as it has much cheaper read/write.</p>
<p>The whole code is here:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="s">"log"</span>
<span class="s">"math/rand"</span>
<span class="s">"time"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">run</span><span class="p">();</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">type</span> <span class="n">Dtype</span> <span class="kt">uint8</span>
<span class="k">const</span> <span class="p">(</span>
<span class="n">DtypeInvalid</span> <span class="n">Dtype</span> <span class="o">=</span> <span class="no">iota</span>
<span class="n">DtypeNull</span>
<span class="n">DtypeString</span>
<span class="n">DtypeInt</span>
<span class="n">DtypeFloat</span>
<span class="n">DtypeBool</span>
<span class="n">DtypeMax</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">run</span><span class="p">()</span> <span class="kt">error</span> <span class="p">{</span>
<span class="n">work</span> <span class="o">:=</span> <span class="k">func</span><span class="p">()</span> <span class="p">{</span>
<span class="n">time</span><span class="o">.</span><span class="n">Sleep</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">Nanosecond</span> <span class="o">*</span> <span class="m">100</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">work</span>
<span class="n">n</span> <span class="o">:=</span> <span class="m">1000</span><span class="n">_000</span>
<span class="n">loops</span> <span class="o">:=</span> <span class="m">10</span>
<span class="n">data</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">([]</span><span class="n">Dtype</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">n</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span> <span class="p">{</span>
<span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">Dtype</span><span class="p">(</span><span class="n">rand</span><span class="o">.</span><span class="n">Intn</span><span class="p">(</span><span class="kt">int</span><span class="p">(</span><span class="n">DtypeMax</span><span class="p">)))</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">l</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">l</span> <span class="o"><</span> <span class="n">loops</span><span class="p">;</span> <span class="n">l</span><span class="o">++</span> <span class="p">{</span>
<span class="n">tgmap</span> <span class="o">:=</span> <span class="nb">make</span><span class="p">(</span><span class="k">map</span><span class="p">[</span><span class="n">Dtype</span><span class="p">]</span><span class="kt">int</span><span class="p">)</span>
<span class="n">t</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Now</span><span class="p">()</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">el</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">data</span> <span class="p">{</span>
<span class="c">// work()</span>
<span class="n">tgmap</span><span class="p">[</span><span class="n">el</span><span class="p">]</span><span class="o">++</span>
<span class="p">}</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"took %v to populate a map</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">time</span><span class="o">.</span><span class="n">Since</span><span class="p">(</span><span class="n">t</span><span class="p">))</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">l</span> <span class="o">:=</span> <span class="m">0</span><span class="p">;</span> <span class="n">l</span> <span class="o"><</span> <span class="n">loops</span><span class="p">;</span> <span class="n">l</span><span class="o">++</span> <span class="p">{</span>
<span class="k">var</span> <span class="n">tgarr</span> <span class="p">[</span><span class="n">DtypeMax</span><span class="p">]</span><span class="kt">int</span>
<span class="n">t</span> <span class="o">:=</span> <span class="n">time</span><span class="o">.</span><span class="n">Now</span><span class="p">()</span>
<span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">el</span> <span class="o">:=</span> <span class="k">range</span> <span class="n">data</span> <span class="p">{</span>
<span class="c">// work()</span>
<span class="n">tgarr</span><span class="p">[</span><span class="n">el</span><span class="p">]</span><span class="o">++</span>
<span class="p">}</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"took %v to populate an array</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">time</span><span class="o">.</span><span class="n">Since</span><span class="p">(</span><span class="n">t</span><span class="p">))</span>
<span class="p">}</span>
<span class="k">return</span> <span class="no">nil</span>
<span class="p">}</span>
</code></pre></div></div>I wrote about performance characteristics of maps vs. slices/arrays a while ago and I stumbled upon this again today, so I figured I’d describe the issue at hand with a bit more context.Intel NUC for the family - journey to non-portable hardware2020-07-12T05:03:53+00:002020-07-12T05:03:53+00:00https://kokes.github.io/blog/2020/07/12/non-portable-hardware<p>I have become so accustomed to owning laptops that I haven’t really looked into the area Big Grey Boxes in a while. But not all PCs are big and grey, Intel has been making these cute little mini PCs, <a href="https://www.intel.com/content/www/us/en/products/boards-kits/nuc.html">Intel NUCs</a>. These are essentially laptop components in a tiny box and crucially, you can buy it without a disk, RAM, and operating system, so that you can customise it a bit.</p>
<p>These have been quite popular for HTPC or small home servers, but I needed something else, a modest computer for my mum. She doesn’t need much, mostly just browsing, some casual gaming, video streaming, messaging. The NUC looked ideal, here’s what I learned over the past few weeks.</p>
<h2 id="model">Model</h2>
<p>When talking about the non-gaming models, there are three to choose from - i3, i5, and i7. I would personally choose the i5 as I view it a good value for money, but, watching <a href="https://www.youtube.com/watch?v=LjWmlacoZ24">Robtech’s videos</a> (thanks!), I chose the i3, as it has enough oomph, but it should run quieter than the i5.</p>
<p>One more thing to choose is either the “low” variant or the “high” one. There is not price difference, the higher one just allows you to plug in a 2.5” disk as an addition to your M.2 system disk. This is handy for media/build/storage/backup servers, but I was fine with the M.2, so I went with the low variant, 10i3FNK.</p>
<h2 id="components">Components</h2>
<p>There are just two things to choose - RAM and disk. RAM was easier, I just picked a 16 GB DDR4 2666 MHz stick, they don’t really differ much in terms of performance. The only thing to watch out for is to choose a single stick if you want to leave some space for future upgrades. I wanted to go with 8 gigs originally, but seeing how cheap RAM is these days, I opted for 16 to make the PC future proof.</p>
<p>In terms of disk, boy was I surprised. Fast NVMe drivers are fairly cheap and the performance is amazing. I needed 256 gigs, so I went with the <a href="https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/970evoplus/">Samsung EVO Plus</a>, which is one of the most performant drives. You can save a bit by going with an ADATA, a WD Black or some other drive that may be on sale. I admit this was a bit of an overkill and if I was on a tighter budget, I would have gone with an alternative.</p>
<h2 id="peripherals">Peripherals</h2>
<p>I already had a keyboard and mouse available, as well as an old monitor, but you obviously need those if you don’t have them. If you’re buying these and also use quite a few USB-A ports, I’d look for bluetooth keyboard/mouse sets as the newest NUC only has three A ports, but make sure to have a wired keyboard for OS installation.</p>
<p>As for monitors, make sure to look at its build before buying it, because you want accessible VESA mounts - e.g. my old Dell monitor has VESA screws built into the stand, so you cannot use both the VESA mount and the stand. And the NUC can be mounted on monitors that do support this, so if that’s something you want to enable, make sure to be careful when shopping for monitors.</p>
<p>Also, if you want extra USB ports, make sure to buy a USB-C enabled monitor, so that it can serve as you hub as well, but this make the build quite expensive.</p>
<h2 id="upsides">Upsides</h2>
<p>I have very little knowledge when it comes to building computers and was pleasantly surprised in terms of how little I needed to know. Just two components and off you go. You can even buy pre-built NUCs, but I didn’t know what sort of SSDs or RAM they used, so I wanted to “build” my own.</p>
<p>The computer is truly tiny, the footprint is lower than a CD in a case, and it runs fairly quiet. It’s not silent, which was a bit surprising in my study, which I keep pretty quiet. I could hear the fan come on and while it wasn’t annoying, it was a tiny bit surprising. At my mum’s, it wasn’t an issue one bit, because you couldn’t hear it thanks to other noise within the flat. You can also tweak the fan characteristics in the BIOS, but I didn’t want to run the NUC too hot just for a few decibels of noise.</p>
<p>I was also pleasantly surprised by the two USB-C ports, which will be underutilised at this point, but it’s another nice future proofing feature.</p>
<h2 id="downsides">Downsides</h2>
<p>It’s a mobile CPU, so it won’t blow you away, but for my use cases, it was more than sufficient. Apart from the fan, the only thing I can complain about is the port organisation - the two USB-A ports in the back are quite close together, so some larger flash drives didn’t fit next to my USB cable. And the front 3.5 mm jack is not terribly well positioned for speakers that will be plugged in 24/7 - it’s more of a headphone jack that you intend to plug in and out.</p>
<h2 id="verdict">Verdict</h2>
<p>For the price of roughly 1/3 of a nice ultrabook, I got a really nice machine that utilises existing peripherals I had lying around. I don’t mean to compare these two form factors, just putting it in perspective.</p>
<p>The new user is very happy with the machine, it runs laps around her old machine with some Celeron CPU. That’s why I wanted to pick my parts - the low-end market is full of low-performing CPUs and cheap drives, but NUCs all have Intel Core processors, which should run most office workloads just fine, and slap enough RAM for Chrome and a fast enough disk and off you go. I’m very happy with the purchase.</p>I have become so accustomed to owning laptops that I haven’t really looked into the area Big Grey Boxes in a while. But not all PCs are big and grey, Intel has been making these cute little mini PCs, Intel NUCs. These are essentially laptop components in a tiny box and crucially, you can buy it without a disk, RAM, and operating system, so that you can customise it a bit.Accessing S3 data with Apache Spark from stock PySpark2020-06-22T05:03:53+00:002020-06-22T05:03:53+00:00https://kokes.github.io/blog/2020/06/22/apache-spark-pyspark-s3<p>For a while now, you’ve been able to run <code class="language-plaintext highlighter-rouge">pip install pyspark</code> on your machine and get all of Apache Spark, all the jars and such, without worrying about much else. While it’s a great way to setup PySpark on your machine to troubleshoot things locally, it comes with a set of caveats - you’re essentially running a distributed, hard to maintain system… via pip install.</p>
<p>While I’m used to Spark being managed (by either a team at work, Databricks or within Amazon’s EMR), I like to run things locally. I’ve been using PySpark via pip for quite a while now, but I’ve always run into issues with S3. Let’s now go through them and look at some resolutions to these problems.</p>
<p>For the purposes of this post, I created two S3 buckets - <code class="language-plaintext highlighter-rouge">ok-eu-central-1</code> in Frankfurt and <code class="language-plaintext highlighter-rouge">ok-us-west-2</code> in Oregon. We’ll see why later on. Let’s pip install PySpark and try a simple pipeline that just exports a single data point.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">spark</span> <span class="o">=</span> <span class="p">(</span><span class="n">SparkSession</span>
<span class="p">.</span><span class="n">builder</span>
<span class="p">.</span><span class="n">appName</span><span class="p">(</span><span class="s">'my_export'</span><span class="p">)</span>
<span class="p">.</span><span class="n">master</span><span class="p">(</span><span class="s">"local[*]"</span><span class="p">)</span>
<span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">spark</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">'select 1'</span><span class="p">).</span><span class="n">write</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">'overwrite'</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'s3a://ok-us-west-2/tmp'</span><span class="p">)</span>
</code></pre></div></div>
<p>But when we invoke <code class="language-plaintext highlighter-rouge">spark-submit pipeline.py</code>, we get the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>py4j.protocol.Py4JJavaError: An error occurred while calling o34.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
</code></pre></div></div>
<p>This is because Spark doesn’t ship with the relevant AWS libraries, so let’s fix that. We can specify a dependency that will get downloaded upon first invocation (so be careful when bundling this in e.g. Docker, you’ll want to ship it differently there):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.4 pipeline.py
</code></pre></div></div>
<p>Here 2.7.4 reflects the version of Hadoop libraries shipped with PySpark - at the time of writing (June 2020), there’s (Py)Spark version 3.0.0 out there and it ships with 2.7.4. While Spark now supports Hadoop 3.x, this is what it ships with by default.</p>
<p>When we run the command above, it all works fine! So if it works for you now, you see what you have to do - you need to instruct Spark (somehow), to include this AWS dependency. I did this at runtime via a cli argument, but there are other (better) ways.</p>
<h2 id="not-all-regions-are-equal">Not all regions are equal</h2>
<p>Now, this all works, but let’s change the code above slightly. Let’s write to the bucket in Frankfurt.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ...
</span> <span class="n">spark</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">'select 1'</span><span class="p">).</span><span class="n">write</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">'overwrite'</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'s3a://ok-eu-central-1/tmp'</span><span class="p">)</span>
</code></pre></div></div>
<p>Now the invocation, even with the <code class="language-plaintext highlighter-rouge">--packages</code> argument won’t work. It will fail with a 400 error:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>py4j.protocol.Py4JJavaError: An error occurred while calling o37.csv.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: (...), AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: (...)=
</code></pre></div></div>
<p>A 400 is a bad request, so something we did was incorrect. If you go down the rabbit hole, you’ll find out that there are multiple versions of the communication protocol and new regions only support a new protocol (v4). Sadly, the <code class="language-plaintext highlighter-rouge">hadoop-aws</code> library in version 2.7.4 defaults to v2, so it cannot handle Frankfurt and other regions without further settings. We can’t just upgrade hadoop-aws, because there are other Hadoop-related libraries linked and their versions need to match.</p>
<p>There are two things you need to do, if you work with Spark on Hadoop 2 and new AWS regions. First, specify that you’re using the v4 protocol. Here I’m submitting it as an argument (and for the driver only, since I’m using the standalone mode), you’ll want to specify this elsewhere.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.4 --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' pipeline.py
</code></pre></div></div>
<p>Second, we need to setup the endpoint for our S3 bucket. Not to diverge too far, but S3 libraries tend to have an <code class="language-plaintext highlighter-rouge">endpoint</code> option, so that you can point them at other servers, even outside of AWS, as long as they support the S3 API. This is very useful for testing, among other things. Here we’ll leverage this option, but only to point it within AWS.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">spark</span> <span class="o">=</span> <span class="p">(</span><span class="n">SparkSession</span>
<span class="p">.</span><span class="n">builder</span>
<span class="p">.</span><span class="n">appName</span><span class="p">(</span><span class="s">'my_export'</span><span class="p">)</span>
<span class="p">.</span><span class="n">master</span><span class="p">(</span><span class="s">"local[*]"</span><span class="p">)</span>
<span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="s">'fs.s3a.endpoint'</span><span class="p">,</span> <span class="s">'s3.eu-central-1.amazonaws.com'</span><span class="p">)</span> <span class="c1"># this has been added
</span> <span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">spark</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">'select 1'</span><span class="p">).</span><span class="n">write</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">'overwrite'</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'s3a://ok-eu-central-1/tmp'</span><span class="p">)</span>
</code></pre></div></div>
<p>And that’s it. Now that we’ve specified the endpoint, protocol version, and hadoop-aws, we can finally write to new S3 regions. Check out <a href="https://docs.aws.amazon.com/general/latest/gr/s3.html">the relevant AWS docs</a> to get your region’s endpoint.</p>
<p>I’m hoping that Spark switches to Hadoop 3.x for its PySpark distribution, that way we can avoid most of the shenanigans here (we’ll only need <code class="language-plaintext highlighter-rouge">hadoop-aws</code> at that point).</p>
<p>Happy data processing!</p>For a while now, you’ve been able to run pip install pyspark on your machine and get all of Apache Spark, all the jars and such, without worrying about much else. While it’s a great way to setup PySpark on your machine to troubleshoot things locally, it comes with a set of caveats - you’re essentially running a distributed, hard to maintain system… via pip install.High Performance Data Loss2020-06-12T05:03:53+00:002020-06-12T05:03:53+00:00https://kokes.github.io/blog/2020/06/12/high-performance-data-loss<p><strong>This article reflects software as it was stable at the time of writing (June 2020) and was later updated (in October 2021) to note changes merged in Spark 3 and Spark 3.2.0, two big updates that fixes some of the issues noted here. As always, software gets updated and the remaining issues may get resolved by the time you read this.</strong></p>
<p>When talking about data loss, we usually mean some traditional ways of losing data, many of which are related to databases:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">TRUNCATE</code> table, which deletes all the data in a table - and you may have just accidentally truncated the wrong one</li>
<li><code class="language-plaintext highlighter-rouge">DELETE</code> without a <code class="language-plaintext highlighter-rouge">WHERE</code> clause - you meant to delete only some rows, but ended up deleting all of them</li>
<li>running against production instead of dev/staging - you think you’re on your dev machine, you drop things left and right and then realise you messed up your production data</li>
<li>crashing without backups - this is pretty common, you encounter a system failure, but there are no backups to restore from</li>
</ul>
<p>These are fairly common and well understood and I don’t mean to talk about them here. Instead, I want to talk about a class of data integrity issues much less visible than a dropped table. <em>We can actually lose data without noticing it at first. We may still have it, it’s just incorrect due to a fault in our data pipeline.</em></p>
<p>How is that possible. Well, let’s see what a program can do when it encounters some invalid input. There are basically three things it can do:</p>
<ol>
<li>Fail (return an error, throw an exception etc.)</li>
<li>Return some NULL-like value (None, NULL, nil or whatever a given system uses)</li>
<li>Parse it incorrectly</li>
</ol>
<p>I ordered these roughly in the order I’d expect a parser to behave. I subscribe to the notion that we should fail early and often, some systems don’t, let’s see about that.</p>
<h2 id="lesson-1-parse-all-the-formats">Lesson 1: parse all the formats</h2>
<p>When it comes to stringified dates, I have <a href="https://xkcd.com/1179/">a favourite format</a>, that is <a href="https://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a>, because it is widely used, cannot be easily misinterpreted, and, as a bonus, is sortable.</p>
<p>Sadly, not all input data come in this format, so you need to accommodate that. You can usually specify the format you expect the string to be in and pass it to <a href="https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime">strptime</a> or its equivalent.</p>
<p>This all works if your data is homogenous or if you know how to construct such a date format string. But if either of those preconditions fails, it’s no longer trivial. Luckily, there’s <a href="https://pypi.org/project/python-dateutil/">dateutil</a>. This library has as a lot of <em>abstractions</em> for a slightly friendlier work with dates and times. The functionality I use the most is its parser, it works like this</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dateutil</span> <span class="kn">import</span> <span class="n">parser</span>
<span class="n">formats</span> <span class="o">=</span> <span class="p">[</span><span class="s">'2020-02-28'</span><span class="p">,</span> <span class="s">'2020/02/28'</span><span class="p">,</span> <span class="s">'28/2/2020'</span><span class="p">,</span> <span class="s">'2/28/2020'</span><span class="p">,</span> <span class="s">'28/2/20'</span><span class="p">,</span> <span class="s">'2/28/20'</span><span class="p">]</span>
<span class="p">[</span><span class="n">parser</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="n">j</span><span class="p">).</span><span class="n">date</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">formats</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['2020-02-28',
'2020-02-28',
'2020-02-28',
'2020-02-28',
'2020-02-28',
'2020-02-28']
</code></pre></div></div>
<p>No matter you throw at it, it will usually figure things out.</p>
<p>My use case was very similar. I was parsing dozens of datasets from our labour ministry and it had all sorts of date formats. There was the Czech variant of <code class="language-plaintext highlighter-rouge">23. 4. 2020</code>, but also <code class="language-plaintext highlighter-rouge">23. 4. 2020 12:34:33</code> and <code class="language-plaintext highlighter-rouge">23. 4. 2020 12</code> and a few others.</p>
<p>The issue was that the dateutil parser assumes m/d/y as the default (if the input is ambiguous), so fourth of July using the non-american notation (day first) would get detected as seventh of April.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parser</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="s">'4. 7. 2020'</span><span class="p">).</span><span class="n">date</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'2020-04-07'
</code></pre></div></div>
<p>Luckily dateutil has a setting, where you can hint that your dates tend to have days first.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parser</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="s">'4. 7. 2020'</span><span class="p">,</span> <span class="n">dayfirst</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">date</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'2020-07-04'
</code></pre></div></div>
<p>… and it works as expected.</p>
<p>This worked just fine for a while, but then I noticed some of my dates were in the future, but they denoted things from the past. I dug into the code and found out what happened.</p>
<p>The data source switched from non-standard <code class="language-plaintext highlighter-rouge">4. 7. 2020</code> to ISO-8601, so <code class="language-plaintext highlighter-rouge">2020-07-04</code> - excellent! - but since dateutil was told <code class="language-plaintext highlighter-rouge">dayfirst</code>, it misinterpreted a completely valid, standard date, just because I changed some dateutil defaults.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parser</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="s">'2020-07-04'</span><span class="p">,</span> <span class="n">dayfirst</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="n">date</span><span class="p">().</span><span class="n">isoformat</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>'2020-04-07'
</code></pre></div></div>
<p>We ended up with a mangled ISO date and were taught two lessons:</p>
<ul>
<li>sometimes explicit is better than implicit - if we know the possible source formats, we can map their conversions explicitly and have more control over them (hint: use <a href="https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime">strptime</a>/<a href="https://docs.python.org/3/library/datetime.html#datetime.time.fromisoformat">fromisoformat</a>)</li>
<li>you need to err when there’s an unexpected input - it’s much safer to crash than trying to estimate what was probably meant (you will be wrong at some point)</li>
</ul>
<p>This is a <a href="https://github.com/dateutil/dateutil/issues/268">known (and unresolved) issue</a>.</p>
<h2 id="lesson-2-overflows-are-not-just-for-integers">Lesson 2: overflows are not just for integers</h2>
<p>In computer science, if you have a container that can hold values of -128 to 127, computing 100*2 will <em>overflow</em> (~wrap around) and you’ll get a negative number. I always thought that this was a numbers thing, but I recently found out it can happen for dates and time as well.</p>
<p>There is a number of occasions where you can get invalid dates - be it a confusion about their formats (you swap days and months in ISO-8601, for instance), it could be a bug (adding a year is just adding 1 to the year, not checking if that day exists, hint: <a href="https://en.wikipedia.org/wiki/Leap_year_problem">leap years</a>), it could be data corruption, it can simply be many things.</p>
<p>As we learned in lesson 1, a sensible way to resolve this is… not to resolve it and just err.</p>
<p>For this lesson, we’ll need <a href="http://spark.apache.org/">Apache Spark</a>, a very popular data processing engine with Python bindings.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pyspark.sql.functions</span> <span class="k">as</span> <span class="n">f</span>
<span class="kn">import</span> <span class="nn">pyspark.sql.types</span> <span class="k">as</span> <span class="n">t</span>
<span class="kn">import</span> <span class="nn">pandas</span>
<span class="kn">import</span> <span class="nn">numpy</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">'data'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s write some non-existent dates. We’ll let Spark infer what the types are, we won’t tell it. Since these dates are syntactically alright, but non-existent in reality, we can expect one of three things:</p>
<ol>
<li>Error</li>
<li>NULLs instead of dates (and possibly a warning)</li>
<li>Dates will get recognised as strings as they don’t conform to date formats</li>
</ol>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/dates.csv'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'''name,date
john,2019-01-01
jane,2019-02-30
doe,2019-02-29
lily,2019-04-31'''</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/dates.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root
|-- name: string (nullable = true)
|-- date: timestamp (nullable = true)
+----+-------------------+
|name| date|
+----+-------------------+
|john|2019-01-01 00:00:00|
|jane|2019-03-02 00:00:00|
| doe|2019-03-01 00:00:00|
|lily|2019-05-01 00:00:00|
+----+-------------------+
</code></pre></div></div>
<p>Turns out we don’t get either of those three sensible results, dates overflow into the following month, which, to me, is pretty insane.</p>
<p><strong>Update (October 2021): This has since been fixed and these invalid dates no longer get inferred as dates, but as strings instead (using Spark 3.2.0)</strong></p>
<p>Let’s try and hint that we have dates in our data. At this point only the first two options apply: error or null.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">StructType</span><span class="p">([</span>
<span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span>
<span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'date'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">DateType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span>
<span class="p">])</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/dates.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root
|-- name: string (nullable = true)
|-- date: date (nullable = true)
+----+----------+
|name| date|
+----+----------+
|john|2019-01-01|
|jane|2019-03-02|
| doe|2019-03-01|
|lily|2019-05-01|
+----+----------+
</code></pre></div></div>
<p>And yet… nothing, still overflows.</p>
<p><strong>Update (October 2021): This, again, has been fixed in Spark 3.0 and the exception this triggers nicely says so, even letting you know how you can revert to its past behaviour.</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2019-02-30' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
</code></pre></div></div>
<p>Let’s see what pandas does (type inference is not shown here, but pandas just assumes they are plain strings in that case). It errs as expected. This way we don’t lose anything and we’re forced to deal with the issue.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/dates.csv'</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">pandas</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'date'</span><span class="p">])</span>
<span class="k">except</span> <span class="nb">ValueError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span> <span class="c1"># just for output brevity
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>day is out of range for month: 2019-02-30
</code></pre></div></div>
<p>As I noted above - if you swap days and months, you can easily move years into the future. If you place garbage in any of the fields, you can move a century.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/dates_swap.csv'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'''name,date
john,2019-31-12
jane,2019-999-02
doe,2019-02-22'''</span><span class="p">)</span>
<span class="c1"># luckily, inferSchema will work fine (will think it's a string)
</span><span class="n">schema</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">StructType</span><span class="p">([</span><span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span> <span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'date'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">DateType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)])</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/dates_swap.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root
|-- name: string (nullable = true)
|-- date: date (nullable = true)
+----+----------+
|name| date|
+----+----------+
|john|2021-07-12|
|jane|2102-03-02|
| doe|2019-02-22|
+----+----------+
</code></pre></div></div>
<p>Digging into Spark’s source, we can trace this back to Java’s standard library. The API that gets used is not generally recommended:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scala> import java.sql.Date
import java.sql.Date
scala> Date.valueOf("2019-02-29")
val res0: java.sql.Date = 2019-03-01
</code></pre></div></div>
<p>This issue is being addressed within Apache Spark, the forthcoming major version (3) <a href="https://issues.apache.org/jira/browse/SPARK-26178">will use a different Java API</a> and will change the behaviour described above. When letting Spark infer types automatically, it will not assume dates and will choose strings instead (no data loss). When supplying a schema, it will trow away all the invalid dates, i.e. we’re losing data without any warnings.</p>
<p>Back in our current Spark 2.4.5, we’re left with not just <code class="language-plaintext highlighter-rouge">java.sql.Date</code>, but also other implementations, because e.g. using Spark SQL directly won’t trigger this issue. Go figure.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">spark</span><span class="p">.</span><span class="n">sql</span><span class="p">(</span><span class="s">'select timestamp("2019-02-29 12:59:34") as bad_timestamp'</span><span class="p">).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+-------------+
|bad_timestamp|
+-------------+
| null|
+-------------+
</code></pre></div></div>
<p>When researching these issues, I stumbled <a href="https://issues.apache.org/jira/browse/SPARK-12045">upon a comment</a> by Michael Armbrust (a hugely influential person in the Spark community and a great speaker)</p>
<blockquote>
<p>Our general policy for exceptions is that we return null for data dependent problems (i.e. a date string that doesn’t parse) and we throw an AnalysisException for static problems with the query (like an invalid format string).</p>
</blockquote>
<p>Which just about sums up the issues raised earlier - it’s more common to return a NULL instead of dealing with an issue - you’ll see that yourself when running Spark at scale - it will not fail too often, but it will lose your data instead.</p>
<p>The major issue with this is that not only is it wrong… the result is also a perfectly valid date, so the usual mechanisms for detecting issues - looking for nulls, infinities, warnings etc. - none of that works, because the output looks just fine.</p>
<p><strong>Update (October 2021): This is partially resolved in <a href="https://issues.apache.org/jira/browse/SPARK-35030">SPARK-35030</a> with the introduction of ANSI SQL compliance. This is, however, turned off by default, see <a href="https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html">its docs</a> for more details.</strong></p>
<h3 id="fun-tidbit---spark-overflows-time-as-well">Fun tidbit - Spark overflows time as well</h3>
<p>In case you were wondering, you can place garbage into time as well, it will overflow like dates. (And just like the date issue, it will get addressed in Spark 3.0.)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/times.csv'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'''name,date
john,2019-01-01 12:34:22
jane,2019-02-14 25:03:65
doe,2019-05-30 23:59:00'''</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/times.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root
|-- name: string (nullable = true)
|-- date: timestamp (nullable = true)
+----+-------------------+
|name| date|
+----+-------------------+
|john|2019-01-01 12:34:22|
|jane|2019-02-15 01:04:05|
| doe|2019-05-30 23:59:00|
+----+-------------------+
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/times99.csv'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'''name,date
john,2019-01-01 12:34:22
jane,2019-02-14 13:303:65
doe,2019-05-30 984:76:44'''</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/times99.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root
|-- name: string (nullable = true)
|-- date: timestamp (nullable = true)
+----+-------------------+
|name| date|
+----+-------------------+
|john|2019-01-01 12:34:22|
|jane|2019-02-14 18:04:05|
| doe|2019-07-10 01:16:44|
+----+-------------------+
</code></pre></div></div>
<p><strong>Update (October 2021): As noted, this has since been addressed and it’s no longer an issue in Spark 3.</strong></p>
<h3 id="funner-tidbit---casting-is-different">Funner tidbit - casting is different</h3>
<p>While we did get date overflows when reading data straight into dates, what if we read in strings and only later decide to cast to dates. Should do the same thing, right? (I guess you see where I’m going with this.)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fn</span> <span class="o">=</span> <span class="s">'data/dates.csv'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+----+----------+
|name| date|
+----+----------+
|john|2019-01-01|
|jane|2019-02-30|
| doe|2019-02-29|
|lily|2019-04-31|
+----+----------+
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">'date'</span><span class="p">).</span><span class="n">cast</span><span class="p">(</span><span class="s">'date'</span><span class="p">)).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+----------+
| date|
+----------+
|2019-01-01|
| null|
| null|
| null|
+----------+
</code></pre></div></div>
<p><strong>Update (October 2021): Spark 3.2.0 has a new ANSI SQL compliance mode (<code class="language-plaintext highlighter-rouge">spark.sql.ansi.enabled</code>). If you turn it on (it’s off by default), it will trigger an exception in this case.</strong></p>
<h3 id="integer-overflows">Integer overflows</h3>
<p>I already touched on overflows earlier, let’s put them to a test. Let’s say we have a few numbers and one of them is too large for a 32-bit integer. Let’s tell Spark it’s an integer anyway and see what happens.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fn</span> <span class="o">=</span> <span class="s">'data/integers.csv'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'id</span><span class="se">\n</span><span class="s">123123</span><span class="se">\n</span><span class="s">123123123</span><span class="se">\n</span><span class="s">123123123123'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">StructType</span><span class="p">([</span>
<span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'id'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">IntegerType</span><span class="p">(),</span> <span class="bp">False</span><span class="p">)</span>
<span class="p">])</span>
<span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+---------+
| id|
+---------+
| 123123|
|123123123|
| null|
+---------+
</code></pre></div></div>
<p>Okay, we get a null - fair(-ish). This is the same reason as noted above in the Armbrust quote - Spark prefers nulls over exceptions. (The number would load just fine if we specified longs instead of integers.)</p>
<p>Let’s digest what this means - if you have a column of integers - say an incrementing ID - and your source gets so popular you keep ingesting more and more data that it no longer fits into int32, Spark will start throwing data out without telling you so. You have to ensure you have data quality mechanisms in place to catch this.</p>
<p>Note that if you do supply a schema upon data loading, pandas will fail the same way, at least if you use the venerable numpy type offering.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="p">{</span><span class="s">'id'</span><span class="p">:</span> <span class="n">numpy</span><span class="p">.</span><span class="n">int32</span><span class="p">}))</span> <span class="c1"># is this better? not really
</span></code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> id
0 123123
1 123123123
2 -1430928461
</code></pre></div></div>
<p>You have to resort to pandas 1.x extension arrays type annotation to avoid issues.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">try</span><span class="p">:</span>
<span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="p">{</span><span class="s">'id'</span><span class="p">:</span> <span class="s">'Int32'</span><span class="p">})</span>
<span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">e</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cannot safely cast non-equivalent int64 to int32
</code></pre></div></div>
<p>(Type inference “solves” the issue for both Spark and pandas.)</p>
<h2 id="lesson-3-local-problems-can-get-global">Lesson 3: local problems can get global</h2>
<p>You might be thinking: hey, problems in some random column don’t concern me - I’m all about <code class="language-plaintext highlighter-rouge">amount_paid</code> and <code class="language-plaintext highlighter-rouge">client_id</code>, both of which are just fine. Well, not so fast.</p>
<ol>
<li>If you filter on these mangled columns (e.g. “give me all the invoices where <code class="language-plaintext highlighter-rouge">amount_paid</code> is so and so”) - you will lose data if this column is not parsed properly.</li>
<li>This mad thing you’re about to read below.</li>
</ol>
<p>Let’s have a dataset of three integer columns. Or at least these have always been integers, but somehow you got a float in there (never mind the value is the same* as the integer). We’ll illustrate how this can easily happen in real life.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">StructType</span><span class="p">([</span>
<span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'a'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">IntegerType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span>
<span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'b'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">IntegerType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span>
<span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'c'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">IntegerType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span>
<span class="p">])</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/null_row.csv'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'''a,b,c
10,11,12
13,14,15.0'''</span><span class="p">)</span>
<span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/null_row.csv'</span><span class="p">).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+----+----+----+
| a| b| c|
+----+----+----+
| 10| 11| 12|
|null|null|null|
+----+----+----+
</code></pre></div></div>
<p>The crazy thing that just happened is that just because <code class="language-plaintext highlighter-rouge">c</code>’s second value is a float instead of an integer, Spark decided to <em>throw away the whole row</em> - so if you’re filtering/aggregating on other columns, your analysis is already wrong now (this did bite us <em>hard</em> at work).</p>
<p>Luckily, this gets fixed in the upcoming major version of Spark (3.0, couldn’t find the ticket just now, but it does get resolved) and only that one single value is NULLed.</p>
<p><strong>Update (October 2021): this has been been fixed in Spark 3.</strong></p>
<p>Also, like with some (sadly, not all) of the issues described here, you can catch these nasty bugs by turning the permissiveness down (mode=FAILFAST in the case of file I/O).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">try</span><span class="p">:</span>
<span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'mode'</span><span class="p">,</span> <span class="s">'FAILFAST'</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/null_row.csv'</span><span class="p">).</span><span class="n">show</span><span class="p">()</span>
<span class="k">except</span> <span class="nb">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">e</span><span class="p">)[:</span><span class="mi">500</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>An error occurred while calling o114.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 17, localhost, executor driver): org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST.
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:70)
at org.apache.spark.sql.execution.datasources.csv.Uni
</code></pre></div></div>
<h2 id="lesson-4-csvs-are-fun-until-they-arent">Lesson 4: CSVs are fun until they aren’t</h2>
<p>CSVs look like such a simple format. A bunch of values separated by commas and newlines, seems hardly complicated. But that’s only half of the story, there are two format aspects that don’t get that much attention.</p>
<ol>
<li>If you want to use quotes in fields, they need to be escaped <em>by another quote</em>, not by a backslash like in most other places.</li>
<li>You can have newlines in fields. As many as you want in fact. You only need to quote your field first.</li>
</ol>
<p>There isn’t a standard per se, but <a href="https://tools.ietf.org/html/rfc4180">RFC 4180</a> is the closest we got. Let’s now leverage the fact that these rules are not always adhered to.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">csv</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fn</span> <span class="o">=</span> <span class="s">'data/quote.csv'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">cw</span> <span class="o">=</span> <span class="n">csv</span><span class="p">.</span><span class="n">writer</span><span class="p">(</span><span class="n">fw</span><span class="p">)</span>
<span class="n">cw</span><span class="p">.</span><span class="n">writerow</span><span class="p">([</span><span class="s">'name'</span><span class="p">,</span> <span class="s">'age'</span><span class="p">])</span>
<span class="n">cw</span><span class="p">.</span><span class="n">writerow</span><span class="p">([</span><span class="s">'Joe "analyst, among other things" Sullivan'</span><span class="p">,</span> <span class="s">'56'</span><span class="p">])</span>
<span class="n">cw</span><span class="p">.</span><span class="n">writerow</span><span class="p">([</span><span class="s">'Jane Doe'</span><span class="p">,</span> <span class="s">'44'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> name age
0 Joe "analyst, among other things" Sullivan 56
1 Jane Doe 44
</code></pre></div></div>
<p>So here we have a CSV with some people and their ages, let’s ask pandas what their average age is.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">age</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>50.0
</code></pre></div></div>
<p>Now let’s ask Spark the very same question.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="s">'age'</span><span class="p">)).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------+
|avg(age)|
+--------+
| 44.0|
+--------+
</code></pre></div></div>
<p>The reason is that Spark, <em>by default</em>, doesn’t escape quotes properly, so instead of parsing the first non-header row as two fields, it spilled the name field into the age field and thus threw away the age information.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------------+--------------------+
| name| age|
+--------------+--------------------+
|"Joe ""analyst| among other thin...|
| Jane Doe| 44|
+--------------+--------------------+
</code></pre></div></div>
<p>Not only did it completely destroy the file, it also didn’t complain during the sum!</p>
<p>And this continues on if we finish the pipeline - if we actually write the data some place, it gets further destroyed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">write</span><span class="p">.</span><span class="n">mode</span><span class="p">(</span><span class="s">'overwrite'</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="s">'data/write1'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rfn</span> <span class="o">=</span> <span class="p">[</span><span class="n">j</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">os</span><span class="p">.</span><span class="n">listdir</span><span class="p">(</span><span class="s">'data/write1'</span><span class="p">)</span> <span class="k">if</span> <span class="n">j</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.csv'</span><span class="p">)][</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">'data/write1/'</span><span class="p">,</span> <span class="n">rfn</span><span class="p">)))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> name age
0 \Joe \"\"analyst" among other things\\" Sullivan\""
1 Jane Doe 44
</code></pre></div></div>
<p>At this point we used Spark to load a CSV and write it back out and in the process, we lost data.</p>
<h3 id="failfast-to-the-not-rescue">FAILFAST to the not rescue</h3>
<p>I noted earlier that the <code class="language-plaintext highlighter-rouge">FAILFAST</code> mode is quite helpful, though since it’s not the default, it doesn’t get used as much as I’d like. Let’s see if it helps here.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'mode'</span><span class="p">,</span> <span class="s">'FAILFAST'</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="s">'age'</span><span class="p">)).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------+
|avg(age)|
+--------+
| 44.0|
+--------+
</code></pre></div></div>
<p>You need to adjust quoting in the Spark CSV reader to get this right.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'escape'</span><span class="p">,</span> <span class="s">'"'</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------------------+---+
| name|age|
+--------------------+---+
|Joe "analyst, amo...| 56|
| Jane Doe| 44|
+--------------------+---+
</code></pre></div></div>
<h2 id="lesson-5-csvs-are-really-not-trivial">Lesson 5: CSVs are really not trivial</h2>
<p>It’s easy to pick on one technology, so let’s look at one other implementation of CSV I/O - <code class="language-plaintext highlighter-rouge">encoding/csv</code> in Go, that’s their standard library implementation. The issue presents itself when you have a single column of data - this happens all the time. Oh and you have missing values. Let’s take this example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>john
jane
jack
</code></pre></div></div>
<p>It’s a simple dataset of four values, the third one is missing. If we write this dataset using <code class="language-plaintext highlighter-rouge">encoding/csv</code> and read it back in, we won’t get the same thing. This is what’s called a roundtrip test. Here’s my <a href="https://github.com/golang/go/issues/39119">longer explanation</a> together with a snippet of code.</p>
<p>Now this fails, because <code class="language-plaintext highlighter-rouge">encoding/csv</code> skips all empty rows - but in this case an empty row is a datapoint. The implementation departs from RFC 4180 in a very subtle way, which, in our case, leads to loss of data.</p>
<p>I reported the issue, submitted a PR, but was told it probably won’t (ever) be fixed.</p>
<h2 id="lesson-6-all-bets-are-off">Lesson 6: all bets are off</h2>
<p>So far I’ve shown you actual cases of data loss that I unfortunately experienced first hand. Now let’s look at a hypothetical - what if someone knew about this problem and wanted to exploit it. Exploit a CSV parser? Do tell!</p>
<p>Let’s look at a few pizza orders.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'note'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'note'</span><span class="p">].</span><span class="nb">str</span><span class="p">[:</span><span class="mi">20</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> name date pizzas note
0 john 2019-09-30 3 cash payment
1 jane 2019-09-13 5 2nd floor, apt 2C
2 wendy 2019-08-30 1 no olives, moar chee
</code></pre></div></div>
<p>How many pizzas did people order?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">pizzas</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>9
</code></pre></div></div>
<p>Correct answer is… well, it depends who you ask 😈</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'mode'</span><span class="p">,</span> <span class="s">'FAILFAST'</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">sdf</span><span class="p">.</span><span class="n">agg</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="s">'pizzas'</span><span class="p">)).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+-----------+
|sum(pizzas)|
+-----------+
| 127453|
+-----------+
</code></pre></div></div>
<p>So now instead of nine pizzas, we need to fulfil 127 <em>thousand</em> orders, that’s a tall order (sorry not sorry).</p>
<p>Wonder what happened? It’s not much clearer when we look at the dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sdf</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------+-------------------+------+--------------------+
| name| date|pizzas| note|
+--------+-------------------+------+--------------------+
| john|2019-09-30 00:00:00| 3| cash payment|
| jane|2019-09-13 00:00:00| 5| 2nd floor, apt 2C|
| wendy|2019-08-30 00:00:00| 1|no olives, moar c...|
| rick| null| 123| null|
| suzie| null|112233| null|
| jim| null| 13593| null|
|samantha| null| 29| null|
| james| null| 1000| null|
| roland| null| 135| null|
| ellie| null| 331| "|
+--------+-------------------+------+--------------------+
</code></pre></div></div>
<p>It’s a touch clearer if we look at the file itself. In the third order, I decided to play with the note value. I decided to span multiple lines (a completely valid CSV value) and leveraged the fact that Spark, by default, reads CSVs line by line and will interpret all the subsequent lines (of our note!) as new records. I can fabricate a lot of data this way.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data/pizzas.csv'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fr</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">fr</span><span class="p">.</span><span class="n">read</span><span class="p">())</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name,date,pizzas,note
john,2019-09-30,3,cash payment
jane,2019-09-13,5,"2nd floor, apt 2C"
wendy,2019-08-30,1,"no olives, moar cheese
rick,,123,
suzie,,112233,
jim,,13593,
samantha,,29,
james,,1000,
roland,,135,
ellie,,331,"
</code></pre></div></div>
<p>We managed to silently inject data into an analysis by leveraging <em>the most common data format</em> and <em>one of the most used data engineering tools</em>. That’s pretty dangerous.</p>
<p>If we want to fix it in Spark, we need to force it to take into consideration multiline values.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sdf</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'multiLine'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">option</span><span class="p">(</span><span class="s">'inferSchema'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">sdf</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+-----+-------------------+------+--------------------+
| name| date|pizzas| note|
+-----+-------------------+------+--------------------+
| john|2019-09-30 00:00:00| 3| cash payment|
| jane|2019-09-13 00:00:00| 5| 2nd floor, apt 2C|
|wendy|2019-08-30 00:00:00| 1|no olives, moar c...|
+-----+-------------------+------+--------------------+
</code></pre></div></div>
<p>(Note that this breaks Unix tools like grep or cut, because these don’t really operate on CSVs, they are line readers.)</p>
<h2 id="bonus-a-magic-trick">Bonus: a magic trick</h2>
<p>With all the standard examples out of the way, let’s now try a fun trick I learned of by accident, when I wanted to demonstrate something completely different.</p>
<p>Let’s look at a small dataset, it’s a list of children, what’s their average age?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fn</span> <span class="o">=</span> <span class="s">'data/trick.csv'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">fn</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fw</span><span class="p">:</span>
<span class="n">fw</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'''name,age
john,1
jane,2
joe,
jill,3
jim,4'''</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s see, we’d guess 2.5, right? Maybe 2 since there’s a missing value. But engines tend to skip missing values for aggregations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">schema</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">StructType</span><span class="p">([</span><span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'name'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">StringType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">),</span> <span class="n">t</span><span class="p">.</span><span class="n">StructField</span><span class="p">(</span><span class="s">'age'</span><span class="p">,</span> <span class="n">t</span><span class="p">.</span><span class="n">IntegerType</span><span class="p">(),</span> <span class="bp">True</span><span class="p">)])</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">agg</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="s">'age'</span><span class="p">)).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------+
|avg(age)|
+--------+
| 2.5|
+--------+
</code></pre></div></div>
<p>Now let’s narrow the dataset down to only 1000 rows and ask Spark again. Just to be sure, let’s do it twice, the same thing!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span>
<span class="n">pandas</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/trick.csv'</span><span class="p">).</span><span class="n">head</span><span class="p">(</span><span class="mi">1000</span><span class="p">).</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'data/trick.csv'</span><span class="p">)</span> <span class="c1"># lossless, right?
</span>
<span class="c1"># this is the exact same code as above
</span> <span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">option</span><span class="p">(</span><span class="s">'header'</span><span class="p">,</span> <span class="bp">True</span><span class="p">).</span><span class="n">schema</span><span class="p">(</span><span class="n">schema</span><span class="p">).</span><span class="n">csv</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">agg</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="s">'age'</span><span class="p">)).</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------+
|avg(age)|
+--------+
| null|
+--------+
+--------+
|avg(age)|
+--------+
| 2.0|
+--------+
</code></pre></div></div>
<p>We managed to get three different answers from a seemingly identical dataset. <em>How?</em></p>
<h2 id="tidbits-from-other-systems">Tidbits from other systems</h2>
<p>Now we only covered a handful of tools, there are tons more to dissect. Here is just a quick list of notes from using other systems:</p>
<ul>
<li>Be careful when saving the abbreaviation of Norway in YAML, it might get turned to <code class="language-plaintext highlighter-rouge">false</code> (see <a href="https://hitchdev.com/strictyaml/why/implicit-typing-removed/">the Norway problem</a> for more details).</li>
<li>Some scikit-learn defaults <a href="https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong/">have been questioned</a> - do we know what defaults our <em>[complex piece of software]</em> uses?</li>
<li>When you go distributed, a whole host of consistency issues arise. If you like to sleep at night, I don’t recommend Martin Kleppmann’s <a href="http://dataintensive.net/">Designing Data-Intensive Applications</a> or Kyle Kingbury’s <a href="http://jepsen.io/">Jepsen suite</a>.</li>
<li>Some reconciliation systems are <em>designed</em> to lose data, see <a href="https://dzone.com/articles/conflict-resolution-using-last-write-wins-vs-crdts">Last Writer Wins (LWW)</a></li>
<li>Building distributed data stores is really hard and some popular ones can lose a bunch of data in some cases. See Jepsen’s reports on MongoDB or Elasticsearch (and keep in mind the version these are valid for).</li>
<li>Parsing JSON is much more difficult than parsing CSVs, the dozens of mainstream parsers don’t seem to agree on what should be done with various inputs (see <a href="http://seriot.ch/parsing_json.php">Parsing JSON is a minefield</a>)</li>
</ul>
<h2 id="the-future">The future</h2>
<p>I’d love to say that all of these issues will get resolved, but I’m not that certain. Some of these technologies had a chance to use major version upgrades to introduce breaking improvements, but they chose not to do so.</p>
<ul>
<li>pandas 1.0 still has nullable int columns by default, despite offering superior functionality as an opt-in (we didn’t explicitly cover this issue, but it was a culprit in one of my examples)</li>
<li>Apache Spark has improved its date handling, but the CSV situation is still dire, the/my issue has been out there for years now and v3.0 was the perfect candidate for a breaking improvement, but no dice.</li>
<li>Go’s encoding/csv can still lose data and it seems my battle to improve this has been lost already, so the only chance is to fork the package or use a different one.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>While I described many very specific issues, the morale of this article is more about software abstractions. Do you <em>really</em> know what’s happening, when you do things like <code class="language-plaintext highlighter-rouge">pandas.read_csv</code>? Do you understand what things get executed during a Spark query? Frankly, you should keep asking these things as you use complicated software.</p>
<p>There are tons of abstractions these days and it’s one of the reasons Python has become so popular, but it’s also a very convenient footgun. So many abstractions, so many ways to lose data.</p>
<p>While we did talk about the various ways to mess things up, we didn’t quite cover how to get out of it, that’s perhaps a topic for another day. But there are a few general tips to give you:</p>
<ul>
<li>Correctness should always trump performance. Make sure you only pursue performance once you have ensured your process works correctly.</li>
<li>Explicit is often better than implicit. Sounds like a cliche, but I mean it - abstractions will save you a lot of trouble, until you reach a point where the implicit mechanisms within a giant package are actually working against you. Sometimes it makes sense to get your hands dirty and code it up from scratch.</li>
<li>Fail early, fail often. Don’t catch all the exceptions, don’t log them instead of handling them, don’t throw away invalid data. Crash, fail, err, make sure errors get exposed, so that you can react to them.</li>
<li>Have sensible defaults. It’s not enough your software <em>can</em> work correctly, it’s important it <em>does</em> work correctly from the get go.</li>
<li>The job of (not just) data engineers is to move data <em>reliably</em> - the end user doesn¨t care if it’s Spark or Cobol, they want their numbers to line up. Always put your data integrity above your technology on your list of priorities.</li>
<li>Know your tools. Make sure you understand at least a part of what’s happening under the hood, so that you can better diagnose it when it goes wrong. And it will.</li>
</ul>
<p>Last but not least - this article doesn’t mean to deter you from using abstractions. Please do, but be careful, get to know them, understand how they work, and good luck!</p>This article reflects software as it was stable at the time of writing (June 2020) and was later updated (in October 2021) to note changes merged in Spark 3 and Spark 3.2.0, two big updates that fixes some of the issues noted here. As always, software gets updated and the remaining issues may get resolved by the time you read this.Assorted tech links #42020-05-27T05:03:55+00:002020-05-27T05:03:55+00:00https://kokes.github.io/blog/2020/05/27/assorted-tech-links-4<p>I do quite a bit of reading and watching of tech talks and since I often forget things, I decided to turn this into a sort of bookmarking service.</p>
<hr />
<ul>
<li>With Python 2 about to be deprecated, there are a few talks about migration to Python 3. There’s Facebook’s take <a href="https://youtu.be/H4SS9yVWJYA">from PyCon 2018</a> and Instagram’s <a href="https://www.youtube.com/watch?v=66XoCk79kjM">from PyCon 2017</a>. Oh and the most recent now - how <a href="https://www.youtube.com/watch?v=e1vqfBEAkNA&feature=share">Pinterest did the same</a> - it’s a bit more about the code differences rather than how they actually migrated.</li>
<li>I expected a dry talk on code reviews and it ended up being <a href="https://youtu.be/iNG1a--SIlk">a really nice overview of culture, soft skills, helping newcomers, dealing with anxiety, GTD etc</a>.</li>
<li>Forget PEP8, version control, reviews or tests. No bash, subprocess run all the things, just good old sloppy code that <em>solves your problem</em>. <a href="https://www.youtube.com/watch?v=Jd8ulMb6_ls">Super fast and excellent</a>.</li>
<li>This is <a href="https://blog.jooq.org/2019/04/09/the-difference-between-sqls-join-on-clause-and-the-where-clause/">a useful reminder</a> that where conditions in SQL are quite different from conditions on joins.</li>
<li>This is a classic. <a href="https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html">Why is GNU grep so fast?</a> It does very little and it does this to as few bytes as possible. These advice apply to a much wider number of situations.</li>
</ul>I do quite a bit of reading and watching of tech talks and since I often forget things, I decided to turn this into a sort of bookmarking service.Waiting for a column store in Postgres2020-05-23T05:03:53+00:002020-05-23T05:03:53+00:00https://kokes.github.io/blog/2020/05/23/postgres-column-store<p>Databases have been popular for storing and retrieval of information for decades and they have offered ways to work with all sorts of shapes and sizes of data and without specialising in any kind of retrieval or functionality. Over the past decade, there has been an emergence of a <em>analytical</em> databases, which leverage the fact that some data, chiefly in reporting and analytics, are written once and read many times. Updates are infrequent, so are individual row lookups. How are these engines different from what we have in Postgres or MySQL? There are a few concepts to unfold.</p>
<h3 id="rows-and-columns">Rows and columns</h3>
<p>In order to support heterogenous workloads and strict guarantees (<a href="https://en.wikipedia.org/wiki/ACID">ACID</a>, <a href="https://en.wikipedia.org/wiki/Multiversion_concurrency_control">MVCC</a> and others), databases typically store all the data for a given row in one chunk. A table of names and ages would, in a grossly simplified manner, be stored as <code class="language-plaintext highlighter-rouge">joe,20;jane,24;doug,30;lynn,21</code>, which works alright, but the “issue” is that when we want to query for the average age of all the people in our contacts list, we need to read all the names as well. This is not an issue in this table, but imagine wide tables with a 100 columns (no uncommon in a production setting). You suddenly read a lot more data than you need.</p>
<p>One mitigation is to use indexes. They work great, but they pose a few problems in analytics - 1) You ~duplicate storage, 2) all the table operations are more expensive, because you need to keep indexes in sync, 3) ideally you’d have an index for each column, because you often don’t know what columns you’ll query, thus amplifying issues one and two.</p>
<p>This is where column store come in. Instead of storing data by rows, they keep column information together, so our example above would yield <code class="language-plaintext highlighter-rouge">joe,jane,doug,lynn;20,24,30,21</code>, so a question for an average age or whether or not we know any Dougs would be much faster than in a row store. In contrast, looking up all the information for Doug would be fairly slow in a column store, because you need to look in <code class="language-plaintext highlighter-rouge">n</code> places for <code class="language-plaintext highlighter-rouge">n</code> columns. You know, tradeoffs.</p>
<h3 id="engines-and-storage">Engines and storage</h3>
<p>While we think of databases as homogenous, one stop shops, but they have a few more or less modular pieces. The critical piece here is the storage system - e.g. MySQL has historically had storage engines swappable, that’s why it has seen a number of implementations over the years.</p>
<p>The way Postgres users got around it varied. Either new extensions got built - think <a href="https://github.com/citusdata/citus">Citus</a> or <a href="https://github.com/timescale/timescaledb">TimescaleDB</a> - these allow for somewhat native improvements in functionality, but they are not shipped with Postgres, they are not first class citizens, they are often not supported in hosted solutions (like <a href="https://aws.amazon.com/rds/">Amazon RDS</a>) and they need to keep pace with upstream Postgres to keep being compatible. The other option is a hard fork, one of the more notable is <a href="https://aws.amazon.com/redshift/">Amazon Redshift</a>, Amazon’s data warehousing solution, which started as a fork of Postgres 8 and has lived its own life - new Postgres features didn’t get incorporated and, more importantly, the whole design of this columnar store sacrificed some functionality surrounding integrity and table design (e.g. you can’t change column types on the fly, it’s as frustrating as it sounds).</p>
<p>Postgres has historically taken this different path from MySQL, it’s always had the Postgres storage system as the only way of maintaining data, that is until last spring (2019), when <a href="https://commitfest.postgresql.org/19/1283/">pluggable storage got committed</a>.</p>
<p>This change now has sparked a new interest in new storage engines, e.g. <a href="https://github.com/EnterpriseDB/zheap">zheap</a>. But the one I’d like to briefly talk about is <a href="https://github.com/greenplum-db/postgres/tree/zedstore">zedstore</a>.</p>
<p>Zedstore finally comes as a first class implementation of a column store with all the guarantees and APIs Postgres currently offers, be it MVCC, ACID, transactional DDL etc. If you like running or testing databases locally, I encourage you to try this one. I’ve never built Postgres myself, but it turned out to be pretty simple.</p>
<p>First clone the repo linked above, configure it using the recommended option noted in the README and then follow the <a href="https://www.postgresql.org/docs/current/install-procedure.html">upstream installation docs</a>. If you’re already running Postgres, make sure to edit the port used by this zedstore branch (I used 5433). Then just create a database, initialise it and run your new build against this new database on a new port.</p>
<p>I tested zedstore against a real workload I was running at the time, it was quite a full table scan heavy set of queries, an ideal workload for a column store. I got 3-10x speedups for most aggregation and filter queries. In some cases, I got slowdowns, this was especially notable when running expensive calculations like <code class="language-plaintext highlighter-rouge">count(distinct)</code>.</p>
<h3 id="conclusion">Conclusion</h3>
<p>The main advantage of having a column store built into Postgres is that one can easily run hybrid workloads, where you have row-based OLTP-like tables for your day to day operations and then some column store tables/views for analytics and reporting. This approach saves you the hassle of synchronising data between databases, which has been a pain we’ve had to endure for too long. It’s something Microsoft’s SQL Server has offered from quite some time and I’m glad open source databases are catching up.</p>
<p>I think the potential of removing ETL processes is the main benefit of all this, but there are some other considerations Ihaven’t talked about - notably space savings thanks to on-disk compression, among other things.</p>Databases have been popular for storing and retrieval of information for decades and they have offered ways to work with all sorts of shapes and sizes of data and without specialising in any kind of retrieval or functionality. Over the past decade, there has been an emergence of a analytical databases, which leverage the fact that some data, chiefly in reporting and analytics, are written once and read many times. Updates are infrequent, so are individual row lookups. How are these engines different from what we have in Postgres or MySQL? There are a few concepts to unfold.Abandoning Apache Airflow2020-04-30T05:03:53+00:002020-04-30T05:03:53+00:00https://kokes.github.io/blog/2020/04/30/abandoning-apache-airflow<p>I was listening to <a href="https://softwareengineeringdaily.com/2020/04/29/prefect-dataflow-scheduler-with-jeremiah-lowin/">this podcast about Prefect</a>, a dataflow scheduler that tries to learn from the shortcomings in Apache Airflow. It occurred to me that I discussed our decisions to deploy and subsequently abandon Airflow within our company, I actually discussed this on a number of occasions, so I might as well write it up.</p>
<h3 id="inception">Inception</h3>
<p>We had a job scheduler that triggered various jobs in a system that shall remain nameless (to avoid painful memories). One day we realised we don’t have to use its enterprise version, because all it brings on top of the community version, was a job scheduler. I could do that in a weekend (c)!</p>
<p>At that time (late 2017), <a href="https://airflow.apache.org/">Apache Airflow</a> was the go-to solution, the competition - Luigi and Pinball - wasn’t as mature or lacked key features. I was a bit worried at the start, because Apache projects lean heavily toward the JVM, but this was Python, which we employed for other tasks, so the initial deployment was super easy. Install a package, link it to (our existing) Postgres instance and off we went.</p>
<p>After tinkering with it for a few days, we set up a Jenkins job to deploy this installation and it lived in our infrastructure for some six months and successfully launched tens of thousands of jobs. And then one day, I deleted it all. Here are the reasons for it. <strong>Tl; dr: it did way more than we needed and the complexity got in our way too often during development.</strong></p>
<h3 id="reloading-jobs">Reloading jobs</h3>
<p>Our first hurdle was refreshing the state Airflow operated with. We had all our jobs in a git repo and when we pushed new code, we needed Airflow to reflect it and start scheduling new jobs (or stop scheduling existing ones). That wasn’t a solved thing and we ended up ductaping some kill/launch sequence (yes, we turned it off and on again).</p>
<p>This was not only suboptimal from a cleanliness perspective, but it just didn’t work right at times - there’d be cruft left in the database, some jobs would be left hanging, it just was a silly solution to a very obvious problem.</p>
<h3 id="too-much-airflow-code-in-our-etl">Too much Airflow code in our ETL</h3>
<p>One main advantage of Airflow-like systems is that it decouples the tasks, lets you run them, retry them if necessary, and facilitate communication between them (e.g. “I created an S3 object at s3://foo/bar”). Sounds cool, but this turned out to be rather gimmicky and lead to more and more integration of Airflow libraries into our internal ETL code. We wanted to keep things separate, but it was just way too integrated.</p>
<p>Another example is time of execution - instead of using <code class="language-plaintext highlighter-rouge">now()</code>, you’d be provided with a timestamp for job execution - this was super handy if your job failed and you needed to backfill it - run it with a “fake” timestamp from the past. Works great in theory, but this was such a confusing concept that wasn’t quite understood by many (me not excluding) and lead to surprising job invocations and bugs (especially around midnight).</p>
<p>One last hurdle in this regard - and perhaps the biggest - was debugging in local use. I’m a big believer in programs that are buildable and debuggable on local machines. I don’t want to spawn 12 Docker containers just to run a job, I want to see what’s happening, debug it in a proper way (<code class="language-plaintext highlighter-rouge">print</code> everywhere), and in general be in control. This was not possible with Airflow, because it took over invocation of jobs and it was very cumbersome to clone and debug.</p>
<h3 id="deployment-and-dependencies">Deployment and dependencies</h3>
<p>The way we deployed Airflow (and it’s probably not the way we’d do it now) was we had a Jenkins job that installed it on a long-running EC2 box and it ran the jobs on that machine as it was (Airflow offered workers separate from the scheduler, but we didn’t need that as our jobs were quite lightweight).</p>
<p>Now this worked fine when things were running, but we ran into issues when updating things. Be it Python dependencies or Python itself. Or environment variables. Or the OS. It was this one big persistent thing that operated everything and it was quite difficult to isolate things and to have reproducible runs.</p>
<h3 id="lack-of-integration-into-our-infrastructure">Lack of integration into our infrastructure</h3>
<p>We already had a whole host of technologies within the company for running things (Gitlab, Mesos, Graphite, Grafana, ELK) and our Airflow solution was not integrated with any of those in terms of monitoring, logging, or resource management. Airflow has since gained Kubernetes support, so it’s getting better in this regard, but at the time, we were in the dark and had to build our own Slack integration and other reporting, just so that we’d understand what was going on.</p>
<h3 id="moment-of-realisation">Moment of realisation</h3>
<p>At one point we realised on thing - we’re not using half of what Airflow offers us and the remaining features are getting in our way instead of helping us. So is there a better way to do this thing? We needed a way to schedule things, have them logged, monitored, dependencies handled etc.</p>
<p>We already had a cron-like solution within our infrastructure, so we just used that. There are no job-dependencies, no inter-process communication, just cron schedules and retries on failures, but it ended up working well, we just had to do a few things to make it work.</p>
<h3 id="making-all-jobs-independent">Making all jobs independent</h3>
<p>Before Airflow, we’d have to check “is job B running right now? If so, abort”, because some jobs couldn’t overlap. With Airflow, we’d just stitch them one after the other and it worked well. Since our new solution didn’t have this, we had to work around this and it wasn’t all that hard in the end. We just made sure that jobs could run concurrently. In the end, we realised that there shouldn’t be a reason two jobs cannot run concurrently - if we make all jobs atomic and idempotent (more on that later), then a concurrent job will either see old or new data, but nothing in between, so it doesn’t matter what jobs run when.</p>
<p>For this to work, we had to ensure two things:</p>
<ol>
<li>Atomicity - every job would either write everything or nothing. There were no partial results. This is fairly trivial in databases, which are sort of built for this, we had to think about things a bit elsewhere. E.g. in S3, we’d just buffer the result locally or used some flags, or used metadata in a database that’d be written only after persisting data etc.</li>
<li>Idempotency - this is quite a key component of any job, regardless of a scheduler. All it says it that it doesn’t matter how many times your run a job, it always behaves the same way. This saved us many times.</li>
</ol>
<p>This also resulted in a lack of cleanups (before that, we’d always have to ensure we’d delete all the partial results that could be there because of a failure halfway through a job) and everything was more reliable.</p>
<h3 id="chaining-things">Chaining things</h3>
<p>This was tricky and, to be honest, the least satisfactory thing in our solution. How do you compose parts of your jobs in a nice way (the whole DAG spiel)? We sort of did that by breaking up our jobs into smaller pieces, which could be run independently and when we needed things to run in a given order, we’d just import these jobs into a tiny orchestrator and run them there.</p>
<p>In order for this to work, we had to have a simple system - each job ultimately has a <code class="language-plaintext highlighter-rouge">main()</code> function, which gets run when the main file gets invoked. BUT, you can equally have a different file that imports these various main functions and invokes them instead. So we could compose a number of smaller jobs, run them in parallel if we wanted to, but we mostly didn’t.</p>
<h3 id="conclusion">Conclusion</h3>
<p>What we ended up was a much better codebase, not just because we didn’t have Airflow cruft in our code, but also because we had to think about building reliable jobs that could overlap or fail at any time. The resulting system was much easier to operate from a devops perspective, had much better reliability, monitoring, and logging - but these were only due to the fact we already had a cron-like system in place. If you don’t, your mileage may vary. Also, we didn’t have giant dags with dozens of complex relationships - for these use cases, dataflow systems like Airflow are still a good fit.</p>
<p>What I personally took most from this was the fact that this was super easy to debug locally. You just cloned a repo, installed a bunch of dependencies in a virtual environment and you were good to go. You could build a Docker image and you’d see <em>exactly</em> what would happen in production. Most of all, you were just running simple Python scripts, you could introspect it in a million different ways (but, you know, print is the way to go).</p>
<p>I’m not saying people should go this route, but as I gain more experience, I cherish simplicity and debuggability (these two often go hand in hand). This taught me that while Airflow is great and offers a lot of cool features, it might just be in your way.</p>I was listening to this podcast about Prefect, a dataflow scheduler that tries to learn from the shortcomings in Apache Airflow. It occurred to me that I discussed our decisions to deploy and subsequently abandon Airflow within our company, I actually discussed this on a number of occasions, so I might as well write it up.