Experiments in owning data, part 1

2019-01-20 14:13:04 -05:00
parent cd17762593
commit 88091a4040
1 changed files with 189 additions and 0 deletions
--- a/blog/2019-01-20-experiments-in-owning-data.markdown
+++ b/blog/2019-01-20-experiments-in-owning-data.markdown
@@ -0,0 +1,189 @@
+---
+layout: post
+title: "Experiments in owning data"
+date: 2019-01-20
+comments: true
+tags: freebsd
+---
+
+I have been working for a while to own most of the data I generate. Thought I
+would write down what I mean by that and how I am doing so far.
+
+Before this effort most of my data where spread across many proprietary
+services, some free, some paid. I had always felt I had restricted control over
+them, and I had to find out some free tier [restrictions the hard
+way](https://www.quora.com/What-happens-to-your-older-photos-once-you-go-over-the-free-200-limit-on-Flickr-without-turning-pro).
+
+So this started as an effort to 
+
+1. organize all the things that I wanted in a central (virtual) place. 
+2. and fine grain control over who has access to the data.
+
+However all of the service that I looked into was made for exactly what I wanted
+to avoid - a free service that monetizes based on my personal data, they do take
+my money to provide "upgrades", and my data may not be mined (or maybe - no way
+to ensure that). Also service behaviors were sometimes opaque and confusing,
+even [causing people to loose
+data](https://productforums.google.com/forum/#!topic/apps/RIHSJ4LIXwE).
+
+Another thing that stood out was how inflexible these services were. Mostly
+designed as big monoliths that does not play well with others. For e.g google
+photos is really nice - but what if I want to run an imagemagic script over all
+of the photos I have? I think there is someway to do this if you poke at the
+photos API, however the friction is too much compared to just mounting them over
+webdav or fuse. For a lot of these services Linux was a second class citizen,
+and FreeBSD an undiscovered species. I understand these are not common
+requirements, but I wanted the system to work with things I use and have.
+
+## Hardware
+At the moment, what I call my personal data is ~500GB, that's all the pictures,
+emails documents, code and other things that I have. Assuming a 3 fold growth (probably too low?) I decided that I need around 2.5TB storage. Other requirements were,
+
+1. connected to reasonably fast internet and reliable power
+2. cheap (remember, migrating out of this system is going to be really painful)
+
+After some consideration I decided to not to host my hardware, I move around a
+lot and state of home internet in Germany is not where I'd like it to be. 
+
+Requirements for storage made most of the cloud providers unfeasible (_3TB EBS is
+~$350/month_).
+
+I finally settled on a physical machine from hetzner [server
+auction](https://www.hetzner.com/sb). Server auction is where they sell their
+older generation machines (read: sandy bride/ivy bridge) at a steep discount. I
+was able to get a Xeon E3 with 32GB ECC ram and 2x3TB disks for 30 EUR a month.
+
+It could have been a bit cheaper if had gone with an i7 machine (newer cpu too)
+instead. But they don't ECC RAM. Intel is very adamant in not supporting ECC in
+"desktop class" processors.
+
+## Installation
+Installation was piece of cake, hetzner allows you to boot the server into
+`freebsd rescue mode` where they point server to PXE boot from a
+[`mfsbsd`](https://mfsbsd.vx.sk/) disk and lets you ssh, and then you can start
+installing `FreeBSD` (one can follow a similar procedure for Linux distros with
+a linux rescue image..)
+
+## Security
+Even though the main goal is to avoid mass surveillance, I also wanted to avoid
+data leaks because of unplanned events - me not paying bills, hardware failures
+etc. The solution was to encrypt the disks, so that at rest nobody can sniff
+data out of them.
+
+This became a challenge because getting access to KVM in hetzner environment is
+not instant. One need to send them a request and a human mails you kvm access
+creds for an hour (they are usually fast though). This is a challenge because
+every time I need to reboot the server I would need to get KVM access, type in
+my password over KVM (also not sure how much of that encryption I can trust..)
+and let the machine boot.
+
+### Two Zpools approach
+However a friend of mine had the solution, the idea is to have two
+[zpool](https://en.wikipedia.org/wiki/ZFS)s. one, unencrypted that holds the OS
+and the other encrypted that holds data.
+
+Both of the zpools are in
+[raid1](https://en.wikipedia.org/wiki/Standard_RAID_levels), meaning they are
+mirrored to two physical disks, hence as long as both disks don't fail together,
+we won't have any problems.
+
+```
+   disk1
+   +------------------------+
+   | pool1|    pool2        |
+   | unenc|    enc          |
+   +------------------------+
+   disk2
+   +------------------------+
+   | pool1|    pool2        |
+   | unenc|    enc          |
+   +------------------------+
+```
+
+Roughly this how it works: When machine boots, it boots off the plain zpool, and
+gets to the custom rc.script `geli0` installed by us
+
+```
+#!/bin/sh
+#
+
+# PROVIDE: geli0
+# BEFORE: disks
+# REQUIRE: initrandom
+# KEYWORD: nojail
+
+. /etc/rc.subr
+
+name="geli0"
+start_cmd="geli0_start"
+stop_cmd=":"
+required_modules="geom_eli:g_eli"
+
+geli0_start()
+{
+        zfs mount -av
+        /etc/rc.d/hostid start
+        /etc/rc.d/hostname start
+        /etc/rc.d/netif start
+        /etc/rc.d/routing start
+        /etc/rc.d/sshd start
+
+        echo -n "Waiting for zpool:encrypted to become available, "
+        echo -n "press enter to continue..."
+        echo
+
+        while true; do
+                if [ -e /dev/ada0p4.eli -a -e /dev/ada1p4.eli ]; then
+                        break
+                fi
+                read -t 5 dummy && break
+        done
+        /etc/rc.d/sshd stop
+        pkill sshd
+        /etc/rc.d/routing stop
+        /etc/rc.d/netif stop
+#       /etc/rc.d/devd stop
+}
+
+load_rc_config $name
+run_rc_command "$1"
+```
+
+This script pauses the boot, setups up some essential services related to
+`network`, `ssh` and waits for the second set of disks to be available. The
+machine is essentially waiting for me to decrypt the disks, and I can do that by
+ssh-ing to the box and running `decryptvol.sh` (contents below)
+
+```
+#!/bin/sh
+
+#
+# The passphrase for both disks is the same.
+# Read it once and decrypt the disks.
+#
+
+set -e
+
+echo -n "Enter passphrase: "
+stty -echo
+IFS="" read -r passphrase
+stty echo
+echo
+
+echo $passphrase | geli attach -k /boot/keys/ada1p4.key -j - /dev/ada1p4
+echo $passphrase | geli attach -k /boot/keys/ada0p4.key -j - /dev/ada0p4
+```
+
+As soon as the disks are available the `geli0` scripts resumes regular boot, but
+now with access to encrypted data.
+
+## Conclusion and part 2
+With this setup I have a place to store my data and its secure from data mining
+by third party service providers. One bit that worries me is that someone can
+coerce hetzner to attack the hardware itself, but I am not sure its something I
+can solve at the moment.
+
+However this is only a part of the puzzle. Strictly speaking I have my data
+platform so as to speak, and now I need services that integrates this with other
+devices that generate and consume data. This post is already longer than I
+anticipated, so I will write about software and other services in a follow up.