diff --git a/blog/2019-01-20-experiments-in-owning-data.markdown b/blog/2019-01-20-experiments-in-owning-data.markdown new file mode 100644 index 0000000..d2dd921 --- /dev/null +++ b/blog/2019-01-20-experiments-in-owning-data.markdown @@ -0,0 +1,189 @@ +--- +layout: post +title: "Experiments in owning data" +date: 2019-01-20 +comments: true +tags: freebsd +--- + +I have been working for a while to own most of the data I generate. Thought I +would write down what I mean by that and how I am doing so far. + +Before this effort most of my data where spread across many proprietary +services, some free, some paid. I had always felt I had restricted control over +them, and I had to find out some free tier [restrictions the hard +way](https://www.quora.com/What-happens-to-your-older-photos-once-you-go-over-the-free-200-limit-on-Flickr-without-turning-pro). + +So this started as an effort to + +1. organize all the things that I wanted in a central (virtual) place. +2. and fine grain control over who has access to the data. + +However all of the service that I looked into was made for exactly what I wanted +to avoid - a free service that monetizes based on my personal data, they do take +my money to provide "upgrades", and my data may not be mined (or maybe - no way +to ensure that). Also service behaviors were sometimes opaque and confusing, +even [causing people to loose +data](https://productforums.google.com/forum/#!topic/apps/RIHSJ4LIXwE). + +Another thing that stood out was how inflexible these services were. Mostly +designed as big monoliths that does not play well with others. For e.g google +photos is really nice - but what if I want to run an imagemagic script over all +of the photos I have? I think there is someway to do this if you poke at the +photos API, however the friction is too much compared to just mounting them over +webdav or fuse. For a lot of these services Linux was a second class citizen, +and FreeBSD an undiscovered species. I understand these are not common +requirements, but I wanted the system to work with things I use and have. + +## Hardware +At the moment, what I call my personal data is ~500GB, that's all the pictures, +emails documents, code and other things that I have. Assuming a 3 fold growth (probably too low?) I decided that I need around 2.5TB storage. Other requirements were, + +1. connected to reasonably fast internet and reliable power +2. cheap (remember, migrating out of this system is going to be really painful) + +After some consideration I decided to not to host my hardware, I move around a +lot and state of home internet in Germany is not where I'd like it to be. + +Requirements for storage made most of the cloud providers unfeasible (_3TB EBS is +~$350/month_). + +I finally settled on a physical machine from hetzner [server +auction](https://www.hetzner.com/sb). Server auction is where they sell their +older generation machines (read: sandy bride/ivy bridge) at a steep discount. I +was able to get a Xeon E3 with 32GB ECC ram and 2x3TB disks for 30 EUR a month. + +It could have been a bit cheaper if had gone with an i7 machine (newer cpu too) +instead. But they don't ECC RAM. Intel is very adamant in not supporting ECC in +"desktop class" processors. + +## Installation +Installation was piece of cake, hetzner allows you to boot the server into +`freebsd rescue mode` where they point server to PXE boot from a +[`mfsbsd`](https://mfsbsd.vx.sk/) disk and lets you ssh, and then you can start +installing `FreeBSD` (one can follow a similar procedure for Linux distros with +a linux rescue image..) + +## Security +Even though the main goal is to avoid mass surveillance, I also wanted to avoid +data leaks because of unplanned events - me not paying bills, hardware failures +etc. The solution was to encrypt the disks, so that at rest nobody can sniff +data out of them. + +This became a challenge because getting access to KVM in hetzner environment is +not instant. One need to send them a request and a human mails you kvm access +creds for an hour (they are usually fast though). This is a challenge because +every time I need to reboot the server I would need to get KVM access, type in +my password over KVM (also not sure how much of that encryption I can trust..) +and let the machine boot. + +### Two Zpools approach +However a friend of mine had the solution, the idea is to have two +[zpool](https://en.wikipedia.org/wiki/ZFS)s. one, unencrypted that holds the OS +and the other encrypted that holds data. + +Both of the zpools are in +[raid1](https://en.wikipedia.org/wiki/Standard_RAID_levels), meaning they are +mirrored to two physical disks, hence as long as both disks don't fail together, +we won't have any problems. + +``` + disk1 + +------------------------+ + | pool1| pool2 | + | unenc| enc | + +------------------------+ + disk2 + +------------------------+ + | pool1| pool2 | + | unenc| enc | + +------------------------+ +``` + +Roughly this how it works: When machine boots, it boots off the plain zpool, and +gets to the custom rc.script `geli0` installed by us + +``` +#!/bin/sh +# + +# PROVIDE: geli0 +# BEFORE: disks +# REQUIRE: initrandom +# KEYWORD: nojail + +. /etc/rc.subr + +name="geli0" +start_cmd="geli0_start" +stop_cmd=":" +required_modules="geom_eli:g_eli" + +geli0_start() +{ + zfs mount -av + /etc/rc.d/hostid start + /etc/rc.d/hostname start + /etc/rc.d/netif start + /etc/rc.d/routing start + /etc/rc.d/sshd start + + echo -n "Waiting for zpool:encrypted to become available, " + echo -n "press enter to continue..." + echo + + while true; do + if [ -e /dev/ada0p4.eli -a -e /dev/ada1p4.eli ]; then + break + fi + read -t 5 dummy && break + done + /etc/rc.d/sshd stop + pkill sshd + /etc/rc.d/routing stop + /etc/rc.d/netif stop +# /etc/rc.d/devd stop +} + +load_rc_config $name +run_rc_command "$1" +``` + +This script pauses the boot, setups up some essential services related to +`network`, `ssh` and waits for the second set of disks to be available. The +machine is essentially waiting for me to decrypt the disks, and I can do that by +ssh-ing to the box and running `decryptvol.sh` (contents below) + +``` +#!/bin/sh + +# +# The passphrase for both disks is the same. +# Read it once and decrypt the disks. +# + +set -e + +echo -n "Enter passphrase: " +stty -echo +IFS="" read -r passphrase +stty echo +echo + +echo $passphrase | geli attach -k /boot/keys/ada1p4.key -j - /dev/ada1p4 +echo $passphrase | geli attach -k /boot/keys/ada0p4.key -j - /dev/ada0p4 +``` + +As soon as the disks are available the `geli0` scripts resumes regular boot, but +now with access to encrypted data. + +## Conclusion and part 2 +With this setup I have a place to store my data and its secure from data mining +by third party service providers. One bit that worries me is that someone can +coerce hetzner to attack the hardware itself, but I am not sure its something I +can solve at the moment. + +However this is only a part of the puzzle. Strictly speaking I have my data +platform so as to speak, and now I need services that integrates this with other +devices that generate and consume data. This post is already longer than I +anticipated, so I will write about software and other services in a follow up.