blogng/blog/2019-01-20-experiments-in-owning-data.markdown
Dhananjay Balan 8dab8ed59a Bump stack version
system ghc is now 8.6.3 but hakyll is not in any lts versions that
support (ghc 8.6.x)
2019-01-23 15:59:05 -05:00

190 lines
6.9 KiB
Markdown

---
layout: post
title: "Experiments In Owning Data"
date: 2019-01-20
comments: true
tags: freebsd
---
I have been working for a while to own most of the data I generate. Thought I
would write down what I mean by that and how I am doing so far.
Before this effort most of my data where spread across many proprietary
services, some free, some paid. I had always felt I had restricted control over
them, and I had to find out some free tier [restrictions the hard
way](https://www.quora.com/What-happens-to-your-older-photos-once-you-go-over-the-free-200-limit-on-Flickr-without-turning-pro).
So this started as an effort to
1. organize all the things that I wanted in a central (virtual) place.
2. and fine grain control over who has access to the data.
However all of the service that I looked into was made for exactly what I wanted
to avoid - a free service that monetizes based on my personal data, they do take
my money to provide "upgrades", and my data may not be mined (or maybe - no way
to ensure that). Also service behaviors were sometimes opaque and confusing,
even [causing people to loose
data](https://productforums.google.com/forum/#!topic/apps/RIHSJ4LIXwE).
Another thing that stood out was how inflexible these services were. Mostly
designed as big monoliths that does not play well with others. For e.g google
photos is really nice - but what if I want to run an imagemagic script over all
of the photos I have? I think there is someway to do this if you poke at the
photos API, however the friction is too much compared to just mounting them over
webdav or fuse. For a lot of these services Linux was a second class citizen,
and FreeBSD an undiscovered species. I understand these are not common
requirements, but I wanted the system to work with things I use and have.
## Hardware
At the moment, what I call my personal data is ~500GB, that's all the pictures,
emails documents, code and other things that I have. Assuming a 3 fold growth (probably too low?) I decided that I need around 2.5TB storage. Other requirements were,
1. connected to reasonably fast internet and reliable power
2. cheap (remember, migrating out of this system is going to be really painful)
After some consideration I decided to not to host my hardware, I move around a
lot and state of home internet in Germany is not where I'd like it to be.
Requirements for storage made most of the cloud providers unfeasible (_3TB EBS is
~$350/month_).
I finally settled on a physical machine from hetzner [server
auction](https://www.hetzner.com/sb). Server auction is where they sell their
older generation machines (read: sandy bride/ivy bridge) at a steep discount. I
was able to get a Xeon E3 with 32GB ECC ram and 2x3TB disks for 30 EUR a month.
It could have been a bit cheaper if had gone with an i7 machine (newer cpu too)
instead. But they don't ECC RAM. Intel is very adamant in not supporting ECC in
"desktop class" processors.
## Installation
Installation was piece of cake, hetzner allows you to boot the server into
`freebsd rescue mode` where they point server to PXE boot from a
[`mfsbsd`](https://mfsbsd.vx.sk/) disk and lets you ssh, and then you can start
installing `FreeBSD` (one can follow a similar procedure for Linux distros with
a linux rescue image..)
## Security
Even though the main goal is to avoid mass surveillance, I also wanted to avoid
data leaks because of unplanned events - me not paying bills, hardware failures
etc. The solution was to encrypt the disks, so that at rest nobody can sniff
data out of them.
This became a challenge because getting access to KVM in hetzner environment is
not instant. One need to send them a request and a human mails you kvm access
creds for an hour (they are usually fast though). This is a challenge because
every time I need to reboot the server I would need to get KVM access, type in
my password over KVM (also not sure how much of that encryption I can trust..)
and let the machine boot.
### Two Zpools approach
However a friend of mine had the solution, the idea is to have two
[zpool](https://en.wikipedia.org/wiki/ZFS)s. one, unencrypted that holds the OS
and the other encrypted that holds data.
Both of the zpools are in
[raid1](https://en.wikipedia.org/wiki/Standard_RAID_levels), meaning they are
mirrored to two physical disks, hence as long as both disks don't fail together,
we won't have any problems.
```
disk1
+------------------------+
| pool1| pool2 |
| unenc| enc |
+------------------------+
disk2
+------------------------+
| pool1| pool2 |
| unenc| enc |
+------------------------+
```
Roughly this how it works: When machine boots, it boots off the plain zpool, and
gets to the custom rc.script `geli0` installed by us
```
#!/bin/sh
#
# PROVIDE: geli0
# BEFORE: disks
# REQUIRE: initrandom
# KEYWORD: nojail
. /etc/rc.subr
name="geli0"
start_cmd="geli0_start"
stop_cmd=":"
required_modules="geom_eli:g_eli"
geli0_start()
{
zfs mount -av
/etc/rc.d/hostid start
/etc/rc.d/hostname start
/etc/rc.d/netif start
/etc/rc.d/routing start
/etc/rc.d/sshd start
echo -n "Waiting for zpool:encrypted to become available, "
echo -n "press enter to continue..."
echo
while true; do
if [ -e /dev/ada0p4.eli -a -e /dev/ada1p4.eli ]; then
break
fi
read -t 5 dummy && break
done
/etc/rc.d/sshd stop
pkill sshd
/etc/rc.d/routing stop
/etc/rc.d/netif stop
# /etc/rc.d/devd stop
}
load_rc_config $name
run_rc_command "$1"
```
This script pauses the boot, setups up some essential services related to
`network`, `ssh` and waits for the second set of disks to be available. The
machine is essentially waiting for me to decrypt the disks, and I can do that by
ssh-ing to the box and running `decryptvol.sh` (contents below)
```
#!/bin/sh
#
# The passphrase for both disks is the same.
# Read it once and decrypt the disks.
#
set -e
echo -n "Enter passphrase: "
stty -echo
IFS="" read -r passphrase
stty echo
echo
echo $passphrase | geli attach -k /boot/keys/ada1p4.key -j - /dev/ada1p4
echo $passphrase | geli attach -k /boot/keys/ada0p4.key -j - /dev/ada0p4
```
As soon as the disks are available the `geli0` scripts resumes regular boot, but
now with access to encrypted data.
## Conclusion and part 2
With this setup I have a place to store my data and its secure from data mining
by third party service providers. One bit that worries me is that someone can
coerce hetzner to attack the hardware itself, but I am not sure its something I
can solve at the moment.
However this is only a part of the puzzle. Strictly speaking I have my data
platform so as to speak, and now I need services that integrates this with other
devices that generate and consume data. This post is already longer than I
anticipated, so I will write about software and other services in a follow up.