 58b20109cf
			
		
	
	58b20109cf
	
	
	
		
			
			import sys
import yaml
with open(sys.argv[1]) as fp:
    data = fp.read()
if not data.find("---") == 0:
    # no head
    print("NO YAML HEAD FOUND")
    sys.exit(-1)
data = data[3:]
head_end = data.find("---")
head = data[0:head_end]
data = data[head_end+3:]
metadata = yaml.safe_load(head)
cats = metadata.pop('categories', None)
if cats != None:
    if type(cats) == list:
        tags = cats
    elif type(cats) == str:
        tags = cats.split()
    tags = list(map(lambda t: t.lower(), tags))
    metadata["tags"] = ", ".join(tags)
    new_data = f"---\n{yaml.dump(metadata, default_flow_style=False)}---{data}"
    # write it
    print(f"coverted: categories to tags: {tags} - {sys.argv[1]}")
    with open(sys.argv[1], "w") as fp:
        fp.write(new_data)
    sys.exit(0)
if not metadata.get("tags", None):
    metadata["tags"] = "untagged"
    new_data = f"---\n{yaml.dump(metadata, default_flow_style=False)}---{data}"
    print(f"untagged: {sys.argv[1]}")
    # write it
    with open(sys.argv[1], "w") as fp:
        fp.write(new_data)
    sys.exit(0)
print("No changes needed")
		
	
		
			
				
	
	
		
			190 lines
		
	
	
		
			6.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			190 lines
		
	
	
		
			6.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| layout: post
 | |
| title: "Experiments In Owning Data"
 | |
| date: 2019-01-20
 | |
| comments: true
 | |
| tags: freebsd, data, hetzner
 | |
| ---
 | |
| 
 | |
| I have been working for a while to own most of the data I generate. Thought I
 | |
| would write down what I mean by that and how I am doing so far.
 | |
| 
 | |
| Before this effort most of my data where spread across many proprietary
 | |
| services, some free, some paid. I had always felt I had restricted control over
 | |
| them, and I had to find out some free tier [restrictions the hard
 | |
| way](https://www.quora.com/What-happens-to-your-older-photos-once-you-go-over-the-free-200-limit-on-Flickr-without-turning-pro).
 | |
| 
 | |
| So this started as an effort to 
 | |
| 
 | |
| 1. organize all the things that I wanted in a central (virtual) place. 
 | |
| 2. and fine grain control over who has access to the data.
 | |
| 
 | |
| However all of the service that I looked into was made for exactly what I wanted
 | |
| to avoid - a free service that monetizes based on my personal data, they do take
 | |
| my money to provide "upgrades", and my data may not be mined (or maybe - no way
 | |
| to ensure that). Also service behaviors were sometimes opaque and confusing,
 | |
| even [causing people to loose
 | |
| data](https://productforums.google.com/forum/#!topic/apps/RIHSJ4LIXwE).
 | |
| 
 | |
| Another thing that stood out was how inflexible these services were. Mostly
 | |
| designed as big monoliths that does not play well with others. For e.g google
 | |
| photos is really nice - but what if I want to run an imagemagic script over all
 | |
| of the photos I have? I think there is someway to do this if you poke at the
 | |
| photos API, however the friction is too much compared to just mounting them over
 | |
| webdav or fuse. For a lot of these services Linux was a second class citizen,
 | |
| and FreeBSD an undiscovered species. I understand these are not common
 | |
| requirements, but I wanted the system to work with things I use and have.
 | |
| 
 | |
| ## Hardware
 | |
| At the moment, what I call my personal data is ~500GB, that's all the pictures,
 | |
| emails documents, code and other things that I have. Assuming a 3 fold growth (probably too low?) I decided that I need around 2.5TB storage. Other requirements were,
 | |
| 
 | |
| 1. connected to reasonably fast internet and reliable power
 | |
| 2. cheap (remember, migrating out of this system is going to be really painful)
 | |
| 
 | |
| After some consideration I decided to not to host my hardware, I move around a
 | |
| lot and state of home internet in Germany is not where I'd like it to be. 
 | |
| 
 | |
| Requirements for storage made most of the cloud providers unfeasible (_3TB EBS is
 | |
| ~$350/month_).
 | |
| 
 | |
| I finally settled on a physical machine from hetzner [server
 | |
| auction](https://www.hetzner.com/sb). Server auction is where they sell their
 | |
| older generation machines (read: sandy bride/ivy bridge) at a steep discount. I
 | |
| was able to get a Xeon E3 with 32GB ECC ram and 2x3TB disks for 30 EUR a month.
 | |
| 
 | |
| It could have been a bit cheaper if had gone with an i7 machine (newer cpu too)
 | |
| instead. But they don't ECC RAM. Intel is very adamant in not supporting ECC in
 | |
| "desktop class" processors.
 | |
| 
 | |
| ## Installation
 | |
| Installation was piece of cake, hetzner allows you to boot the server into
 | |
| `freebsd rescue mode` where they point server to PXE boot from a
 | |
| [`mfsbsd`](https://mfsbsd.vx.sk/) disk and lets you ssh, and then you can start
 | |
| installing `FreeBSD` (one can follow a similar procedure for Linux distros with
 | |
| a linux rescue image..)
 | |
| 
 | |
| ## Security
 | |
| Even though the main goal is to avoid mass surveillance, I also wanted to avoid
 | |
| data leaks because of unplanned events - me not paying bills, hardware failures
 | |
| etc. The solution was to encrypt the disks, so that at rest nobody can sniff
 | |
| data out of them.
 | |
| 
 | |
| This became a challenge because getting access to KVM in hetzner environment is
 | |
| not instant. One need to send them a request and a human mails you kvm access
 | |
| creds for an hour (they are usually fast though). This is a challenge because
 | |
| every time I need to reboot the server I would need to get KVM access, type in
 | |
| my password over KVM (also not sure how much of that encryption I can trust..)
 | |
| and let the machine boot.
 | |
| 
 | |
| ### Two Zpools approach
 | |
| However a friend of mine had the solution, the idea is to have two
 | |
| [zpool](https://en.wikipedia.org/wiki/ZFS)s. one, unencrypted that holds the OS
 | |
| and the other encrypted that holds data.
 | |
| 
 | |
| Both of the zpools are in
 | |
| [raid1](https://en.wikipedia.org/wiki/Standard_RAID_levels), meaning they are
 | |
| mirrored to two physical disks, hence as long as both disks don't fail together,
 | |
| we won't have any problems.
 | |
| 
 | |
| ```
 | |
|    disk1
 | |
|    +------------------------+
 | |
|    | pool1|    pool2        |
 | |
|    | unenc|    enc          |
 | |
|    +------------------------+
 | |
|    disk2
 | |
|    +------------------------+
 | |
|    | pool1|    pool2        |
 | |
|    | unenc|    enc          |
 | |
|    +------------------------+
 | |
| ```
 | |
| 
 | |
| Roughly this how it works: When machine boots, it boots off the plain zpool, and
 | |
| gets to the custom rc.script `geli0` installed by us
 | |
| 
 | |
| ```
 | |
| #!/bin/sh
 | |
| #
 | |
| 
 | |
| # PROVIDE: geli0
 | |
| # BEFORE: disks
 | |
| # REQUIRE: initrandom
 | |
| # KEYWORD: nojail
 | |
| 
 | |
| . /etc/rc.subr
 | |
| 
 | |
| name="geli0"
 | |
| start_cmd="geli0_start"
 | |
| stop_cmd=":"
 | |
| required_modules="geom_eli:g_eli"
 | |
| 
 | |
| geli0_start()
 | |
| {
 | |
|         zfs mount -av
 | |
|         /etc/rc.d/hostid start
 | |
|         /etc/rc.d/hostname start
 | |
|         /etc/rc.d/netif start
 | |
|         /etc/rc.d/routing start
 | |
|         /etc/rc.d/sshd start
 | |
| 
 | |
|         echo -n "Waiting for zpool:encrypted to become available, "
 | |
|         echo -n "press enter to continue..."
 | |
|         echo
 | |
| 
 | |
|         while true; do
 | |
|                 if [ -e /dev/ada0p4.eli -a -e /dev/ada1p4.eli ]; then
 | |
|                         break
 | |
|                 fi
 | |
|                 read -t 5 dummy && break
 | |
|         done
 | |
|         /etc/rc.d/sshd stop
 | |
|         pkill sshd
 | |
|         /etc/rc.d/routing stop
 | |
|         /etc/rc.d/netif stop
 | |
| #       /etc/rc.d/devd stop
 | |
| }
 | |
| 
 | |
| load_rc_config $name
 | |
| run_rc_command "$1"
 | |
| ```
 | |
| 
 | |
| This script pauses the boot, setups up some essential services related to
 | |
| `network`, `ssh` and waits for the second set of disks to be available. The
 | |
| machine is essentially waiting for me to decrypt the disks, and I can do that by
 | |
| ssh-ing to the box and running `decryptvol.sh` (contents below)
 | |
| 
 | |
| ```
 | |
| #!/bin/sh
 | |
| 
 | |
| #
 | |
| # The passphrase for both disks is the same.
 | |
| # Read it once and decrypt the disks.
 | |
| #
 | |
| 
 | |
| set -e
 | |
| 
 | |
| echo -n "Enter passphrase: "
 | |
| stty -echo
 | |
| IFS="" read -r passphrase
 | |
| stty echo
 | |
| echo
 | |
| 
 | |
| echo $passphrase | geli attach -k /boot/keys/ada1p4.key -j - /dev/ada1p4
 | |
| echo $passphrase | geli attach -k /boot/keys/ada0p4.key -j - /dev/ada0p4
 | |
| ```
 | |
| 
 | |
| As soon as the disks are available the `geli0` scripts resumes regular boot, but
 | |
| now with access to encrypted data.
 | |
| 
 | |
| ## Conclusion and part 2
 | |
| With this setup I have a place to store my data and its secure from data mining
 | |
| by third party service providers. One bit that worries me is that someone can
 | |
| coerce hetzner to attack the hardware itself, but I am not sure its something I
 | |
| can solve at the moment.
 | |
| 
 | |
| However this is only a part of the puzzle. Strictly speaking I have my data
 | |
| platform so as to speak, and now I need services that integrates this with other
 | |
| devices that generate and consume data. This post is already longer than I
 | |
| anticipated, so I will write about software and other services in a follow up.
 |