Ok, I finally figured it out (for the most part)!
The important detail I left out in the question was that I was running PostgreSQL on btrfs and was streaming to a different server with ext4. There seemed to be some race condition during high load periods that caused the data being streamed to be corrupted or read incorrectly. I don't know exactly what. Sometimes it failed after 30 seconds, sometimes 30 minutes.
So last night I shut down the system, backed everything up on a separate HDD, reformated my btrfs partition to ext4, moved everything back and brought the system back up. Once I restarted the live replication it caught up and now 24 hours later it is still perfectly in sync, no errors!
So whatever was going on it was related to the btrfs partition. I've spent this entire week trying to figure this out so I hope this saves someone else some time. :-)