• I think you're underestimating network latency. If reading from the cache is like talking to a person face-to-face, then reading from disk is like sending a letter, and reading over the network is like sending a rocket to the moon :P

    So for the example you gave, with 1 Gbps upload speed (pretty fast even by today's standards) then a 16Tb folder would take 16,000 seconds = almost 4 1/2 hours just to send to one other device. So unless the compression task would take over 9 hours and is completely parallel (no compression algorithm is) then its faster to just do it on one machine. You could send it over to more than one other device, but then the transmission time would increase (albeit non-linearly) and you'll see lower percentage gains in compute time due to Amdahl's Law.

    That being said, solutions for what you're talking about do exist! See Hadoop/Apache Spark for a distributed, scalable, general-purpose mapreduce framework. I'm sure Azure/AWS have more modern solutions as well.

  • Voted!
    Need karma! Please check submission guidelines.
    Paid!
    Why pay twice?