5

I've noticed loading half of a small dataset twice is faster than loading the full small dataset once from files around half a terabyte.

Any tips?

Comments
  • 0
    uhhhh... half of a terabyte is pretty fucking big. That can't be cached in memory so you are going to be limited by the speeds of your hard drive.
    Or are you joking and I missed it lol
  • 0
    @iam13islucky I'm loading "a small dataset" around 256 MiB from the 512 GiB file.
  • 0
    Also happens for datasets as small as 32 MiB from files as small as 64 GiB.
  • 1
    I dunno then, good luck!
  • 0
  • 0
    Huge files are not cached entirely but only those ranges you actually access. When reading a small chunk a second time it will likely still be in RAM. Other parts of the same might not yet be cached (even though the OS sometimes reads more than requested to speed up future read requests)
  • 0
    @kraator my wording was not so good. I meant loading each half of the small dataset is faster than the full dataset. I can see it if each load puts twice the requested size in the cache. Maybe loading 256 MiB puts 512MiB in the cache? But then the first 128 MiB block happens to put only a small buffer containing the second 128 MiB block in the cache?
  • 0
    @thejohnhoffer ah, but I'm doing these loads in series. Maybe that's the obvious problem here...
  • 0
    @thejohnhoffer I'm loading the full 256 MiB block. Then immediately I try to load two 128 MiB blocks from the same place in the same file. At least some of that must be cached. I'll redo my experiment backwards to see if that changes anything.
  • 0
    @kraator thank you! I'll post results
  • 0
    Not an expert on OS but pretty sure you will get some great research papers on this topic.
Add Comment