Skrevet av Emne: Copying large amounts of small files using SMB/CIFS  (Lest 7942 ganger)

Utlogget Floyd-ATC

  • Livstidsdiktator
  • Administrator
  • Guru
  • *****
  • Innlegg: 542
  • Karma: +12/-0
    • MSN Messenger - floyd@atc.no
    • Vis profil
    • floyd.atc.no
    • E-post
Copying large amounts of small files using SMB/CIFS
« på: 17. April 2015, 14:41 pm »
  • [applaud]0
  • [smite]0
  • We encountered a scenario where more than 500000 small files needed to be copied from a CIFS share running on a NetApp to a local USB disk. To make matters even more interesting, the NetApp uses on-access virus scanning. I should point out that we wanted to discard file times, ownership and permissions if possible. The USB disk was therefore formatted as exFAT rather than NTFS.

    This was solved using a laptop computer with an USB3 port and a Gigabit ethernet port.

    First, a plain old Ctrl+C/Ctrl+V file copy achieved a peak throughput of about 3 Mbytes/sec, which would complete the copying of 530 Gbytes in just over 2 days. TeraCopy yielded the exact same throughput. This was completely unacceptable.

    Second, we downloaded RichCopy, a tool originally developed by Microsoft for internal use but later released to the public. Using 50 paralell threads and request serialization we achieved a peak throughput of nearly 8 Mbytes/sec, reducing the estimated time needed to under 23 hours.

    Third, we booted a Knoppix live DVD, mounted the exFAT USB drive and CIFS share and used the following commands:
    Kode: [Velg]
    time find Source -noleaf -type d | xargs -n 1 -P 10 -I % mkdir -p /mnt/destination/%
    Kode: [Velg]
    time find Source -noleaf -type f | xargs -n 1 -P 50 -I % cp % /mnt/destination/%
    Here, "Source" is the relative name of the directory containing the files to be copied.
    "/mnt/destination" is the USB drive mount point.
    The first command is used to build the directory structure using up to 10 paralell processes. This took about 10 minutes.
    The second command is used to copy the actual files using up to 50 paralell processes. This took just over 7 hours with an average throughput of just over 20 Mbytes/sec.
    Note the "-I %" which means that the character "%" should be substituted with the actual directory or file name to be created/copied.

    It is possible to have 'cp' keep file times using the appropriate options, note however that this may affect the performance. You should also experiment with the number of paralell processes to find the "sweet spot" for your particular scenario. Not all USB drives, computers, network switches and CIFS servers are the same. That said, the most important factor is probably the number and size of the files to be copied.

    I should also mention that the GUI on our Knoppix 7.4.2 crashed halfway through one of the tests, possibly due to a resource leak in the window manager or display driver. This did not seem to affect the performance but we decided to boot using the CLI only by typing "knoppix 2" at the bootloader prompt. As a bonus, this freed up some system resources, improving the performance even more.  :)


    -Floyd.

    --
    Det finnes 10 typer mennesker;
    de som forstår binærtall, de som ikke gjør det, og de som forstår Grey code.