Idea ID: 1638649

Backup a single file system with multiple streams

Status: Accepted

Brief description:

While file systems continue to grow and the underlaying disk subsystems become faster the backups of large file systems are still one of the major concerns for traditional backup environments. The Data Protector Disk Agent should support multiple streams on a single file system that can be configured similar to device concurrency to drastically speed up the backup and the restore operation.

Picture1.png

Existing Enhancement Requests:

QCCR2A62413: Support multiple disk agents per volume

Benefit:

Reduce the complexitiy of subdividing file systems manually into smaller pieces. Optimize for backup and restore performance using multiple streams even on large file systems with millions of files and folders.

Tags:

  • - I have the same question as . What would be your expectation around the split streams with regards to features like object copy and object consolidation? Should the pieces have an indepedent existence or should they all be tied together under the parent entity?

  • This idea is being reviewed by engineering to determine the feasibility of this request. Will revert back shortly.

  • Greedy distribution is simplest and self-balancing: assign next file to stream that is available for processing. This precludes any thinking about what the "ideal" distribution would be, but the cost of possibly splitting files from same directory into multiple streams, thus potentially prolonging the restore process.

     

    One problem with all multi-stream approaches in DP is that user is forced to pick the number of streams. But that number cannot be correct for all situations: e.g. using 5 streams for incremental backup that picks up only 5 small files is wasteful. And the number of streams can conflict with number of destination devices that are selected or available, leading to situations where backup cannot continue (you selected 5 streams, but device can only handle 3). So maybe the setting should specify an upper limit to prevent diminishing returns described by , but the actual number up to that limit is then inferred from the selected destination devices and their capacity for concurrency at the time of backup?

     

    : How would you expect related functionality to behave in presence of multiple streams: object consolidation, object copy, protection?

  • Hi Jim,

    This is true. I know how challenging it can be when manually configuring multiple streams. File Systems can have all kind of structures. With no folders or tousands of of folders in the root. While multi-streaming of a block based backup is realitivly simple to implement (get the size, process pieces of similar sizes in parallel) we're looking for ways to optimize a traditional file system based backup. This means we have to work with the directory structures the Disk Agent can find.

    I would try to get the overall volume usage first, fetch top level folders and their usage. From there decide if the 2nd/3rd level folders need to to be considered or not. Each folder is processed as a stream where we can read data using a single thread or multiple threads.

    Regards,
    Sebastian Koehler

  • Don't rely on folder structure for making the automatic splits.  I've seen some filesystems with no subfolders that contain thousands of files amounting to hundreds of GB.  It's usually some wonky imaging app that stupidly dumps all output in to one folder with no option of dividing into subfolders by job, date, etc.