Issue with gateway selection when creating Deduplication Store

When trying to configure (add) a Deduplication Store device on a new server, I could create the store, but when having to select a "client" for the gateway, the machine with the store isn't listed; instead machines that still run StoreOnce are listed. I don't understand, but  I'm new to Deduplication Store. And I cannot continue creating the device unless I add at least one gateway, it seems. Is this a software bug, or a misunderstanding at my side?

This is for Data Protector 24.4, and the machine to host Deduplication Store has the Deduplication Store and the Disk Agent installed; does it need a Media Agent, too? I thought Deduplication Store is a kind of Media Agent.

  • Suggested Answer

    0  

    The B2D gateway is included in a media agent indeed. That's why you see them for your current SO gateways ... those systems do have a media agent installed. This is the same for all B2D devices.

  • Suggested Answer

    0  

    To extend a little more on this ... The "Deduplication Store" and the "Storeonce Software Deduplication" packages are basically the engines, the servers. In addition to that you need one or more gateways, either on the same or on a or multiple different hosts. That's the same for both software implementations (DPD and SOS) as well as for all supported hardware deduplication devices. And that gateway is included in a media agent.

  • 0 in reply to   

    Is there some concept overview relating "Backup to disk" (StoreOnce, Deduplication Store, etc.), "Gateway",  and "Media Agent"?

    Does the media agent connect to the B2D via a gateway, or what is the relation? When configuring a backup, it seems the B2D takes the role of a library device, while the gateway teems to have the role of a "drive" inside the library, so I don't really see a media agent there..

    The other point is: When I add the "Deduplication Store" component, and the component needs a media agent, why isn't it selected automatically?

    I noticed that adding a component even reinstalled the core components even if there and current (and an upgrade always seems to replace the software, even if current (but that's another topic)).

  • 0 in reply to 

    While we are on it: I also think that it would be state of the art when using "Add component" to see which components are there already when selecting the component to add. As it is now the complete set of components is updated (or added if actually new). Most software management is much more clever than this.

  • Suggested Answer

    0   in reply to   

    You are right when saying the B2D device corresponds to a physical library and the gateway corresponds to the physical device from a DP perspective. It is not true that a MA is automatically required on the same system as the B2D software deduplication component (either DPD or SOS). The gateway (included in the MA) can be any system in the cell, so an automatic selection is not appropriate.

  • 0 in reply to   

    While the gateway may be on any host, my guess was that having a local gateway will avoid needless network traffic. I'd still like to see some schematic overview of the data flows. CM, system to backup, gateway, and deduplication store are four components, and I'd like to avoid any useless data exchange between hosts if possible.

  • Suggested Answer

    0   in reply to   

    I assume the general concept of interaction between CS, DA and MA is known. The additional link in a B2D device scenario is between the gateway (MA) and the B2D device itself. This could be a fiber link (only with hardware devices) or a network link.

    When you are talking about a "local gateway" I'm assuming you mean a gateway residing on the DPD or SOS system. That will avoid the additional network traffic between the deduplication server and the gateway indeed. There's however another aspect to keep in mind and that's the CPU required for deduplication. By using a remote gateway, part of the required resources for deduplication will move to the remote gateway system. So it's not only about network bandwidth, but also about CPU power.

    In general we talk about low bandwidth and high bandwidth data transfers. A high bandwidth data transfer is established with a target-side gateway in Data Protector. In this case all data is transferred between gateway and device and the deduplication is fully happening on the device itself. A low bandwidth transfer is established using a server-side or source-side gateway. In this case the deduplication is mainly happening on the gateway system which basically means less network traffic to the device, but more resources needed on the gateway system. The difference between source-side and server-side is that the first one is implicitly defined (always runs on the DA system) while the second is explicitly defined (on a specific DP client).

    Let's go back to the scenario of a software deduplication server (SOS or DPD). Having a gateway on the deduplication host itself may not always be the best choice as the system, although well equipped, may still run out of resources easily. The most obvious choice may be to have the gateway on the DA host, but that will only work when that host has enough resources. So in some cases it may help to have it remote (server-side gateway) which offloads the load from the DA host (but means additional network traffic).

  • Verified Answer

    +1 in reply to   

    It's still a bit confusing: I always thought the deduplication is actually done on the deduplication device (because only those can do), but it seems the gateways also do deduplication, so I wonder what the actual deduplication device does then (especially when considering its fragile nature (it's not transactional and it's easy to loose data, it seems)).

    Another aspect is that of configuration and complexity: If you have multiple hosts, but few backup devices, it seems to be more attractive to configure the gateways for the devices (target) instead of configuring  one or more gateways on each host. So far we did not really experience a CPU bottleneck (modern CPUs have plenty of cores, even my laptop has 32 logical CPUs these days, and if you have a server with a lot of RAM, you may end up with 100 logical CPUs or even more). I'm still waiting to see an application that makes reasonable use of all those CPUs ("Reasonable" means that one can always keep the CPUs busy with useless work; sometimes algorithms are just highly inefficient). Talking on that, I think modern software should actually be able to diagnose performance bottlenecks and make recommendations how to change configuration for better performance if needed.

    Say you backup 10 servers to one deduplication store using 10 gateways on the server: Will the gateways control which one is allowed to write to the store, and will the store be responsible to deduplicate the "already deduplicated" streams from each gateway? I mean a gateway can deduplicate the data it sends to the device (does it consider what's on the device already?) from the particular server, but only the device can deduplicate the data streams from individual servers. Or did I misunderstand something?

  • Verified Answer

    +1   in reply to   

    An important detail around deduplication is that the deduplication engine is not only available on the deduplication server, but also built in the API's provided by the dedupe device vendors and built into our media agent serving as a gateway. The low bandwidth data transfer I already mentioned is not our invention, but rather a general deduplication concept. We are providing 2 different gateway types for low bandwidth (source- & server-side), but the only difference is the way they are defined. If you use a server-side gateway on your application or DA host then you have exactly the same as with a source-side gateway.

    And yes, the way deduplication works is: it is buffering the data, chunking and hashing it, looking for a match on the device, compressing the chunks that are not available yet and sending them over to the store. In case of high bandwidth, the only thing done on the gateway is buffering the data and sending over the full blocks. All other task occur on the device itself. In case of low bandwidth, most of the task are happening on the gateway. The device is basically contacted for matching the hash list and for storing the unavailable chunks, already compressed on the gateway side. Note again this is all general deduplication knowledge, not specifically data protector.

    I believe most of your questions should be clear if the above is understood.

  • 0 in reply to   

    Yes, I probably should have a better background, but OTOH I'm old enough to remember that once (in the good old days of hand-written documentation) there was a "Concepts Guide" where (as the name suggests) the concepts were explained. I'm glad I did not delete it; it's still valuable: