Issue with gateway selection when creating Deduplication Store

When trying to configure (add) a Deduplication Store device on a new server, I could create the store, but when having to select a "client" for the gateway, the machine with the store isn't listed; instead machines that still run StoreOnce are listed. I don't understand, but  I'm new to Deduplication Store. And I cannot continue creating the device unless I add at least one gateway, it seems. Is this a software bug, or a misunderstanding at my side?

This is for Data Protector 24.4, and the machine to host Deduplication Store has the Deduplication Store and the Disk Agent installed; does it need a Media Agent, too? I thought Deduplication Store is a kind of Media Agent.

Parents
  • Suggested Answer

    0  

    To extend a little more on this ... The "Deduplication Store" and the "Storeonce Software Deduplication" packages are basically the engines, the servers. In addition to that you need one or more gateways, either on the same or on a or multiple different hosts. That's the same for both software implementations (DPD and SOS) as well as for all supported hardware deduplication devices. And that gateway is included in a media agent.

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.

  • Suggested Answer

    0   in reply to   

    I assume the general concept of interaction between CS, DA and MA is known. The additional link in a B2D device scenario is between the gateway (MA) and the B2D device itself. This could be a fiber link (only with hardware devices) or a network link.

    When you are talking about a "local gateway" I'm assuming you mean a gateway residing on the DPD or SOS system. That will avoid the additional network traffic between the deduplication server and the gateway indeed. There's however another aspect to keep in mind and that's the CPU required for deduplication. By using a remote gateway, part of the required resources for deduplication will move to the remote gateway system. So it's not only about network bandwidth, but also about CPU power.

    In general we talk about low bandwidth and high bandwidth data transfers. A high bandwidth data transfer is established with a target-side gateway in Data Protector. In this case all data is transferred between gateway and device and the deduplication is fully happening on the device itself. A low bandwidth transfer is established using a server-side or source-side gateway. In this case the deduplication is mainly happening on the gateway system which basically means less network traffic to the device, but more resources needed on the gateway system. The difference between source-side and server-side is that the first one is implicitly defined (always runs on the DA system) while the second is explicitly defined (on a specific DP client).

    Let's go back to the scenario of a software deduplication server (SOS or DPD). Having a gateway on the deduplication host itself may not always be the best choice as the system, although well equipped, may still run out of resources easily. The most obvious choice may be to have the gateway on the DA host, but that will only work when that host has enough resources. So in some cases it may help to have it remote (server-side gateway) which offloads the load from the DA host (but means additional network traffic).

    Although I am an OpenText employee, I am speaking for myself and not for OpenText.
    If you found this post useful, give it a “Like” or click on "Verify Answer" under the "More" button.

  • Verified Answer

    +1 in reply to   

    It's still a bit confusing: I always thought the deduplication is actually done on the deduplication device (because only those can do), but it seems the gateways also do deduplication, so I wonder what the actual deduplication device does then (especially when considering its fragile nature (it's not transactional and it's easy to loose data, it seems)).

    Another aspect is that of configuration and complexity: If you have multiple hosts, but few backup devices, it seems to be more attractive to configure the gateways for the devices (target) instead of configuring  one or more gateways on each host. So far we did not really experience a CPU bottleneck (modern CPUs have plenty of cores, even my laptop has 32 logical CPUs these days, and if you have a server with a lot of RAM, you may end up with 100 logical CPUs or even more). I'm still waiting to see an application that makes reasonable use of all those CPUs ("Reasonable" means that one can always keep the CPUs busy with useless work; sometimes algorithms are just highly inefficient). Talking on that, I think modern software should actually be able to diagnose performance bottlenecks and make recommendations how to change configuration for better performance if needed.

    Say you backup 10 servers to one deduplication store using 10 gateways on the server: Will the gateways control which one is allowed to write to the store, and will the store be responsible to deduplicate the "already deduplicated" streams from each gateway? I mean a gateway can deduplicate the data it sends to the device (does it consider what's on the device already?) from the particular server, but only the device can deduplicate the data streams from individual servers. Or did I misunderstand something?

Reply
  • Verified Answer

    +1 in reply to   

    It's still a bit confusing: I always thought the deduplication is actually done on the deduplication device (because only those can do), but it seems the gateways also do deduplication, so I wonder what the actual deduplication device does then (especially when considering its fragile nature (it's not transactional and it's easy to loose data, it seems)).

    Another aspect is that of configuration and complexity: If you have multiple hosts, but few backup devices, it seems to be more attractive to configure the gateways for the devices (target) instead of configuring  one or more gateways on each host. So far we did not really experience a CPU bottleneck (modern CPUs have plenty of cores, even my laptop has 32 logical CPUs these days, and if you have a server with a lot of RAM, you may end up with 100 logical CPUs or even more). I'm still waiting to see an application that makes reasonable use of all those CPUs ("Reasonable" means that one can always keep the CPUs busy with useless work; sometimes algorithms are just highly inefficient). Talking on that, I think modern software should actually be able to diagnose performance bottlenecks and make recommendations how to change configuration for better performance if needed.

    Say you backup 10 servers to one deduplication store using 10 gateways on the server: Will the gateways control which one is allowed to write to the store, and will the store be responsible to deduplicate the "already deduplicated" streams from each gateway? I mean a gateway can deduplicate the data it sends to the device (does it consider what's on the device already?) from the particular server, but only the device can deduplicate the data streams from individual servers. Or did I misunderstand something?

Children
No Data