I have a Netbackup Environment with a Virtual Master Server running 7506 and a 5230 Appliance running 2.5.2 (7505). There are a small number of Physical clients (Windows and Linux) and many VMware clients.
Whilst most of the VMware clients (ESX5) are working successfully (120+) I have two which are giving major problems. The ESX LUNs are presented to the 5230 over SAN and we are authenticating to the vCenter server to control backups. Policy type is VMWARE.
The two servers in question have a large number of files, over 1TB total size each.
One of the servers, following the backup being initiaited, performs the VMware snapshot, starts the backup job and then sits there for some time before failing with a status 40 - Network Connection Broken. Which network conection ? As this is a VMware policy type I am not talking directly to the VM, it is being snapped and then presented to the 5230 as a local image. I have extended the client read timeouts on both the master and 5230 appliance with no success.
This job has run successfully in the past numerous times - both FULL and DIFF.
Failed Job log:
10/09/2013 10:53:38 - Info nbjm(pid=3704) starting backup job (jobid=4321) for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule DAILY
10/09/2013 10:53:38 - estimated 1388605664 Kbytes needed
10/09/2013 10:53:38 - Info nbjm(pid=3704) started backup (backupid=DUB-APPS-001_1378806818) job for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule DAILY on storage unit DUB-5230-DeDup
10/09/2013 10:53:39 - started process bpbrm (29973)
10/09/2013 10:53:41 - Info bpbrm(pid=29973) DUB-APPS-001 is the host to backup data from
10/09/2013 10:53:41 - Info bpbrm(pid=29973) reading file list from client
10/09/2013 10:53:41 - Info bpbrm(pid=29973) starting bpbkar on client
10/09/2013 10:53:41 - Info bpbkar(pid=29997) Backup started
10/09/2013 10:53:41 - Info bpbrm(pid=29973) bptm pid: 29998
10/09/2013 10:53:41 - Info bptm(pid=29998) start
10/09/2013 10:53:41 - connecting
10/09/2013 10:53:41 - connected; connect time: 00:00:00
10/09/2013 10:53:42 - Info bptm(pid=29998) using 1048576 data buffer size
10/09/2013 10:53:42 - Info bptm(pid=29998) using 64 data buffers
10/09/2013 10:53:42 - Info bptm(pid=29998) start backup
10/09/2013 10:54:25 - begin writing
10/09/2013 12:01:17 - Error bptm(pid=29998) media manager terminated by parent process
10/09/2013 12:01:24 - Info dub-5230(pid=29998) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 3 KB, CR sent: 0 KB, CR sent over FC: 0 KB, dedup: 100.0%
10/09/2013 12:01:25 - Info bpbkar(pid=0) done. status: 40: network connection broken
10/09/2013 12:01:25 - end writing; write time: 01:07:00
network connection broken(40)
10/09/2013 12:11:28 - Info bpbrm(pid=11144) DUB-APPS-001 is the host to backup data from
10/09/2013 12:11:28 - Info bpbrm(pid=11144) reading file list from client
10/09/2013 12:11:28 - Info bpbrm(pid=11144) starting bpbkar on client
10/09/2013 12:11:28 - Info bpbkar(pid=11151) Backup started
10/09/2013 12:11:28 - Info bpbrm(pid=11144) bptm pid: 11152
10/09/2013 12:11:28 - Info bptm(pid=11152) start
10/09/2013 12:11:29 - Info bptm(pid=11152) using 1048576 data buffer size
10/09/2013 12:11:29 - Info bptm(pid=11152) using 64 data buffers
10/09/2013 12:11:29 - Info bptm(pid=11152) start backup
Following this the job just sits there doing nothing.
Successful Job Log:
07/09/2013 02:59:09 - Info nbjm(pid=3984) starting backup job (jobid=4051) for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule MONTHLY
07/09/2013 02:59:09 - estimated 0 Kbytes needed
07/09/2013 02:59:09 - Info nbjm(pid=3984) started backup (backupid=DUB-APPS-001_1378519149) job for client DUB-APPS-001, policy Daily_18.30__Weekly_00.01__Monthly_00.01__VMware, schedule MONTHLY on storage unit DUB-5230-DeDup
07/09/2013 02:59:11 - started process bpbrm (7231)
07/09/2013 02:59:12 - Info bpbrm(pid=7231) DUB-APPS-001 is the host to backup data from
07/09/2013 02:59:12 - Info bpbrm(pid=7231) reading file list from client
07/09/2013 02:59:12 - connecting
07/09/2013 02:59:13 - Info bpbrm(pid=7231) starting bpbkar on client
07/09/2013 02:59:13 - Info bpbkar(pid=7238) Backup started
07/09/2013 02:59:13 - Info bpbrm(pid=7231) bptm pid: 7239
07/09/2013 02:59:13 - Info bptm(pid=7239) start
07/09/2013 02:59:13 - Info bptm(pid=7239) using 1048576 data buffer size
07/09/2013 02:59:13 - Info bptm(pid=7239) using 64 data buffers
07/09/2013 02:59:13 - connected; connect time: 00:00:01
07/09/2013 02:59:14 - Info bptm(pid=7239) start backup
07/09/2013 02:59:51 - begin writing
07/09/2013 04:00:53 - Info bpbkar(pid=7238) INF - Transport Type = san
07/09/2013 20:25:04 - Info bpbkar(pid=7238) bpbkar waited 124 times for empty buffer, delayed 77660 times
07/09/2013 20:25:04 - Info bptm(pid=7239) waited for full buffer 848100 times, delayed 3160597 times
07/09/2013 20:25:29 - Info bptm(pid=7239) EXITING with status 0 <----------
07/09/2013 20:25:29 - Info dub-5230(pid=7239) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 1148749424 KB, CR sent: 57710289 KB, CR sent over FC: 0 KB, dedup: 95.0%
07/09/2013 20:25:29 - Info bpbrm(pid=7231) validating image for client DUB-APPS-001
07/09/2013 20:25:29 - end writing; write time: 17:25:38
07/09/2013 20:25:30 - Info bpbkar(pid=7238) done. status: 0: the requested operation was successfully completed
the requested operation was successfully completed(0)
The other server is more promising, but runs extremely slowly and would take days to finish if let run. This job has run successfully (FULL) twice and secured 1.6TB in 14hrs.
Failed Job Log:
09/09/2013 16:00:45 - Info nbjm(pid=3704) starting backup job (jobid=4227) for client DUB-PTFILE-P01, policy AEGON-PROD-VMWARE-ADHOC, schedule MONTHLY
09/09/2013 16:00:45 - estimated 0 Kbytes needed
09/09/2013 16:00:45 - Info nbjm(pid=3704) started backup (backupid=DUB-PTFILE-P01_1378738845) job for client DUB-PTFILE-P01, policy AEGON-PROD-VMWARE-ADHOC, schedule MONTHLY on storage unit DUB-5230-DeDup
09/09/2013 16:00:46 - started process bpbrm (459)
09/09/2013 16:00:48 - Info bpbrm(pid=459) DUB-PTFILE-P01 is the host to backup data from
09/09/2013 16:00:48 - Info bpbrm(pid=459) reading file list from client
09/09/2013 16:00:48 - Info bpbrm(pid=459) starting bpbkar on client
09/09/2013 16:00:48 - Info bpbkar(pid=467) Backup started
09/09/2013 16:00:48 - Info bpbrm(pid=459) bptm pid: 468
09/09/2013 16:00:48 - connecting
09/09/2013 16:00:48 - connected; connect time: 00:00:00
09/09/2013 16:00:49 - Info bptm(pid=468) start
09/09/2013 16:00:49 - Info bptm(pid=468) using 1048576 data buffer size
09/09/2013 16:00:49 - Info bptm(pid=468) using 64 data buffers
09/09/2013 16:00:50 - Info bptm(pid=468) start backup
09/09/2013 16:00:51 - begin writing
10/09/2013 10:42:07 - Error bptm(pid=468) media manager terminated by parent process
10/09/2013 10:42:14 - Info dub-5230(pid=468) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 3636227 KB, CR sent: 342397 KB, CR sent over FC: 0 KB, dedup: 90.6%
10/09/2013 10:42:15 - Info bpbkar(pid=0) done
10/09/2013 10:42:15 - Info bpbkar(pid=0) done. status: 150: termination requested by administrator
10/09/2013 10:42:15 - end writing; write time: 18:41:24
termination requested by administrator(150)
Successful job log:
10/08/2013 00:13:37 - Info nbjm(pid=3836) starting backup job (jobid=1681) for client DUB-PTFILE-P01, policy Daily_None__Weekly_00.01__Monthly_00.01__VMware, schedule WEEKLY
10/08/2013 00:13:37 - estimated 0 Kbytes needed
10/08/2013 00:13:37 - Info nbjm(pid=3836) started backup (backupid=DUB-PTFILE-P01_1376090017) job for client DUB-PTFILE-P01, policy Daily_None__Weekly_00.01__Monthly_00.01__VMware, schedule WEEKLY on storage unit DUB-5230-DeDup
10/08/2013 00:13:38 - started process bpbrm (23623)
10/08/2013 00:13:40 - Info bpbrm(pid=23623) DUB-PTFILE-P01 is the host to backup data from
10/08/2013 00:13:40 - Info bpbrm(pid=23623) reading file list from client
10/08/2013 00:13:40 - connecting
10/08/2013 00:13:40 - connected; connect time: 00:00:00
10/08/2013 00:13:41 - Info bpbrm(pid=23623) starting bpbkar on client
10/08/2013 00:13:41 - Info bpbkar(pid=23645) Backup started
10/08/2013 00:13:41 - Info bpbrm(pid=23623) bptm pid: 23648
10/08/2013 00:13:41 - Info bptm(pid=23648) start
10/08/2013 00:13:41 - Info bptm(pid=23648) using 1048576 data buffer size
10/08/2013 00:13:41 - Info bptm(pid=23648) using 64 data buffers
10/08/2013 00:13:42 - Info bptm(pid=23648) start backup
10/08/2013 00:13:51 - begin writing
10/08/2013 08:09:45 - Info bpbkar(pid=23645) INF - Transport Type = san
10/08/2013 13:56:30 - Info bpbkar(pid=23645) bpbkar waited 389820 times for empty buffer, delayed 1140862 times
10/08/2013 13:56:30 - Info bptm(pid=23648) waited for full buffer 107426 times, delayed 1894435 times
10/08/2013 13:59:14 - Info bptm(pid=23648) EXITING with status 0 <----------
10/08/2013 13:59:15 - Info dub-5230(pid=23648) StorageServer=PureDisk:dub-5230; Report=PDDO Stats for (dub-5230): scanned: 1617774661 KB, CR sent: 187714934 KB, CR sent over FC: 0 KB, dedup: 88.4%
10/08/2013 13:59:16 - Info bpbrm(pid=23623) validating image for client DUB-PTFILE-P01
10/08/2013 13:59:17 - end writing; write time: 13:45:26
10/08/2013 13:59:18 - Info bpbkar(pid=23645) done. status: 0: the requested operation was successfully completed
the requested operation was successfully completed(0)
I note in both of the above FAILED joblogs we never get the message INF - Transport Type = san, whereas we do in the successful jobs.
I am unaware of any other changes in the environment which have been made - and in either case why would it affect only these two VM clients ? As a test I have now installed the 7506 agent on each of the VM's and am running an MS-Windows policy typw with Accelerator enabled and Client Side Dedup to test if the job will complete in that manner. However - I do not want to progress this way, and both jobs worked perfectly previously ??
Any pointers ?
AJ