Parallel Concurrent Processing Mike Swing TruTek [email protected] com RMOUG 2009 1 Conclusions • You don’t need RAC to use Parallel Concurrent Processing (PCP)! • If you have PCP enabled, secondary nodes must be defined during the upgrade to R12 • Tuning of TCP, SQLNet and PMON parameters can minimize PCP failover time. • Implement Failover Sensitive Workshifts 2 Concurrent Processing Server Allows scheduling of jobs – batch jobs, or Requests in Oracle terms. Processes concurrent programs as a Request.
Requests can be grouped together into Request Sets. Different types of concurrent managers handle different types of requests. A concurrent program can be assigned to a responsibility, and that responsibility can be assigned to users, allowing them the permission to run the concurrent program. Concurrent managers may have limits on the concurrent programs that can be run, and the times that they can be started. Requests have priorities, status and log and out files in the above directory 3 Definitions • • • • • • CP => Concurrent Processing DCD => Dead Connection Detection ICM => Internal Concurrent Manager IM => Internal Monitor CRM => Conflict Resolution Manager PCP => Parallel Concurrent Processing PMON => Process Monitor for ICM 4 Concurrent Request 5 Phase and Status of Concurrent Requests Phase Pending Pending Running Completed Completed Completed Inactive Status Normal Standby Normal Normal Error Warning No Manager Description – Action The request is waiting to be picked up by the next available manager. Waiting for CRM to resolve conflict.
CRM could be slow or an incompatible program is running. The request is running normally. The request has finished successfully The request has finished with an error. Check logs. The request has finished with a Warning. Check the logs. Request won’t run without a manager. Specialization rules aren’t configured properly. 6 PCP Failover •DB Node – RH8 •Database •RH7 •RH8 •RH9 •sqlnet. ora •PCP •PCP •PCP •Database Listener •SQL*Net •Client •SQL*Net •Client •SQL*Net •Client •TCP_KEEPALIVE takes 240 seconds before issuing DCD 7 Concurrent Managers 8 Concurrent Managers
Manager Type Internal Concurrent Manager Conflict Resolution Manager Internal Monitor Concurrent Manager Concurrent Manager Concurrent Manager Concurrent Manager Transaction Manager Transaction Manager Transaction Manager Transaction Manager Service Instance Internal Manager Conflict Resolution Manager Internal Monitor:Node Service Manager: Node Standard Manager Inventory Manager Session History Cleanup PA Streamline Manager CRP Inquiry Manager FastFormula Transaction Manager PO Document Approval Manager Transaction Manager Scheduler/Prerelease Manager OAM Generic Collection Service:Node Program FNDLIBR FNDCRM FNDIMON FNDSM FNDLIBR INVLIBR FNDLIBR PALIBR CYQLIB FFTM POXCON FNDTMTST FNDSVC FNDSVC 9 Concurrent Processing 1. The Concurrent Web Processing server Interface Browser communicates with the database using Forms Server Oracle SQL*Net. JAVA 2.
The concurrent JInitiator Interface program log or output Reports Server file from a request is passed back as a report to the Report SQL*Net ICM Service Internal Report Review Agent. FNDLIBR Manager Monitor Review FNDSM . rdx FNDIMON 3. The Report Review Agent Agent passes a file Standard Manager containing the entire Requests Log Out FNDCRM FNDLIBR report to the forms server. 4. The Forms Services component passes the report back to the user’s browser one page at time. Profile options can be used to control the size of the files and pages passed, to suit report volume and available network capacity. HTML Web Server 10 Internal Concurrent Manager • The Internal Concurrent Manager (ICM) starts, sets the number of active rocesses, monitors, and terminates all other concurrent processes through requests made to the Service Manager, including restarting any failed processes. • The ICM also starts and stops, and restarts the Service Manager for each node. • The ICM will perform process migration during an instance or node failure. • The ICM will be active on a single node. • This is also true in a PCP environment, where the ICM will be active on at least one node at all times. 11 Internal Concurrent Manager • The ICM really does not have any scheduling responsibilities. It has NOTHING to do with scheduling requests, or deciding which manager will run a particular request.
The function of the ICM is to run ‘queue control’ requests; requests to startup or shutdown other managers. • The ICM is responsible for startup and shutdown of the whole concurrent processing facility, and it monitors the other managers periodically, and restarts them if they should go down. It can also take over the Conflict Resolution manager’s job, and resolve incompatibilities. • If the ICM itself should go down, requests will continue to run normally, except for ‘queue control’ requests. Restart the ICM with ‘startmgr’; no need to kill the other managers first. 12 Internal Concurrent Manager 13 Service Manager FNDSM process – Communicates with the Internal Concurrent Manager, Concurrent Manager, and non-Manager Service processes. The Service Manager (SM) spawns, and terminates manager and service processes (these could be Forms, or Apache Listeners, Metrics or Reports Server, and any other process controlled through Generic Service Management). • When the ICM terminates the SM that resides on the same node with the ICM will also terminate. • The SM is “chained” to the ICM. The SM will only reinitialize after termination when there is a function it needs to perform (start, or stop a process), so there may be periods of time when the SM is not active, and this would be normal. 14 Service Manager • All processes initialized by the SM inherit the same environment as the SM. • The SM’s environment is set by APPSORA. env file, and the gsmstart. sh script. • The apps_ listener must be active on each CP node to support the SM connection to the local instance. There should be a Service Manager active on each node where a Concurrent or non-Manager service process will reside. 15 FNDSM Failure FNDSM failover as noted in the concurrent manager log: Could not contact Service Manager FNDSM_RH8_VIS. The TNS alias could not be located, the listener process on RH8 could not be contacted, or the listener failed to spawn the Service Manager process. Found dead process: spid=(962754), cpid=(2259578), Service Instance=(1045) CONC-SM TNS FAIL Call to PingProcess failed for WFMAILER CONC-SM TNS FAIL Call to StopProcess failed for WFMAILER CONC-SM TNS FAIL Call to PingProcess failed for FNDCPGSC 16 FNDSM Failover
Found dead process: spid=(716870), cpid=(2259580), Service Instance=(2009) Found dead process: spid=(1442020), cpid=(2259579), Service Instance=(2010) Starting WFMGSMD Concurrent Manager : 15-AUG-2008 13:28:56 Starting WFMGSMDB Concurrent Manager : 15-AUG-2008 13:28:56 Starting WFALSNRSVCB Concurrent Manager : 15-AUG-2008 13:28:57 Starting STANDARD Concurrent Manager : 15-AUG-2008 13:30:31 Starting Internal Concurrent Manager Concurrent Manager : 15-AUG2008 13:30:32 17 Internal Monitor (FNDIMON process) – Communicates with the Internal Concurrent Manager. • This manager/service is used to implement Parallel Concurrent Processing. • You do not need to run this manager/service unless you are using Parallel Concurrent Processing. The Internal Monitor (IM) monitors the Internal Concurrent Manager, and restarts any failed ICM on the local node. It monitors whether the ICM is still running, and if the ICM crashes, it will restart it on another node. • During a node failure in a PCP environment the IM will restart the ICM on a surviving node (multiple ICM’s may be started on multiple nodes, but only the first ICM started will eventually remain active, all others will gracefully terminate). • There should be an Internal Monitor defined on each node where the ICM may migrate. 18 Standard Manager • (FNDLIBR process) – Communicates with the Service Manager and any client application process. The Standard Manager is a worker process that initiates, and executes client requests on behalf of Applications batch, and OLTP clients. 19 Standard Manager 20 Standard Manager – OAM The Standard Manager is active on RH9, even though no primary node is defined Since no secondary node is defined, the Standard Manager will not failover “Failover Processes” in the Work Shifts definition are the number of processes that will run (3) when the Standard Manager fails over to the secondary node. 21 Transaction Manager A Transaction Manger communicates with the Service Manager, and any user process initiated on behalf of Forms, or a Standard Manager request. A
Transaction Manager: • Supports synchronous processing of requests from a client program • Gets request for a client program to run a server-side program synchronously. • Return a status/results to the client program. • At runtime, it starts a number of these managers as defined. • Doesn’t poll concurrent request table for a new request • Only need 1 transaction manager per database, not 1 per instance. 22 Transaction Managers Some of the Transaction Managers in R12 23 Configuring Transaction Managers for RAC • R11i Transaction Managers use DBMS_PIPE – This does not work across RAC instances – RAC users must perform additional configuration • Requires complicated configuration or additional hardware • R12 Transaction Managers use AQ – – – Works across RAC Instances Simplifies configuration Reduces complexity Profile Option can switch between mechanisms • DBMS_PIPE can be used for non-RAC users if performance becomes an issue 24 Configuring Transaction Managers for RAC • Edit $ORACLE_HOME/dbs/_ifile. ora and add these parameters: • • _lm_global_posts=TRUE _immediate_commit_propagation=TRUE • • • • • Change the profile option ‘Concurrent: TM Transport Type’ to ‘QUEUE’, and verify that the transaction manager works across the RAC instance. ATG RUP3 (4334965) or higher provides an option to use AQs in place of Pipes. Profile “Concurrent:TM Transport Type” Set to QUEUE Pipes are more efficient but equire a Transaction Manager to be running on each DB Instance. Navigate to Concurrent > Manager > Define screen, and set up the primary and secondary node names for transaction managers. 25 Configuring Transaction Managers for RAC • Transaction Managers allow a client to make a request for a program to be run on the server immediately. The client then waits for the program to complete and can receive program results from the server. As the client and server are two separate database sessions, the communication between has been handled using the DBMS_PIPE package. Unfortunately the DBMS_PIPE package does not extend to communications between sessions on different RAC instances.
On an Applications instance using RAC, the client and server are very likely to be on different instances, causing transactions to time out for long periods or fail completely. The current workaround is to manually set up Transaction managers to connect to all RAC instances, which not only takes up additional resources, it may require additional middle-tier hardware or a complicated configuration that is difficult to maintain. • 26 R12 Transaction Managers • In R12, the Transaction Managers use the AQ mechanism; the Transaction Managers, work on RAC connected to either instance. • This greatly simplifies the configuration and reduces the complexity for RAC administrators. A Profile Option has been introduced to allow users to switch between the two transports DBMS_PIPE or AQ. 27
Concurrent:PCP Instance Check • Concurrent processing provides database instancesensitive failover capabilities. When an instance is down, all managers connecting to it switch to a secondary middle-tier node. • However, if you prefer to handle instance failover separately from such middle-tier failover (for example, using TNS connection-time failover mechanism instead), use the profile option Concurrent:PCP Instance Check. • When this profile option is set to OFF, Parallel Concurrent Processing will not provide database instance failover support; however, it will continue to provide middle-tier node failover support when a node goes down. 28 Conflict Resolution Manager • • Concurrent managers read requests to start concurrent programs. The Conflict Resolution Manager checks concurrent program definitions for incompatibility rules. If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the concurrent managers from starting other programs in the same conflict domain. When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have completed running. To enable/disable the Conflict Resolution Manager, use the system profile option ‘Concurrent: Use ICM’. Set this to ‘No’ (default) allows the CRM to be started.
Setting it to ‘Yes’ causes the CRM to be shutdown and the Internal Manager (ICM) will take over the conflict resolution duties. If the CRM will not start (it is started automatically by the ICM), check this profile option. 29 • • • Conflict Resolution Manager • Use the system profile option ‘Concurrent: Use ICM’. ‘No‘ allows the CRM to be started. • Setting it to ‘Yes’ causes the CRM to shutdown. The Internal Manager (ICM) will take over the conflict resolution duties. • Using the ICM to resolve conflicts is not recommended. • The CRM’s sole purpose is to resolve conflicts, while the ICM has other functions to perform as well. • Setting this option to ‘YES’ is not recommended. 30 Generic Service Management An E-Business Suite system depends on a variety of services, such as Forms Listeners, HTTP Servers, Concurrent Managers, and Workflow Mailers. These services are composed of one or more processes. In the past, many of these processes had to be individually started and monitored by system administrators. Management of these processes is complicated, since these services can be distributed across multiple host machines. The introduction of Generic Service Management in Release 11i helped simplify the management of these processes by providing a fault tolerant service framework and a central management console built into Oracle Applications Manager.
Service Management is an extension of Concurrent Processing, and provides a framework for managing processes on multiple host machines. With Service Management, virtually any application tier service can be integrated into this framework. Patch 2221688 introduces GSM. 31 • • • • GSM 32 Generic Services 33 GSM and Multiple Nodes • GSM enables users to manage Applications services across multiple middle-tier nodes. • This includes services on Web/Forms nodes that previously have had no concurrent processing footprint. • Users configuring GSM in a multiple-node system should be sure to have followed the instructions for Parallel Concurrent Processing. This includes setting the environment variable APPLDCP=ON and assigning a primary node for all defined managers and services (if not already defined. ) 34 Seeded GSM Services When configuring GSM the following GSM Services are seeded automatically: – – – – – Forms Listener Metrics Server Metrics Client Reports Server Apache Listener LINUX users should not Activate the Reports Server under GSM 35 Starting GSM Apps Listener: listener. ora gsmstart. sh exec FNDSM 36 adcmctl. sh adcmctl. sh calls: starmgr. sh batchmgr. sh CONCSUB FNDSVCRG 37 FNDSVCRG – Service Controller Utility • FNDSVCRG is an executable introduced as a part of the Seeded GSM Services.
It provides improved coordination between the GSM monitoring of these service and their commandline control scripts. • The $FND_TOP/bin/FNDSVCRG executable is called from adcmctl. sh control script before and after the script starts or stops the service. FNDSVCRG connects to the database using JDBC and validates the configuration of the Seeded GSM Service. 38 Verify GSM • To verify GSM is working, start the concurrent managers. • Once GSM is enabled, the ICM uses Service Managers to start all concurrent managers and activated services. • If the ICM is successfully starting the managers, then GSM has been configured properly. • If managers and/or services fail to start, errors should appear in the ICM log file. 39 Service Manager Log Each Service Manager maintains its own log file named FNDSMxxxx. mgr, located in the same directory as concurrent manager log files. • If you cannot locate the Service Manager log file, it is likely that the Service Managers are not starting properly and there is a configuration issue that needs troubleshooting. 40 Kill FNDSM Test – Kill services and see if GSM restarts them applvis 9007 1 0 11:53 ? 00:00:00 FNDSM applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9161 5683 0 11:55 pts/3 00:00:00 grep FND [[email protected] scripts]$ kill -9 9007 [[email protected] scripts]$ ps -ef |grep FND applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9169 1 0 11:55 ? 0:00:00 FNDSM applvis 9249 5683 0 11:57 pts/3 00:00:00 grep FND Kill FNDCRM [[email protected] scripts]$ ps -ef |grep FNDCRM applvis 8886 1 0 11:52 ? 00:00:00 FNDCRM APPS/ZGA13053E1E1B7BA773417089054DA88F194EAC0D687728CC2551870E6B7 8C4B439 EADB287342795115A88DBC85788CCB4 FND FNDCRM N 10 c LOCK Y RH9 1302318 [[email protected] scripts]$ kill -9 8886 [[email protected] scripts]$ ps -ef |grep FNDCRM 00:00:00 FNDCRM applvis 9457 9392 0 12:09 ? APPS/ZG26430816FA3570354BC57DE47FF105D145F8DE226EFE58CE04B416633D CB90126 7BFECFA7585114F7090060EFE1147BE FND FNDCRM N 10 c LOCK Y RH9 1302343 Both of these services were started before I could enter the grep command to find the corresponding process. 41 1i – Defining PCP Details In Release 11i, the Secondary Node doesn’t need to be filled in for failover to occur 42 R12 PCP Details In Release 12, failover won’t occur if there is no Secondary Node defined 43 R12 PCP Setup The only Standard Manager set up to fail over is the “Standard Manager” 44 R12 Manager Failover 45 PCP Failover •DB Node – RH8 Database •RH7 •RH8 •RH9 sqlnet. ora •PCP •PCP •PCP Database Listener SQL*Net •Client •SQL*Net Client •SQL*Net Client •TCP_KEEPALIVE takes 240 seconds before issuing DCD 46 Parallel Concurrent Processing • Parallel concurrent processing allows distribution of concurrent managers across multiple nodes. Benefits are better: performance, availability and scalability (load balancing). • Parallel Concurrent Processing (PCP) is activated along with Generic Service Management (GSM); it can not be activated independently of GSM. • With parallel concurrent processing implemented with GSM, the Internal Concurrent Manager (ICM) tries to assign valid nodes for concurrent managers and other service instances. 47 Parallel Concurrent Processing • There should be only one ICM and CRM, at any given time, although the ICM and CRM could be configured to run on several of the nodes. • Concurrent Managers migrate to the surviving node when one of the concurrent nodes goes down. 48
Parallel Concurrent Processing Web Browser HTML Interface Web Server Forms Server Data Reports Server JInitiator JAVA Interface Internal Monitor FNDIMON ICM FNDLIBR Standard Manager FNDLIBR Service Manager FNDSM Report Review Agent Logs SQL*Net .rdx Out FNDCRM Requests Internal Monitor FNDIMON ICM FNDLIBR Standard Manager FNDLIBR Service Manager FNDSM Report Review Agent SQL*Net .rdx Database Out FNDCRM Requests Logs What’s wrong with this picture? 49 APPLDCP Profile Option Starting with Release 11. 5. 10, FND. H, the APPLDCP environment variable is ignored. R12 GSM requires the value of APPLDCP to be set to “ON”. The value is hard-coded in afpcsq. pc version 115. 35, thereby ignoring the value of APPLDCP. As per ATG Development: As of file “afpcsq. lpc” version 115. 35 or higher, APPLDCP is internally hard-coded to “ON” when the Generic Service Management (GSM) is enabled–“keeping in mind, use of the GSM is required”. In short, at “afpcsq. lpc” version 115. 35 or higher with the GSM enabled, the setting of the APPLDCP environment variable is ignored–this is the “default behavior on all R12 releases. ” NOTE: As per ARU, “Patch 11i. FND. H” (3262159) and “Oracle Applications Release 11. 5. 10” (3140000) contains “afpcsq. lpc” version 115. 37. From Note: 753678. 1 50 PCP Failover Mechanisms • • • • TCP keepalive PMON – ICM Process Monitor Dead Connection Detection Connection Failure Recovery – R12 10g Timeout Parameters (untested) – sqlnet. inbound_connect_timeout (server) – sqlnet. send_timeout (client and/or server) – sqlnet. recv_timeout (client and/or server) 51 11i PCP Failure • TCP Failure • ICM Lock is released, FNDIMON pings ICM node, if ping fails, check PMON • PMON detects a “dead process”, crashed ICM • reviver. sh • DCD 52 R12 PCP Failure • TCP Failure • PMON detects a “dead process” • ICM Shutdown – Look for error messages ORA-3113, ORA3114 or ORA-1041 • reviver. sh • DCD 53 Reviver ICM REVIVER Start Starts to Shutdown No Receive Shutdown? Lost DB Connection? Yes Attempt to Get DB Connection No Sleep
Yes No Spawn Reviver Yes Kill Previous DB Session ICM Started? Yes No Start ICM Exit From the CM log file: • The ICM has lost its database connection and is shutting down. • Spawning reviver process to restart the ICM when the database becomes available again. • Spawned reviver process 10910. Exit 54 reviver. log The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 10910. 55 TCP TCP/IP is a connection-oriented protocol; TCP implements packet timeout and retransmission in an effort to guarantee the safe and sequenced order of data packets.
If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number of times before timing out. After TCP/IP gives up, SQL*Net receives notification that the probe failed. 56 TCP Keepalive At this time, client side SQL*Net connections do not enable keepalive for TCP connections by default. However, it is possible to enable this by adding the ENABLE=BROKEN parameter to the SQL*Net connect string, by adding this parameter to the sqlnet. ora file. **WARNING** Keepalive intervals can typically be set to 2 hours or more (i. e,,it can take more than 2 hours to notice a dead server even if keepalive is enabled). To make keepalive useful for PCP and TAF the keepalive interval needs to be reduced to a smaller value (such as 2 minutes).
If there are a lot of IDLE connections on your network, then reducing keepalive can increase network traffic significantly. 57 ENABLE=BROKEN Sample TNS alias to enable keepalive (notice the ENABLE=BROKEN clause) VIS_BALANCE = (DESCRIPTION = (ENABLE=BROKEN) (ADDRESS_LIST = (LOAD_BALANCE = ON) (FAILOVER = ON) ADDRESS = (PROTOCOL = TCP) (HOST = rh8)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = rh6)(PORT = 1521))) 58 TCP Keepalive • **WARNING** Keepalive intervals are typically set to 2 hours or more (ie: it can take more than 2 hours to notice a dead server even if keepalive is enabled). • To make keepalive useful for TAF, the keepalive interval would need to be reduced to a smaller value (such as 2 minutes). Note: 249213. 1 59 TCP KeepAlive Parameters for Linux cp_keepalive_time tcp_keepalive_intvl tcp_keepalive_probes the time since the last data packet sent and the first keepalive probe the time between keepalive probes the number of probes to be sent before declaring the connection dead tcp_keepalive_time = 7200 seconds tcp_keepalive_intvl = 75 tcp_keepalive_probes = 9 Default Settings A total of 7875 seconds, or 2 hours 11 minutes and 15 seconds. 60 TCP Keepalive Initial Settings – tcp_keepalive_time = 200 secs – tcp_keepalive_intvl = 20 – tcp_keepalive_probes = 2 • After 200 seconds of no response, TCP sends the first of 2 probes, 20 seconds apart. • TCP notifies SQL*Net of the failure, and SQL*Net removes the offending connection. 61 TCP Retries tcp_retries1 (default: 3) The number of times TCP will attempt to retransmit a packet on an established connection normally, without the extra effort of getting the network layers involved. • tcp_retries2 (default: 15) The maximum number of times a TCP packet is retransmitted in established state before giving up • tcp_syn_retries (default: 5) The maximum number of times initial SYNs for an active TCP connection attempt will be retransmitted. The default value is 5, corresponds to approximately 180 seconds. 62 TCP Retries Now let’s consider changing the following TCP parameters from their default values: tcp_retries1 = 2 tcp_retries2 = 2 tcp_syn_retries = 2 In this example, the time to initialize the PCP failover was an average of 8 seconds after changing these TCP parameters. 63
Disconnect TCP Connection from RH9 From the ICM log: The Internal Concurrent Manager has encountered an error. Review concurrent manager log file for more detailed information. : 12JAN-2009 15:22:55 Shutting down Internal Concurrent Manager : 12-JAN-2009 15:22:55 12-JAN-2009 15:22:55 The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 1541. The [email protected] internal concurrent manager has terminated with status 1 – giving up. Found dead process: spid=(17963), cpid=(1302176), ORA pid=(26), manager=(0/1) 64 PMON & fnd_concurrent _queues
PMON updates the work_start column in the fnd_concurrent_queues table every 4 PMON cycles fdpsrp() (running_processes correction): ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUES Oracle error code returned: 1 This message is information and does not indicate a problem with CP functionality. remote call function (FNDIMON) 15-AUG-2008 10:06:02 – Function to call: PingProcess 65 PMON – ICM Lock – 11i • If the “ICM lock” is not available, FNDIMON will now ping the node of the ICM. • If the ping succeeds, we conclude that the ICM is fine. What???? • If the ping fails, we further check if it has been over “quesiz” pmon cycles since the ICM updated the work_start column fnd_concurrent_queues. • If it has been more than four pmon cycles we conclude that the ICM is dead. 66 PMON “found dead process” On RH9 the PMON found a dead process.
The PMON takes about 1 second to run, then sleeps for 2 minutes: Process monitor session started : 18-JAN-2009 21:46:05 Found dead process: spid=(16977), cpid=(1321475), Service Instance=(36543) Process monitor session ended : 18-JAN-2009 21:46:06 The Internal Concurrent Manager has encountered an error. Review concurrent manager log file for more detailed information. : 18-JAN-2009 22:02:01 67 PMON – node RH9 is down From the ICM log: Process monitor session started : 12-JAN-2009 15:18:27 Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes. CONC-SM TNS FAIL Call to PingProcess failed for XDPCTRLS 68 PMON Process monitor session started : 18-JAN-2009 22:38:57
CONC-SM TNS FAIL Call to PingProcess failed for OAMGCS 18-JAN-2009 22:38:58 – Node:(RH7), Service Manager:(FNDSM_RH7_VIS) currently unreachable by TNS Found dead process: spid=(11234), cpid=(1321563), ORA pid=(167), manager=(0/4) Process monitor session ended : 18-JAN-2009 22:38:58 69 PMON Shutting down Internal Concurrent Manager : 18JAN-2009 22:02:01 18-JAN-2009 22:02:01 The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 10910. 70 PMON runs every 2 minutes Process monitor session ended : 18-JAN2009 21:49:05 Process monitor session started : 18-JAN2009 21:51:05 71 Edit ICM Runtime Parameters 72 Edit PMON Parameters 73 Edit PMON Parameters
ICM parameters are read from batchmgr. sh when adcmctl. sh runs. Changing these parameters here does not change batchmgr. sh! 74 $FND_TOP/bin/batchmgr. sh Make sure the PMON changes are made in the $FND_TOP/bin/batchmgr. sh file. FILENAME # batchmgr # DESCRIPTION # fire up Internal Concurrent Manager process # USAGE # batchmgr arg1=val1 arg2=val2 … # # Parameters may be sent via the environment. # # ARGUMENTS # [appmgr|sysmgr]=username/password # [sleep=sleep_seconds] # [mgrname=manager_name] # [logfile=log_filename] # [restart=N|mim minutes between restarts] # [mailto=”user1 user2… “] # [PRINTER=printer_name] # [pmon=iterations] # [quesiz=pmon_iterations] # [diag=Y|N]
DEFAULT 15 icm $FND_TOP/$APPLLOG/$mgrname. mgr N current user 4 1 N 75 Reviver ICM REVIVER Start Starts to Shutdown No Receive Shutdown? Lost DB Connection? Yes Attempt to Get DB Connection No Sleep Yes No Spawn Reviver Yes Kill Previous DB Session ICM Started? Yes No Start ICM Exit From the CM log file: • The ICM has lost its database connection and is shutting down. • Spawning reviver process to restart the ICM when the database becomes available again. • Spawned reviver process 10910. Exit 76 reviver. log reviver. sh starting up… [ Mon Jan 12 20:02:15 MST 2009 ] – Read APPS username/password. [ Mon Jan 12 20:02:45 MST 2009 ] – Attempting database connection… Mon Jan 12 20:02:45 MST 2009 ] – Successful database connection. [ Mon Jan 12 20:02:45 MST 2009 ] – Killing previous ICM session… 1 row updated. Commit complete. [ Mon Jan 12 20:02:45 MST 2009 ] – Looking for a running ICM process… [ Mon Jan 12 20:02:45 MST 2009 ] – ICM now running, reviver. sh complete. 77 reviver. sh reviver. sh – code summary Sleep 30 Test_connection Kill_old _icm Get session Alter system kill session Check_running_icm Fnd_conc. ecm_alive start_icm startmgr. sh 78 Dead Connection Detection • Dead Connection Detection (DCD) is a feature of SQL*Net 2. 1 and later, including Oracle Net8. DCD detects when a partner in a SQL*Net V2 lient/server or server/server connection has terminated unexpectedly, and releases the resources associated with it. 79 Implement DCD • Implement by: adding SQLNET. EXPIRE_TIME = 1 (Minutes) to the sqlnet. ora file If the connection is idle for the time interval specified in minutes by the SQLNET. EXPIRE_TIME parameter, the serverside process sends a small 10-byte packet to the client. The packet is sent using TCP/IP. 80 DCD – ICM Lock • ICM and IM can use the DCD functionality of the Network (TCP sqlnet). • ICM is a client process connected to a DCD enabled DB dedicated server process. • ICM holds the named PL/SQL Lock, the “ICM lock”. • IM is continuously trying to check whether it can get the same named PL/SQL Lock. 81 DCD – ICM Lock As soon as the “ICM lock” is released by the DB / DCD, FNDIMON pings the ICM node, and the IM deduces that the ICM has crashed. – If the ping succeeds, we conclude that the ICM is fine. • Obviously, the ICM can be down, even if TCP is working, this is bad logic. – If the ping fails, FNDIMON determines if it’s been over four pmon cycles since the ICM updated the work_start column fnd_concurrent_queues. – If it has been more than four pmon cycles FNDIMON concludes the ICM is dead. • The DCD comes into picture here after ICM has crashed and DB needs to identify that the ICM is gone. • The DB needs to clean up the dedicated server process resource corresponding to the ICM client process 82
FNDIMON has the ICM Lock Check if the ICM updated the work_start column fnd_concurrent_queues. Be aware that if a TCP failure is not detected, failover will not occur. The following except from a concurrent manager log shows: fdpsrp() (running_processes correction): ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUES Oracle error code returned: 1 This message is information and does not indicate a problem with CP functionality. remote call function (FNDIMON) 15-AUG-2008 10:06:02 – Function to call: PingProcess The PingProcess continues until the CP processes resume, or a TCP failure is detected, and failover is begun. 83 11i PCP Failure TCP Failure • ICM Lock is released, FNDIMON pings ICM node, if ping fails, check PMON • PMON detects a “dead process”, crashed ICM • reviver. sh • DCD 84 R12 PCP Failure • TCP Failure • PMON detects a “dead process” • ICM Shutdown – Look for error messages ORA-3113, ORA3114 or ORA-1041 • reviver. sh • DCD 85 Test PCP Failover Parameters • Test to explore effect of DCD, PMON and TCP failover methods. • Variables: sqlnet. expire_time, pmon sleep and number of cycles, and the following TCP Keepalive parameters: • tcp_keepalive_time, • tcp_keepalive_intvl, • tcp_keepalive_probes • tcp_retries1 (default: 3, new value 2) • tcp_retries2 (default: 15, new value 2) • tcp_syn_retries (default: 5, new value 2) 86 Failover Test Results
Failover time / Failback time Expire_time PMON Sleep PMON Cycles tcp_KA time tcp KA intvl tcp KA probes tcp retries tcp retries2 tcp syn retries 241 secs / 250 secs / 50 secs 1 minute 5 minute 30 secs 30 secs 4 4 200 200 20 20 2 2 3 3 15 15 5 5 262 secs / 100 sec 10 minutes 30 secs 4 200 20 2 3 15 5 300 secs / 75 secs 1 minute 15 secs 2 200 20 2 3 15 5 285 secs / 35 min 8 secs / 105 secs 10 secs / 42 secs 7 secs / 40 secs 6 secs / 34 secs 10 minute 1 minute 1 minute 10 minutes 1 minute 30 secs 30 secs 30 secs 30 secs 15 secs 4 4 4 4 2 1000 1000 200 200 200 60 60 20 20 20 10 10 2 2 2 3 2 2 2 2 15 2 2 2 2 5 2 2 2 2 87 All Services are UP 88 Concurrent Managers • • Processes – Actual = 1 and Target = 1, manager is running Processes – Actual = 0 and Target = 1, manager is running 89 Actual Processes = 0
Example of Actual Processes = 0, in this example the CRM is not running 90 PCP Setup PCP setup – this screen is continued on the next slide 91 Primary and Secondary Nodes Any concurrent programs not assigned to the Standard Manager will not fail over The CRM, ICM and Standard Manager will fail over 92 TCP Failure • • • TCP disconnected at 2:57:25 10 seconds after the TCP connection was pulled, OAM reported the status above. It took 10 seconds for OAM to register a failure of services on RH9. 93 CRM is DOWN If any of the subordinate services fail, it rolls up to the Dashboard 94 CRM Failure CRM has failed, Actual Processes = 0 95 PCP Failover from RH9 to RH7
Adding Node:(RH9), to unavailable list Found dead process: spid=(9696), cpid=(1321449), ORA pid=(80), manager=(0/0) Found dead process: spid=(9784), cpid=(1321458), ORA pid=(114), manager=(0/0) Found dead process: spid=(9783), cpid=(1321457), ORA pid=(104), manager=(0/0) Found running request 4413565 attached to dead manager process. Attempting to restart request. Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes. 96 GSM tries to restart the services TCP and TNS is unavailable: Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42 CONC-SM TNS FAIL Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12. 0. 0/bin/FNDLIBR. Check that your system has enough resources to start a concurrent manager process.
Contac : 18-JAN-2009 21:43:42 Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42 CONC-SM TNS FAIL Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12. 0. 0/bin/FNDLIBR. Check that your system has enough resources to start a concurrent manager process. Contac : 18-JAN-2009 21:43:42 Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42 CONC-SM TNS FAIL Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12. 0. 0/bin/FNDLIBR. 97 ICM and CRM are DOWN 98 RH9 is DOWN Not really down, just not on the network 99 PCP is DOWN This is momentary as GSM figures out what to do 100 Failover to Secondary Node The ICM and CRM failed over to RH7 in about 1 minute and 30 seconds 101
Failover from RH9 to RH7 Starting Internal Concurrent Manager Concurrent Manager : 18-JAN-2009 21:51:23 : Started ICM on Target RH7. Process monitor session ended : 18JAN-2009 21:52:53 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 18JAN-2009 21:53:23 The [email protected] internal concurrent manager has terminated successfully – exiting. 102 ICM Failover to RH7 Starting Internal Concurrent Manager Concurrent Manager : 18-JAN-2009 21:51:23 : Started ICM on Target RH7. Process monitor session ended : 18JAN-2009 21:52:53 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 18JAN-2009 21:53:23 The [email protected] internal oncurrent manager has terminated successfully – exiting. 103 RH9 not available 104 Request Failover 105 Standard Manager Failover Configuration • Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover. 106 Managers with a Secondary Node • Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover. 107 Failback FAILBACK – tcp connected at 31:40 The host, RH9 becomes available on OAM about 2 minutes later. 108 RH9 available 109 ICM Failback 110 Concurrent Manager Log Starting Internal Concurrent Manager Concurrent Manager : 18-JAN-2009 22:53:33 : Started ICM on Target RH9.
Process monitor session ended : 18JAN-2009 22:55:03 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 18JAN-2009 22:55:33 The [email protected] internal concurrent manager has terminated successfully – exiting. 111 112 Failback Complete Total Failback Time 3 minutes and 45 seconds 113 Standard Manager before Failover The Standard Manager has 3 Actual and Target processes. 114 Standard Manager is DOWN 115 Standard Manager has 2 Processes on Failover After 3 minutes and 30 seconds the Standard Manager started on RH7 116 Shutdown of CP 117 Concurrent Processing Load Balancing Two types of Load Balancing • Load Balancing with both nodes running – no failover • Load Balancing during failover 118
PCP Load Balancing • One of the benefits Parallel Concurrent Processing provides: – failover in case of node failure • maintain throughput and keep the business running during node failures. • When a node fails, the processes that were running on the failed node are restarted on secondary nodes. • However, a resource intensive node may overload the secondary node when it fails-over. 119 PCP Load Balancing • If too many processes are running on the secondary node when the primary node fails over, the secondary node may not have the capacity to process the requests from additional concurrent managers. • R12 introduces Failover Sensitive Workshifts.
This enhancement allows the System Administrator to configure how many processes failover for each workshift. With this added control, System Administrators can enjoy the benefits of PCP failover without risking performance issues through overloaded resources. 120 R12 Failover Sensitive Workshifts 121 Failover Sensitive Workshifts 122 Failover Sensitive Workshifts • Conversely, if a failover occurs from node 1 to node 2, we may want to reduce the failover processes, however, this doesn’t work. • Only if the node fails does the “failover processes” take effect. 123 Failover Processes PO Document Approval Manager and the Standard Manager will reduce the number of processes when RH7 fails.
When RH9 fails, the number of failover processes for managers that run on RH7 are not reduced. 124 Failover Sensitive Workshifts It’s clear: to run a R11i or R12 system during a failover, there are two choices: • Run the servers at 35% or less utilization • Reduce the number of processes that are allowed during failover For most businesses the second option is the most practical. 125 References • • • • • • • • • • • • 249213. 1 – Performance problems with Failover when TCP Network goes down 364171. 1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP Keepalive 211362. 1 – Process Monitor Session Cycle Repeats Too Frequently 291201. 1 – How To Remove a Dead Connection to the Target Database 362135. – Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real Application Clusters and Automatic Storage Management Optimizing the E-Business Suite with Real Application Clusters (RAC) – Ahmed Alomari 240818. 1 – Concurrent Processing: Transaction Manager Setup and Configuration Requirement in an 11i RAC Environment R12 ATG – Concurrent Processing Functional Overview – Aaron Weisberg 210062. 1 – Generic Service Management (GSM) in Oracle Applications 11i 271090. 1 – Parallel Concurrent Processing Failover/Failback Expectations 241370. 1 – Concurrent Manager Setup and Configuration Requirements in an 11i RAC Environment 602899. 1 – Some More Facts On How to Activate Parallel Concurrent Processing 126