20151120 - Troubleshooting Oracle Commerce

Many online retailers are preparing to do a lot of business in the next six weeks and site performance is crucial to these sales, as customers will jump very quickly to a competitors site if the one they’re on is slow. To avoid issues during this season, we’ve been working with our clients to plan and prepare since August. We’re also on high alert to help any clients who do run into issues. With so many sales up for grabs, performance issues can cost a retailer by the second, which is why we’ve built out checklists to help clients who are experiencing issues.

Here is the checklist that our support team uses to troubleshoot any of our Oracle Commerce client sites. If you’re site is experiencing performance issues it will help you narrow down the problem so that you can fix it and take advantage of the holiday shoppers.

Keep in mind, this list isn’t for beginners, it’s designed for those familiar with Oracle Commerce as a way to organize their problem solving. It’s not very pretty, but it will help diagnose the cause of performance issues for your Oracle Commerce store. If this is too technical for you, we might be able to help. Contact us to discuss how we can help manage your environment and protect your performance.



1. Obtain a good problem description:

A. When is the site slow?

B. Are clients logged in to the application?

C. Are there error messages observed?

D. Are there specific products or product lines involved?

E. Are there specific browsers being used by the end users?

F. Are the end users affected in one geographical location?

2. Look at what changes have been made in the last week

A. Client Application Changes

B. Tenzing Infrastructure Changes

3.  Check session limits – if they are reaching their request they be increased

4.  Assess the infrastructure for the environment

A.  Server

i. Generate an performance report on all servers

a) CPU

(1) If CPU is high, identify the processes that are causing it

(2) Use strace to find cause of load

b) Disk

c)  Memory

d) I/O Wait – Use ‘iostat’ to collect metrics

ii. Patching Level on key servers (DB, Search)

iii. If “Too many files open” is observed then we are likely hitting an OS-level limit that prevents too many file descriptorsbeing open by one user or group. These limits should be reviewed and adjusted.

iv. Antivirus

v.  Monitoring Agent malfunction

B. Firewalls – check CPU, health of the firewalls

C. Switching

i.  Check for utilization on key switches:

a) Database

b) Endeca or Search

ii. Have counters cleared on these switches

iii. Check Server Interface Utilization

iv.  MTR

v.   Traceroute

D.  Storage

i.    Reset Switch A and B Counters

ii.   Collect LUN IOPs

iii.  Collect Aggregate IOPs

iv.   Disk Utilization

v.    Latency

vi.   CPU

5.    Assess Performance of VMware Cluster

A.   Identify Cluster

B.   Hypervisors VMs reside on

C.   Performance of Environment

6.   Assess the Database

A.   Date/Time of last fail over

B.   Long Running Queries

C.   Error Messages detected

D.  Row Lock Contention

E.  Session Limits

F.  Resource Utilization

7.  Backups

A. Backup Agent Type

B. Scheduled Backups – Full and Incremental

C. Last Successful

8. Assess Traffic

A. Bandwidth Utilization Changes

B. Bot Traffic – Top 10 IPs

C. Suspicious Traffic – DOS

9. Application

A. Application Release – can it be backed out if it was recent

B. Caching – what is cached and how frequent

C. Complete Thread Dumps during healthy periods and during an incident

D. Open a ticket with Oracle ATG Support

E. Review Logs

i.  To locate logs:  sof’ or ‘fuser’ to locate

ii. Use ‘sof’ or ‘fuser’ to search for specific items

F.  Look for evidence of a network issue in the application – take thread dumps and evaluate how many threads are waiting on a socket read or write


i.   List JVM instances and roles

ii.  Compare JVM configurations across JVMs

iii. Verify unique names for each JVM

iv.  JVM Thread levels

v.   Determine if there is one JVM causing a problem (it will cascade down to other JVMs and cause the site to crash)

vi.  Deploy JVM level monitoring

H.  Garbage Collection

i.    Stop the world or incremental collection?

ii.   Heap Size

iii.  Time to complete GC

iv. Frequency of collection:  If Full GC occurs frequently or constantly, this indicates an issue with a memory leak in the code.

v.  New versus Old Thresholds

vi. Turn on GC logging

vii. Are Out of Memory Errors observed? This is a symptom of a memory leak

viii. Take heap dumps

I. Endeca

i. Assess Server health of the MDEXes (Search Servers)

a) Memory – not enough RAM can cause degradation in performance due to OS paging

b) CPU utilization

ii. Check how many skus

a) Lots of Skus (millions) –> add more RAM and CPU to the MDEX to increase processing power

b) Lots of Queries –> add more MDEXes

iii. Access the Workbench – look at the following statistics

a)  Max Query Size – Oracle best practices indicate response sizes should be less than 500kb

b)  Query Speed

iv. Check the Endeca logs, are errors Observed?

v.  Is Preview Enabled? endeca/assembler/cartridge/manager/AssemblerSettings/ : previewEnabled = false  OR   in Dgaph in queries: merchdebug=true

vi. Incremental Indexing turned on?  When does it run?

vii. When are full indexes completed?

viii. Endeca VIP performance – is there saturation on the switches to the VIP (Check Server Interface Utilization)

J. Collect a list of jobs or scheduled activities

i. Inventory Updates

ii. Product Deployments


Aisling McCaffrey

Demand Marketing Specialist at Thinkwrap
Aisling is our Demand Marketing Specialist at Thinkwrap, and loves working with both technology and humans. She studied International Business (concentrating in Marketing) and has spent several years living and working in China, mostly in Shanghai, where she became passionate about global innovation and how the use of social media changes in different cultures. Aisling likes to keep up on internet trends - from business to memes - and is always looking for new ways to learn or entertain herself.