Many online retailers are preparing to do a lot of business in the next six weeks and site performance is crucial to these sales, as customers will jump very quickly to a competitors site if the one they’re on is slow. To avoid issues during this season, we’ve been working with our clients to plan and prepare since August. We’re also on high alert to help any clients who do run into issues. With so many sales up for grabs, performance issues can cost a retailer by the second, which is why we’ve built out checklists to help clients who are experiencing issues.
Here is the checklist that our support team uses to troubleshoot any of our Oracle Commerce client sites. If you’re site is experiencing performance issues it will help you narrow down the problem so that you can fix it and take advantage of the holiday shoppers.
Keep in mind, this list isn’t for beginners, it’s designed for those familiar with Oracle Commerce as a way to organize their problem solving. It’s not very pretty, but it will help diagnose the cause of performance issues for your Oracle Commerce store. If this is too technical for you, we might be able to help. Contact us to discuss how we can help manage your environment and protect your performance.
TENZING CYBER WEEK CHECKLISTS
TROUBLESHOOTING ORACLE COMMERCE
1. Obtain a good problem description:
A. When is the site slow?
B. Are clients logged in to the application?
C. Are there error messages observed?
D. Are there specific products or product lines involved?
E. Are there specific browsers being used by the end users?
F. Are the end users affected in one geographical location?
2. Look at what changes have been made in the last week
A. Client Application Changes
B. Tenzing Infrastructure Changes
3. Check session limits – if they are reaching their request they be increased
4. Assess the infrastructure for the environment
i. Generate an performance report on all servers
(1) If CPU is high, identify the processes that are causing it
(2) Use strace to find cause of load
d) I/O Wait – Use ‘iostat’ to collect metrics
ii. Patching Level on key servers (DB, Search)
iii. If “Too many files open” is observed then we are likely hitting an OS-level limit that prevents too many file descriptorsbeing open by one user or group. These limits should be reviewed and adjusted.
v. Monitoring Agent malfunction
B. Firewalls – check CPU, health of the firewalls
i. Check for utilization on key switches:
b) Endeca or Search
ii. Have counters cleared on these switches
iii. Check Server Interface Utilization
i. Reset Switch A and B Counters
ii. Collect LUN IOPs
iii. Collect Aggregate IOPs
iv. Disk Utilization
5. Assess Performance of VMware Cluster
A. Identify Cluster
B. Hypervisors VMs reside on
C. Performance of Environment
6. Assess the Database
A. Date/Time of last fail over
B. Long Running Queries
C. Error Messages detected
D. Row Lock Contention
E. Session Limits
F. Resource Utilization
A. Backup Agent Type
B. Scheduled Backups – Full and Incremental
C. Last Successful
8. Assess Traffic
A. Bandwidth Utilization Changes
B. Bot Traffic – Top 10 IPs
C. Suspicious Traffic – DOS
A. Application Release – can it be backed out if it was recent
B. Caching – what is cached and how frequent
C. Complete Thread Dumps during healthy periods and during an incident
D. Open a ticket with Oracle ATG Support
E. Review Logs
i. To locate logs: sof’ or ‘fuser’ to locate
ii. Use ‘sof’ or ‘fuser’ to search for specific items
F. Look for evidence of a network issue in the application – take thread dumps and evaluate how many threads are waiting on a socket read or write
i. List JVM instances and roles
ii. Compare JVM configurations across JVMs
iii. Verify unique names for each JVM
iv. JVM Thread levels
v. Determine if there is one JVM causing a problem (it will cascade down to other JVMs and cause the site to crash)
vi. Deploy JVM level monitoring
H. Garbage Collection
i. Stop the world or incremental collection?
ii. Heap Size
iii. Time to complete GC
iv. Frequency of collection: If Full GC occurs frequently or constantly, this indicates an issue with a memory leak in the code.
v. New versus Old Thresholds
vi. Turn on GC logging
vii. Are Out of Memory Errors observed? This is a symptom of a memory leak
viii. Take heap dumps
i. Assess Server health of the MDEXes (Search Servers)
a) Memory – not enough RAM can cause degradation in performance due to OS paging
b) CPU utilization
ii. Check how many skus
a) Lots of Skus (millions) –> add more RAM and CPU to the MDEX to increase processing power
b) Lots of Queries –> add more MDEXes
iii. Access the Workbench – look at the following statistics
a) Max Query Size – Oracle best practices indicate response sizes should be less than 500kb
b) Query Speed
iv. Check the Endeca logs, are errors Observed?
v. Is Preview Enabled? endeca/assembler/cartridge/manager/AssemblerSettings/ : previewEnabled = false OR in Dgaph in queries: merchdebug=true
vi. Incremental Indexing turned on? When does it run?
vii. When are full indexes completed?
viii. Endeca VIP performance – is there saturation on the switches to the VIP (Check Server Interface Utilization)
J. Collect a list of jobs or scheduled activities
i. Inventory Updates
ii. Product Deployments