How To

Handling Performance Degradation Symptoms: Initial Triage and Ultimate Resolution

causes swimlane platform performance degradation symptoms can arise for a variety of causes including platform limitations client code limitations client job/data load upon the platform large record sets suboptimal backup configuration hardware limitations any combination of these examples examples include text data entry lag ui navigation and/or page load speed slowdowns these sometimes manifest as http 502 and 504 errors reported in chrome developer tools poor performance when searching records (via workspace wide search or keyword search within a single app) or exporting records timing out of python script tasks that interact with the restful api these also sometimes manifest as http 502 and 504 errors background job queue congestion instructions for all of the symptoms above except text data entry lag, take the following measures, as directed by the support team, to assess, triage, and resolve these issues take a screen capture video of the symptom while it is manifesting use video, screenshots, log file entries, etc, to help the support and engineering teams to get into context and see the symptom through your eyes the support team will join via zoom to witness the symptom and can record the zoom meeting screen sharing display for ui slowness, collect a har file from chrome developer tools export the application ssp follow the instructions in the mongodb profiler article if requested by the support team, assist them to harvest mongodb diagnostic and log data this data can sometimes be large and require special arrangements for the file transfers see the special instructions below for background job queue congestion symptoms provide all of the artifacts above to the support team in the pertinent support portal ticket expectations the time/effort required to collect the above mentioned artifacts is well worth it the engineering team has a solid track record of providing initial workarounds and long term, platform code fix resolutions, and they need to study these artifacts in order to bring about solutions resolving performance issues is a complex undertaking, and it can take more time than is comfortable for all stakeholders customers should work closely with their customer success manager to establish reasonable expectations in some cases, the engineering team may request repeated mongodb profiling repeated collection of mongodb diagnostic/log data a follow up zoom to further study the symptom self help tuning scripts whenever possible, customers should attempt to refactor their automations targeting the restful api to minimize the number of round trips to the restful api remove superfluous calls to record save() / record patch() minimize the load imposed on the restful api (and, by extension, the mongodb database) in each round trip work with your professional services engineer to tune scripts that search against large numbers of records where possible, control the number of records against which searches are performed make smart use of search apis use the limit parameter appropriately with the swimlane driver’s app records search() use app reports build() against large record sets constrain the sorting order of results this can afford an optimization in cases where only the first few results in a sorted set are really needed background job queue congestion introduction the queue is congested whenever it’s dashboard shows that new jobs are being enqueued faster than previously enqueued jobs can finish processing this symptom can be either the result of underlying performance problems the cause of any other performance degradation symptoms when this symptom is severe (thousands or tens of thousands of enqueued jobs), it can prevent newly enqueued jobs from executing for hours or days background to understand this symptom and its remedies, it’s important to understand the queue itself each job is an instance of either a built in integration task one example is the nightly directory services sync a customer created integration task as shown in the queue’s dashboard, each job passes through the following states of being enqueued processing succeeded / deleted the failed state is not used by swimlane failed jobs are grouped in the deleted state there are other states such as scheduled, awaiting, or aborted, but these are not often used initial triage the quick fix to temporarily eliminate congestion is to purge the background job queue to allow newer jobs to process this works for a short time before purging the job queue consider the ramifications the recommended purge method will delete all job execution data from the queue the support team does not have any more granular purge method take comfort knowing that no swimlane records will be altered by the purge loss of knowledge of success/fail outcomes for completed jobs is often negligible because information about succeeded and deleted jobs are purged automatically every night at midnight server time swimlane only retains this data for 24 48 hours this same information can often be reconstructed from the swimlane log stream (the swimlane logs collection in mongodb) however, the loss whose cost must be carefully considered is the elimination of the jobs in the enqueued state when thousands or tens of thousands of jobs are congested the following questions must be addressed is it more harmful to leave these jobs enqueued knowing that recent swimlane alarm records will go under enriched for hours or days? or, is it more harmful to eliminate all enqueued jobs so that subsequent jobs can start and finish more promptly? to answer this question, consider the consequences of either leaving the recently ingested records under enriched or putting forth special effort to back fill the under enriched records consult with your professional services engineer for assistance using the bulk edit feature and/or special purpose scripts to catalyze enrichment on all records neglected during queue congestion diagnostic procedure after stopping the tasks service(s) and purging the queue, disable all integration tasks decide on one small suit of tasks to enable these tasks should all pertain to one use case (one security alarm type and its automation processing flow), but they may only be a subset of the tasks belonging to that use case enable only those chosen tasks monitor the job queue for 1 3 hours does swimlane keep up with this reduced load of tasks? if swimlane keeps up with the reduced load by never falling permanently behind, then add a few more tasks (completing the first use case’s portfolio or adding a small second use case), and continue monitoring as soon as swimlane falls behind permanently, then you know precisely how to increase load incrementally until the ability of the swimlane deployment to keep up has been surpassed the information about which tasks were enabled, in what order, during what span of time, is precisely the information that is needed by the engineering team (along with the other artifacts requested above) to provide a solution