How To
MongoDB query/script for Swimlane to parse all records for an application and automatically remove inaccessible attachments (10.x)
2 min
below is the query to clean up records with corrupted or inaccessible attachments by validating fileids against the gridfs (fs files) collection the query will remove invalid attachments from both values and valuesdocument retain valid attachments in the respective fields ensure no valid data is inadvertently removed const batch size = 1000; // define batch size for updates let totalprocessed = 0; // track total processed records // fetch all attachment fields const attachmentfields = db applications aggregate(\[ { $unwind "$fields" }, { $match { "fields t" "attachmentfield" } }, { $project { fieldid "$fields id", applicationid "$ id" } } ]) toarray(); print(`found ${attachmentfields length} attachment fields to process `); // iterate through each attachment field attachmentfields foreach(({ fieldid }) => { const bulkops = \[]; let fieldprocessed = 0; print(`processing records for attachment field ${fieldid}`); db records find({ "values k" fieldid }) foreach(record => { const fielddata = record values find(value => value k === fieldid); const attachments = fielddata? v? v || \[]; if (!attachments length) { print(`no attachments found for field id ${fieldid} in record ${record id}`); return; } // validate files in gridfs const validattachments = attachments filter(attachment => { const fileexists = db fs files findone({ id attachment fileid }); if (!fileexists) { print(`invalid fileid ${attachment fileid} in record id ${record id}`); } return !!fileexists; }); if (validattachments length !== attachments length) { const updatedvalues = record values map(value => { if (value k === fieldid) { return validattachments length > 0 ? { k fieldid, v { t "system object\[]", v validattachments } } null; } return value; }) filter(v => v !== null); const updatedvaluesdocument = { record valuesdocument }; if (validattachments length > 0) { updatedvaluesdocument\[fieldid] = { t "system object\[]", v validattachments }; } else { delete updatedvaluesdocument\[fieldid]; } const updatedoc = { $set { values updatedvalues, valuesdocument updatedvaluesdocument } }; bulkops push({ updateone { filter { id record id }, update updatedoc } }); print(`prepared update for record id ${record id}`); } fieldprocessed++; totalprocessed++; // execute bulk updates in batches if (bulkops length === batch size) { db records bulkwrite(bulkops); print(`processed ${fieldprocessed} records for field id ${fieldid} (batch completed) `); bulkops length = 0; } }); if (bulkops length > 0) { db records bulkwrite(bulkops); print(`processed ${fieldprocessed} records for field id ${fieldid} (final batch) `); } print(`finished processing field id ${fieldid}, total processed ${fieldprocessed}`); }); print(`completed processing total records processed ${totalprocessed}`); when to run the script this query is resource intensive and should be run only when absolutely necessary below are the scenarios in which you may consider running it critical business impact if inaccessible attachments are causing disruptions or critical workflow failures, run this script large scale corruption if a significant number of records contain corrupted attachments and resolving them manually is not feasible preconditions to ensure the script runs efficiently and safely indexing ensure the following indexes are in place values k in the records collection id in the fs files collection testing test the script on a small subset of records to verify its behavior before running it on the full dataset batch execution if you have a large dataset, split the execution into smaller chunks, such as processing records by application or by attachment field monitoring monitor database performance during execution to ensure minimal disruption to other operations summary time complexity the query has a time complexity of o(r (a r + u r)) r total records in records a r average attachments per record u r average fields updated per record performance considerations linear complexity with respect to records and attachments makes it feasible but resource intensive for large datasets recommended usage run only when necessary to address severe business or functional issues avoid running as part of routine maintenance