Overview
Nebula is the application name for this single-page intensity-matrix QC and visualization tool (browser-based). This tab documents the key hard-coded behaviors in the app for data import, replicate parsing, and group completeness filtering (plus the technical-replicate algorithm where it still applies in code).
SAINT and Enrichr (v2.29): load matrix and meta in Data Preparation, then use SAINT to map control/treatment and bait columns, run analysis, and open Network / Scatter Plot sub-tabs. Enrichr accepts pasted genes, current matrix row labels, or SAINT prey names after a run. D3 v7 for SAINT is loaded on demand and does not replace Clustergrammer’s D3 v3.
Enrichr enrichment plots (v2.69+): After enrichment returns the term table, open the Enrichment plots sub-tab (next to Results table) for Plotly charts built from the full API result (not only the paginated table). The main panel uses tabs for the horizontal bar chart (rank by adjusted P, P-value, or combined score), the bubble plot (overlap count or odds ratio vs term; color −log10(adjusted P) or adjusted P), and the Jaccard heatmap between top terms using overlapping genes from each row. Plot options sit in the left sidebar; use Export results TSV for a tab-separated download of parsed columns (includes one TSV column per data matrix sample when row count matches and gene mapping is available).
Enrichr → Heatmap (v3.01+): Third sub-tab draws a Plotly heatmap of per-sample matrix intensities only (columns = samples). Adjusted P, P-value, combined score, odds ratio, and overlap are row filters in the sidebar; max adj. P defaults to 0.05 (clear to disable). Terms are ranked and capped by max rows. For hierarchical clustering of the full protein × sample matrix, use the top-level Heatmap tab.
Heatmap pop-out (v2.82+): After generating a heatmap, use Pop out in the left sidebar (to the right of Generate heatmap) to open popout/heatmap_popout.html in a new window with the same Plotly figure (full-window autosize), using postMessage and sessionStorage like the UpSet/Venn pop-out. The button appears only after a heatmap exists.
Data QC → Overall (v4.18+): Matrix-wide QC on the current matrix (v4.19+: Plotly per-column summary bars for all samples — mean raw/transformed or detection rate — plus D3 box plots and total Σ log10(1+I); transform / treat-zero follow Data QC → Column correlation shared controls). Use Refresh dashboard after loading or filtering; auto-refresh when Data Filter updates while Overall is active.
Data QC → Column correlation (v4.19+): Inner tabs Overall correlations (subset heatmaps) and Paired correlation (scatter, QQ, Bland–Altman, r vs X) use the same data-filter-inner-tabs-row / data-filter-inner-tab markup and styling as Data PreProcess → Data Filter nested tabs (e.g. Row Filter → Passed). Shared Transform / Treat 0 as missing at the top of the sidebar.
Use the outline on the left to jump to specific sections (including Column profile and Column correlation under Data PreProcess, sections 8–9).
Input Data
- Accepts clean matrix files (TSV/CSV/XLSX) and DIANN protein-group matrices.
- First row is treated as sample columns; first column is row ID.
- After loading a DIANN protein group matrix, use Data PreProcess → DIANN annotations for a read-only table of the first file columns; sort order and pagination match the Data Filter sub-tab.
DIANN Upload Rules
- For DIANN `results.pg*_matrix.tsv`, intensity values are read from column 7 onward; the first six columns are stored as per-row annotation (same file).
- **Data PreProcess → DIANN annotations** lists those columns in a read-only table (same row order, sort, and pagination as Data Filter).
- Default row-ID column is `Protein.Names` (plots and tables use this ID); you can choose `Genes`, `First.Protein.Description`, or `Protein.Group` instead.
- **Data PreProcess → Row Profile** search matches the primary ID plus all stored annotation fields (e.g. gene, protein names, descriptions). **Row label** dropdown chooses which DIANN annotation column labels the checklist and plot legend (`Protein.Group`, `Protein.Names`, `Genes`, `First.Protein.Description`); without DIANN pg annotations, matrix row IDs are used. Sidebar width is enlarged for the checklist. Quick selection: **Select all rows**, **Select top N** by row intensity sum (numeric input, default 5), **Clear selection**.
- If the selected ID column is missing from the header, fallback order is `Protein.Names` → `Genes` → `First.Protein.Description` → `Protein.Group` → first column.
- If selected ID is not `Protein.Group`, rows with blank selected IDs are removed.
- For DIANN `results.gg*_matrix.tsv`, intensity values are read from column 4 onward and row IDs always come from the first column (`Genes`).
- Sample names are simplified from raw file paths to concise informative names.
Name mapping (Data PreProcess)
Data PreProcess → Name mapping calls the free MyGene.info API (no key) to map matrix row IDs to gene_symbol, protein_name, uniprotkb_ac, and related fields. Choose species (human / mouse / rat / custom NCBI taxid), whether IDs are gene symbols, UniProt accessions, or auto (accession regex vs symbol), then Run mapping. Large tables are queried in batches; Stop keeps partial results.
Results are stored in the session as featureNameMapHeaders / featureNameMapRows and appear in feature label dropdowns (heatmap, Clustergrammer, Differential, column profile, column correlation, Row Profile) and in Row Profile search — similar to DIANN row annotations. Mapping is best-effort; ambiguous or missing hits are noted in mapping_note.
If the browser blocks cross-origin requests (e.g. opening the app as file://), serve the folder over http://localhost or another HTTP server so fetch to MyGene can succeed.
Replicate Parsing
- Parser detects replicate tags from sample names.
- Supports full `R#_T#`, or only complete `R1..RN`, or only complete `T1..TN`.
- A replicate axis is accepted only when all samples contain that axis and indices are contiguous from `1` to `N`.
Group completeness filter (Data Filter → Row Filter)
- Apply group completeness filter (Data PreProcess → Data Filter → Row Filter): choose one or two metadata columns to define groups (same combination of meta values ⇒ one group).
- A matrix cell counts as valid if it is finite and > 0.
- Threshold: minimum count of valid columns in the group, or minimum percentage — required valid count = ⌈groupSize × P / 100⌉.
- For each row, if a group fails its threshold, all columns in that group for that row are set to `0`. Rows that are then all non-positive across every sample are removed; heatmap/PCA/t-SNE/Clustergrammer caches are reset.
- Filtered out (Data PreProcess, sub-tab next to Data Filter; v3.12: the tab appears only after at least one row has been dropped): lists row IDs removed for that reason, with intensities at the moment of removal and a short note of the last filter run. The list is cleared when you load new data or a session snapshot. Technical replicate filter (console) updates the same list when it drops all-zero rows.
- If Biological_Replicate / Technical_Replicate cells are empty in the meta table but the matrix column id still contains R# / T# tokens (e.g. DIANN paths), those tags are read from the column id for grouping so replicate triplets are not merged into one huge group.
- The legacy T1..TN technical replicate rule is still implemented in code as applyTechnicalReplicateFiltering() (strict complete technical sets on Group + Biological_Replicate) but is not exposed as a sidebar button.
“Ignore groups with fewer than 2 samples”
Each unique key from your one (or two) grouping column(s) defines a group: the matrix columns whose metadata match that key. The checkbox controls whether single-sample groups — only one sample column shares that key — participate in the filter.
When the option is on (default): only groups with at least two columns are kept. Single-sample groups are excluded from validGroups in the implementation: the completeness rule (min count or min percentage of valid >0 values) is not evaluated for those columns. Their intensities are left as they were before you click Apply (they are not zeroed by this rule, and they are not part of any other group’s column set). Use this when you only want to enforce completeness on replicate groups (e.g. R1–R3) and to avoid orphan conditions or odd one-off samples being forced through a rule that expects several values per group. It also avoids pathological cases with mixed designs: if you required e.g. “min count 3” and the smallest group had only one column, the tool would report an error; dropping singletons from the group list can make the remaining “valid” groups all large enough for your threshold.
When the option is off: every meta key counts as a group, including singletons (group size = 1). Then a count threshold greater than 1 cannot be met for that “group”, and the filter will reject the run if the smallest group size is below your threshold. Percent mode still runs, but for a single column the required count of valid values is ⌈1 × P/100⌉, which is 1 for any P from 1–100, so a singleton always “passes” unless the value is invalid (not >0).
Edge case: if all groups are singletons (e.g. every sample has a unique combination in the chosen meta column(s)), with the option on there are no groups of size ≥ 2 and the filter shows an error like No groups match the chosen columns and minimum group size. Uncheck the option or change the grouping so at least one key has two or more samples.
Post-filter Behavior
- Table and row profile selectors are refreshed.
- Heatmap/PCA/t-SNE/Clustergrammer caches are reset to avoid stale results.
- Data PreProcess → Data Filter: shared matrix preview (Row Filter and Column Filter) uses a scroll area with sticky column headers. ID cells show a native tooltip with DIANN annotation fields (when loaded from a protein group matrix). Column Filter keeps or removes sample columns (per-column checkboxes, optional Choose by group using one or two meta columns—same rules as group completeness—with Check / Uncheck per group); Row Filter includes group completeness, global min-valid, and row ID include actions.
UpSet / Venn / Karnaugh
- Top-level tab UpSet / Venn / K-map uses UpSet.js (lazy-loaded from jsDelivr), following the linked components pattern. The library is licensed under GNU AGPL-3.0; consider license terms if you redistribute or use the app commercially.
- Each set is one matrix sample column. Each row (feature ID) is an element; membership uses the sidebar Presence rule: 0 is always treated as missing/NA. Either any finite non-zero value, or intensity > threshold (user-set cutoff, default threshold 0 ⇒ >0; values ≤ threshold treated as absent).
- Set chooser: checkboxes for columns (default: first five selected), filter box, Select all / Clear, and Refresh plots.
- Set combinations (UpSet plot) (collapsible sidebar block, aligned with UpSet.js App): ordering, mode (intersections / unions / distinct intersections), min and max set-members (degree), max number of combinations after sort, and whether to include empty combinations — implemented with UpSetJS.generateCombinations (docs). Venn and Karnaugh still use the full generated combination lattice for layout.
- Main area (scrollable): UpSet full width on top; Venn and Karnaugh map side by side below (stacks vertically on narrow viewports). Each section title row includes a Pop out button that opens popout/upset_popout.html for that view; membership and UpSet combination options are delivered by postMessage (and sessionStorage when shared, e.g. HTTP). Hover in any view updates linked highlighting in the others (by intersection/set name). The pop-out plot uses the same onHover / selection behavior for that single view and reflows on window resize. The built-in plot share control is hidden so PNG/SVG/dump/VEGA toolbar buttons work reliably. With many sets selected, Venn/Karnaugh can be crowded; a note appears when more than six sets are selected.
Column profile
- Data PreProcess sub-tab Column profile (after Row Profile): one button per sample column above the plots.
- For the selected column, finite numeric values are sorted high → low. Interactive Plotly views: ranked bars + line (linear intensity y-axis: scientific notation, e.g. 1.2e+6), cumulative % of total, treemap, and pie (top N features + Other; sidebar Max features). Optional log10(1 + intensity) for the bar/line panel uses fixed-decimal y ticks on the transformed scale.
- A full-width panel plots a histogram (probability density) and Gaussian KDE curve of log10(1 + intensity) for every feature in that column (not limited by Max features); KDE uses a Silverman bandwidth and subsamples very large matrices for speed.
- Row Profile bar/line chart: linear Value y-axis uses the same scientific tick style; log Y and Z-score modes keep their own tick formats. The plot uses the full width and height of the Row Profile plot column inside a framed container; sample names get extra bottom space and smaller tick fonts when there are many columns.
Column correlation
Data PreProcess sub-tab Column Correlation (after Column profile) explores sample–sample relationships: matrix columns are treated as vectors over rows (features). This is QC / replicate agreement, not differential expression.
Transforms and missing values: Choose none, log10(1 + x), or log2(1 + x). Optional Treat 0 as missing (default on) matches DIANN-style intensity QC. Pair plots and correlations use pairwise-complete rows only (finite values after transform and missing rules).
Nested tab strip: Uses the same data-filter-inner-tabs-row / data-filter-inner-tab styling as Data PreProcess → Data Filter (e.g. Row Filter → Passed). Overall correlations shows subset heatmaps plus a mixed matrix plot: lower triangle = pairwise scatter plots; upper triangle = color-coded correlation cells. Paired correlation shows scatter, QQ, Bland–Altman, and the bar chart of r(column X, j) for each column j in the heatmap subset.
Pair plots: Pick Column X and Column Y (sidebar on the Paired tab); scatter includes an identity line when scales match. QQ compares sorted quantiles. Bland–Altman: difference vs mean, or log2(Y/X) vs geometric mean when both raw values are strictly positive.
Correlation and distance matrices: Pearson or Spearman (ranks with average ties, then Pearson on ranks). Matrices use the first N columns in file/matrix order (Max columns in heatmaps, default 40, cap 80). Distance: 1 − r, √(2(1 − r)), or Euclidean on z-scored column vectors (pairwise-complete rows).
Per-column summary (Plotly): Moved to Data QC → Overall — bars for every sample column (mean raw, mean transformed, or detection rate). Uses the same transform and zero-as-missing rules as this tab’s Shared scale block.
Session JSON saves the sidebar control values (colCorr* ids) so imports restore your settings; reopen the sub-tab or use Refresh plots to redraw.
Differential analysis
The top-level Differential tab compares groups on the loaded intensity matrix using a metadata column (all columns except Sample_ID). Matrix column headers are matched to meta rows by Sample_ID (same join as technical-replicate filtering).
Two groups (default): Choose Group A and Group B. Per-feature Welch (default) or Student t-tests. Optional Log2(x + 1) before testing; with it on, log2FC is the difference of group means on that scale. Without log transform, log2FC is log2((meanB + pseudocount)/(meanA + pseudocount)) on linear means. Treat 0 / invalid as missing and Min valid values per group apply per group. Two-sided p-values use the Student t CDF (incomplete beta). Each group needs at least two samples.
Multiple groups: Mode Multiple groups (ANOVA / Kruskal–Wallis) shows a checklist of meta levels (all checked by default). Each included group must have at least two samples after the meta join. Tests run on the same transformed scale as two-group mode. One-way ANOVA uses the classical F-test (equal variances between groups). Kruskal–Wallis is rank-based; raw-p uses a χ² approximation (Wilson–Hilferty) on H with k−1 df. Effect summaries include partial η² (ANOVA: SSB/SST; KW: (H−df)/(N−1) as a simple effect-size style quantity).
Multi-group plots: Volcano uses η² or the test statistic (F or H) on the x-axis vs −log10(p) or −log10(FDR). Cutoffs in the sidebar (η², optional minimum statistic) combine with p/FDR for point coloring. Mean range (formerly MA for two groups) plots grand mean of group means vs (max − min) group mean. Group heatmap shows the top N features by raw p with optional row z-score. P-value histogram is unchanged.
Multiple testing: None, Benjamini–Hochberg FDR, or Bonferroni on the vector of raw p-values across tested features (same for t-test, ANOVA, and Kruskal–Wallis).
Results table (two-group) — t and df: t is the two-sample t-statistic (difference in group means relative to its standard error). Larger |t| means stronger separation relative to noise. df is the degrees of freedom used by the t-distribution to compute the raw p-value. With Student test (pooled variance), df ≈ nA + nB − 2. With Welch test (unequal variance), df uses the Welch–Satterthwaite approximation and may be non-integer.
Fudge factor volcano (two-group only): Optional SAM-style joint rule inspired by Giai Gianetto et al., Proteomics 2016 (uses and misuses of the fudge factor). On the tested scale, SE of the mean difference is approximated as |log2FC / t| (when t ≈ 0, the median SE across features is used). User s₀ adds to SE in the denominator: d = |log2FC| / (SE + s₀). The green guide uses median SE and median df: two line traces (negative and positive log2FC, small gap at 0). For |log2FC| ≥ need = t★(median SE + s₀), y = y₀ = −log10(α) (flat tails). For |log2FC| < need, height rises in a 1/|x| (hyperbolic) way from y₀ at |x| = need to ycap at the inner edge of the drawn branch, so the silhouette is flat at the sides and curved “wings” toward the center — not a smooth dome. Point colors (volcano, MA, results table) use the same median-based boundary: a feature is red/blue if its plotted y (−log10 p or FDR, same cap as the plot) lies on or above that curve at its log2FC and the fold direction matches the sign of log2FC. pmod in the tooltip still uses each feature’s own SE and df (diffStudentTTwoTailP(d, df)) as a t-distribution shorthand — not full SAM permutation. Multi-group ANOVA/Kruskal–Wallis keeps the rectangular volcano; fudge controls are hidden in that mode.
Session JSON: Saves diffAnalysisMode, multi-group test, volcano x-axis and effect cutoffs, heatmap options, diffVolcanoFudgeFactor, diffFudgeS0, and the list of checked ANOVA levels (diffAnovaIncludedLevels) so imports can rebuild the checklist after the meta column is repopulated. formatVersion 3+ also stores the full diffAnalysisLastResult and table sort state so volcano / MA / heatmap / table restore without recomputation.
Not in scope (browser-only tool): Paired tests and full limma empirical Bayes or SAM permutation calibration. DESeq2 / edgeR / limma-voom require count matrices, normalization/dispersion, and an R/Bioconductor runtime — they are not implemented here. For RNA-seq–style DE, export intensities and design to R or a dedicated pipeline.