If you wish to construct a prediction model from summary statistics, we generally recommend you use BayesR-SS, contained within the full version of MegaPRS. However, to run the full version of MegaPRS requires training and test summary statistics. If you only have one set of summary statistics (and you are unable to generate Pseudo Summaries), then here we explain how you can create models using Lasso-SS and Ridge-SS (in general, we find Lasso-SS performs best). These instructions require that you have already estimated Per-Predictor Heritabilities and that you have a Reference Panel.

Always read the screen output, which suggests arguments and estimates memory usage.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Calculate predictor-predictor correlations:**

The main argument is --calc-cors <outfile>.

This requires the options

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files (see File Formats).

--window-cm <float> - to specify the window size (LDAK will only compute correlations for predictions within this distance). It generally suffices to use --window-cm 3. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 3000.

Use --keep <keepfile> and/or --remove <removefile> to restrict to a subset of samples. As explained above, if you are using pseudo-summaries, then you must ensure the samples used to calculate predictor-predictor correlations are distinct from those used to generate the pseudo summaries and those used to test prior parameters.

By default, LDAK will save correlations between pairs that are significant (P<0.01). This corresponds to predictors pairs with squared correlation greater than 1-exp(-6.6/n), where n is the sample size. To change this threshold use --min-cor <thresh>.

To specify a subset of predictors, use --extract <extractfile> and/or --exclude <excludefile>. In particular, you can reduce the computational burden by restricting to predictors for which you have summary statistics. Note that if you plan to analyse summary statistics from multiple association studies, each of which used different SNPs, then it is probably easier to calculate predictor-predictor correlations once using all available predictors, then use these correlations for all analyses.

The predictor-predictor correlations will be saved in the file <outfile>.cors.bin (this is a binary file, so not human-readable), while details of the data files are provided in <outfile>.cors.root.

To parallelize this process, you can add --chr <integer> to calculate correlations for each chromosome separately, then combine these with the argument --join-cors <output>, using the option --corslist <corsstems> to specify the per-chromosome correlations (see the example below).

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Estimate effect sizes:**

The main argument is --mega-prs <outfile>.

This requires the options

--model <type> - to specify the tool. Because you have only one set of summary statistics, you can only use either Lasso-SS (--model lasso) or Ridge-SS (--model ridge). In general, we recommend use Lasso-SS (i.e., using --model lasso).

--cors <corstem> - to specify the predictor-predictor correlations.

--bfile/--gen/--sp/--speed <datastem> - to specify the genetic data files used to compute the predictor-predictor correlations.

--one-sums YES - to tell LDAK to expect only one set of summary statistics.

--summary <sumsfile> - to specify the file containing summary statistics.

--ind-hers <indhersfile> - to specify the per-predictor heritabilities.

--multi-hers YES - to tell LDAK to use the specified heritability (by default it also considers values a bit lower and a bit higher)

--window-cm <float> - to specify the window size (see below). It generally suffices to use --window-cm 1. If the genetic data files do not contain genetic distances, an approximate solution is to instead use --window-kb 1000.

By default, LDAK will ignore alleles with ambiguous alleles (those with alleles A & T or C & G) to protect against possible strand errors. If you are confident that these are correctly aligned, you can force LDAK to include them by adding --allow-ambiguous YES.

To specify a subset of predictors, use --extract <extractfile> and/or --exclude <excludefile>. Note that LDAK will report an error if any predictors are missing summary statistics (in which case, you can use --extract <extractfile> to specify the predictors with summary statistics).

LDAK will estimates effect sizes using overlapping windows of predictors. The step size is the window size (specified using --window-cm <float> or --window-kb <float>) divided by the number of segments. By default, there are eight segments, but you can change this using --segments <powerof2>. For example, if you used --window-cm 3 and --segments 2, then each predictor will be included in two windows (so the windows will start 1.5cM apart).

The estimated effect sizes will be saved in the file <outfile>.effects.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**Example:**

Here we use the binary PLINK files human.bed, human.bim and human.fam, and the phenotype quant.pheno from the Test Datasets, as well as the file highld.txt from High-LD Regions. We also use the file ldak.thin.ind.hers, created in the example for Per-Predictor Heritabilities. This contains estimates of the heritability contributed by each predictor, obtained assuming the LDAK-Thin Model (note that we normally recommend using the BLD-LDAK Model, but as this is only an example, we use the simpler model).

Although we have individual-level data (i.e., we have genotypes and phenotypes for the same samples), for this example we will pretend we are using summary statistics. Therefore, we will first create summary statistics by running

./ldak.out --linear quant --bfile human --pheno quant.pheno

The summary statistics are saved in quant.summaries (already in the format required by LDAK). For more details on this command, see Single-Predictor Analysis. Then we will use the genetic data files as the reference panel.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**1 - Calculate predictor-predictor correlations.**

We run the command

./ldak.out --calc-cors cors --bfile human --window-kb 3

The significant correlations are saved in cors.cors.bin, with some details in cors.cors.root. If analyzing very large data, we can parallelise the process by computing correlations separately for each chromosome, then merging

for j in {21..22}; do

./ldak.out --calc-cors cors$j --bfile human --window-kb 3 --chr $j

done

rm list.txt; for j in {21..22}; do echo "cors$j" >> list.txt; done

./ldak.out --join-cors cors --corslist list.txt

Note that in these scripts, we loop from 21 to 22, because our example dataset contains only these two chromosomes; usually you would loop from 1 to 22.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

**2 - Estimate effect sizes.
**

We run the command

./ldak.out --mega-prs megalasso --model lasso --bfile human --cors cors --ind-hers ldak.thin.ind.hers --multi-hers NO --summary quant.summaries --one-sums YES --window-kb 1 --allow-ambiguous YES

For this example, we have created Lasso models (to instead create Ridge models, replace --model lasso with --model ridge). The estimated effect sizes are saved inĀ megalasso.effects.