What does the Genotype PLINK file format look like? |
|
The Genotype PLINK File-Format:
easyGWAS requires genotype data in PLINK [1] format for upload. Also, all downloadable public datasets in
easyGWAS are in PLINK format.
Example files can be downloaded here
Two files are required to store genotypic data, the PED and MAP file. To upload new data to
easyGWAS the PED file must have the following format:
The PED file has 6 fixed columns at the beginning followed by the SNP information. The columns should be separated by a whitespace or a tab. The first six columns hold the following information:
- Family ID (if unknown use the same id as for the sample id in column two)
- Sample ID
- Paternal ID (if unknown use 0)
- Maternal ID (if unknown use 0)
- Sex (if unknown use 0)
- Not used, set to 0
- Rest of the columns: SNPs
Important: All SNPs must have two alleles, e.g. a heterozygous SNP with the alleles A and T must be specified as A T, whereas a homozygous SNP with allele A must be specified as A A.
Here is a brief example of a genotype PED file containing 5 samples with 10 homozygous SNPs:
4304 4304 0 0 0 0 C C C C G G G G G G C C G G C C T T T T
6925 6925 0 0 0 0 C C C C T T G G A A C C G G C C T T T T
7319 7319 0 0 0 0 C C C C G G G G G G C C G G C C T T T T
6963 6963 0 0 0 0 A A C C T T G G A A C C G G C C T T T T
6968 6968 0 0 0 0 C C C C G G G G G G G G G G C C T T T T
The second important file is the MAP file. The MAP file contains information about every single SNP. Each row corresponds to one SNP in the PED file.
The order of the SNPs must be the same as in the PED file, i.e. the order of the rows in the MAP file must match the columns in the PED file (starting at column 7)
The MAP file must has exactly four columns with the following information (the columns should be separated by a whitespace or a tab):
- Chromosome ID (e.g. Chr1 for Chromosome 1)
- Unique SNP identifier
- Genomic distance (if unknown use 0)
- SNP Position
Here is a brief example of a genotype MAP file to the corresponding PED file form above:
Chr1 Chr1_314 0 314
Chr1 Chr1_317 0 317
Chr1 Chr1_323 0 323
Chr1 Chr1_324 0 324
Chr1 Chr1_332 0 332
Chr1 Chr1_334 0 334
Chr1 Chr1_342 0 342
Chr1 Chr1_346 0 346
Chr1 Chr1_348 0 348
Chr1 Chr1_349 0 349
Detailed information about the PLINK file formats can be found at
PLINK's web page
References
[1] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR,
Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007)
PLINK: a toolset for whole-genome association and population-based
linkage analysis. American Journal of Human Genetics, 81.