Gene 760 – Problem Set 0 The purpose of this problem set is to

Gene 760 – Problem Set 0
The purpose of this problem set is to familiarize students with Louise, UNIX, and Python. By the end of
this problem set we hope students will have learned the following skills:
 Logging onto Louise from their computer
 Use of simplequeue/screen and qsub to work safely on Louise
 The use of basic UNIX commands
 How to find and learn about new UNIX commands
 Writing a basic Python script
 How to find and learn about new Python commands
To teach these skills we will be using genotyping data generated from an Illumina microarray. We will
use these methods to identify the sex of the patients who generated these samples.
Turning in the homework
Students are to submit a gzipped directory named [YourNetID]_PS0.gz containing:
 [yourNetID]_PS0_answers.txt : Text file with the responses to the questions below
 [yourNetID]_PS0.py : Commented Python script used to answer the final problem
to ‘/home2/tgj2/gene760/DROPBOX/PS0’ by 9:00AM on Thurs, Jan 29, 2015.
CodeCademy
It is expected that students will complete up to the ‘Exam Statistics’ section of the ‘Python’ course at
www.codecademy.com by the time they turn in this problem set. This course will get you up to speed on
most of the basic programming concepts you will need for this class. While it is not mandatory for those
who already know Python, we still recommend the course to get refreshed with the concepts we will
cover in section.
The course as a whole entails up to 13 hours of work for those who have not programmed before, so
start early!
Introduction to Unix
1) Log onto Louise using SSH
ANS: Write the command and a description of what the command does
2) Use ‘screen’ to start a session called ‘genotype’
ANS: Write the command and a description of what the command does
ANS: Describe why it is advantageous to use screen when running an interactive session
3) Use the queuing system to log onto an interactive node
ANS: Write the command and a description of what the command does
4) Copy the file located at ‘/home2/tgj2/gene760/REFERENCE/bashrc_gene760_2015.txt’ and use that
copy to replace your .bashrc file. Source your bash after you’ve done this.
ANS: Write out the commands you used to replace the file
ANS: Explain what the .bashrc file is and what it does
5) Create a file in your home directory called .screenrc . Use a text editor (emacs, nano, or vim) to edit
this file, and add the line “defscrollback 5000”, then save and close the file.
ANS: Write out the commands you used to create, edit, save, and close the file
6) Make symbolic links in your home directory to the “ANNOTATION”, “DATA”, “DROPBOX”, AND
“REFERENCE” directories from the /home2/tgj2/gene760 directory.
ANS: What is the command to make a symbolic link and what is a symbolic link?
7) In your home directory make a directory called ‘Problem_set_0’ and change directory into this
directory. Work out the absolute path of this directory.
ANS: Write the commands and a description of what each command does
8) Copy the following file to your directory and then submit a job using simplequeue to unzip it:
‘/home2/tgj2/gene760/DATA/PS0/Test_FinalReport_1Mv3_hg18.txt.gz’
ANS: Write the commands and a description of what each command does
ANS: Explain what simplequeue is and why it is useful
9) Count the number of lines in the file; look at the first and last 50 lines of the unzipped file
ANS: Write the commands and a description of what each command does
ANS: State the number of lines, the name of SNP on the 50th line from the top, and the name of the
SNP on the 50th line from the bottom
10) Make a file called ‘chrXY_Test_FinalReport_1Mv3_hg18.txt’; extract all the SNPs from chrX and chrY
from Test_FinalReport_1Mv3_hg18.txt and put them into the file you made (hint ‘X’ and ‘Y’ only
appear in column 2)
ANS: Write the commands and a description of what each command does
ANS: State the number of lines in the new file and the number of chrX, chrY and chrXY SNPs
11) Download the file and open it in excel. Count the # of SNPs that are heterozygous on chrX (only).
ANS: State the number of heterozygous SNPs on chrX, the percentage of chrX that is heterozygous
and whether you think this sample is male or female
12) Change the filename of ‘chrXY_Test_FinalReport_1Mv3_hg18.txt’ to ‘chrXY_Python_In_1.txt’; copy the
following 3 files to your own directory and unzip them:
‘/home2/tgj2/gene760/DATA/PS0/chrXY_Python_In_2.txt.gz’
‘/home2/tgj2/gene760/DATA/PS0/chrXY_Python_In_3.txt.gz’
‘/home2/tgj2/gene760/DATA/PS0/chrXY_Python_In_4.txt.gz’
List the files in your current directory.
ANS: Write the commands and a description of what each command does
ANS: State the size of ‘chrXY_Python_In_4.txt’ in bytes
Introduction to Python
13) Write a python script to determine the sex of the four patient samples. You should assume a 1%
margin of error in the genotyping. Do not include undetermined SNPs (represented by ‘-‘) in your
calculation. The python script should be able to determine the sex of all four patients in a single run.
ANS: State the sex of samples 1, 2, 3, and 4
ANS: How do you create a comment in Python? What is a comment, and why is it important?
Brief Introduction to R
R is a useful software environment for statistical computing and graphics. You will be using it at various
points throughout the course. Here we will use it to do a simple significance test.
13) Use Fisher’s exact test (fisher.test in R) and your answers to determine if the number of heterozygous
& homozygous SNPs on the X chromosome is significantly different between males and females.
ANS: Write your commands and a description of what each command does
ANS: Copy the output from the Fisher’s exact test in R to your answer sheet & provide an
interpretation of the result.
*Hint: make a 2x2 contingency table in R using the ‘matrix’ function:
Men
Heterozygous SNPs
Homozygous SNPs
Women