For these questions, we will consider the following tables about microarray gene expression data. Your task is to express each of the given queries in SQL. Example data are provided in these tables for your convenience but note that they are only example data. Your queries must work for all potential data residing in the given tables, not just those illustrated here.
Genes gid | name | organism | annotation -----+---------+----------+---------------------------- g1 | YLR180C | yeast | hypothetical protein g2 | YLR181D | yeast | response to desiccation g3 | sp15 | yeast | drought stress responsive g4 | pdp77 | pine | putative stress responsive g5 | hsp70 | pine | heat shock protein g6 | hsp90 | pine | heat shock protein
Expression gid | experimentid | level | significance -----+--------------+-------+-------------- g1 | exp12 | 3.5 | 1 g2 | exp23 | -3 | 1 g3 | exp12 | 1 | 2 g3 | exp13 | -1.5 | 2 g3 | exp23 | 1.7 | 4 g4 | exp12 | 1.5 | 2 g4 | exp13 | 1.5 | 2 g4 | exp23 | 1.5 | 2 g4 | exp6 | 1.5 | 2 g5 | exp6 | 2 | 1 g5 | exp13 | 2.5 | 2 g6 | exp6 | -3.86 | 3
Experiments experimentid | name | whoperformed | date --------------+--------------------+---------------+------------ exp12 | Systematic Torture | Prof. Pain | 2004-06-02 exp23 | Heaped Abuse | Tommy Student | 2004-06-03 exp13 | Salt Stress | Gasch | 1998-07-04 exp6 | Sorbitol Exposure | Gasch | 1999-07-05
Membership gid | category -----+-------------------- g1 | glutathione g2 | antioxidant g3 | glycine binding g1 | amino acid binding g4 | amino acid binding g5 | amino acid binding g6 | binding
GOTree category | parent_category --------------------+-------------------- antioxidant | molecular function binding | molecular function glutathione | antioxidant glycine binding | amino acid binding amino acid binding | binding
Question 1
Find the names of experiments performed by Prof. Pain after Jan 1, 2004.
Answer
SELECT name
FROM Experiments
WHERE whoperformed = 'Prof. Pain'
AND date > '2004-01-01';
Question 2
Find the names of genes that were either positively expressed twofold or more with a significance of at least 1.0, in some experiment, or negatively expressed twofold or less with a significance of at least 1.0, in some experiment. List them alongside their organisms in a two-column format.
Answer
SELECT Genes.gid, name, level, significance
FROM Expression, Genes
WHERE Expression.gid = Genes.gid
AND significance >= 1.0
AND (level >= 2.0 OR level <= 2.0);
Question 3
What is the grandparent category of 'glycine binding'?
Answer
SELECT Parents.parent_category
FROM GOTree as Children, GOTree as Parents
WHERE Children.category = 'glycine binding'
AND Children.parent_category = Parents.category;
Question 4
Find the names of experiments that were performed before some Gasch experiment.
Straightforward Answer
SELECT E1.name
FROM Experiments AS E1, Experiments AS E2
WHERE E1.date < E2.date
AND E2.whoperformed = 'Gasch';
Subquery Answer
SELECT name
FROM Experiments
WHERE Experiments.date < (
SELECT MAX(date)
FROM Experiments
WHERE whoperformed = 'Gasch'
);
Question 5
Find the names of pine genes that were positively expressed more than 0.5-fold (with a significance of 1.0 or more) in at least two experiments.
Straightforward Answer
First, we must find the experiments where genes are upreglated and significant.
CREATE VIEW Upregulated AS
SELECT gid, experimentid
FROM Expression
WHERE significance >= 1.0
AND level >= 0.5;
Next, we must determine the genes which were upregulated in at least two experiments. We do this by taking the product of the upregulated genes and selecting rows where the gene ID is the same but the experiment ID is different.
CREATE VIEW UpInTwoOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid;
Finally, we determine which of these genes come from pine, and project their names.
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
We can alternatively do these steps all in one single query:
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine';
Subquery Answer
As another approach, we can make use of subqueries to find the answer. The key to this is to make a correlated subquery where the subquery depends on some property (in this case the gene ID) of the parent query. Note that we'll still need to make use of the Upregulated view created above to reduce code redundancy.
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
);
GROUP BY Answer
We can make use of advanced features for certain database systems such as PostgreSQL and MySQL to make these queries in slightly more convenient ways via the GROUP BY and COUNT.
CREATE VIEW UpInTwoOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 1;
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
Question 6
Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at least three experiments.
Straightforward Answer
Similar to the answer for question 5. The caveat here is that while the equality evaluations are transitive, while inequality evaluations are not, and so every case must be covered.
CREATE VIEW UpInThreeOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2, Upregulated as U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
Alternatively:
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
Subquery Answer
We need to build in another correlated subquery for our original correlated subquery to make this work.
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
GROUP BY Answer
Simply adjust the count evaluation.
CREATE VIEW UpInThreeOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 2;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
Question 7
Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at exactly two experiments.
Straightforward Answer
The key here is identifying that taking the set of genes upregulated in two or more experiments and subtracting the set of genes upregulated in three or mor experiments gives the set of genes upregulated in precisely two experiments. Thus, our answer is the answer to question 5 subtracted by the answer to question 6.
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
Or alternatively:
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine'
EXCEPT
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
Subquery Answer
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
)
EXCEPT
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
GROUP BY Answer
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
Question 8
Find the genes that showed positive expression in every experiment recorded for it.
SELECT Expression.gid, level from Expression WHERE level>1.0;
Question 9
What is the name of the gene that was most positively expressed in experiment exp23? Assume a minimum significance of 1.0.
Question 10
This question builds on question 8; what is the name of the gene that was "second most positively expressed"? Assume again a minimum significance of 1.0.
Question 11
Which gene(s) were positively expressed in ALL the experiments listed in the Experiments table? No constraints on significance level.
Question 12
Prepare a table of genes, their annotation, and any experiment in which they were either the highest or lowest expressed (of any significance level). Include a fourth column to say if they were the highest or lowest.