BioMart is pretty cool: http://www.biomart.org
I created an XML query using their form, narrowing the data set and species, then uploading my gene shopping list by Entrez gene IDs. I then specified that I wanted 2000bp upstream and downstream (2 separate queries) and created the XML files. I then removed line breaks in TextPad and ran the queries on a Linux machine using wget:
wget -O results.txt ‘http://www.biomart.org/biomart/martservice?query=MY_XML’
…where MY_XML is the newline-free XML query. The result (for one upstream and one downstream query) was 2 text files with 2000 nucleotides per line for each of 200 genes. I’ve sent this off to my biologist colleague to see if it’s indeed what we’re looking for, but I think it is.
My other colleague is looking at our target gene and getting a consensus sequence or a position weight matrix for the binding target sequence. We will then search for this sequence in the up/downstream sequences of our shopping list of genes and extract the best targets.
This is a lot harder than I thought it would be. I guess there isn’t a tool pre-developed for any data acquisition one wishes. An interesting fact: I just downloaded the human genome. It’s just under 3 gigs of text. Talk about needles in a haystack.
So it looks like I get to play the biologist today. I am going to try and figure out an easy way to get upstream and downstream sequences for a shopping list of genes. Then my colleague will see how likely our favorite transcription factor is to bind to each gene, either upstream or downstream. It seems pretty straightforward, but I have two worries. On my part, I’m not sure how easy it will be to get upstream or downstream regions for a collection of genes automated in bulk, not one at a time, and I’m not sure how far up or down to go in the intergenic region. 1kb? 2?
But after that, I’m not sure how reliable/usable our results will be, since our favorite transcription factor potentially interacts with lots of friends. Just because we determine it highly likely to bind upstream of some gene doesn’t mean it is not working in tandem with other transcription factors on some other gene, and we could miss that.
So far my biggest inhibitor of progress, at least as far as I can gauge from my mentor, is my dislike for trying things I doubt. For example, I am more likely to spend my time convincing myself that an undertaking (such as finding these 5′ and 3′ regions and binding sites) is worth the effort (and thus not making progress) than I am to try it and fail. I think some of that stems from my anxiety about finishing this degree in under a decade–I don’t want to waste time, per se. However, I guess you don’t learn unless you screw things up all the time, and I think I spend too much time avoiding mistakes than achieving success.