Thursday, November 30, 2006

Sequencing technology and the Neanderthal Genome

You've probably read about the two recent papers in Science and Nature reporting the sequencing of portions of the Neanderthal Genome. (Subscription required for full text of the Nature and Science papers. Check out Google for the many news stories on this.) This is exciting work, but I'm not really going to comment about the signficance of the results - I think it's worth understanding how new sequencing technology enabled researchers to sequence 1 million bases of 38,000 year-old DNA.

Each group used a different sequencing technology, and as a result, their coverage of the genome differed widely: the Nature group got about 1 million base pairs of sequence, while the Science group got about 65,000 base pairs. (Recall that the human genome contains about 3 billion base pairs, and the Neanderthal genome was undoubtedly similar.) The Neanderthal genome is a challenge because, obviously, any DNA remaining in the 30,000-40,000 year old bones we have is highly fragmented, and the amount of contaminating DNA from microbes (not to mention scientists) is significant. To eventually cover the entire genome, we need a method that can generate lots and lots of sequence without being prohibitively expensive.

The Science group used a more traditional method, which works well for most large-scale genome sequencing efforts, but is not really well-suited for getting huge chunks of Neanderthal genome. As a result, this group obtained only 65,000 base pairs of DNA sequence (which is still a significant accomplishment.) The big problem with this method is that the fragments of DNA isolated from Neanderthal bones have to be cloned before they can be sequenced. (In layman's terms - the DNA fragments have to be placed inside circular pieces of DNA called plasmids, which can be then grown in large quantities inside bactieria.) Many fragments of Neanderthal DNA fail to be cloned at this point, meaning that you lose much of the sample that was painstakingly isolated from the ancient bones.


The Nature group used something called pyrosequencing, which is done on machines called 454 sequencers. Crucially, this technique does not require a cloning step, which means much more of the isolated sample gets sequenced. Pyrosequencing also produces lots and lots of sequencing data very quickly. (One major downside is that each sequence is much shorter than what you get using traditional Sanger Sequencing, by a factor of ten at least. But in this case, the Neanderthal DNA fragments are so short this doesn't matter.)

You can read a really nice explanation of how pyrosequencing works at 454's web site. (For more technical coverage, look here.) Without this technology, a Neanderthal genome project would not be feasible. With this technology, we can now consider all sorts of sequencing projects that would not have been financially or technically feasible before - not just Neanderthal sequencing, but also large scale studies of gene varation in natural populations, including humans. Such large-scale sequencing could help up close in on the genes involved in complex diseases.

I said I was going to talk about sequencing, but I can't resist making a plug for completing the Neanderthal genome. It's interesting to learn about the changes in the genome that took place during evolution, but it's also extremely useful to have a more closely related genome as we try to find and understand the functional portions of the human genome. Having multiple genomes of close species has helped enormously in flies, worms, and yeast (for examples, check out this, this, and this). As is usual in almost any genome-level study of human biology, you can't get far without using an understanding of evolution.

Tuesday, November 21, 2006

Where did all these scientists come from?

It's no secret that it's now more difficult than ever to get funded by the NIH, in spite of the fact that Congress doubled the NIH budget between 1998 and 2003. Researchers have been frustrated and wondering, over coffee and in print, where all the money has gone.

NIH director Elias Zerhouni explains what's going on in the November 17th issue of Science. Grant applications have nearly doubled since 1998 - from 24,151 in 1998 to an expected 46,000 applications in 2006. And this isn't just because individual scientists are applying for more grants - in 1998, 19,000 scientists applied for grants, while in 2006, there were 34,000 scientists who applied.

Where did all these people come from? (I should note that I'm one of these people - I just submitted my first NIH application this summer.) When Congress announced its intention to double the NIH budget, universities started expanding - adding new graduate programs, hiring new faculty, and building new core facilities. All this happened amazingly fast, and now we're feeling the crunch. It doesn't help that the NIH budget hasn't kept pace with inflation since 2003, but Zerhouni presents the numbers that lay to rest other explanations for the funding crunch that have been tossed around - such as an excessive investment in large clinical trials or big, Manhattan project-style science at the expense of smaller, innovative projects initiated by individual researchers.

Is this a good thing? The down side is that with more people we'll get more fraud, more mediocre science, and more fragmentation of the scientific community. It's already barely possible to seriously keep up with the literature in one's own field - which means it will be harder to find people on review committees who understand each other. It's much, much harder for a young investigator to get started - in the past, scientists in their 20's and early 30's have been among the most innovative and creative scientists, but young scientists today have their motivation and creativity squashed by the high barriers to independence, barriers which are only overcome when some of your best years are over with. Sure, older scientists are still damn good researchers, but if the start of your scientific career is creatively stunted, it can hobble your thinking later on.

In spite of these drawbacks, the fact is that there is still a hell of a lot of good scientific work to do, even if all of it isn't the most pathbreaking or innovative science. There are a lot of useful details to be worked out, enough to keep people busy for a long time. Money invested in new scientists will be money well spent - much more well spent than much of billions of dollars we lost in the attempted reconstruction of Iraq, money which did more to enrich the already bloated pockets of Dick Cheney's friends than it did to benefit the Iraqis.

The investment in research infrastructure made by US research universities and biotech companies in recent years has helped keep the US at the forefront of an increasingly competitive world-wide scientific community. If we want to stay there, we need to pay for it.

As Zerhouni put it:

"Since 1945, United States success in scientific research and development has been the result of the implicit partnership that exists among academia, the federal government, and industry. In this model, research institutions take the risk of building and developing our national scientific capacity; the federal government, through a competitive peer-review process, funds the best science; and industry plays the critical role of bringing new, safe, and effective products to the public. This strategy is the keystone to sustaining American competitiveness, and must be preserved."

Wednesday, November 15, 2006

How should we teach our kids math?

The NY Times is reporting on the lagging math skills of US kids and efforts to change, yet again, how we teach math. I'm sympathetic to the desire to teach math in a way that doesn't turn people off. Too many people (including some people who grow up to be biologists) go through their education feeling very, very insecure about their ability to do and understand math.

What's tough about this problem is that you just can't teach the concepts and let kids figure out for themselves how to solve the problems. Understanding is good, but in math, understanding is not a substitute for the ability that comes with lots and lots of practice. This is different from many other fields of study, where if you understand the basic ideas and arguments you can work out a lot for yourself. To actually do math well, you need regular, sometimes mind-numbing practice - you can't just reinvent the wheel (i.e., derive your results from first principles) every time you need to solve a problem.

Gaining proficiency in math is similar to being able to do the NY Times crossword puzzle or play arpeggios on the piano - you simply need a lot of repetitive practice. You may know what an arpeggio is, but that's different from being able to play them over the entire keyboard, in all major and minor keys at a fast tempo.

Just like with the piano, kids will like math more when they are actually reasonably good at it. And there is no reason that most kids can't be fairly good at the kind of basic math we'd expect every educated person to know. Our teaching needs to reflect that - it's good to encourage understanding, but proficiency will never come without plenty of practice.

Sunday, November 12, 2006

The problem with computational biology papers

OK, my title is too general - it should be, "The problem with some computational biology papers that deal with certain research questions." There is a type of trendy science that frequently crops up in many journals (including good ones like Nature and Science. It basically goes like this (for a prime example, look here):

1. A computational biology lab sees one or more genomic-scale datasets that they can do some calculations on (usually microarray data).

2. The computational biologists come up with some algorithm that's supposedly better than what's out there, and they crunch the numbers on the genomic data. This results in some predictions of novel regulatory interactions - for example, they predict that certain transcription factors regulate certain genes involved in cell division. At this point we have no idea whether their predictions are right, or even persusive enough to be worth testing. But it's a start.

3. The computationl biologists "validate" their results by using (notoriously incomplete) database annotations about the genes in their predictions, or by a shallow, cursory scan of the experimental literature (which the authors are usually not that familiar with). They then state something like "75% of our predicted transcription factor-gene interactions have some basis in the literature." Up to this point things are fine (they have made predictions, and given us some reason to believe that the predictions have a chance of being right), but then they usually go on and say something like this: "Therefore, we have demonstrated that our algorithm has the ability to find new regulatory interactions..." They have demonstrated no such thing. They have made predictions, but haven't bothered to test them; instead, they do a crappy literature survey (usually with significant omissions).

The result is that you get different groups coming up with all sorts of new analyses of the same genomic data (in my field, cell cycle gene expression and genome-wide transcription factor binding data are big ones), but never really making any serious progress towards improving our understanding of the biological process in question. The worst part is that, over time, the researchers doing this kind of work start talking as if we are making progress in our understanding, even though we haven't really tested that understanding. You start getting an echo chamber resonating with these guys who are citing each other for validation more than they are citing the people actually study the relevant genes in the lab.

This means that the experimentalists ignore the echo chamber, and then computational biology becomes irrelevant to experimental biology - which is a sad thing. There are so many 'validated' predictions out there, that the experimentalists don't really know where to start, where the good predictions are. And the computational researchers don't care enough to really work with someone who will actually go test things in the lab, in spite of the fact that if these computational biologists did care enough, they would get more notice from the experimentalists.

The problem is bad enough that one journal, Nucleic Acids Research changed their policy on computational papers:

"Computational biology
Manuscripts will be considered only if they describe new algorithms that are a substantial improvement over current applications and have direct biological relevance. The performance of such algorithms must be compared with current methods and, unless special circumstances prevail, predictions must be experimentally verified. The sensitivity and selectivity of predictions must be indicated. Small improvements or modifications of existing algorithms will not be considered. Manuscripts must be written so as to be understandable to biologists. The extensive use of equations should be avoided in the main text and any heavy mathematics should be presented as supplementary material. All source code must be freely available upon request."

This is a move in the right direction. But until more journals adopt this stance, beware of researchers who claim to have calculated the gene regulatory network for this or that process, or have identified 'modules' of interacting proteins that perform a function in the cell. If these claims, usually based on noisy, less than ideal genomic data, haven't been tested with serious experiments, they remain unproven hypotheses.

Wednesday, November 01, 2006

A Republican War on Science? Nature's editors cop out

The Oct 19th issue of Nature contains a feature section on science and the upcoming US Conrgessional elections. In one of the editorials, (subscription required) Nature's editors criticize the phrase "Republican war on science":

"Slogans such as the 'Republican war on science', meant to sum up a host of perceived abuses, do not do justice to the complex relationship between science and each of the two major political parties."

In an effort to not be percieved as partisan, some people, Nature's editors included, can't bring themselves to truly call things as they are. In recent years, there are very good reasons to single out the Republican party for its serious corrosion of the US government's relationship with science. The phrase "Republican War on Science" (the title of Chris Mooney's recent book) is a correct, legitimate characterization for the following reasons:

1. It's true that no political party or presidential administration is monolithic. There are many Republicans who are not part of an assault on the integrity of science, and not every single decision made by the Bush adminstration has been bad for science. However, the Republican leadership in Congress and the Executive Branch, as well as the active members of the Republican base, have seriously abused science and scientists to push their ideological agenda. Whether it's pushing intelligent design (from school boards to the 'Santorum Amendment' of the No Child Left Behind Act), having an ex-physician novelist testify to Congress on climate change, diagnosing Terri Schiavo by video from Congress, or the more low-profile but perasive agency decisions that weaken protection of the environment and endangered species, promote ineffective and sometimes inaccurate 'abstinence only' sex ed programs, and restrict drugs because of anti-abortion ideology and not safety and efficacy concerns, the Republican leadership and base have attacked mainstream, scientific consensus when it stands in the way of their ideological postion. In recent years, Republicans have been much, much more guilty of this than Democrats.

2. Yes, the Republican-led Congress voted to double the NIH budget in the late 90's, and recently voted to double the NSF budget. But this is an easy vote - it doesn't offend anyone's ideology, it's a fairly small fraction of overall governtment spending on R&D, and there is large bipartisan support for these increases. These votes don't negate the Congressional meddling in research whenever that research is politically controversial for conservatives.

3. This 'war on science' fits the description given by Rep. Rush Holt of the current political climate (in Nature, subscription required):

"In official Washington, scientific subjects have become really politicized. There should be debate about the policy that is derived from science. But, historically, if science puts limits on the choices that are possible, the politicians would accept that. Now, by treating science as just another topic to be dealt with ideologically, or to be part of political trades, they will even ignore the laws of science."

Nature's editors should have shown some spine on this issue. At the very least, they shouldn't have taken a blatant swipe at Chris Mooney, who has been a serious champion of scientific integrity in both journalism and government.