Data Mining with Entrez Direct

  • Wednesday, August 20, 2014, 10:00 AM - 11:30 AM
Entrez Direct (EDirect) is an advanced method for accessing the NCBI's set of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a UNIX terminal window. Functions take search terms from command-line arguments, and individual operations are combined to build multi-step queries. Record retrieval and formatting normally complete the process.

EDirect also provides an argument-driven function that simplifies the extraction of data from results that are returned in structured XML format. This can eliminate the need for writing custom code to answer novel questions that are not addressed by existing analysis software. Queries can move seamlessly between EDirect commands and UNIX utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

A laboratory research biologist or medical librarian can use EDirect to build a sophisticated Entrez query without having to write a program, and without needing any formal training in bioinformatics. Typical ad hoc questions that can be answered by straightforward EDirect queries include "How many exons are in each dystrophin transcript variant?", "What genes are in a given range on the human Y chromosome?", and "What are the missense products of green-sensitive opsin?"

After the talk there will be a half hour computer lab session where attendees can write EDirect queries in their own areas of interest.

