org.bioquery.query
Class NCBIQuerySubmitter

java.lang.Object
  |
  +--org.bioquery.query.QuerySubmitter
        |
        +--org.bioquery.query.NCBIQuerySubmitter

public class NCBIQuerySubmitter
extends QuerySubmitter

The NCBIQuerySubmitter class handles the submission of 8 types of Querys associated with databases at the NCBI (National Center of Biotechnology Information) through their website at www.ncbi.nlm.nih.gov .
Like all QuerySubmitter subclasses, this class should not be instantiated directly, but instead is returned when the getQuerySubmitter method is called on a Query object.

The submission routine works through 2 scipts at NCBI: ESearch and their web-based query.fcgi. While query.fcgi is largely undocumented, details about ESearch are available at:
http://www.ncbi.nlm.nih.gov/entrez/query/static/esearch_help.html

Should these scripts, their formats, or their URLs change this class may have to be rewritten or overwritten to restore functionality. The internal methods of this class are protected so that subclasses can access them. This would allow a developer to correct for changes in NCBI's scripts by making a subclass that only alters the necessary methods needed to correct the problem. The subclass could be placed in the bioquery directory and used without recompiling bioquery.jar, as long as all of the NCBI Querys listed in querys.xml are given the appropriate class name of the new QuerySubmitter

Please see the parent class QuerySubmitter for details on how to submit Querys and receive the results.

Author:
James Brundege

Field Summary
protected static java.lang.String EFETCH_ADDRESS
          Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
Web address of the EFetch utility at NCBI/NIH Note: This is no longer used.
protected static java.lang.String EFETCH_MODE
          Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
The mode is the file format of the results (html, file, text).
protected static java.lang.String ESEARCH_ADDRESS
          Web address of the ESearch utility at NCBI/NIH
ESearch retrieves the counts (when getQueryData is called) and a list of unique identifiers to pass to the retrieve scipt when getQueryResults is called.
protected  java.lang.String idListURL
          URL to create the links in the id list format.
protected static int MAX_NUM_DISPLAYED
          The max number displayed is determined by how long the PMID-containing URL can be before it exceeds the URL size limit set by the NCBI Entrez server.
protected static java.lang.String RETRIEVE_ADDRESS
          Web address of the query.fcgi script used to retrieve QueryResults from NCBI
This QuerySubmitter passes unique ids retrieved via ESearch to this script, though it is possible to directly pass the submittable Query text in the form of term=exampleterm[FIELD]
 
Fields inherited from class org.bioquery.query.QuerySubmitter
endDate, myQuery, startDate
 
Constructor Summary
protected NCBIQuerySubmitter()
          Empty constructor is protected.
 
Method Summary
protected  java.lang.StringBuffer addBioQueryInfo(java.lang.StringBuffer URLString)
          Courtesy to NCBI that lets them know who is using their utilites and how to contact us.
protected  java.lang.String filterByDate(java.lang.String eSearchURL)
          Helper method adjusts the Query text to search for results added or modified between 2 specific dates: startDate and endDate.
protected  java.lang.String getCount(org.jdom.Document doc)
          Pulls the reference count out of the returned xml document when ESearch is asked to provide counts, such as when getQueryData is called.
protected  java.lang.String getDatabaseCode()
          Returns a database code that can be added to the URLs to all NCBI scripts to tell it which database to search.
protected  java.lang.String getFormatCode(Query theQuery)
          Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
protected  java.util.List getIDs(org.jdom.Document doc)
          Takes the xml document returned by ESearch and pulls out all of the unique IDs and puts them into a List as Strings.
 java.io.BufferedReader getQueryData()
          Finds the number of references returned by each line of the Query.
 java.io.BufferedReader getQueryResults()
          Submits this QuerySubmitter's Query object to the appropriate database and returns a BufferedReader containing the text results.
protected  java.lang.String getResultsFooter()
          Returns a footer String that simply ends the body and html tags.
protected  java.lang.String getResultsHeader()
          Returns a header containing details about the Query in HTML format.
protected  java.lang.String getSubmittableText()
          Returns the text to be submitted of the Query.
protected  java.lang.String parseQueryText(java.lang.String queryText)
          Helper method takes the queryText and converts it to a String that can be directly submitted to NCBI.
 java.io.BufferedReader parseWebPage(java.io.BufferedReader webpage)
          Parses the HTML query results returned by the query.fcgi script.
protected  void setInitParams(java.util.HashMap parameters)
          Loads the idListUrl from the query.xml, which is only used by the ID List format.
protected  java.io.BufferedReader submitEFetch(java.util.List PMIDs)
          Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
This helper method does the second of a 2-step process for submitting Querys to NCBI.
protected  java.util.List submitESearch(java.lang.String queryText, boolean count)
          This helper method does the first of a 2-step process for submitting Querys to NCBI.
protected  java.util.List submitESearch(java.lang.String queryText, int startNum, int endNum)
          startNum less than 0 means we should just get the number of items, not the items themselves (getQueryData, not getQueryResults).
protected  java.io.BufferedReader submitRetrieveUrl(java.util.List PMIDs)
          Returns the full results of the Query.
protected  java.io.BufferedReader submitURL(java.lang.StringBuffer URLString)
          Submits a URL and returns a BufferedReader containing the response.
protected  java.io.BufferedReader writeLinks(java.util.List PMIDs)
          Returns the Query results in the ID List format.
 
Methods inherited from class org.bioquery.query.QuerySubmitter
convertOperatorCase, parseQueryLine, resetEndDate, resetStartDate, setEndDate, setQuery, setStartDate
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ESEARCH_ADDRESS

protected static java.lang.String ESEARCH_ADDRESS
Web address of the ESearch utility at NCBI/NIH
ESearch retrieves the counts (when getQueryData is called) and a list of unique identifiers to pass to the retrieve scipt when getQueryResults is called.

RETRIEVE_ADDRESS

protected static java.lang.String RETRIEVE_ADDRESS
Web address of the query.fcgi script used to retrieve QueryResults from NCBI
This QuerySubmitter passes unique ids retrieved via ESearch to this script, though it is possible to directly pass the submittable Query text in the form of term=exampleterm[FIELD]

MAX_NUM_DISPLAYED

protected static int MAX_NUM_DISPLAYED
The max number displayed is determined by how long the PMID-containing URL can be before it exceeds the URL size limit set by the NCBI Entrez server.

idListURL

protected java.lang.String idListURL
URL to create the links in the id list format. This is a custom paramater pulled from the tag in file querys.xml and loaded in the setCustomParameters method

EFETCH_ADDRESS

protected static java.lang.String EFETCH_ADDRESS
Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
Web address of the EFetch utility at NCBI/NIH Note: This is no longer used. It is an alternative to the RETRIEVE_ADDRESS

EFETCH_MODE

protected static java.lang.String EFETCH_MODE
Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
The mode is the file format of the results (html, file, text). This utility uses the 'text' mode for all operations.
Constructor Detail

NCBIQuerySubmitter

protected NCBIQuerySubmitter()
Empty constructor is protected. This class should not be instantiated directly, but instead should be returned by a call to getQuerySubmitter on a Query object.
Method Detail

setInitParams

protected void setInitParams(java.util.HashMap parameters)

Loads the idListUrl from the query.xml, which is only used by the ID List format.

Loads URLs and parsing parameters from the tag of the querys.xml file. This method is called by the QueryFactory, who passes it a HashMap containing all of the name-value pairs listed in querys.xml.

 

The QueryFactory should be the only one to call this method. There is no need to have other classes call it, though it needs to be custom written for each subclass of QuerySubmitter.

Overrides:
setInitParams in class QuerySubmitter

getQueryResults

public java.io.BufferedReader getQueryResults()
                                       throws InvalidQueryException,
                                              InvalidQueryLineException
Submits this QuerySubmitter's Query object to the appropriate database and returns a BufferedReader containing the text results. A null BufferedReader may be returned if there was a problem connecting to the database server.
Overrides:
getQueryResults in class QuerySubmitter
Returns:
a BufferedReader containing the complete list of references returned by the Query's submitLine, formated according to the settings in the Query.

getQueryData

public java.io.BufferedReader getQueryData()
                                    throws InvalidQueryException,
                                           InvalidQueryLineException
Finds the number of references returned by each line of the Query. QueryLines marked as not needing an update are not checked. The returned BufferedReader contains the data in the following text format:
QueryLine_Number BQProtocol.DELIMITER Count_Number Linefeed_Character
Overrides:
getQueryData in class QuerySubmitter
Returns:
a BufferedReader containing the number of entries found for each QueryLine that needed updating.

filterByDate

protected java.lang.String filterByDate(java.lang.String eSearchURL)
Helper method adjusts the Query text to search for results added or modified between 2 specific dates: startDate and endDate. These parameters are set using methods in the parent QuerySubmitter class. This method adjusts the submittable text that will be sent to NCBI's ESearch utility, which will then filter the unique ids passed to the query retrieve utility.
If the startDate and endDate are not set, it sets the filter to search from year 1900 to the present moment.
Parameters:
eSearchURL - The URL to be passed to the ESearch utility, which will have 2 additional parameters added to the end.
Returns:
The eSearchURL modified to filter by the startDate and endDate set in this QuerySubmitter.

parseQueryText

protected java.lang.String parseQueryText(java.lang.String queryText)
Helper method takes the queryText and converts it to a String that can be directly submitted to NCBI. This just converst whitespace to "%20". Note that the queryText parameter must have already had the line numbers removed from standard BioQuery format via the QuerySubmitter.parseQueryLine() method prior to being passed to this method.
Parameters:
queryText - the text from a QueryLine which has already had any line numbers removed via the QuerySubmitter.parseQueryLine() method
Returns:
a string with the space characters replaced by '+' characters.

submitESearch

protected java.util.List submitESearch(java.lang.String queryText,
                                       boolean count)
                                throws InvalidQueryException
This helper method does the first of a 2-step process for submitting Querys to NCBI. It takes the text of the Query and submits it to the NCBI ESearch database. This returns an ID number for each entity in the Query's resultset. This method parses the ID numbers and returns them as Strings contained in a List.
Parameters:
queryText - the text of searchable terms from the Query. The queryText should already be converted into a format understood by ESearch (by the submitQuery() method).
Returns:
a List containing NCBI PMIDs (unique identifiers) as Strings. Each PMID is one String / item in the List.

submitESearch

protected java.util.List submitESearch(java.lang.String queryText,
                                       int startNum,
                                       int endNum)
                                throws InvalidQueryException
startNum less than 0 means we should just get the number of items, not the items themselves (getQueryData, not getQueryResults). endNum == -1 means we should return all refs.

addBioQueryInfo

protected java.lang.StringBuffer addBioQueryInfo(java.lang.StringBuffer URLString)
Courtesy to NCBI that lets them know who is using their utilites and how to contact us.

submitURL

protected java.io.BufferedReader submitURL(java.lang.StringBuffer URLString)
                                    throws java.io.IOException
Submits a URL and returns a BufferedReader containing the response. The StringBuffer containing the URL should have the complete protocol and address.
Parameters:
URLString - A StringBuffer containing a full URL address as text.
Returns:
A BufferedReader containing the response.

getCount

protected java.lang.String getCount(org.jdom.Document doc)
Pulls the reference count out of the returned xml document when ESearch is asked to provide counts, such as when getQueryData is called.
Parameters:
doc - the JDOM document for the returned xml document.
Returns:
The count number as a String (should be parsable to an int)

getIDs

protected java.util.List getIDs(org.jdom.Document doc)
Takes the xml document returned by ESearch and pulls out all of the unique IDs and puts them into a List as Strings.
Parameters:
doc - A JDOM document created from the xml document returned by ESearch.
Returns:
A List containing all of the unique IDs for the returned entries as Strings.

getDatabaseCode

protected java.lang.String getDatabaseCode()
                                    throws InvalidQueryException
Returns a database code that can be added to the URLs to all NCBI scripts to tell it which database to search. Takes the database listed in the Query (which is human readable) and translates it into a "URL suitable" name.
Returns:
a String that indicated the database and can be appended to a URL submitted to any NCBI script.

getSubmittableText

protected java.lang.String getSubmittableText()
                                       throws InvalidQueryLineException
Returns the text to be submitted of the Query. This method handles expanding line numbers, filtering by date if required, and removing whitespace.
Returns:
the submittable text of the Query, based on the submit line in that Query and converted to a form ready to send to the database.

submitRetrieveUrl

protected java.io.BufferedReader submitRetrieveUrl(java.util.List PMIDs)
                                            throws InvalidQueryException
Returns the full results of the Query. This is the 2nd half of a 2-stage process: First the IDs of each results are retrieved by ESearch, then this method sends those IDs to the query.fcgi script which returns the complete results. This method uses the RETRIEVE_ADDRESS hard-coded into this class.
Parameters:
PMIDs - a List containing the unique IDs of the entries to be retrieved, as Strings.
Returns:
A BufferedReader containing the results. The results have already been parsed and have a header with info about the Query added.

parseWebPage

public java.io.BufferedReader parseWebPage(java.io.BufferedReader webpage)

Parses the HTML query results returned by the query.fcgi script. This method may be the most fragile section of this QuerySubmitter. If the results are returned but are no longer parsed correctly, NCBI may have changed the format of the returned web page. The problem can be corrected by altering or overwriting this method

This parser does 3 things:

Not all links will be properly rendered, but most will.

Parameters:
webpage - The BufferedReader returned by the submitRetrieveUrl method, which contains the response from the query.fcgi script at NCBI.
Returns:
A BufferedReader containing the parsed page.

writeLinks

protected java.io.BufferedReader writeLinks(java.util.List PMIDs)
Returns the Query results in the ID List format. This adds the header and then writes the unique IDs returned by ESearch along with links to more details about each item at NCBIs website.
Parameters:
PMIDs - The list of IDs returned by ESearch via the submitESearch method
Returns:
A BufferedReader containing an HTML document of the Query results in ID List format.

getResultsHeader

protected java.lang.String getResultsHeader()
Returns a header containing details about the Query in HTML format.

getResultsFooter

protected java.lang.String getResultsFooter()
Returns a footer String that simply ends the body and html tags.

submitEFetch

protected java.io.BufferedReader submitEFetch(java.util.List PMIDs)
                                       throws InvalidQueryException
Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.
This helper method does the second of a 2-step process for submitting Querys to NCBI. It takes the List of IDs returned from the submitESearch() method and submits it to PmFetch. The results are contained within the BufferedReader that is returned. A failure to successfully connect to PmFetch will return a null BufferedReader reference.
Parameters:
PMIDs - the List of IDs that is returned from the submitESearch() method.
Returns:
A BufferedReader containing the full response from the Entrez PmFetch server as ascii text.

getFormatCode

protected java.lang.String getFormatCode(Query theQuery)
                                  throws InvalidQueryException
Deprecated in favor of the RETRIEVE_ADDRESS and query.fcgi script.