WebCSD v1.1.1 FAQs
- How does the similarity search work?
- What are the differences between the Tanimoto and Dice similarity coefficients?
- What are the strengths/weaknesses of similarity searching in WebCSD?
- What should I sketch for a similarity search?
- Why do I get different reduced cell search results in WebCSD compared to ConQuest?
- Why do I only see refcodes beginning with 'A' when I browse the database?
- What are the client-side technical requirements for WebCSD access?
- How can I speed up Java on my computer?
- Why is WebCSD always slow at the beginning of each session?
- My searches are running really slowly - what should I do?
- I have just started using a new version of WebCSD - why are some features behaving strangely or not working?
- How does author name searching work?
- How can I search for more complicated compound names?
- Why does WebCSD run searches in a new window each time?
- Why do my WebCSD applets stop appearing when I already have many WebCSD tabs/windows open?
- Why does WebCSD give a 'Socket is not connected' error every time I try to run a search?
- Why does the Jmol visualiser give an 'access denied' error when I try to view WebCSD structures?
- What are CSD X-Press and Structures Pending?
- How do I reference WebCSD?
- What are Retracted CSD entries?
- What is the 'teaching subset' of the CSD?
- What are user accounts for and how do I get one?
- Are there any differences between substructure searches in WebCSD and ConQuest?
How does the similarity search work?
The similarity calculation in WebCSD is based on molecular fingerprints
that are calculated using the chemical features of the molecule such as atom types,
bond types and bonded paths through the molecule. When a molecule is drawn in the
similarity sketcher, the molecular fingerprint for this molecule is calculated and
then it is compared to pre-calculated fingerprints of all the structures in the CSD.
The fingerprint comparison is performed using either of the Tanimoto or Dice coefficients,
this effectively gives a measure of the similarity between the molecules. Each of the
coefficients will produce a similarity value in the range of 0 to 1, with 0 being completely
dissimilar and 1 being identical. In order to produce a manageable set of similar structures
a cut-off value for the similarity coefficient is used, below which value matches are
discarded (the default for this is 0.7 for Tanimoto and 0.975 for Dice).
N.B. The two types of similarity coefficient are not directly comparable, so calculated
similarity values cannot be compared between the two types in a quantitative fashion.
What are the differences between the Tanimoto and Dice similarity coefficients?
The Tanimoto coefficent is determined by looking at the number of chemical
features that are common to both molecules (the intersection of the data
strings) compared to the number of chemical features that are in either (the
union of the data strings). The Dice coefficient also compares these values but
using a slightly different weighting.
The Tanimoto coefficient is the ratio of the number of features common to both
molecules to the total number of features, i.e.
( A intersect B ) / ( A + B - ( A intersect B ) )
The range is 0 to 1 inclusive.
The Dice coefficient is the number of features in common to both molecules relative
to the average size of the total number of features present, i.e.
( A intersect B ) / 0.5 ( A + B )
The weighting factor comes from the 0.5 in the denominator. The range is 0 to 1.
What are the strengths/weaknesses of similarity searching in WebCSD?
With all fingerprint-based methods of similarity searching there are certain
strengths and weaknesses inherent in the fingerprint definitions. The
fingerprints used by WebCSD for similarity searching are created using atom
types, bond types and bonded paths through the molecules. This definition for
the fingerprints means that the search will tend to find matches that contain
closely related scaffolds. There are, however, a number of weaknesses
associated with the fingerprints and similarity calculations as they are
implemented at the moment.
The first issue is that although the bond types are compared, cyclicity is not
explicitly taken into account within the fingerprints. This means that
cyclohexane will be indistinguishable from hexane in a similarity search.
Molecules that contain fewer atoms will also be less well defined, and
therefore are more prone to low similarity scores. Finally, no information is
stored about chemically related elements, such as transition metals, this means
that closely related metal complexes, for example, may not be listed with high
similarity coefficients.
For further information about the similarity search calculation, see the
following open access publication:
Thomas et al.,
2010, J. Appl. Cryst., 43, 362-366.
What should I sketch for a similarity search?
The similarity search is based on a comparison of molecular fingerprints, so
it is important to sketch a full molecule rather than a substructure. It is not crucial,
however, to draw the hydrogens on your molecule because hydrogens are not included
explicitly in the similarity calculation.
Why do I get different reduced cell search results in WebCSD compared to ConQuest?
Firstly, if a particular unit cell is entered for a reduced cell search in either
of ConQuest or WebCSD the search algorithms will not miss any matches which should be hit
for that particular search. The ConQuest search, however, only uses the reduced unit cell
lengths to find matches due to known mathematical instabilities associated with inclusion of
the unit cell angles (Andrews, Bernstein & Pelletier, Acta Cryst, 1980, A36, 248-252).
The new implementation in WebCSD takes into account the cell angles as well by using
a more advanced methodology involving nearly Buerger-reduced cells (Andrews & Bernstein,
Acta Cryst, 1988, A44, 1009-1018). This approach avoids the problems with instabilities and
means that the reduced cell search in WebCSD gives fewer false positive hits.
Why do I only see refcodes beginning with 'A' when I browse the database?
The scrollable list of refcodes in the Browse Database section has been designed such
that it only loads the set of refcodes beginning with one particular letter at any time. This
has been done to avoid over-loading the Javascript menu and also to make scrolling through the
list easier and more useful. As such, when you first enter the Browse Database page, the browser
will be showing only the refcodes starting with the letter 'A'. The browser can be prompted to go
to a particular section of the database by typing letters into the textbox - as you type, the
browser will jump to the most relevant refcode.
What are the client-side technical requirements for WebCSD access?
Supported Browsers
The following browsers are fully supported for WebCSD v1.0:
Apple Mac Users
We recommend the use of Safari on Mac OS X as this generally offers the best
user experience on this platform.
Please note - we no longer offer formal support for Mac OS X 10.4 ("Tiger").
Alternative Browsers
If none of the supported browsers are available, you could use one of the following
alternatives to run WebCSD v1.0 even though they are not formally supported at
this stage. You may notice some limitations when using one of these browsers - please
let us know if you encounter any technical difficulties and we will endeavour to assist you.
Other Requirements
- Java Runtime Environment (JRE) v1.5 or later.
Latest JRE v1.6 highly recommended for optimal performance.
- Your network must allow you to open TCP socket connections to webcsdserver.ccdc.cam.ac.uk on either port 80 or port 8765.
- You must allow pop-ups for the *.ccdc.cam.ac.uk domain.
- You must enable Javascript in your web browser.
- Client-side cookies are used to store personal preferences within WebCSD. They may also
be used to store essential per-session data required for WebCSD access. If you disable cookies,
your preferences and interface settings will not be retained and you may not be able to access
WebCSD.
- You must accept the CCDC digital certificate when prompted to do so by the WebCSD Java applets.
The Java applets have been digitally signed to give them sufficient privileges to connect to the WebCSD
server. Failure to accept the certificate will prevent them from starting your searches. If you reject
the digital certificate, you will still be prompted to accept it next time you visit the site in a new
browser session.
How can I speed up Java on my computer?
WebCSD relies heavily on
Java technology. Java is used to power the chemical sketcher, the 3D visualiser and the results browser. There are four key factors that determine the speed of Java applications:
- The version of Java you are using - Generally speaking, the newer the better.
- The speed of your computer - This determines how quickly the Java Runtime Environment can be started at the beginning of each browser session and how quickly the applet can be initialised on the page.
- The speed of your internet connection - This determines how quickly the applet can be downloaded from the web server.
- The internet browser you use - Some internet browsers work better with Java applications than others. If you are having performance issues with one browser, it's worth trying a different one.
We recommend the use of Java Runtime Environment 6 (the current release version) which can be downloaded
here.
Why is WebCSD always slow at the beginning of each session?
Before a Java application can run, the "Java Runtime Environment" (JRE) must be initialised. This can take quite a few seconds (depending purely on the speed of your client machine and the version of Java you are using). Until the JRE has completely initialised, the page you are trying to use will be inactive and will probably appear empty.
Once the JRE has loaded the page will come to life, the missing components will appear, and you will be able to use WebCSD. The JRE only needs to initialise once per browser session, the first time a Java application is run. The next time Java is used, the application should appear virtually immediately due to the internal caching that automatically takes place within the JRE.
My searches are running really slowly - what should I do?
There are many possible explanations for slow searches. The most common reasons are:
- Slow internet connection
- Slow client PC
- Slow Java performance
- Very busy server
In order to diagnose the underlying cause of this problem, we have added a '
Socket Connection Test' mechanism to WebCSD. This test retrieves the first 100,000 database entries from the CSD via your network connection. Please allow the test to run to completion and retrieve all 100,000 entries. You can then send us an automated performance report by choosing the 'Send Search Statistics Report' option from the 'Help' menu of the result browser applet down the left-hand side. Please enter your name, email address and any other relevant information in the dialog that appears and then click 'Send Report'. The information will then be automatically sent to the CCDC support team for their prompt attention.
The information contained within this report should indicate where the performance bottleneck lies and therefore what needs to be done to resolve it. You may be asked to submit several performance reports in this way either in quick succession or at different times of day to give us a better average. For example, the general level of traffic on the internet varies throughout the day and can skew the results at certain times of day.
I have just started using a new version of WebCSD - why are some features behaving strangely or not working?
Of course it is possible that you have identified a genuine issue in WebCSD, but it is quite common for this to be caused by a web browser failing to notice that a file has changed on the WebCSD server and therefore continuing to use the old cached version.
Before contacting us to report the problem, we recommend that you empty your browser's cache of temporary internet files and try WebCSD again just in case this provides a quick and easy solution.
- On Mozilla Firefox 3.0.*, go to the 'Tools' menu and choose 'Clear Private Data...'
- On Mozilla Firefox 3.5.*, go to the 'Tools' menu and choose 'Clear Recent History...'.
Make sure the 'Cache' checkbox is selected in the 'Details' before clicking 'Clear Now'.
- On Internet Explorer 7, go to the 'Tools' menu and choose 'Delete Browsing History...' and then click on the 'Delete Files' button.
- On Internet Explorer 8, go to the 'Safety' menu and choose 'Delete Browsing History...'
How does author name searching work?
To search on author name, select the 'Author Name' query type and enter the required surname
in the text/numeric 'Query' box. Optionally authors' initials may also be specified, but each
must be followed by a full-stop with no spaces between initials or between initials and surname,
e.g. 'F.H.Allen'. When initials are provided, all must match exactly, e.g. 'F.H.Allen' would not
match 'F.Allen'.
When using the match anywhere option, the query 'Allen' would match names like 'Allenby' and 'Allenford'.
Use the match exact word option to only allow exact name matches.
How can I search for more complicated compound names?
Here are some useful conventions and tips for compound name searching:
-
Standard paranthesis characters can be used in WebCSD
text/numeric searches, so you can search for 'cobalt(ii)'
or 'bicyclo[3.3.1]nonane'.
-
You can use '+' and '-' characters to define
stereochemistry, e.g. '(+-)-Nefopam'.
-
Lower case Greek characters are stored in the text
using their latin alphabet descriptions, e.g. alpha for
α and mu for μ. Upper case Greek characters are
spelt out and prefixed by c, e.g. cdelta for Δ.
-
The names of elements Al, Cs and S are spelt aluminium,
cesium, sulfur.
-
Bridging ligands in polymeric metal coordination complexes
are identified by the bridging indicator μ, with the polymer
identified by the prefix catena, e.g. catena-((μ2-2,5-dihydroxy-p-benzoquinonato)-zinc).
-
Names of hydrates will contain the words hemihydrate,
monohydrate, dihydrate, etc., otherwise, just hydrate if
the multiplier is a non-integer value.
-
If other solvents are present, the name will contain the
word solvate; clathrate is used for solvates which are
clathrated, as in host-guest compounds.
-
Deuterated species will always contain the name characters deuter.
-
Characters which would normally be typeset
as superscripts or subscripts are enclosed within the
characters $ (up) and ! (down) eg.:
'eta$5!-cyclopentadienyl' will match strings including
'η5-cyclopentadienyl'.
Why does WebCSD run searches in a new window each time?
WebCSD is designed to launch each search in a new 'pop-up'. This approach offers two key advantages:
- Your query is retained in the original window/tab so it can easily be modified, saved or run again.
- You can compare multiple search results side-by-side.
You have some control over how your internet browser handles these pop-ups. Most modern tabbed browsers
(including Internet Explorer, Firefox, Safari, Opera and Google Chrome) allow you to specify whether
pop-ups should open in a new window or a new tab by default. We would recommend configuring your browser
to open pop-ups in a new tab as this offers the best user experience in web applications such as WebCSD.
Why do my WebCSD applets stop appearing when I already have many WebCSD tabs/windows open?
Sun's Java Runtime Environment (JRE) applies a default limit on the maximum amount of memory made available
to the Java applets running in your web browser. Depending on your browser and JRE version, this
limit may be shared across all applets running in your browser, even if they are in different windows
or tabs. If you open too many applets at once, you may run out of Java heap memory and be unable to
open any more. If this occurs, you will see an error message like "java.lang.OutOfMemoryError: Java
heap space" in your Java console. If this occurs, please update your JRE to the latest version. If
you are unable to run Java 6 Update 10 or later, please refer to this
article.
Why does WebCSD give a 'Socket is not connected' error every time I try to run a search?
In order to run a search, WebCSD's result browser applet must make a TCP socket connection back
to the CCDC's search server at webcsdserver.ccdc.cam.ac.uk. By default, it attempts to connect on
port 80. However, some networks block direct port 80 access to the internet and force all traffic through
an HTTP web proxy which is not suitable for WebCSD traffic. If port 80 is blocked, the applet will
automatically try to connect on port 8765 instead. If it successfully connects to port 8765, it remembers
to use that port by default for all subsequent searches in that session. Therefore, in order to run
searches on the public internet version of WebCSD, you must ensure that your network allows your PC
to connect to webcsdserver.ccdc.cam.ac.uk on either port 80 or port 8765.
If you want to use a different port to the one automatically selected by the result browser, you
can manually override its selection by going to the 'Help/Settings' menu and choosing a
new port number. Your selection will be saved in a browser cookie for future sessions.
Why does the Jmol visualiser give an 'access denied' error when I try to view WebCSD structures?
If you get an error message similar to:
access denied (java.net.SocketPermission 127.0.0.1:8081 connect,resolve)
at the top of the Jmol display window and no molecule appears, you may need to update your Java
security policy to allow connections to the WebCSD server.
To do this, you will need to edit the java.policy file that your local computer is using
- this will probably be in the lib/security subdirectory of your Java runtime installation.
In the java.policy file, add a line like this:
permission java.net.SocketPermission "http://127.0.0.1:8081";, "connect, resolve";
(or whatever address is used to connect to your WebCSD server) in the grant section.
If this does not work, you can also try adding:
Permission java.security.AllPermission;
in the grant section, but this disables the Java security mechanism and should
ideally be avoided.
What are CSD X-Press and Structures Pending?
As the processing and curation of CSD structures takes a finite and not insignificant period
of time, the CCDC has decided to take advantage of the new Web-based architecture in WebCSD
and start releasing structures to the public before they are fully curated. These structures
have been automatically processed using our specialist in-house software to ensure a certain
level of quality, but may not have had any manual input. As they have not been fully processed
by a CCDC editor yet it is likely that some will contain errors and some of the entries won't
contain fields that are added during the curation process, such as 'recrystallisation solvent'
and 'bioactivity'. New structures will be added in batches on a regular basis as they are received
and these uncurated structures will appear in WebCSD as a separate database designated
"CSD X-Press". The intention is that the early access to these structures
will be beneficial to users in spite of the possible errors in the "Structures Pending",
especially in the handling of disorder and diagram generation.
More about Structures Pending:
- CSD X-Press: The structures that are pending curation are kept in a
separate database within the WebCSD architecture named "CSD X-Press". This means that it is
simple to perform searches or extract results based on only the fully curated CSD,
only the structures pending, or both sets of structures using the checkboxes provided
in the Settings tab of WebCSD.
- Refcode Format: The reference code for a regular CSD structure
has the format of six letters followed by an optional two digits (e.g. AABHTZ or
AACRUB01). For structures pending, the temporary refcodes assigned will always end
in '00' to indicate that the structure has not yet been fully curated. Please bear
in mind that these refcodes are temporary and the code will either be changed
completely or the '00' will be removed.
- Citing a Structure Pending: If you would like to refer to a
CSD X-Press entry within a scientific publication, please report the CCDC reference
number (e.g. CCDC 747743) rather than the temporary refcode. For example use one
of the following styles:
- For published structures, write in the body text "(CCDC 747743)", then
cite the original paper in your references section.
- Or, for private communications use a reference like so: "S. Parsons,
C. Grant, R. Winpenny, R. Gould & P. Wood (2004). Private communication to
CSD, CCDC 248052".
- Reliability Score: This score indicates the level of reliability
assigned automatically to a structure based on the curation status of the entry
and the likelihood of complications in the automatic processing. The reliability
score does not reflect the quality of the crystallography/science and is purely
based on the difficulty of processing the particular entry.
- 4 stars
(
)
are given to all fully curated entries in the main CSD. This rating represents a
wide-ranging set of professionally-edited structures containing molecules with a
broad level of complexity.
- 3 stars
(
)
will be given to CSD X-Press entries which encountered a low number of complications
during automatic processing and typically represent entries with simple
chemistry/crystallography.
- 2 stars
(
)
will be given to CSD X-Press entries for which a moderate number of automatic
processing problems were discovered, normally representing entries with more
complicated chemistry/crystallography.
- 1 star
(
)
will be given to CSD X-Press entries for which a high number of automatic processing
problems were discovered, normally representing larger entries with complex
chemistry/crystallography.
Entries can be sorted on reliability score in the Results Browser applet by
clicking on the 'Reliability' column heading.
- 2D Diagrams: If there is a match in the CSD based on
structural topology, the chemical diagram for this match will be used as a
template - the majority of 2D diagrams are derived using this method. Diagram
generation for any new structural topologies is automated using Marvin (a
ChemAxon package);
entries for which acceptable diagrams cannot be automatically generated are simply
shown with no diagram. The automatically generated diagrams do not necessarily
reflect the exact 3D geometry of the structure.
- Compound Name: This field will be populated if anything has
been supplied in the author's original deposited CIF, or if a name can be generated
automatically using
ACD/Name Batch
(an ACD/Labs product)
based on the chemical connectivity. Clearly any automatically
generated compound names will be subject to possible errors, especially in
the case of disordered structures or highly complicated connectivities.
- Disorder: Any disorder will be identified and processed
automatically, but it is likely that some structures will require manual checking
and editing to fully characterise the disorder.
- Feedback: If you have any questions or comments relating to
this new functionality, please e-mail the CSD X-Press Team at
csdxpress@ccdc.cam.ac.uk.
How do I reference WebCSD?
WebCSD: the online portal to the Cambridge Structural Database
I. R. Thomas, I. J. Bruno, J. C. Cole, C. F. Macrae, E. Pidcock and P. A. Wood,
J. Appl. Cryst., 43, 362-366, 2010
DOI: 10.1107/S0021889810000452
What are Retracted CSD entries?
You may or may not be aware that evidence was discovered in 2010 proving that a
substantial series of crystal structures published in Acta Crystallograhica Section
E were based on falsified data. These structures, primarily published in 2007,
originated from research groups at the Jinggangshan University in China. The
editors of the journal, along with Ton Spek (Utrecht University), identified the
fraudulent structures and the papers have been retracted as described in this
Editorial article.
In order to accurately represent the situation, we have decided to flag each
relevant entry as "Retracted" - all data for these structures have been removed,
but the journal references remain in place. See this
Statement
by Dr. Colin Groom, Executive Director of the CCDC for further information on the
matter.
What is the 'teaching subset' of the CSD?
The teaching subset of the Cambridge Structural Database comprises 500 structures
chosen specifically for their educational value. The subset is freely available
via the WebCSD interface. For further information see:
Teaching Three-Dimensional Structural Chemistry Using Crystal Structure Databases
1. An Interactive Web-Accessible Teaching Subset of the Cambridge Structural Database
G. M. Battle, F. H. Allen and G. M. Ferrence
J. Chem. Educ., 87, 809-812, 2010.
DOI: 10.1021/ed100256k
What are user accounts for and how do I get one?
The optional WebCSD database security mechanism can be used to protect access to
WebCSD data on a per-database basis. When enabled, the WebCSD server administrator
can restrict access to a predefined list of privileged user groups for each
individual database.
There are two ways to determine from your Settings
that the security mechanism is currently blocking your access to a specific database:
- You cannot select the database in question because it is 'Denied'.
- The database in question is not listed in the database selector (which
means the security mechanism is in stealth mode and denied databases are hidden
from non-privileged users).
In order to access a protected database, you must log in to a user account which
is a member of a group with permission to access that database. If you do not yet
have a user account on the WebCSD server in question, you must first request a new
account from the WebCSD server administrator. Once you have a user account, you
must then ask your administrator to add your account to an appropriate group which
has permissions to access the database in question.
You can contact your server administrator via the
Support Request form.
NOTE - If the database security mechanism is not enabled on your WebCSD server, or
none of the databases you are interested in are protected, you do not require a user
account.
Are there any differences between substructure searches in WebCSD and ConQuest?
Although the substructure search engine behind WebCSD is entirely new and does not
work in the same way as ConQuest, the results of substructures searches using these
two programs are generally identical.
It is important to note though that there is a subtle difference between the two
search programs - a WebCSD substructure search currently requires that the fragments
drawn are within the same connectivity. ConQuest on the other hand by default will
allow a user to define fragments in unconnected moieties (e.g. co-crystals, solvates
or salts) and even allows one to define contacts between these non-covalently-bonded
fragments. Identical substructure search behaviour between the two systems can be
achieved by using the option in the ConQuest sketcher to request "All Atoms in Same
Molecule" under the "Atoms" menu.