Chemical data validation |
---|
|
Bot Pages |
Categories |
IRC-related |
Commons |
Chemical Lists |
|
Style Guides |
Useful Links |
NOTE: Chemical Abstracts have agreed to perform this work for us. Information on this process will be posted once details are available.
This topic is the main agenda item at the February 5th IRC Meeting. Please join us!
There are two types of validation which may be done with CAS numbers. One is a mathematical validation, designed to detect mistyped CAS numbers. The other method is to validate that the number is assigned to a chemical.
Mathematical validation (using a check digit):
editThe Chemical Abstracts Service (CAS) registry number system was designed to be fault-tolerant. Built into every CAS number is a check-digit that makes it possible to detect mis-typed numbers. Validation is a mathematical and repetitive process well-suited for software. Note that a validated CAS number can still be absent from the CAS database; mathematical validation only says that a CAS number could be valid based on its format.[1]
Here is sample code for this validation: module CAS
def validate cas_number return false unless cas_number && cas_number.match(/[0-9]{2,7}-[0-9]{2}-[0-9]/)
check_digit = cas_number[-1,1].to_i sum = 0
cas_number.reverse.scan(/[0-9]/).each_with_index do |digit, i| sum = sum + digit.to_i * i end
check_digit == sum.remainder(10) end
end
include CAS
while true do
print "CAS Number: "
cas_number = gets.strip
break if cas_number.empty?
puts CAS.validate(cas_number) ? "valid" : "invalid"
end
Validation by lookup:
editCAS numbers need to be validated for the ~4000 chemical pages. Since the only authoritative source is the American Chemical Society, SciFinder looks like the best bet. For various reasons (see previous IRC discussions), it is not practical for one editor to validate them all. Thus, the divison of labor:
ChemSpiderMan (talk · contribs) will be in charge of the distribution. Help is wanted! To contribute, simply request the block (number of entries) you would like to handle, and sign by using ~~~~ after you are done. It may be helpful to try tackling a smaller block, before making further requests.
Visit the authority for CAS numbers and use either Scifinder or STN to search/validate the CAS number for the represented structure. There ARE multiple CAS numbers associated with a single compound so the CAS number itself might need to be annotated. For most complex organics I don't think this will be a problem but will be for the inorganics and a number of the more common organics. --ChemSpiderMan (talk) 05:15, 23 January 2008 (UTC)
CAS Number Legality
editPlease see the Ruby code posted by Rich Apodaca at http://depth-first.com/articles/2008/07/23/validating-cas-numbers[1]. This code demonstrates how the check digit (the last digit) is calculated.
As an example, Caffeine is 58-08-2. 2 was calculated by (8*1) + (0*2) + (8*3) + (5*4) = 8 + 0 + 24 + 20 = 52, then taking modulo 10 of the result.
Now, this validates that this numerical sequence (1-7 digits) - (2 digits) - (1 digit) is legal to use as a CAS number, but doesn't validate that it is in use in the CAS Registry. --Underscore bruce (talk) 18:07, 18 September 2008 (UTC)
Pending requests
editAllocated blocks
edit- 50 entries, to get a feel for it. --Rifleman 82 (talk) 17:29, 22 January 2008 (UTC)
- 50 entries have been uploaded for you here: http://www.chemspider.com/docs/wikipedia/Structures_1_to_25.pdf
- 50 entries for me, as a trial, I'll see if I can get the access somehow. Walkerma (talk) 20:49, 22 January 2008 (UTC) Note that I may not get to these for a couple of weeks, it will involve a special trip to another college. But I want to see what's involved. Walkerma (talk) 07:43, 26 January 2008 (UTC)
- http://www.chemspider.com/docs/wikipedia/Structures_26_to_50.pdf
Let's try this format...if it works we will stick with it. I will not generate anymore until you've moved through these...but no pressure on you. I will then generate in blocks of 25. --ChemSpiderMan (talk) 00:49, 24 January 2008 (UTC)
- http://www.chemspider.com/docs/wikipedia/Structures_26_to_50.pdf
- 150 entries for validation. I will put the project on hold until we have worked our way through these. All are here:
- 1 to 25 and
- 26 to 50 and
- 51 to 75 and
- 76 to 100 and
- 101 to 125 and
- 126 to 150 -- Done with these 25. Notes: 1) found no CAS number for Activella; 2) the only number I found for Adenosine thiamine triphosphate is for the neutral chloride salt, which does not match the structure diagram. 3) found a CAS number for acriflavine, but CAS has no structure diagram, which is unusual. In the cases with no chemboxes I just added the CAS number to the talk page or somewhere in the article. --Itub (talk) 14:27, 12 February 2008 (UTC)
Enjoy.--ChemSpiderMan (talk) 03:38, 4 February 2008 (UTC)
I have now updated the links above so you can access the files without passwords.--ChemSpiderMan (talk) 01:20, 6 February 2008 (UTC)
References
editSee also
edit- Wikipedia:WikiProject Chemistry/Curation (the initial list of chemicals being validated - missing some inorganics)
- Wikipedia:WikiProject_Chemicals/Inorganics (a list which aims to include all inorganics, to supplement the Curation list.
- List of CAS numbers by chemical compound