This is the talk page for discussing improvements to the Tesseract (software) article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is written in American English, which has its own spelling conventions (color, defense, traveled) and some terms that are used in it may be different or absent from other varieties of English. According to the relevant style guide, this should not be changed without broad consensus. |
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||
|
Not quite free software?
editAlthough most of Tesseract is free software under the Apache License v2.0, the Aspirin neural network engine may not be. I've no idea if that license is free. I might email the FSF and ask - David Gerard 20:58, 7 September 2006 (UTC)
- It seems Aspirin was removed in v. 1.02. Rwxrwxrwx 18:25, 5 November 2006 (UTC)
- Yeah, I finally got email back from the FSF - they asked Google about that bit of the licence and Google apparently went "oops" :-) - David Gerard 16:23, 15 April 2007 (UTC)
User-friendly versions
editTesseract seems rather technically challenging to install/configure. FreeOCR is built on it, and may be more user-friendly for people who have the required Windows 2K/XP. Archivista Box is a complete document management solution Linux livecd that includes Tesseract.[1] [2] The iso download is here:[3] Do any other livecds include Tesseract? Does anyone make it available as on online tool? It is odd that this is a google project, but they aren't making it available in readily usable forms. -69.87.204.80 20:34, 2 October 2007 (UTC)
- Tesseract is available on the Ubuntu repositories via the Synaptic package manager. It is therefore very easy to install, just a matter of checking a couple of boxes. Using it from the command line is also very simple as described in the Ubuntu Documentation - Ahunt (talk) 12:31, 28 June 2008 (UTC)
Userbox
editIf you use Tesseract, please feel free to put this userbox on your user page!
Code | Result | |
---|---|---|
|{{User:Ahunt/Tesseract}} | Usage |
Formats
editI've just tried to scan a file on Ubuntu. I got this output:
screenshot.bmp: Not a TIFF or MDI file, bad magic number 19778 (0x4d42).
It seems that Tesseract wants a TIFF, or Microsoft's proprietary version of TIFF. No BMP. That contradicts the article. — Chameleon 23:53, 20 August 2008 (UTC)
- You are quite right: the article is wrong and the Ubuntu wiki is right. I will fix the article. If you use ".tif" (and only that extension) it works really well. - Ahunt (talk) 00:07, 21 August 2008 (UTC)
Spell checking?
editA spell checker is not integrated, it seems.-- Matthead Discuß 13:02, 26 February 2011 (UTC)
- No it isn't. - Ahunt (talk) 14:50, 26 February 2011 (UTC)
- BTW, thank you very very much for replacing the link to a web page explaining how to turn on the hOCR feature with a "Citation needed". This will improve the article and the reliability of wikipedia a lot. Keep up your good work. -- Matthead Discuß 18:10, 26 February 2011 (UTC)
- And you should read WP:CIVIL because sarcasm like that isn't civil. You should also have a read of WP:SPS where it says: "Anyone can create a personal web page or pay to have a book published, then claim to be an expert in a certain field. For that reason, self-published media, such as books, patents, newsletters, personal websites, open wikis, personal or group blogs, Internet forum postings, and tweets, are largely not acceptable as sources." If you can find a proper ref for that feature then great, otherwise the wording will be removed from the article as explained at WP:V, which says "The threshold for inclusion in Wikipedia is verifiability, not truth; that is, whether readers can check that material in Wikipedia has already been published by a reliable source, not whether editors think it is true." - Ahunt (talk) 18:25, 26 February 2011 (UTC)
- Thank you for making Wikipedia such a nice place. Please go ahead and remove the offending gibberish of mine. -- Matthead Discuß 19:26, 26 February 2011 (UTC)
- Why don't you drop the incivility and find a ref for your text instead. I have done a search, but haven't found one yet. - Ahunt (talk) 20:01, 26 February 2011 (UTC)
- Had to go through the Tesseract Issues Logs but I found the whole history of it there and added it as a ref. It is a primary source, though so it would be ideal to have a reliable third party ref as well. - Ahunt (talk) 20:12, 26 February 2011 (UTC)
- Why don't you drop the incivility and find a ref for your text instead. I have done a search, but haven't found one yet. - Ahunt (talk) 20:01, 26 February 2011 (UTC)
- Thank you for making Wikipedia such a nice place. Please go ahead and remove the offending gibberish of mine. -- Matthead Discuß 19:26, 26 February 2011 (UTC)
- And you should read WP:CIVIL because sarcasm like that isn't civil. You should also have a read of WP:SPS where it says: "Anyone can create a personal web page or pay to have a book published, then claim to be an expert in a certain field. For that reason, self-published media, such as books, patents, newsletters, personal websites, open wikis, personal or group blogs, Internet forum postings, and tweets, are largely not acceptable as sources." If you can find a proper ref for that feature then great, otherwise the wording will be removed from the article as explained at WP:V, which says "The threshold for inclusion in Wikipedia is verifiability, not truth; that is, whether readers can check that material in Wikipedia has already been published by a reliable source, not whether editors think it is true." - Ahunt (talk) 18:25, 26 February 2011 (UTC)
Should the reference to FreeOCR be removed ?
editShould the reference to FreeOCR be removed from the article on Tesseract (software) ?
The user comments section under URL:
http://download.cnet.com/FreeOCR/3000-10743_4-10717191.html
emphatically identify FreeOCR as sneakware.
Please note: the intial download of FreeOCR is only a download of an installer; the installer itself passes virus scans, but then the installer goes on to download the bulk of the product. — Preceding unsigned comment added by 74.94.104.84 (talk) 20:09, 5 February 2014 (UTC)
Well there is a redirect from FreeOCR to this article, so it may be smarter to just tell the whole story instead. - Ahunt (talk) 20:57, 5 February 2014 (UTC)
Someone braver than I might want to check but currently (April 2018) the FreeOCR download is about 10 megabytes and the download page seems to be more reputable than before, so maybe things have changed.
or maybe not :) Someone (someone else) should try it out and see .... 116.231.75.71 (talk) 11:47, 15 April 2018 (UTC)
- Oddly FreeOCR now redirects here to this article, but is not mentioned on the page. I think that redirect needs to be deleted. - Ahunt (talk) 12:46, 15 April 2018 (UTC)
- Done - Ahunt (talk) 12:49, 15 April 2018 (UTC)
External links modified
editHello fellow Wikipedians,
I have just modified 2 external links on Tesseract (software). Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Corrected formatting/usage for http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html
- Corrected formatting/usage for http://code.google.com/p/tesseract-ocr/issues/detail?id=263
When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}
).
An editor has reviewed this edit and fixed any errors that were found.
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—cyberbot IITalk to my owner:Online 16:42, 31 March 2016 (UTC)
one of the most accurate open-source OCR ??
editTesseract is considered one of the most accurate open-source OCR engines currently available.[1][2]
- ^ Canonical Ltd. (February 2011). "OCR". Retrieved 2011-02-11.
- ^ Willis, Nathan (September 2006). "Google's Tesseract OCR engine is a quantum leap forward". Retrieved 2008-07-18.
The two references given are 6 and 9 years old. Are there any newer references? Otherwise the statement seems to be a little pretentious. --Dichter (talk) 13:09, 27 April 2017 (UTC)
- The refs are still valid, but I think it should be dated and I will add that. See what you think. - Ahunt (talk) 13:39, 27 April 2017 (UTC)
Ad hoc logo?
editDoes anybody have an official Tesseract page that uses the image that is listed as the logo here? The original URL for the image points to a consulting company that seems only tenuously related to Tesseract (though I didn't delve). I did an image search for the displayed image and only found this page and a few blog entries that likely cut/pasted from here. I think we should either post a citation to an official Tesseract page for the logo or cut it. B k (talk) 19:50, 30 January 2020 (UTC)