Computing desk
< September 7	<< Aug \| September \| Oct >>	September 9 >

Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is an archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

September 8

Algorithmns to measure how closely two short sections of text match

I want to write a program that parses downloaded bank statements. This would include comparing text strings of up to about 50 characters long with another similar text string, or perhaps a whole group of similar text strings. What algorithmn could I use to measure how similar the two compared text strings are? The result would be a number scale that at one extreme means an exact match, at the other no match at all. Perhaps such an algoithmn may be similar to those which match mispelt words - how do they work? 78.149.167.102 (talk) 00:04, 8 September 2009 (UTC)[reply]

Levenshtein distance is a good article to start with; there are several articles in the same area linked to at the bottom of that article. -- Finlay McWalter • Talk 00:26, 8 September 2009 (UTC)[reply]

Hamming distance is another - but a lot depends on what your desired kind of similarity is...if you look for precise binary differences - then a single letter missing from one string towards the start will make the strings totally different from that point onwards - when from a "human" perspective, they are almost identical. Two interchanged letters is another common typo that you might want to consider to be a 'minimal' change. Two strings might be identical but one is in uppercase and the other lower - not one character matches! You might also want to consider using a Soundex approach - which makes two strings that SOUND similar come out similar in comparisons...handy when a message has been relayed by phone and typed by someone who doesn't spell so well! But it depends on what you want and why. SteveBaker (talk) 03:29, 8 September 2009 (UTC)[reply]

For things like bank statements it may also be useful to compare similarity of numerical values, for example saying $100 is closer to $99.99 than to $1000 or $900. PrimeHunter (talk) 12:18, 8 September 2009 (UTC)[reply]

Wikipedia

I'm not sure where to ask this question so I picked this section. How much data is stored in Wikipedia? I'm curious to know. Thanks for any information. NeoJustin (talk) 07:50, 8 September 2009 (UTC)[reply]

Text only, or including media (e.g. pictures)? Only en.wikipedia.org, or *.wikipedia.org? Do you want to include wiktionary.org, and other projects as well? If you only count the pure textual content (with formatting) in en.wikipedia.org, you might use the last dump of the database as a measure. --Andreas Rejbrand (talk) 07:53, 8 September 2009 (UTC)[reply]

I was thinking en.wikipedia.org and everything including pictures. Any breakdown would be nice. I was thinking it must be like trillions and trillions of bytes of data. I'll look at the link. NeoJustin (talk) 08:08, 8 September 2009 (UTC)[reply]

The dump doesn’t seem to provide a total byte count, only 17,449,963 pages of which 2,404,861 pages are in English. If a “page” is, on average, about 10KB (this is just a guess – someone might have done some research on article length?) then the total in the English Wikipedia would be around 2.5 GB, but the last dump on September 4th was 9.7GB compressed, which could expand to nearly 200GB including talk pages and user pages. Can anyone narrow down the discrepancy?

By way of comparison, The U.S. Library of Congress Web Capture team has claimed that "As of May 2009, the Library has collected almost 100 terabytes of data", and, according to Kevin Kelly of the New York Times, "the entire written works of humankind, from the beginning of recorded history, in all languages" would amount to 50 petabytes of data. Google processes about 20 petabytes of data per day. The big advantages of Wikipedia are its ready accessibility and its relevance (usually) Dbfirs 08:55, 8 September 2009 (UTC)[reply]

So all the articles on the English Wikipedia would be less than 200 GB. Interesting... I could download it all onto my computer (if I had a reason). NeoJustin (talk) 17:49, 8 September 2009 (UTC)[reply]

I'm not sure whether that dump included pictures (probably not if they reside in commons). Are there any Wikipedia experts reading this who can check on my guesses? Dbfirs 22:54, 8 September 2009 (UTC)[reply]

From Wikipedia database dump, the dump of all current articles is 5 GB, and all current pages (including talk pages etc) is 10 GB. However, this doesn't include page histories or images. Just the full article histories are 2.8 TB, not counting images. — QuantumEleven 09:36, 9 September 2009 (UTC)[reply]

Is this the compressed size? What does it expand to? Dbfirs 21:44, 10 September 2009 (UTC)[reply]

That is the compressed size. Uncompressed it's a lot larger, but remember it's XML, which is an inefficient representation. It gets even worse if you try to install it on your computer (which means expanding it into MySQL tables), but that doesn't really count as Wikipedia itself using the space. I think if you include all the static pictures that you see in articles, Wikipedia is a few hundred GB, but if you include all the media in Commons, it's tens of TB heading towards hundreds of TB. It might be better to ask this question on meta. 70.90.174.101 (talk) 18:03, 11 September 2009 (UTC)[reply]

The August 2009 wiki dump of current page article text only uncompressed is 22.4 GB. Look at Wikimedia technical blog and some of the links off of it. They have good info about the size of the commons, which doesn't do dumps. Also [1] and [[2]]. Shadowjams (talk) 01:48, 12 September 2009 (UTC)[reply]

Turn the Windows search index off

how

Load Control Panel -> Administrative Tools -> Services - Scroll down to "Windows Search" and double-click to open. Click stop and then change the startup state to "Disabled" to stop it from restarting. This will obviously have a knock on effect on the performance of anything trying to use Windows Search. ZX81 ^talk 19:11, 8 September 2009 (UTC)[reply]

Vision client

Is there any way to turn it off without my teacher knowing?Accdude92 (talk) (sign) 13:21, 8 September 2009 (UTC)[reply]

Probably not, if they've set it up reasonably intelligently. (I'm assuming they don't give you access to the Program Files, Control Panel, or an unencumbered Task Manager). And even if you could, if you got caught, there would probably be severe penalties regarding computer "hacking" and other such things. You'll have a whole, long life to goof around on the computer — surely you can make it through a few months of school. --98.217.14.211 (talk) 13:52, 8 September 2009 (UTC)[reply]

As 98.217 mentioned, unless there is a serious flaw in the software and setup, you will not be able to circumvent it while it is running. But there are other ways to work around such software locks. As mentioned above, try not to get yourself in trouble; schools sometimes overreact to any minor digital deviance with legal threats and a variety of other hassles.

The following advice applies in general to a machine you wish to unlock, which you have physical access to. A common adage is that "physical access to the machine" trumps every software security methodology, but this is not strictly true. A skilled systems engineer can set up so many levels of protection.

You could conceivably reboot the machine with a Live CD. This will circumvent almost every software protection on the local machine. But, a few technical caveats come to mind. But, if the machine is running a Trusted Platform Module, or some other pre-boot hardware lock, (including certain setups with encrypted hard disk drives), the machine may refuse to boot any unauthorized software. In fact, there may be a pre-boot BIOS password at power-up, which will deny you the ability to boot from disk or CD. Even if you successfully circumvent these and boot a Live CD, you're not in the clear yet. The network may be configured to deny access to any machine which isn't running exactly as it was configured.

If such draconian technological protections were actually in place, attempting to circumvent them would be technically challenging and would certainly not win you any points with your teacher.

As always, though, the best way to circumvent a computer security system is social engineering. See if you can volunteer to work with the computer or IT department as an extra-curricular activity. It will be a good learning experience, and you will probably gain fuller access to the computers and the network. The "best way" to "break in" to a system is to politely ask permission and earn the privilege to use it. Nimur (talk) 15:04, 8 September 2009 (UTC)[reply]

Short answer to your question is "No" As a Computer teacher, I use vision, and I've had students try to disconnect... if they are sucessful I see an indication on the classroom overview so I can check on that student personally. 206.131.39.6 (talk) 16:25, 8 September 2009 (UTC)[reply]

My college teacher uses Vision as well (in fact he's probably watching me type this right now...), and it isn't possible to turn it off without them knowing. If you did, they would most likely see that your computer isn't connected. This could easily be justifiable if the computer was actually turned off, but if the lights are off in the classroom, the glow from the monitor onto your face would be a dead giveaway as to what you're up to... Do you go to high school or college? I personally don't highly recommend it, but if it's college, and it has wireless internet access, you could probably bring a laptop to do... whatever it is you want to do. Until It Sleeps ^alternate 15:41, 9 September 2009 (UTC)[reply]

php html

I have a php script that generates a page from it's flat file database when I visit a url such as "example.php?page=2". I would like it to also create a .html page on the server so that the php is not being queried constantly to view the page from the php. What would be the best way to do this? —Preceding unsigned comment added by 82.43.88.99 (talk) 14:15, 8 September 2009 (UTC)[reply]

Why? If you save an html page, it's no longer dynamic, which is the point of that setup. And you could have it automatically generate one, but it would be easier to load it and copy the HTML source code from the browser, then just upload that. Ale_Jrb ^talk 15:13, 8 September 2009 (UTC)[reply]

I think the OP wants to cache the generated pages for a time, perhaps for performance reasons. This is precisely what Wikipedia does. Wikipedia's caching is done using the squid proxy. -- Finlay McWalter • Talk 15:18, 8 September 2009 (UTC)[reply]

I am aware that caches exist, and indeed that Wikipedia uses one ;). However, while you could technically call saving a static page in your web directory is a form of caching, it really isn't. It would be completely useless to simply save the page - you would have to direct users there. To do that, you would need a dynamic script (or a separate caching server/program) to direct requests there - in most cases, this is no more efficient than simply retrieving a page from the database in the first place, and databases are often (usually?) more efficient than file writes.

It might just be worthwhile if the page in question is performing lots of complicated calculations, but it would still then be better just to cache results in a database and display them directly. And the question says that the original script is just displaying a page - as in '?page=2'. PHP is efficient enough that doing this dynamically is superior in most standard situations to periodically performing a file write to save the HTML, and then redirecting users there.

And for the example of Wikipedia, it generates everything dynamically - it is the separate server software, squid, that performs the caching. Imagine if the Wiki web directory (SVN) had a static HTML page for every article... Ale_Jrb ^talk 15:46, 8 September 2009 (UTC)[reply]

I do this for sites similar to Wikipedia. Users edit content. When they edit the content, I don't just update the database. I generate the entire HTML and replace the html file on the main website. The public hits static HTML pages. However, the site is dynamic in the sense that it is easily updated online by the users. Of course, these pages do not have truly dynamic content that changes based on time of day or browsing history of the user. -- kainaw ™ 15:40, 8 September 2009 (UTC)[reply]

Honestly, that sounds like pretty bad practise to me - you're merging several jobs into one. If you feel the need to have a cache, which only really improves performance on extremely busy sites, you should have a proper cache server - or failing that a cache script that stores requests and responses, and redirects users. Dynamically re-saving HTML pages when someone changes your content isn't the way I'd do things, though I suppose if it works for you, that's fine :). Ale_Jrb ^talk 15:49, 8 September 2009 (UTC)[reply]

Caching is very good practice for sites that don't update constantly. It can save a huge amount of processing on the back end. It's not very hard to develop good caching procedures. I'm not sure where you got the idea that caches are bad, but they aren't, and nearly everybody uses them. Dynamically generating a complex page for each user is not a good idea if you can avoid it. (Does taking up processing power matter? Yes, if you have a lot of users, and yes, if you are co-located, where processes that go over 1-2% of the total CPU can get you disabled temporarily. Running multiple MySQL queries, for example, can easily do that if you are not careful about optimizing them.) Complicated scripts like WordPress can be sped up dramatically with caching, for example.--98.217.14.211 (talk) 15:56, 8 September 2009 (UTC)[reply]

Yes, caching is effective. That's why large sites invest significant resources in cache servers. But if you update your site so infrequently, and it has so few pages, that you consider a script that dynamically saves static HTML pages directly into your web directory for people to view a good thing, then your site would be fine being solely static. A proper cache acts as a proxy between the content server and the user - for most sites, it's on a different machine. Wikipedia is an excellent example - the squid (cache) servers are most definitely not PHP scripts that recreate static HTML whenever you perform an edit. Is that how you would cache wordpress? If so, yikes.

Caching properly is a good thing. That method of caching is (usually) not. See the difference? In a situation where the potential bottleneck is lots of database queries, a far better method would be a system like memcached. Ale_Jrb ^talk 17:50, 8 September 2009 (UTC)[reply]

It depends on the usage. Many of the projects that I work on have one or two changes per month. The entire point behind the PHP-based online admin is to keep them from calling me to change a phone number or add a new employee name to a list. Of course, they still call and ask how to make the change. -- kainaw ™ 15:53, 8 September 2009 (UTC)[reply]

Heh - the danger of the layman. :) Ale_Jrb ^talk 17:50, 8 September 2009 (UTC)[reply]

You could easily write a function that, before being called, would check if there was an existing HTML file in the directory, and if so, output its contents, and if not, generate it, and then output its contents. Whenever you updated the flat file, you'd just delete (either manually or algorithmically) the existing HTML file to clear the cache. Depending on the complexity of generating the page, this could easily deliver higher performance. --98.217.14.211 (talk) 15:56, 8 September 2009 (UTC)[reply]

woah! lots of answers :) But how exactly do I do this? What is the code I have to enter into my php file to make it do this? —Preceding unsigned comment added by 82.43.88.99 (talk) 18:56, 8 September 2009 (UTC)[reply]

It is not a line of code. It is an entire program (possibly hundreds of lines of code). I seriously doubt anyone is going to flood this page with tons of code. As for caching - that is separate from PHP. You install a caching mod in your webserver. As for writing web pages, you just use fopen and fputs. -- kainaw ™ 19:36, 8 September 2009 (UTC)[reply]

ok, sorry. I am absolutely clueless with php, I thought it would just be a simple matter of telling the script to make a actual .html file out of the html code it sends to the browser when a page is viewed. my bad —Preceding unsigned comment added by 82.43.88.99 (talk) 19:43, 8 September 2009 (UTC)[reply]

Screenshots From Games

I'm thinking of making a blog/diary based on some of the wargames I play regularly. It's more of a private account of certain scenarios I've played and how certain little dramas within those scenarios played out (e.g. defense of a certain building by a certain unit; assault on an enemy position using a particular combination of forces; etc). and I was wondering if it would be possible to get multiple screen shots without leaving the game. I would take a bunch of screen shots, then organize them after the battle and post them on the blog. Is there any software that does this (i.e. run side-by-side with a game, with hotkeys to take screen shots, and said screen shots will be saved to a clipboard or auto-saved)? TIA!! --KageTora - (영호 (影虎)) (talk) 18:21, 8 September 2009 (UTC)[reply]

Fraps. -- Finlay McWalter • Talk 18:29, 8 September 2009 (UTC)[reply]

Excellent! Just what I needed, thanks! --KageTora - (영호 (影虎)) (talk) 18:55, 8 September 2009 (UTC)[reply]

Processor comparison

As per my previous posts detailing my recent bad luck with computers, I've decided to throw in the towel and simply buy a new system. The place I normally buy from (www.neutron.ca) allows me to customize my system piece by piece and I'm out of step with how processors are compared. On the system I'm looking at, the cost will be about the same for either of these two setups: Intel Core 2 Duo E8400 3.00 GHz, 6 MB, 1333 MHz and Intel Core 2 Quad Q8400 2.66 GHz, 4 MB 1333 MHz. My computer gets intermittent heavy use (graphically intensive games, multitasking between videos, web, other progs), but is not a work machine. That is, when it gets used, it tends to get used pretty roughly, but it also lies fallow most of the day. Which setup can handle that kind of work better? Or is the difference between them more subtle than that? Matt Deres (talk) 19:17, 8 September 2009 (UTC)[reply]

If you do more multitasking, the Quad core processor will allow your system to feel more responsive while the Core 2 Duo has more raw "horsepower." Either way, you can't go wrong, and both will be plenty fast. (Personally I'd go for the quad core). When dealing with games, your graphics card is going to pull all the weight, so go for something reasonably new (GTX 200 line for nVidia) and you'll be relatively futureproofed. —Preceding unsigned comment added by 206.131.39.6 (talk) 19:34, 8 September 2009 (UTC)[reply]

They only offer ATI cards, so mine will be a Radeon HD 4670 (1 GB), which should also have lots of horsepower. Given that it's a higher end graphics card, would that make the quad core the smarter choice? Let the quad core work the multi-tasking and let the Radeon handle the games? Matt Deres (talk) 19:44, 8 September 2009 (UTC)[reply]

Modern games are surprisingly well multithreaded; I think you will see a strong performance boost from the quad-core, even if it is mildly weaker in "raw horsepower" per core. Nimur (talk) 20:22, 8 September 2009 (UTC)[reply]

The ATI card you picked is plenty good, I'm just more familiar with the nVidia line (hence my example before I signed in). Nimur also makes a good point in that many games are multithreaded, though any modern dual or quad core CPU won't have any problems as the CPU isn't used as extensively as the graphics card in gaming. You will mainly see a difference in the fact that you can run relatively CPU intensive tasks behind the game while seeing a negligible performance hit to the game. I've encoded video while playing games on my similarly configured computer, and it's done just fine. No matter what processor you ultimately choose, you will not be disappointed. Caltsar (talk) 20:34, 8 September 2009 (UTC)[reply]

Wikipedia:Reference desk/Archives/Computing/2009 September 8

Contents