Hey,
Well - I've got a need for something that I cannot find elsewhere. If you can think of an existing program that satisfies my needs please LET ME KNOW! So I don't have to make my own! I'm experienced with PHP so am not asking for help - fear not!
Assuming that nobody knows of something already - here's my idea, and I'd like to have feedback and suggestions about what to put in it. Please understand it's for my own personal use and will be made as such - but I will release it as open-source so others can use, tinker, upgrade, etc etc!
The problem:
I have lots of hard drives - especially being a computer engineer who gets in lots of half-working drives (good enough for storage; not good enough for customers to use). Plus, lots and lots of old files. My partner and I both download regularly, as does my brother and parents (at different locations). I would like a file server that would run on Windows or Linux, written in PHP, that would search through all my files and drives, creating hashes of them all (MD5 or similar), attach filenames, add additional information (such as ID3 tags, video sizes, files contained within ZIP files, etc etc) all to a database.
If you 'upload' a file to the system, it'll hash it, get all the information, and check if the file already exists. If it does, you'll be given the option to add it again or skip. This is because the file may exist on another server - say, my Brothers at a different location, not on LAN but over the Internet. So of course, I want a local copy of this big file, not to use his!!
It'll connect to multiple servers, and automatically update each server with a list of files on each system. These files could (potentially, future upgrade?) be encrypted if they're private, or set as private simply as an initial block.
So - if you're looking for a file (beit a document, film, program, etc) you simply type in a keyword and it'll search it's entire index - probably pretty damn quick - and return to you results, with file sizes, information, and locations. You chose the best location and it'll send it to you. As it's all done using Apache web server, it's accessible online, from a LAN, etc.
I have plenty of issues to overcome; such as, how do you make it automatically index new files, and how will it know if a drive has been removed (hot-swap drive, or a USB/external drive, for instance), without constantly polling said devices. But they'll be overcome!
Who would like such a system, and what are people's opinions?!
Cheers
Mark
PHP File Server [New idea - feedback requested!]
it would be pretty nice, but are you sure that it will be fast? i dont think so. php isnt so slow, but it has limits you know. remember, in your storage can millions of files. to index them it will take a time. in the other hand you coudl leave index - maker on weekends only, then i gues it would work fine. basicly i like idea. its very interesting
Hi roxaz, thanks for the reply!
PHP in itself I believe will be fast enough, but you're right there will be speed problems. The ones I have thought of so far will be:
- Creating hash'es of every file rapidly. This isn't so hard - SHA1 is fast enough, as is MD5, and we won't index files at full speed, but instead leave a small delay (a second?) between hashing to keep server load low. Remembering that the server will most likely run on your home PC, that's important!
- The actual MySQL database, and handling thousands (millions) of files, hashes, descriptions, aliases, etc etc. There is a tradeoff for speed, storage (normalisation) and correctness. A flat database could be extremely quick, but won't have the redundancy and correctness of a fully normalised database. However, there'll be speed implications of such quality databases. My experience has always been a happy mixture between the two, and that's also because I'm not a MySQL expert so cannot fine-tune queries so tightly that they'll execute at top speed in all situations.
- Indexing new files. One solution - the 'easy' method - is to have an Incoming folder. All files you want indexed (after the initial 'full' index) go in here, are indexed periodically (1 minute, perhaps) and then moved away to prevent reindexing. This isn't perfect - but there's no decent way I know for PHP to tap into the OS to find out about new file creations, particularly not that would work on Windows and Linux (and Mac etc) without modification.
PHP Isn't the fastest for this job - but that comes at the benefit of being able to tap straight into TCP/IP, have the featureset of PHP (like hashing, streams, image output for thumbnails, LOADS of features!) and being cross-platform, whilst being easy to set up. For Windows, all I'd need do would be send a full copy of a WAMP (Windows+Apache+MySQL+PHP) setup including the databases, permissions and folders all included - at voila! For Linux, pretty much the same but Windows will be my target OS. (I develop on Linux for web sites, but Windows is more suitable as this is something to go on all your PC's).
There's lots of issues to overcome - please keep posting your thoughts and concerns, plus if you're interested what would be useful to you to have in this. Got lots of ideas, and so I'll work on the most important first, get a Beta up, and then launch a web site to host the project (AFTER beta is ready! Don't want to rush things! )
Cheers!
PHP in itself I believe will be fast enough, but you're right there will be speed problems. The ones I have thought of so far will be:
- Creating hash'es of every file rapidly. This isn't so hard - SHA1 is fast enough, as is MD5, and we won't index files at full speed, but instead leave a small delay (a second?) between hashing to keep server load low. Remembering that the server will most likely run on your home PC, that's important!
- The actual MySQL database, and handling thousands (millions) of files, hashes, descriptions, aliases, etc etc. There is a tradeoff for speed, storage (normalisation) and correctness. A flat database could be extremely quick, but won't have the redundancy and correctness of a fully normalised database. However, there'll be speed implications of such quality databases. My experience has always been a happy mixture between the two, and that's also because I'm not a MySQL expert so cannot fine-tune queries so tightly that they'll execute at top speed in all situations.
- Indexing new files. One solution - the 'easy' method - is to have an Incoming folder. All files you want indexed (after the initial 'full' index) go in here, are indexed periodically (1 minute, perhaps) and then moved away to prevent reindexing. This isn't perfect - but there's no decent way I know for PHP to tap into the OS to find out about new file creations, particularly not that would work on Windows and Linux (and Mac etc) without modification.
PHP Isn't the fastest for this job - but that comes at the benefit of being able to tap straight into TCP/IP, have the featureset of PHP (like hashing, streams, image output for thumbnails, LOADS of features!) and being cross-platform, whilst being easy to set up. For Windows, all I'd need do would be send a full copy of a WAMP (Windows+Apache+MySQL+PHP) setup including the databases, permissions and folders all included - at voila! For Linux, pretty much the same but Windows will be my target OS. (I develop on Linux for web sites, but Windows is more suitable as this is something to go on all your PC's).
There's lots of issues to overcome - please keep posting your thoughts and concerns, plus if you're interested what would be useful to you to have in this. Got lots of ideas, and so I'll work on the most important first, get a Beta up, and then launch a web site to host the project (AFTER beta is ready! Don't want to rush things! )
Cheers!
I must also make a note: I'm aware that a system like this has the capabilities to become a P2P program in itself (sharing the indexes across all nodes, searching and downloading directly off another node, etc) but as I don't plan to make the sharing of indexes hyper-efficient (as in - I want it to work nicely with maybe 10 - 15 nodes, not 10-15 million!) I don't think that will be it's use. It may be useful between friends and family as a P2P program, but again with very small groups - groups that could share these files by another method already.
Also - as each client should keep a list of the entire Index, searching it will not require sending out a message to every node in the network, but simply a search on their own database. This has to be kept relatively small still, but I'd also like to able to index the ENTIRE drive, so searching for (say) a file in the Windows directory that you've lost is possible; find it off your nearest node and download. Searching for the title of a song that was in it's ID3 tag is possible, etc etc...
Also - as each client should keep a list of the entire Index, searching it will not require sending out a message to every node in the network, but simply a search on their own database. This has to be kept relatively small still, but I'd also like to able to index the ENTIRE drive, so searching for (say) a file in the Windows directory that you've lost is possible; find it off your nearest node and download. Searching for the title of a song that was in it's ID3 tag is possible, etc etc...