You may use this tool to search your own data and other files stored in your individual computer accounts on University-owned systems. A technician or administrator may use this tool to search data and other files stored in individual computer accounts on University-owned systems within his or her scope of responsibility if the technician:
The technician should avoid opening the files Spider finds. Instead, send the names of the files found to the owner of the account/system where the files were stored, and direct the owner to review the files and take appropriate action.
For more details, see the Policy on Privacy of Indiana University Information Technology Resources.
Searching a Windows computer for a word or phrase is very simple. Click Start, Search, select All files and folders, and enter the word or words you want to look for in the A word of phrase in the file: box. That works great when you know exactly what text you are looking for. If you wanted to search your own computer for your Social Security Number, you could simply enter it (for example, 999-99-9999). But that does not work when you want to find any Social Security Number on the hard drive.
That is why the security team at Cornell University wrote Spider. Spider is able to search any drive attached to the computer (physical or network) for patterns of data, not just exact strings.
The computer that is going to run Spider needs to have the Windows .NET framework installed. You can install the .NET Framework though Microsoft Update. Select Custom, let your system be analyzed, then select Software, optional() area on the left side of your window. You can now select the .NET framework for installation. After the .NET Framework is installed and your computer has been rebooted, return to Microsoft Update to install critical patches for the .NET Framework.
Once the computer has been prepared, you can install Spider by downloading the installer from Cornell's computer security web site. Run setup.exe as an administrator on the computer. The installation is pretty straight forward.
You need to be logged into your computer as an administrator. If you cannot log in as administrator, contact your Local Support Provider or IT Staff for help.
- Note: At Indiana University, the IT Security Office (ITSO) recommends that you normally run your Windows computer as a member of the Users Group, not as an administrator or a member of the Power Users Group. For more information, see the Knowledge Base document In Windows 2000 and later, why should I avoid running my computer as an administrator or Power User? For tasks requiring administrative access, you can gain it quickly using the Windows Secondary Logon service. For more information, see the Knowledge Base document What is the principle of least privilege?
After you start the Cornell Spider program, you get a pretty simple interface. To look at how Spider is configured, click Configure then Settings from the menu at the top of the window.
Selecting the Files tab displays more tabbed options. Under the Directory tab you can configure which drives or folders you are going to scan. Notice that you can select network drives that have been mapped to other computers. Types allows you to specify a list of file extensions that you want to scan or exclude from the search. The list is pre-populated with a list of common file types that are known to produce false positives. That means that if you were to enable scanning on one of those file types, MP3 for example, you are more likely to find a random string of numbers that fits the pattern you are searching for than an actual Social Security Number. As of Windows version 2.9.3, Spider is not able to search Outlook archive files (.pst). You need to use the Linux version (reboot the computer off a a linux cd that includes spider) or use another means to search for data in .pst files.
Next you have the Regexes tab. Regular expressions let you search for patterns instead of exact strings. For instance, you can search for any number between 0 and 9 using a string like this:
[0-9]
To search for three numbers in a row, you could use:
[0-9][0-9][0-9]
For more information on regular expressions, see Wikipedia article on regular expressions.
The writers of the Cornell Spider have already built some common regular expressions into the program. To enable these search patterns, you simply select the check box next to the type. If you are confident that a computer does not contain credit card numbers, you can disable those searches. The fewer patterns Spider has to match, the faster it runs.
The built in searches are not exhaustive. They simply look for very common patterns. If you know that the sensitive data you or your colleagues work with is in a particular format, you should create a specific regular expression for that pattern.
You can also search for custom strings by clicking the Add regex box on this screen. You can enter anything from a simple string you want to match exactly (payroll), to a complex regular expression. For example, to find for 9 digits surrounded by a comma, use: ,\d{9},). Be careful with these expressions. A search for too common a string (any nine digits with no separators) will yield a large number of false positives.
To find 9 digits surrounded by commas (common in IUIE data extracts) ,123456789,
,\d{9},
To find a SSN formatted number separated by spaces 123 45 6789
\b\d{3} \d{2} \d{4}\b
To find any 9 digits in a row (with a very high false positive rate) 01234567890
\d{9}
Here are some common regular expressions (written by Northwestern University) that you can use, built for use by the Cornell Spider.
SSN w/ dashes:
[0-7]\d{2}\-\d{2}\-\d{4}
SSN w/ dashes and breaks:
\b[0-7]\d{2}\-\d{2}\-\d{4}\b
SSN – 9 consecutive digits:
[0-7]\d{8}
SSN – 9 consecutive digits with breaks:
\b[0-7]\d{8}\b
SSN w/ spaces:
[0-7]\d{2}\s\d{2}\s\d{4}
SSN w/ spaces and breaks:
\b[0-7]\d{2}\s\d{2}\s\d{4}\b
All SSN search options with breaks:
\b[0-7]\d{2}\-\d{2}\-\d{4}\b|\b[0-7]\d{8}\b|\b[0-7]\d{2}\s\d{2}\s\d{4}\b
All SSN search options with no breaks:
[0-7]\d{2}\-\d{2}\-\d{4}|[0-7]\d{8}|[0-7]\d{2}\s\d{2}\s\d{4}
Visa/Mastercard/Discover:
\d{4}\-\d{4}\-\d{4}\-\d{4}|\d{4}\s\d{4}\s\d{4}\s\d{4}
Visa/Mastercard/Discover with breaks:
\b\d{4}\-\d{4}\-\d{4}\-\d{4}\b|\b\d{4}\s\d{4}\s\d{4}\s\d{4}\b
American Express:
\d{4}\-\d{6}\-\d{5}|\d{4}\s\d{6}\s\d{5}
American Express with breaks:
\b\d{4}\-\d{6}\-\d{5}\b|\b\d{4}\s\d{6}\s\d{5}\b
The logging tab gives you three ways to log the results of your Spider scan, to a plain text file, the Windows event log, and a Unix syslog. The simplest way to use Spider is by writing to a local log file and opening the resulting file in a text editor to look for matches.
Simply click Run Spider.
Spider does not automatically display results. You have to open the log manually. If you logged locally to the hard drive (the default is c:\spider.log) you can select File, View Log. Click open and you will see any file that contained the pattern you searched for.
Note: Just because Spider found a result, that does not guarantee the type of data found is in fact sensitive. We frequently find that international phone numbers are written in a way that causes them to match the social security number search pattern. You will need to open the files listed in the Spider log and decide if they really are a risk or if the result was a false positive.
Also note: Just because Spider did not find any results, that does not guarantee that the computer does not contain sensitive data. It simply means that the patterns used by Spider to search your computer did not find any results.
This document was written based on Cornell Spider 2.9.3 for Windows. A Linux version exists that operates in a client server fashion. A Mac version has been released and is available on the Cornell Security Tools website.