NTFS discovery on single share with 18 million files/folders, only one task generated

I'm trying to prove feasibility for using Enterprise Reporter for a number of data projects ongoing in my org.

I have some questions about expected behaviour in regards to NTFS discoveries, I have a test discovery currently running on a single share on my isilon.  This location has roughly 18 million files/folders.

I do only have one node in my testing cluster, but i'm not sure if/how additional nodes would behave.  I only see a single discovery task assigned in the discovery, and it appears to be processing serially through the structures.

I do see the option to create a seperate task for each share, but this discovery only has a single share, so doesn't seem like it will help

at this point it has been running for 6 days, i'm trying to compare that to the estimated times from the user guide and just doesn't seem to add up.

So i'm trying to understand exactly how this should be operating.  Will quest only ever assign a single task per share in the discovery?  If i have multiple nodes in my cluster, will the task be split amongst all of the nodes?

Whats the appropriate configuration for performing discoveries on this scale.

Parents
  • Hi,

    As you have seen there can be only one node assigned to a share on a server. If you are collecting from a lot of shares and have numerous nodes then they will automatically assigned to a task assigned to a share. You mention that it has been 6+ days and Enterprise Reporter is still processing 18 million files and folders. While this may seem to be excessive I would like to bring something to your attention. If you have selected all of the options such as collect advanced file metadata and calculate duplicate files then every file needs to be accessed for the advance data and every file needs to be checked against every other file (all 18 million) to see if it is duplicate. The same is applicable to folder permissions and group account is found then all permissions must be collective reclusively which will a extremely long time based on your Active Directory.

    One other thing you should check is the log on the node to see it is actually collecting data. It has been known that the node may not be processing data but ER may still report that the job is still processing. You can find the node log on the system the node is installed on in C:\ProgramData\Quest\Enterprise_Reporter 

    If you have selected all of the possible options I would recommend that you stop the job and select less options and just retrieve files and folders with basic permissions. Once you have a good idea of the files and folders structure you can then either create a new discovery of for specific folders or increase the options and perform a delta discovery.

Reply
  • Hi,

    As you have seen there can be only one node assigned to a share on a server. If you are collecting from a lot of shares and have numerous nodes then they will automatically assigned to a task assigned to a share. You mention that it has been 6+ days and Enterprise Reporter is still processing 18 million files and folders. While this may seem to be excessive I would like to bring something to your attention. If you have selected all of the options such as collect advanced file metadata and calculate duplicate files then every file needs to be accessed for the advance data and every file needs to be checked against every other file (all 18 million) to see if it is duplicate. The same is applicable to folder permissions and group account is found then all permissions must be collective reclusively which will a extremely long time based on your Active Directory.

    One other thing you should check is the log on the node to see it is actually collecting data. It has been known that the node may not be processing data but ER may still report that the job is still processing. You can find the node log on the system the node is installed on in C:\ProgramData\Quest\Enterprise_Reporter 

    If you have selected all of the possible options I would recommend that you stop the job and select less options and just retrieve files and folders with basic permissions. Once you have a good idea of the files and folders structure you can then either create a new discovery of for specific folders or increase the options and perform a delta discovery.

Children
  • Thanks for the info, this is what i was afraid of, requirements will be pulling this data on a relatively high cadence.  And really will need all of the data pulled. Since this is the first of many very large data shares, i can't wait months for complete data sets.

    Really annoying there is no method to enable mutliple tasks to be built for single share discoveries, or any other options short of building thousands and thousands of discoveries.  will most likely have to investigate other vendors.

  • ny other options short of building thousands and thousands of discoveries. 

    I don't see why you would need thousands of discoveries - the data gathering is not that slow.

    requirements will be pulling this data on a relatively high cadence. 

    So what is driving cadence this exactly?

    If it's monitoring changes to the ACLs, you might be better to look at product like Change Auditor for File Systems to give you a more transactional view?

    The other recommendation I always make when dealing with the file system is to prioritize collecting data on those folder structures that contain the most sensitive data.  

     

     

  • We're beginning an unstructured data project, and as part of the requirements the project wants this data available wholesale and needs the data to be recent to begin analysis and cleanup processes.

    as its all encompassing of our data structures, we need to have good data consistantly, IE ensuring we pick up newly saved/modified data and the like.

    if it takes a full week to process only our home folder locations once... if i try to extrapolate to the entire data structures in scope, i'm looking at months, without trying to build a hundred additonal discovery nodes, which is cost prohibitive.

    so i'm left with... asking to stand up additional discovery nodes... essentially one per location.

    i'll try to experiment with doing a NTFS discovery minus the permissions, but that data is highly desired by a number of other projects so trying to kill multiple birds with one stone.

    either way you cut it, only allowing 1 task per share is a very very limiting decision that has very negative ramifications for large scale discoveries.

  • You could break up the main share with a number of temporary hidden shares, say 5 or six or more and then have a node for each temporary share. You would have all the data at a reduced time. This with a few less options should speed up the process. Block of data would then be available for advancing your project.