Automated Analysis
Overview
Trinket is designed to mirror what experienced data scientists do when they take their first few passes through a new dataset by intelligently automating large portions of the wrangling and analysis/exploration phases of the data science pipeline, integrating them into the initial ingestion or uploading phase.
Architecture
The autoanalysis and text parsing features of Trinket are written in Python. They work by scanning columns of uploaded data and using numpy
, unicodecsv
, onedimensional kernel density estimates, standard analyses of variance mechanisms and hypothesis testing (KDEs, ANOVAs).
This enables Trinket to do type identification, e.g. to identify and differentiate: discrete integers, floats, text data, normal distributions, classes, outliers, and errors. To perform this analysis quickly and accurately during the data ingestion process, Trinket includes a rulesbased system trained from previously annotated data sets and coupled with heuristic rules determined in discussions with a range of experienced data scientists.
Mechanics
Autoanalysis works by assigning each column/feature a data type (dtype
in the parlance of NumPy and Pandas), e.g. categorical, numeric, real, integer, etc. These types must be automatically inferred from the dataset.
The autoanalysis method takes as input a filelike object and generic keyword arguments and returns as output a tuple/list whose length is the (maximum) number of columns in the dataset, and whose values contain the datatype of each column, ordered by column index.
Questions to answer:

How do other libraries like
pandas
andmessytables
do this?
Pandas computes histograms, looks for the min and max values of a column, samples quantiles, and counts unique values. 
Do you have to go through the whole dataset to make a decision?
Yes and no  decide based on how big the dataset is. The below strategy builds a sample from 50 nonempty rows for each column, as well as the rows with the longest and shortest lengths. For larger datasets, maybe sample 10%. For extremely large datasets, 1% might be enough. 
Can we use a sampling approach to reading the data? Naive method (assumes straightforward densities):
for each col in fileTypeObject:
find mx # row with the longest value
find mn # row with the shortest value
find nonNaN # first 50 nonempty rows using ndarray.nonzero()
sampleArray = nd.array(mn, mx, nonNaN)

Is there a certain density of data required to make a decision?
This is a good question  some libraries build histograms for each column to examine densities. See thepandas
method for histograms. TODO: look into thresholds 
What types are we looking for? string, datetime, float, integer, boolean See also
messytables
types.
Attempt parsing from broadest type to narrowest:
for val in colSample:
if val.dtype.type is np.string_:
colType = colType.astype('Sn') # where n is the max length value in col
elif val.dtype.type is np.datetime64:
colType = colType.astype('datetime64') # this is new & experimental in NumPy 1.7.0
elif val.dtype.type is np.float_:
colType = colType.astype('float64')
elif val.dtype.type is np.int_:
colType = colType.astype('int64')
elif val.dtype.type is np.bool_:
colType = colType.astype('bool')
else:
# do something else
# what about unicode and complex types?

What does columnmajor mean for Trinket?
Usetranspose
and/orreshape
fromnumpy
. 
Can we automatically detect delimiters and quote characters? (e.g. ; vs ,)
Seemessytables
method for delimiter detection. 
How do we detect if there is a header row or not?
Seemessytables
method for header detection. 
How lightweight/heavyweight must this be?
Look into making more lightweight using regular expressions & hardcoded rules (see Brill tagging).