Welcome to DSA’s documentation!

Indices and tables

dsa.da.util_feature

Methods for feature extraction and preprocessing util_feature: input/output is pandas

dsa.da.util_feature.col_extractname(col_onehot)[source]

Column extraction :param col_onehot :return:

dsa.da.util_feature.col_extractname_colbin(cols2)[source]

1hot column name to generic column names :param cols2: :return:

dsa.da.util_feature.col_remove(cols, colsremove, mode='exact')[source]
Parameters:
  • cols (TYPE) – DESCRIPTION.
  • colsremove (TYPE) – DESCRIPTION.
  • mode (TYPE, optional) – DESCRIPTION. The default is “exact”, “fuzzy”
Returns:

cols – DESCRIPTION. remove column name from list

Return type:

TYPE

dsa.da.util_feature.col_stat_getcategorydict_freq(catedict)[source]

Generate Frequency of category : Id, Freq, Freqin%, CumSum%, ZScore given a dictionnary of category parsed previously

dsa.da.util_feature.convert(data, to)[source]
Parameters:
  • data
  • to

:return :

dsa.da.util_feature.draw_plots(input_data, feature, target_col, trend_correlation=None)[source]

Draws univariate dependence plots for a feature :param input_data: grouped data contained bins of feature and target mean. :param feature: feature column name :param target_col: target column :param trend_correlation: correlation between train and test trends of feature wrt target :return: Draws trend plots for feature

dsa.da.util_feature.get_trend_changes(grouped_data, feature, target_col, threshold=0.03)[source]

Calculates number of times the trend of feature wrt target changed direction. :param grouped_data: grouped dataset :param feature: feature column name :param target_col: target column :param threshold: minimum % difference required to count as trend change :return: number of trend chagnes for the feature

dsa.da.util_feature.get_trend_correlation(grouped, grouped_test, feature, target_col)[source]

Calculates correlation between train and test trend of feature wrt target. :param grouped: train grouped data :param grouped_test: test grouped data :param feature: feature column name :param target_col: target column name :return: trend correlation between train and test

dsa.da.util_feature.get_trend_stats(data, target_col, features_list=0, bins=10, data_test=0)[source]

Calculates trend changes and correlation between train/test for list of features :param data: dataframe containing features and target columns :param target_col: target column name :param features_list: by default creates plots for all features. If list passed, creates plots of only those features. :param bins: number of bins to be created from continuous feature :param data_test: test data which has to be compared with input data for correlation :return: dataframe with trend changes and trend correlation (if test data passed)

dsa.da.util_feature.get_univariate_plots(data, target_col, features_list=0, bins=10, data_test=0)[source]

Creates univariate dependence plots for features in the dataset :param data: dataframe containing features and target columns :param target_col: target column name :param features_list: by default creates plots for all features. If list passed, creates plots of only those features. :param bins: number of bins to be created from continuous feature :param data_test: test data which has to be compared with input data for correlation :return: Draws univariate plots for all columns in data

dsa.da.util_feature.pd_col_fillna(dfref, colname=None, method='frequent', value=None, colgroupby=None, return_val='dataframe, param')[source]

Function to fill NaNs with a specific value in certain columns :param df: dataframe :param colname: list of columns to remove text :param value: value to replace NaNs with

Returns:new dataframe with filled values
Return type:df
dsa.da.util_feature.pd_col_fillna_advanced(dfref, colname=None, method='median', colname_na=None, return_val='dataframe, param')[source]

Function to fill NaNs with a specific value in certain columns https://impyute.readthedocs.io/en/master/

Parameters:
  • df – dataframe
  • colname – list of columns to remove text
  • colname_na – target na coluns
  • value – value to replace NaNs with
Returns:

new dataframe with filled values

Return type:

df

https://impyute.readthedocs.io/en/master/user_guide/overview.html

dsa.da.util_feature.pd_col_fillna_datawig(dfref, colname=None, method='median', colname_na=None, return_val='dataframe, param')[source]

Function to fill NaNs with a specific value in certain columns https://impyute.readthedocs.io/en/master/

Parameters:
  • df – dataframe
  • colname – list of columns to remove text
  • colname_na – target na coluns
  • value – value to replace NaNs with
Returns:

new dataframe with filled values

Return type:

df

https://impyute.readthedocs.io/en/master/user_guide/overview.html

dsa.da.util_feature.pd_col_filter(df, filter_val=None, iscol=1)[source]

# Remove Columns where Index Value is not in the filter_value # filter1= X_client[‘client_id’].values :param df: :param filter_val: :param iscol: :return:

dsa.da.util_feature.pd_col_intersection(df1, df2, colid)[source]
Parameters:
  • df1
  • df2
  • colid

:return :

dsa.da.util_feature.pd_col_merge_onehot(df, colname)[source]
Merge columns into single (hotn
Parameters:
  • df
  • colname

:return :

dsa.da.util_feature.pd_col_to_onehot(dfref, colname=None, colonehot=None, return_val='dataframe, column')[source]
Parameters:
  • df
  • colname
  • colonehot – previous one hot columns
  • returncol
Returns:

dsa.da.util_feature.pd_colcat_mapping(df, colname)[source]
for col in colcat :
df[col] = df[col].apply(lambda x : colcat_map[“cat_map”][col].get(x) )
Parameters:
  • df
  • colname
Returns:

dsa.da.util_feature.pd_colcat_mergecol(df, col_list, x0, colid='easy_id')[source]
Merge category onehot column
Parameters:
  • df
  • l
  • x0
Returns:

dsa.da.util_feature.pd_colcat_tonum(df, colcat='all', drop_single_label=False, drop_fact_dict=True)[source]

Encoding a data-set with mixed data (numerical and categorical) to a numerical-only data-set, using the following logic: * categorical with only a single value will be marked as zero (or dropped, if requested) * categorical with two values will be replaced with the result of Pandas factorize * categorical with more than two values will be replaced with the result of Pandas get_dummies * numerical columns will not be modified Returns: DataFrame or (DataFrame, dict). If drop_fact_dict is True, returns the encoded DataFrame. else, returns a tuple of the encoded DataFrame and dictionary, where each key is a two-value column, and the value is the original labels, as supplied by Pandas factorize. Will be empty if no two-value columns are present in the data-set :param df: The data-set to encode :type df: NumPy ndarray / Pandas DataFrame :param colcat: A sequence of the nominal (categorical) columns in the dataset. If string, must be ‘all’ to state that

all columns are nominal. If None, nothing happens. Default: ‘all’
Parameters:
  • drop_single_label (Boolean, default = False) – If True, nominal columns with a only a single value will be dropped.
  • drop_fact_dict (Boolean, default = True) – If True, the return value will be the encoded DataFrame alone. If False, it will be a tuple of the DataFrame and the dictionary of the binary factorization (originating from pd.factorize)
dsa.da.util_feature.pd_colnum_normalize(df, colnum_log, colproba)[source]
Parameters:
  • df
  • colnum_log
  • colproba
Returns:

dsa.da.util_feature.pd_colnum_tocat(df, colname=None, colexclude=None, colbinmap=None, bins=5, suffix='_bin', method='uniform', na_value=-1, return_val='dataframe, param', params={'KMeans_algorithm': 'auto', 'KMeans_copy_x': True, 'KMeans_init': 'k-means++', 'KMeans_max_iter': 300, 'KMeans_n_clusters': 8, 'KMeans_n_init': 10, 'KMeans_n_jobs': None, 'KMeans_precompute_distances': 'auto', 'KMeans_random_state': None, 'KMeans_tol': 0.0001, 'KMeans_verbose': 0})[source]

colbinmap = for each column, definition of bins https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

param df:
param method:
return:
dsa.da.util_feature.pd_colnum_tocat_stat(input_data, feature, target_col, bins, cuts=0)[source]

Bins continuous features into equal sample size buckets and returns the target mean in each bucket. Separates out nulls into another bucket. :param input_data: dataframe containg features and target column :param feature: feature column name :param target_col: target column :param bins: Number bins required :param cuts: if buckets of certain specific cuts are required. Used on test data to use cuts from train. :return: If cuts are passed only grouped data is returned, else cuts and grouped data is returned

dsa.da.util_feature.pd_df_sampling(df, coltarget='y', n1max=10000, n2max=-1, isconcat=1)[source]
DownSampler
Parameters:
  • df
  • coltarget – binary class
  • n1max
  • n2max
  • isconcat
Returns:

dsa.da.util_feature.pd_df_stack(df_list, ignore_index=True)[source]

Concat vertically dataframe :param df_list: :return:

dsa.da.util_feature.pd_pipeline_apply(df, pipeline)[source]
pipe_preprocess_colnum = [ (pd_col_to_num, {“val”: “?”, })
, (pd_colnum_tocat, {“colname”: None, “colbinmap”: colnum_binmap, ‘bins’: 5,
“method”: “uniform”, “suffix”: “_bin”, “return_val”: “dataframe”})
, (pd_col_to_onehot, {“colname”: None, “colonehot”: colnum_onehot,
“return_val”: “dataframe”})

]

Parameters:
  • df
  • pipeline
Returns:

dsa.da.util_feature.pd_row_drop_above_thresh(df, colnumlist, thresh)[source]

Function to remove outliers above a certain threshold :param df: dataframe :param col: col from which to remove outliers :param thresh: value above which to remove row :param colnumlist: list

Returns:dataframe with outliers removed
Return type:df
dsa.da.util_feature.pd_stat_colcheck(df)[source]
Parameters:df

:return :

dsa.da.util_feature.pd_stat_correl_pair(df, coltarget=None, colname=None)[source]
Genearte correletion between the column and target column df represents the dataframe comprising the column and colname comprising the target column
Parameters:
  • df
  • colname – list of columns

:param coltarget : target column

Returns:
dsa.da.util_feature.pd_stat_distribution(df, subsample_ratio=1.0)[source]
Parameters:df
Returns:
dsa.da.util_feature.pd_stat_distribution_colnum(df)[source]

Describe the tables

dsa.da.util_feature.pd_stat_histogram(df, bins=50, coltarget='diff')[source]
Parameters:
  • df
  • bins
  • coltarget
Returns:

dsa.da.util_feature.pd_stat_histogram_groupby(df, bins=50, coltarget='diff', colgroupby='y')[source]
Parameters:
  • df
  • bins
  • coltarget
  • colgroupby
Returns:

dsa.da.util_feature.pd_stat_jupyter_profile(df, savefile='report.html', title='Pandas Profile')[source]

Describe the tables #Pandas-Profiling 2.0.0 df.profile_report()

dsa.da.util_feature.pd_stat_na_perow(df, n=1000000)[source]
Parameters:
  • df
  • n
Returns:

dsa.da.util_feature.univariate_plotter(feature, data, target_col, bins=10, data_test=0)[source]

Calls the draw plot function and editing around the plots :param feature: feature column name :param data: dataframe containing features and target columns :param target_col: target column name :param bins: number of bins to be created from continuous feature :param data_test: test data which has to be compared with input data for correlation :return: grouped data if only train passed, else (grouped train data, grouped test data)

dsa.da.util_model

Methods for ML models, model ensembels, metrics etc. util_model : input/output is numpy

https://stats.stackexchange.com/questions/222558/classification-evaluation-metrics-for-highly-imbalanced-data

Besides the AUC and Kohonen’s kappa already discussed in the other answers, I’d also like to add a few metrics I’ve found useful for imbalanced data. They are both related to precision and recall. Because by averaging these you get a metric weighing TPs and both types of errors (FP and FN):

F1 score, which is the harmonic mean of precision and recall. G-measure, which is the geometric mean of precision and recall. Compared to F1, I’ve found it a bit better for imbalanced data. Jaccard index, which you can think of as the TP/(TP+FP+FN). This is actually the metric that has worked for me the best. Note: For imbalanced datasets, it is best to have your metrics be macro-averaged.

esides the AUC and Kohonen’s kappa already discussed in the other answers, I’d also like to add a few metrics I’ve found useful for imbalanced data. They are both related to precision and recall. Because by averaging these you get a metric weighing TPs and both types of errors (FP and FN):

F1 score, which is the harmonic mean of precision and recall. G-measure, which is the geometric mean of precision and recall. Compared to F1, I’ve found it a bit better for imbalanced data. Jaccard index, which you can think of as the TP/(TP+FP+FN). This is actually the metric that has worked for me the best. Note: For imbalanced datasets, it is best to have your metrics be macro-averaged.

Final intuition to metric selection Use precision and recall to focus on small positive class — When the positive class is smaller and the ability to detect correctly positive samples is our main focus (correct detection of negatives examples is less important to the problem) we should use precision and recall. Use ROC when both classes detection is equally important — When we want to give equal weight to both classes prediction ability we should look at the ROC curve. Use ROC when the positives are the majority or switch the labels and use precision and recall — When the positive class is larger we should probably use the ROC metrics because the precision and recall would reflect mostly the ability of prediction of the positive class and not the negative class which will naturally be harder to detect due to the smaller number of samples. If the negative class (the minority in this case) is more important, we can switch the labels and use precision and recall (As we saw in the examples above — switching the labels can change everything). Towards Data Science Sharing concepts, ideas, and codes. Following 1.2K

Machine Learning Data Science

dsa.da.util_model.model_catboost_classifier(Xtrain, Ytrain, Xcolname=None, pars={'iterations': 1000, 'learning_rate': 0.1, 'loss_function': 'MultiClass', 'random_seed': 0}, isprint=0)[source]
from catboost import Pool, CatBoostClassifier

TRAIN_FILE = ‘../data/cloudness_small/train_small’ TEST_FILE = ‘../data/cloudness_small/test_small’ CD_FILE = ‘../data/cloudness_small/train.cd’ # Load data from files to Pool train_pool = Pool(TRAIN_FILE, column_description=CD_FILE) test_pool = Pool(TEST_FILE, column_description=CD_FILE) # Initialize CatBoostClassifier model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function=’MultiClass’) # Fit model model.fit(train_pool) # Get predicted classes preds_class = model.predict(test_pool) # Get predicted probabilities for each class preds_proba = model.predict_proba(test_pool) # Get predicted RawFormulaVal

preds_raw = model.predict(test_pool, prediction_type=’RawFormulaVal’)

https://tech.yandex.com/catboost/doc/dg/concepts/python-usages-examples-docpage/

class dsa.da.util_model.model_template1(alpha=0.5, low_y_cut=-0.09, high_y_cut=0.09, ww0=0.95)[source]
dsa.da.util_model.pd_dim_reduction(df, colname, colprefix='colsvd', method='svd', dimpca=2, model_pretrain=None, return_val='dataframe, param')[source]

Dimension reduction technics dftext_svd, svd = pd_dim_reduction(dfcat_test, None,colprefix=”colsvd”,

method=”svd”, dimpca=2, return_val=”dataframe,param”)
Parameters:
  • df
  • colname
  • colprefix
  • method
  • dimpca
  • return_val
Returns:

dsa.da.util_model.sk_cluster(Xmat, method='kmode', args=(), kwds={'metric': 'euclidean', 'min_cluster_size': 150, 'min_samples': 3}, isprint=1, preprocess={'norm': False})[source]

‘hdbscan’,(), kwds={‘metric’:’euclidean’, ‘min_cluster_size’:150, ‘min_samples’:3 } ‘kmodes’,(), kwds={ n_clusters=2, n_init=5, init=’Huang’, verbose=1 } ‘kmeans’, kwds={ n_clusters= nbcluster }

Xmat[ Xcluster== 5 ] # HDBSCAN Clustering Xcluster_hdbscan= da.sk_cluster_algo_custom(Xtrain_d, hdbscan.HDBSCAN, (),

{‘metric’:’euclidean’, ‘min_cluster_size’:150, ‘min_samples’:3})

print len(np.unique(Xcluster_hdbscan))

Xcluster_use = Xcluster_hdbscan

# Calculate Distribution for each cluster kde= da.plot_distribution_density(Y[Xcluster_use== 2], kernel=’gaussian’, N=200, bandwith=1 / 500.) kde.sample(5)

dsa.da.util_model.sk_feature_concept_shift(df)[source]
(X,y) distribution relation is shifting.

https://dkopczyk.quantee.co.uk/covariate_shift/

Parameters:df (TYPE) – DESCRIPTION.
Returns:
Return type:None.
dsa.da.util_model.sk_feature_covariate_shift(dftrain, dftest, colname, nsample=10000)[source]
X is drifting
Parameters:
  • dftrain (TYPE) – DESCRIPTION.
  • dftest (TYPE) – DESCRIPTION.
  • colname (TYPE) – DESCRIPTION.
  • nsample (TYPE, optional) – DESCRIPTION. The default is 10000.
Returns:

drop_list – DESCRIPTION.

Return type:

TYPE

dsa.da.util_model.sk_feature_impt(clf, colname, model_type='logistic')[source]
Feature importance with colname
Parameters:
  • clf – model or colnum with weights
  • colname
Returns:

dsa.da.util_model.sk_feature_prior_shift(df)[source]
Label is drifting

https://dkopczyk.quantee.co.uk/covariate_shift/

Parameters:df (TYPE) – DESCRIPTION.
Returns:
Return type:None.
dsa.da.util_model.sk_metric_roc_optimal_cutoff(ytest, ytest_proba)[source]

Find the optimal probability cutoff point for a classification model related to event rate :param ytest: :type ytest: Matrix with dependent or target data, where rows are observations :param ytest_proba: :type ytest_proba: Matrix with predicted data, where rows are observations :param # Find prediction to the dataframe applying threshold: :param data[‘pred’] = data[‘pred_proba’].map(lambda x: :type data[‘pred’] = data[‘pred_proba’].map(lambda x: 1 if x > threshold else 0) :param # Print confusion Matrix: :param from sklearn.metrics import confusion_matrix: :param confusion_matrix(data[‘admit’], data[‘pred’]): :param # array([[175, 98],: :param # [ 46, 81]]): :param Returns: :type Returns: with optimal cutoff value

dsa.da.util_model.sk_model_eval_classification_cv(clf, X, y, test_size=0.5, ncv=1, method='random')[source]
Parameters:
  • clf
  • X
  • y
  • test_size
  • ncv
  • method
Returns:

dsa.da.util_model.sk_params_search_best(clf, X, y, param_grid={'alpha': array([0., 0.25, 0.5, 0.75, 1. ])}, method='gridsearch', param_search={'cv': 5, 'generations_number': 3, 'population_size': 5, 'scorename': 'r2'})[source]

Genetic: population_size=5, ngene_mutation_prob=0.10,,gene_crossover_prob=0.5, tournament_size=3, generations_number=3

param X:
param y:
param clf:
param param_grid:
 
param method:
param param_search:
 
return:

dsa.da.util_text

Methods for feature extraction and preprocessing util_feature: input/output is pandas

If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.

feature_extraction.text.CountVectorizer([ÿ]) Convert a collection of text documents to a matrix of token counts feature_extraction.text.HashingVectorizer([ÿ]) Convert a collection of text documents to a matrix of token occurrences feature_extraction.text.TfidfVectorizer([ÿ]) Convert a collection of raw documents to a matrix of TF-IDF features.

dsa.da.util_text.pd_coltext_countvect(df, coltext, word_tokeep=None, word_minfreq=1, return_val='dataframe, param')[source]

Function that adds count of a given column for words in a text corpus. :param df: original dataframe :param word_tokeep: corpus of words to look into :param coltext: column of df to apply tf-idf to

Returns:dataframe with a new column for each word https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Return type:concat_df
dsa.da.util_text.pd_coltext_encoder(df)[source]

https://dirty-cat.github.io/stable/auto_examples/02_fit_predict_plot_employee_salaries.html#sphx-glr-auto-examples-02-fit-predict-plot-employee-salaries-py

Parameters:df
Returns:
dsa.da.util_text.pd_coltext_hashing(df, coltext, n_features=20)[source]

Function that adds Hash a given column for words in a text corpus. :param df: original dataframe :param word_tokeep: corpus of words to look into :param col_tofilter: column of df to apply tf-idf to

Returns:dataframe with a new column for each word
Return type:concat_df
dsa.da.util_text.pd_coltext_minhash(dfref, colname, n_component=2, model_pretrain_dict=None, return_val='dataframe, param')[source]
dfhash, colcat_hash_param = pd_colcat_minhash(df, colcat, n_component=[2] * len(colcat),
return_val=”dataframe,param”)
Parameters:
  • dfref
  • colname
  • n_component
  • return_val
Returns:

dsa.da.util_text.pd_coltext_tdidf(df, coltext, word_tokeep=None, word_minfreq=1, return_val='dataframe, param')[source]

Function that adds tf-idf of a given column for words in a text corpus. :param df: original dataframe :param word_tokeep: corpus of words to look into :param col_tofilter: column of df to apply tf-idf to

Returns:dataframe with a new column for each word https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Return type:concat_df
dsa.da.util_text.pd_coltext_wordfreq(df, coltext, sep=' ')[source]
Parameters:
  • df
  • coltext – text where word frequency should be extracted
  • nb_to_show
Returns:

dsa.da.util_text.pd_fromdict(ddict, colname)[source]
Parameters:
  • ddict
  • colname
Returns:

dsa.da.util_stat

Methods for ML models, model ensembels, metrics etc. util_model : input/output is numpy

dsa.da.util_stat.np_conditional_entropy(x, y)[source]

Calculates the conditional entropy of x given y: S(x|y) Wikipedia: https://en.wikipedia.org/wiki/Conditional_entropy Returns: float :param x: A sequence of measurements :type x: list / NumPy ndarray / Pandas Series :param y: A sequence of measurements :type y: list / NumPy ndarray / Pandas Series

dsa.da.util_stat.np_correl_cat_cat_cramers_v(x, y)[source]

Calculates Cramer’s V statistic for categorical-categorical association. Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): This is a symmetric coefficient: V(x,y) = V(y,x) Original function taken from: https://stackoverflow.com/a/46498792/5863503 Wikipedia: https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V Returns: float in the range of [0,1] :param x: A sequence of categorical measurements :type x: list / NumPy ndarray / Pandas Series :param y: A sequence of categorical measurements :type y: list / NumPy ndarray / Pandas Series

dsa.da.util_stat.np_correl_cat_cat_theils_u(x, y)[source]

Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: U(x,y) != U(y,x) Wikipedia: https://en.wikipedia.org/wiki/Uncertainty_coefficient Returns: float in the range of [0,1] :param x: A sequence of categorical measurements :type x: list / NumPy ndarray / Pandas Series :param y: A sequence of categorical measurements :type y: list / NumPy ndarray / Pandas Series

dsa.da.util_stat.np_correl_cat_num_ratio(cat_array, num_array)[source]

Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty. Wikipedia: https://en.wikipedia.org/wiki/Correlation_ratio Returns: float in the range of [0,1] :param cat_array: :type cat_array: list / NumPy ndarray / Pandas Series A sequence of categorical measurements :param num_array: :type num_array: list / NumPy ndarray / Pandas Series A sequence of continuous measurements

dsa.da.util_stat.np_transform_pca(X, dimpca=2, whiten=True)[source]

Project ndim data into dimpca sub-space

dsa.da.util_stat.pd_num_correl_associations(df, colcat=None, mark_columns=False, theil_u=False, plot=True, return_results=False, **kwargs)[source]

Calculate the correlation/strength-of-association of features in data-set with both categorical (eda_tools) and continuous features using:

  • Pearson’s R for continuous-continuous cases
  • Correlation Ratio for categorical-continuous cases
  • Cramer’s V or Theil’s U for categorical-categorical cases

Returns: a DataFrame of the correlation/strength-of-association between all features Example: see associations_example under dython.examples :param df: The data-set for which the features’ correlation is computed :type df: NumPy ndarray / Pandas DataFrame :param colcat: Names of columns of the data-set which hold categorical values. Can also be the string ‘all’ to state that all

columns are categorical, or None (default) to state none are categorical
Parameters:
  • mark_columns (Boolean, default = False) – if True, output’s columns’ names will have a suffix of ‘(nom)’ or ‘(con)’ based on there type (eda_tools or continuous), as provided by colcat
  • theil_u (Boolean, default = False) – In the case of categorical-categorical feaures, use Theil’s U instead of Cramer’s V
  • plot (Boolean, default = True) – If True, plot a heat-map of the correlation matrix
  • return_results (Boolean, default = False) – If True, the function will return a Pandas DataFrame of the computed associations
  • kwargs (any key-value pairs) – Arguments to be passed to used function and methods
dsa.da.util_stat.sk_distribution_kernel_bestbandwidth(X, kde)[source]

Find best Bandwidht for a given kernel :param kde: :return:

dsa.da.util_stat.sk_distribution_kernel_sample(kde=None, n=1)[source]

kde = sm.nonparametric.KDEUnivariate(np.array(Y[Y_cluster==0],dtype=np.float64)) kde = sm.nonparametric.KDEMultivariate() # … you already did this

dsa.da.util_stat.stat_hypothesis_test_permutation(df, variable, classes, repetitions)[source]

Test whether two numerical samples come from the same underlying distribution, using the absolute difference between the means. table: name of table containing the sample variable: label of column containing the numerical variable classes: label of column containing names of the two samples repetitions: number of random permutations

dsa.da.util_date

import datetime datetime.datetime.strptime(‘20-Nov-2002’,’%d-%b-%Y’).strftime(‘%Y%m%d’) ‘20021120’ Formats -

%d - 2 digit date %b - 3-letter month abbreviation %Y - 4 digit year %m - 2 digit month %a

df = DataFrame(dict(date = date_range(‘20130101’,periods=10))) https://python-utils.readthedocs.io/en/latest/usage.html#quickstart https://dateutil.readthedocs.io/en/stable/examples.html

dsa.da.util_date.datestring_todatetime(datelist, fmt='%Y-%m-%d %H:%M:%S')[source]
Parsing date ‘Jun 1 2005 1:33PM’, ‘%b %d %Y %I:%M%p’
Parameters:
  • datelist
  • fmt
Returns:

dsa.da.util_date.datetime_tostring(datelist, fmt='%Y-%m-%d %H:%M:%S')[source]

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior :param x: :param fmt: :return:

dsa.da.util_date.datetime_weekday_fast(dateval)[source]
date values
Parameters:dateval
Returns:
dsa.da.util_date.pd_datestring_split(dfref, coldate, fmt='%Y-%m-%d %H:%M:%S', return_val='split')[source]
Parsing date ‘Jun 1 2005 1:33PM’, ‘%b %d %Y %I:%M%p’
Parameters:
  • datelist
  • fmt
Returns:

dsa.da.util

Various utilities

dsa.da.util.load(filename='/folder1/keyname', isabsolutpath=0, encoding1='utf-8')[source]

pickle load :param filename: :param isabsolutpath: :param encoding1: :return:

dsa.da.util.load_arguments(config_file=None, arg_list=None)[source]

Load CLI input, load config.toml , overwrite config.toml by CLI Input [{}, {}]

dsa.da.util.logger_setup(logger_name=None, log_file=None, formatter=<logging.Formatter object>, isrotate=False, isconsole_output=True, logging_level=10)[source]

my_logger = util_log.logger_setup(“my module name”, log_file=”“) APP_ID = util_log.create_appid(__file__ ) def log(*argv):

my_logger.info(“,”.join([str(x) for x in argv]))
dsa.da.util.save(obj, filename='/folder1/keyname', isabsolutpath=0)[source]

Pickle saving :param obj: :param filename: :param isabsolutpath: :return:

dsa.da.util.save_all(variable_list, folder, globals_main=None)[source]

Pickle saving batch :param variable_list: :param folder: :param globals_main: :return:

dsa.da.util.sk_tree_get_ifthen(tree, feature_names, target_names, spacer_base=' ')[source]

Produce psuedo-code for decision tree. tree – scikit-leant DescisionTree. feature_names – list of feature names. target_names – list of target (output) names. spacer_base – used for spacing code (default: ” “).

dsa.da.util_automl