Welcome to DSA’s documentation!¶
Indices and tables¶
dsa.da.util_feature¶
Methods for feature extraction and preprocessing util_feature: input/output is pandas
-
dsa.da.util_feature.
col_extractname
(col_onehot)[source]¶ Column extraction :param col_onehot :return:
-
dsa.da.util_feature.
col_extractname_colbin
(cols2)[source]¶ 1hot column name to generic column names :param cols2: :return:
-
dsa.da.util_feature.
col_remove
(cols, colsremove, mode='exact')[source]¶ Parameters: - cols (TYPE) – DESCRIPTION.
- colsremove (TYPE) – DESCRIPTION.
- mode (TYPE, optional) – DESCRIPTION. The default is “exact”, “fuzzy”
Returns: cols – DESCRIPTION. remove column name from list
Return type: TYPE
-
dsa.da.util_feature.
col_stat_getcategorydict_freq
(catedict)[source]¶ Generate Frequency of category : Id, Freq, Freqin%, CumSum%, ZScore given a dictionnary of category parsed previously
-
dsa.da.util_feature.
draw_plots
(input_data, feature, target_col, trend_correlation=None)[source]¶ Draws univariate dependence plots for a feature :param input_data: grouped data contained bins of feature and target mean. :param feature: feature column name :param target_col: target column :param trend_correlation: correlation between train and test trends of feature wrt target :return: Draws trend plots for feature
-
dsa.da.util_feature.
get_trend_changes
(grouped_data, feature, target_col, threshold=0.03)[source]¶ Calculates number of times the trend of feature wrt target changed direction. :param grouped_data: grouped dataset :param feature: feature column name :param target_col: target column :param threshold: minimum % difference required to count as trend change :return: number of trend chagnes for the feature
-
dsa.da.util_feature.
get_trend_correlation
(grouped, grouped_test, feature, target_col)[source]¶ Calculates correlation between train and test trend of feature wrt target. :param grouped: train grouped data :param grouped_test: test grouped data :param feature: feature column name :param target_col: target column name :return: trend correlation between train and test
-
dsa.da.util_feature.
get_trend_stats
(data, target_col, features_list=0, bins=10, data_test=0)[source]¶ Calculates trend changes and correlation between train/test for list of features :param data: dataframe containing features and target columns :param target_col: target column name :param features_list: by default creates plots for all features. If list passed, creates plots of only those features. :param bins: number of bins to be created from continuous feature :param data_test: test data which has to be compared with input data for correlation :return: dataframe with trend changes and trend correlation (if test data passed)
-
dsa.da.util_feature.
get_univariate_plots
(data, target_col, features_list=0, bins=10, data_test=0)[source]¶ Creates univariate dependence plots for features in the dataset :param data: dataframe containing features and target columns :param target_col: target column name :param features_list: by default creates plots for all features. If list passed, creates plots of only those features. :param bins: number of bins to be created from continuous feature :param data_test: test data which has to be compared with input data for correlation :return: Draws univariate plots for all columns in data
-
dsa.da.util_feature.
pd_col_fillna
(dfref, colname=None, method='frequent', value=None, colgroupby=None, return_val='dataframe, param')[source]¶ Function to fill NaNs with a specific value in certain columns :param df: dataframe :param colname: list of columns to remove text :param value: value to replace NaNs with
Returns: new dataframe with filled values Return type: df
-
dsa.da.util_feature.
pd_col_fillna_advanced
(dfref, colname=None, method='median', colname_na=None, return_val='dataframe, param')[source]¶ Function to fill NaNs with a specific value in certain columns https://impyute.readthedocs.io/en/master/
Parameters: - df – dataframe
- colname – list of columns to remove text
- colname_na – target na coluns
- value – value to replace NaNs with
Returns: new dataframe with filled values
Return type: df
https://impyute.readthedocs.io/en/master/user_guide/overview.html
-
dsa.da.util_feature.
pd_col_fillna_datawig
(dfref, colname=None, method='median', colname_na=None, return_val='dataframe, param')[source]¶ Function to fill NaNs with a specific value in certain columns https://impyute.readthedocs.io/en/master/
Parameters: - df – dataframe
- colname – list of columns to remove text
- colname_na – target na coluns
- value – value to replace NaNs with
Returns: new dataframe with filled values
Return type: df
https://impyute.readthedocs.io/en/master/user_guide/overview.html
-
dsa.da.util_feature.
pd_col_filter
(df, filter_val=None, iscol=1)[source]¶ # Remove Columns where Index Value is not in the filter_value # filter1= X_client[‘client_id’].values :param df: :param filter_val: :param iscol: :return:
-
dsa.da.util_feature.
pd_col_intersection
(df1, df2, colid)[source]¶ Parameters: - df1 –
- df2 –
- colid –
:return :
-
dsa.da.util_feature.
pd_col_merge_onehot
(df, colname)[source]¶ - Merge columns into single (hotn
Parameters: - df –
- colname –
:return :
-
dsa.da.util_feature.
pd_col_to_onehot
(dfref, colname=None, colonehot=None, return_val='dataframe, column')[source]¶ Parameters: - df –
- colname –
- colonehot – previous one hot columns
- returncol –
Returns:
-
dsa.da.util_feature.
pd_colcat_mapping
(df, colname)[source]¶ - for col in colcat :
- df[col] = df[col].apply(lambda x : colcat_map[“cat_map”][col].get(x) )
Parameters: - df –
- colname –
Returns:
-
dsa.da.util_feature.
pd_colcat_mergecol
(df, col_list, x0, colid='easy_id')[source]¶ - Merge category onehot column
Parameters: - df –
- l –
- x0 –
Returns:
-
dsa.da.util_feature.
pd_colcat_tonum
(df, colcat='all', drop_single_label=False, drop_fact_dict=True)[source]¶ Encoding a data-set with mixed data (numerical and categorical) to a numerical-only data-set, using the following logic: * categorical with only a single value will be marked as zero (or dropped, if requested) * categorical with two values will be replaced with the result of Pandas factorize * categorical with more than two values will be replaced with the result of Pandas get_dummies * numerical columns will not be modified Returns: DataFrame or (DataFrame, dict). If drop_fact_dict is True, returns the encoded DataFrame. else, returns a tuple of the encoded DataFrame and dictionary, where each key is a two-value column, and the value is the original labels, as supplied by Pandas factorize. Will be empty if no two-value columns are present in the data-set :param df: The data-set to encode :type df: NumPy ndarray / Pandas DataFrame :param colcat: A sequence of the nominal (categorical) columns in the dataset. If string, must be ‘all’ to state that
all columns are nominal. If None, nothing happens. Default: ‘all’Parameters: - drop_single_label (Boolean, default = False) – If True, nominal columns with a only a single value will be dropped.
- drop_fact_dict (Boolean, default = True) – If True, the return value will be the encoded DataFrame alone. If False, it will be a tuple of the DataFrame and the dictionary of the binary factorization (originating from pd.factorize)
-
dsa.da.util_feature.
pd_colnum_normalize
(df, colnum_log, colproba)[source]¶ Parameters: - df –
- colnum_log –
- colproba –
Returns:
-
dsa.da.util_feature.
pd_colnum_tocat
(df, colname=None, colexclude=None, colbinmap=None, bins=5, suffix='_bin', method='uniform', na_value=-1, return_val='dataframe, param', params={'KMeans_algorithm': 'auto', 'KMeans_copy_x': True, 'KMeans_init': 'k-means++', 'KMeans_max_iter': 300, 'KMeans_n_clusters': 8, 'KMeans_n_init': 10, 'KMeans_n_jobs': None, 'KMeans_precompute_distances': 'auto', 'KMeans_random_state': None, 'KMeans_tol': 0.0001, 'KMeans_verbose': 0})[source]¶ colbinmap = for each column, definition of bins https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
param df: param method: return:
-
dsa.da.util_feature.
pd_colnum_tocat_stat
(input_data, feature, target_col, bins, cuts=0)[source]¶ Bins continuous features into equal sample size buckets and returns the target mean in each bucket. Separates out nulls into another bucket. :param input_data: dataframe containg features and target column :param feature: feature column name :param target_col: target column :param bins: Number bins required :param cuts: if buckets of certain specific cuts are required. Used on test data to use cuts from train. :return: If cuts are passed only grouped data is returned, else cuts and grouped data is returned
-
dsa.da.util_feature.
pd_df_sampling
(df, coltarget='y', n1max=10000, n2max=-1, isconcat=1)[source]¶ - DownSampler
Parameters: - df –
- coltarget – binary class
- n1max –
- n2max –
- isconcat –
Returns:
-
dsa.da.util_feature.
pd_df_stack
(df_list, ignore_index=True)[source]¶ Concat vertically dataframe :param df_list: :return:
-
dsa.da.util_feature.
pd_pipeline_apply
(df, pipeline)[source]¶ - pipe_preprocess_colnum = [ (pd_col_to_num, {“val”: “?”, })
- , (pd_colnum_tocat, {“colname”: None, “colbinmap”: colnum_binmap, ‘bins’: 5,
- “method”: “uniform”, “suffix”: “_bin”, “return_val”: “dataframe”})
- , (pd_col_to_onehot, {“colname”: None, “colonehot”: colnum_onehot,
- “return_val”: “dataframe”})
]
Parameters: - df –
- pipeline –
Returns:
-
dsa.da.util_feature.
pd_row_drop_above_thresh
(df, colnumlist, thresh)[source]¶ Function to remove outliers above a certain threshold :param df: dataframe :param col: col from which to remove outliers :param thresh: value above which to remove row :param colnumlist: list
Returns: dataframe with outliers removed Return type: df
-
dsa.da.util_feature.
pd_stat_correl_pair
(df, coltarget=None, colname=None)[source]¶ - Genearte correletion between the column and target column df represents the dataframe comprising the column and colname comprising the target column
Parameters: - df –
- colname – list of columns
:param coltarget : target column
Returns:
-
dsa.da.util_feature.
pd_stat_histogram
(df, bins=50, coltarget='diff')[source]¶ Parameters: - df –
- bins –
- coltarget –
Returns:
-
dsa.da.util_feature.
pd_stat_histogram_groupby
(df, bins=50, coltarget='diff', colgroupby='y')[source]¶ Parameters: - df –
- bins –
- coltarget –
- colgroupby –
Returns:
-
dsa.da.util_feature.
pd_stat_jupyter_profile
(df, savefile='report.html', title='Pandas Profile')[source]¶ Describe the tables #Pandas-Profiling 2.0.0 df.profile_report()
-
dsa.da.util_feature.
univariate_plotter
(feature, data, target_col, bins=10, data_test=0)[source]¶ Calls the draw plot function and editing around the plots :param feature: feature column name :param data: dataframe containing features and target columns :param target_col: target column name :param bins: number of bins to be created from continuous feature :param data_test: test data which has to be compared with input data for correlation :return: grouped data if only train passed, else (grouped train data, grouped test data)
dsa.da.util_model¶
Methods for ML models, model ensembels, metrics etc. util_model : input/output is numpy
Besides the AUC and Kohonen’s kappa already discussed in the other answers, I’d also like to add a few metrics I’ve found useful for imbalanced data. They are both related to precision and recall. Because by averaging these you get a metric weighing TPs and both types of errors (FP and FN):
F1 score, which is the harmonic mean of precision and recall. G-measure, which is the geometric mean of precision and recall. Compared to F1, I’ve found it a bit better for imbalanced data. Jaccard index, which you can think of as the TP/(TP+FP+FN). This is actually the metric that has worked for me the best. Note: For imbalanced datasets, it is best to have your metrics be macro-averaged.
esides the AUC and Kohonen’s kappa already discussed in the other answers, I’d also like to add a few metrics I’ve found useful for imbalanced data. They are both related to precision and recall. Because by averaging these you get a metric weighing TPs and both types of errors (FP and FN):
F1 score, which is the harmonic mean of precision and recall. G-measure, which is the geometric mean of precision and recall. Compared to F1, I’ve found it a bit better for imbalanced data. Jaccard index, which you can think of as the TP/(TP+FP+FN). This is actually the metric that has worked for me the best. Note: For imbalanced datasets, it is best to have your metrics be macro-averaged.
Final intuition to metric selection Use precision and recall to focus on small positive class — When the positive class is smaller and the ability to detect correctly positive samples is our main focus (correct detection of negatives examples is less important to the problem) we should use precision and recall. Use ROC when both classes detection is equally important — When we want to give equal weight to both classes prediction ability we should look at the ROC curve. Use ROC when the positives are the majority or switch the labels and use precision and recall — When the positive class is larger we should probably use the ROC metrics because the precision and recall would reflect mostly the ability of prediction of the positive class and not the negative class which will naturally be harder to detect due to the smaller number of samples. If the negative class (the minority in this case) is more important, we can switch the labels and use precision and recall (As we saw in the examples above — switching the labels can change everything). Towards Data Science Sharing concepts, ideas, and codes. Following 1.2K
Machine Learning Data Science
-
dsa.da.util_model.
model_catboost_classifier
(Xtrain, Ytrain, Xcolname=None, pars={'iterations': 1000, 'learning_rate': 0.1, 'loss_function': 'MultiClass', 'random_seed': 0}, isprint=0)[source]¶ - from catboost import Pool, CatBoostClassifier
TRAIN_FILE = ‘../data/cloudness_small/train_small’ TEST_FILE = ‘../data/cloudness_small/test_small’ CD_FILE = ‘../data/cloudness_small/train.cd’ # Load data from files to Pool train_pool = Pool(TRAIN_FILE, column_description=CD_FILE) test_pool = Pool(TEST_FILE, column_description=CD_FILE) # Initialize CatBoostClassifier model = CatBoostClassifier(iterations=2, learning_rate=1, depth=2, loss_function=’MultiClass’) # Fit model model.fit(train_pool) # Get predicted classes preds_class = model.predict(test_pool) # Get predicted probabilities for each class preds_proba = model.predict_proba(test_pool) # Get predicted RawFormulaVal
preds_raw = model.predict(test_pool, prediction_type=’RawFormulaVal’)
https://tech.yandex.com/catboost/doc/dg/concepts/python-usages-examples-docpage/
-
class
dsa.da.util_model.
model_template1
(alpha=0.5, low_y_cut=-0.09, high_y_cut=0.09, ww0=0.95)[source]¶
-
dsa.da.util_model.
pd_dim_reduction
(df, colname, colprefix='colsvd', method='svd', dimpca=2, model_pretrain=None, return_val='dataframe, param')[source]¶ Dimension reduction technics dftext_svd, svd = pd_dim_reduction(dfcat_test, None,colprefix=”colsvd”,
method=”svd”, dimpca=2, return_val=”dataframe,param”)Parameters: - df –
- colname –
- colprefix –
- method –
- dimpca –
- return_val –
Returns:
-
dsa.da.util_model.
sk_cluster
(Xmat, method='kmode', args=(), kwds={'metric': 'euclidean', 'min_cluster_size': 150, 'min_samples': 3}, isprint=1, preprocess={'norm': False})[source]¶ ‘hdbscan’,(), kwds={‘metric’:’euclidean’, ‘min_cluster_size’:150, ‘min_samples’:3 } ‘kmodes’,(), kwds={ n_clusters=2, n_init=5, init=’Huang’, verbose=1 } ‘kmeans’, kwds={ n_clusters= nbcluster }
Xmat[ Xcluster== 5 ] # HDBSCAN Clustering Xcluster_hdbscan= da.sk_cluster_algo_custom(Xtrain_d, hdbscan.HDBSCAN, (),
{‘metric’:’euclidean’, ‘min_cluster_size’:150, ‘min_samples’:3})print len(np.unique(Xcluster_hdbscan))
Xcluster_use = Xcluster_hdbscan
# Calculate Distribution for each cluster kde= da.plot_distribution_density(Y[Xcluster_use== 2], kernel=’gaussian’, N=200, bandwith=1 / 500.) kde.sample(5)
-
dsa.da.util_model.
sk_feature_concept_shift
(df)[source]¶ - (X,y) distribution relation is shifting.
https://dkopczyk.quantee.co.uk/covariate_shift/
Parameters: df (TYPE) – DESCRIPTION. Returns: Return type: None.
-
dsa.da.util_model.
sk_feature_covariate_shift
(dftrain, dftest, colname, nsample=10000)[source]¶ - X is drifting
Parameters: - dftrain (TYPE) – DESCRIPTION.
- dftest (TYPE) – DESCRIPTION.
- colname (TYPE) – DESCRIPTION.
- nsample (TYPE, optional) – DESCRIPTION. The default is 10000.
Returns: drop_list – DESCRIPTION.
Return type: TYPE
-
dsa.da.util_model.
sk_feature_impt
(clf, colname, model_type='logistic')[source]¶ - Feature importance with colname
Parameters: - clf – model or colnum with weights
- colname –
Returns:
-
dsa.da.util_model.
sk_feature_prior_shift
(df)[source]¶ - Label is drifting
https://dkopczyk.quantee.co.uk/covariate_shift/
Parameters: df (TYPE) – DESCRIPTION. Returns: Return type: None.
-
dsa.da.util_model.
sk_metric_roc_optimal_cutoff
(ytest, ytest_proba)[source]¶ Find the optimal probability cutoff point for a classification model related to event rate :param ytest: :type ytest: Matrix with dependent or target data, where rows are observations :param ytest_proba: :type ytest_proba: Matrix with predicted data, where rows are observations :param # Find prediction to the dataframe applying threshold: :param data[‘pred’] = data[‘pred_proba’].map(lambda x: :type data[‘pred’] = data[‘pred_proba’].map(lambda x: 1 if x > threshold else 0) :param # Print confusion Matrix: :param from sklearn.metrics import confusion_matrix: :param confusion_matrix(data[‘admit’], data[‘pred’]): :param # array([[175, 98],: :param # [ 46, 81]]): :param Returns: :type Returns: with optimal cutoff value
-
dsa.da.util_model.
sk_model_eval_classification_cv
(clf, X, y, test_size=0.5, ncv=1, method='random')[source]¶ Parameters: - clf –
- X –
- y –
- test_size –
- ncv –
- method –
Returns:
-
dsa.da.util_model.
sk_params_search_best
(clf, X, y, param_grid={'alpha': array([0., 0.25, 0.5, 0.75, 1. ])}, method='gridsearch', param_search={'cv': 5, 'generations_number': 3, 'population_size': 5, 'scorename': 'r2'})[source]¶ Genetic: population_size=5, ngene_mutation_prob=0.10,,gene_crossover_prob=0.5, tournament_size=3, generations_number=3
param X: param y: param clf: param param_grid: param method: param param_search: return:
dsa.da.util_text¶
Methods for feature extraction and preprocessing util_feature: input/output is pandas
If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.
feature_extraction.text.CountVectorizer([ÿ]) Convert a collection of text documents to a matrix of token counts feature_extraction.text.HashingVectorizer([ÿ]) Convert a collection of text documents to a matrix of token occurrences feature_extraction.text.TfidfVectorizer([ÿ]) Convert a collection of raw documents to a matrix of TF-IDF features.
-
dsa.da.util_text.
pd_coltext_countvect
(df, coltext, word_tokeep=None, word_minfreq=1, return_val='dataframe, param')[source]¶ Function that adds count of a given column for words in a text corpus. :param df: original dataframe :param word_tokeep: corpus of words to look into :param coltext: column of df to apply tf-idf to
Returns: dataframe with a new column for each word https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Return type: concat_df
-
dsa.da.util_text.
pd_coltext_hashing
(df, coltext, n_features=20)[source]¶ Function that adds Hash a given column for words in a text corpus. :param df: original dataframe :param word_tokeep: corpus of words to look into :param col_tofilter: column of df to apply tf-idf to
Returns: dataframe with a new column for each word Return type: concat_df
-
dsa.da.util_text.
pd_coltext_minhash
(dfref, colname, n_component=2, model_pretrain_dict=None, return_val='dataframe, param')[source]¶ - dfhash, colcat_hash_param = pd_colcat_minhash(df, colcat, n_component=[2] * len(colcat),
- return_val=”dataframe,param”)
Parameters: - dfref –
- colname –
- n_component –
- return_val –
Returns:
-
dsa.da.util_text.
pd_coltext_tdidf
(df, coltext, word_tokeep=None, word_minfreq=1, return_val='dataframe, param')[source]¶ Function that adds tf-idf of a given column for words in a text corpus. :param df: original dataframe :param word_tokeep: corpus of words to look into :param col_tofilter: column of df to apply tf-idf to
Returns: dataframe with a new column for each word https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Return type: concat_df
dsa.da.util_stat¶
Methods for ML models, model ensembels, metrics etc. util_model : input/output is numpy
-
dsa.da.util_stat.
np_conditional_entropy
(x, y)[source]¶ Calculates the conditional entropy of x given y: S(x|y) Wikipedia: https://en.wikipedia.org/wiki/Conditional_entropy Returns: float :param x: A sequence of measurements :type x: list / NumPy ndarray / Pandas Series :param y: A sequence of measurements :type y: list / NumPy ndarray / Pandas Series
-
dsa.da.util_stat.
np_correl_cat_cat_cramers_v
(x, y)[source]¶ Calculates Cramer’s V statistic for categorical-categorical association. Uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): This is a symmetric coefficient: V(x,y) = V(y,x) Original function taken from: https://stackoverflow.com/a/46498792/5863503 Wikipedia: https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V Returns: float in the range of [0,1] :param x: A sequence of categorical measurements :type x: list / NumPy ndarray / Pandas Series :param y: A sequence of categorical measurements :type y: list / NumPy ndarray / Pandas Series
-
dsa.da.util_stat.
np_correl_cat_cat_theils_u
(x, y)[source]¶ Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: U(x,y) != U(y,x) Wikipedia: https://en.wikipedia.org/wiki/Uncertainty_coefficient Returns: float in the range of [0,1] :param x: A sequence of categorical measurements :type x: list / NumPy ndarray / Pandas Series :param y: A sequence of categorical measurements :type y: list / NumPy ndarray / Pandas Series
-
dsa.da.util_stat.
np_correl_cat_num_ratio
(cat_array, num_array)[source]¶ Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty. Wikipedia: https://en.wikipedia.org/wiki/Correlation_ratio Returns: float in the range of [0,1] :param cat_array: :type cat_array: list / NumPy ndarray / Pandas Series A sequence of categorical measurements :param num_array: :type num_array: list / NumPy ndarray / Pandas Series A sequence of continuous measurements
-
dsa.da.util_stat.
np_transform_pca
(X, dimpca=2, whiten=True)[source]¶ Project ndim data into dimpca sub-space
-
dsa.da.util_stat.
pd_num_correl_associations
(df, colcat=None, mark_columns=False, theil_u=False, plot=True, return_results=False, **kwargs)[source]¶ Calculate the correlation/strength-of-association of features in data-set with both categorical (eda_tools) and continuous features using:
- Pearson’s R for continuous-continuous cases
- Correlation Ratio for categorical-continuous cases
- Cramer’s V or Theil’s U for categorical-categorical cases
Returns: a DataFrame of the correlation/strength-of-association between all features Example: see associations_example under dython.examples :param df: The data-set for which the features’ correlation is computed :type df: NumPy ndarray / Pandas DataFrame :param colcat: Names of columns of the data-set which hold categorical values. Can also be the string ‘all’ to state that all
columns are categorical, or None (default) to state none are categoricalParameters: - mark_columns (Boolean, default = False) – if True, output’s columns’ names will have a suffix of ‘(nom)’ or ‘(con)’ based on there type (eda_tools or continuous), as provided by colcat
- theil_u (Boolean, default = False) – In the case of categorical-categorical feaures, use Theil’s U instead of Cramer’s V
- plot (Boolean, default = True) – If True, plot a heat-map of the correlation matrix
- return_results (Boolean, default = False) – If True, the function will return a Pandas DataFrame of the computed associations
- kwargs (any key-value pairs) – Arguments to be passed to used function and methods
-
dsa.da.util_stat.
sk_distribution_kernel_bestbandwidth
(X, kde)[source]¶ Find best Bandwidht for a given kernel :param kde: :return:
-
dsa.da.util_stat.
sk_distribution_kernel_sample
(kde=None, n=1)[source]¶ kde = sm.nonparametric.KDEUnivariate(np.array(Y[Y_cluster==0],dtype=np.float64)) kde = sm.nonparametric.KDEMultivariate() # … you already did this
-
dsa.da.util_stat.
stat_hypothesis_test_permutation
(df, variable, classes, repetitions)[source]¶ Test whether two numerical samples come from the same underlying distribution, using the absolute difference between the means. table: name of table containing the sample variable: label of column containing the numerical variable classes: label of column containing names of the two samples repetitions: number of random permutations
dsa.da.util_date¶
import datetime datetime.datetime.strptime(‘20-Nov-2002’,’%d-%b-%Y’).strftime(‘%Y%m%d’) ‘20021120’ Formats -
%d - 2 digit date %b - 3-letter month abbreviation %Y - 4 digit year %m - 2 digit month %a
df = DataFrame(dict(date = date_range(‘20130101’,periods=10))) https://python-utils.readthedocs.io/en/latest/usage.html#quickstart https://dateutil.readthedocs.io/en/stable/examples.html
-
dsa.da.util_date.
datestring_todatetime
(datelist, fmt='%Y-%m-%d %H:%M:%S')[source]¶ - Parsing date ‘Jun 1 2005 1:33PM’, ‘%b %d %Y %I:%M%p’
Parameters: - datelist –
- fmt –
Returns:
-
dsa.da.util_date.
datetime_tostring
(datelist, fmt='%Y-%m-%d %H:%M:%S')[source]¶ https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior :param x: :param fmt: :return:
dsa.da.util¶
Various utilities
-
dsa.da.util.
load
(filename='/folder1/keyname', isabsolutpath=0, encoding1='utf-8')[source]¶ pickle load :param filename: :param isabsolutpath: :param encoding1: :return:
-
dsa.da.util.
load_arguments
(config_file=None, arg_list=None)[source]¶ Load CLI input, load config.toml , overwrite config.toml by CLI Input [{}, {}]
-
dsa.da.util.
logger_setup
(logger_name=None, log_file=None, formatter=<logging.Formatter object>, isrotate=False, isconsole_output=True, logging_level=10)[source]¶ my_logger = util_log.logger_setup(“my module name”, log_file=”“) APP_ID = util_log.create_appid(__file__ ) def log(*argv):
my_logger.info(“,”.join([str(x) for x in argv]))
-
dsa.da.util.
save
(obj, filename='/folder1/keyname', isabsolutpath=0)[source]¶ Pickle saving :param obj: :param filename: :param isabsolutpath: :return:
-
dsa.da.util.
save_all
(variable_list, folder, globals_main=None)[source]¶ Pickle saving batch :param variable_list: :param folder: :param globals_main: :return:
-
dsa.da.util.
sk_tree_get_ifthen
(tree, feature_names, target_names, spacer_base=' ')[source]¶ Produce psuedo-code for decision tree. tree – scikit-leant DescisionTree. feature_names – list of feature names. target_names – list of target (output) names. spacer_base – used for spacing code (default: ” “).