Machine Learning¶
CivisML uses the Civis Platform to train machine learning models and parallelize their predictions over large datasets. It contains best-practice models for general-purpose classification and regression modeling as well as model quality evaluations and visualizations. All CivisML models use the scikit-learn API for interoperability with other platforms and to allow you to leverage resources in the open-source software community when creating machine learning models.
Optional Dependencies¶
You do not need any external libraries installed to use CivisML, but
the following pip-installable dependencies enhance the capabilities of the
ModelPipeline
:
- pandas
- scikit-learn
- glmnet
- feather-format
- civisml-extensions
- muffnn
Install pandas
if you wish to download tables of predictions.
You can also model on DataFrame
objects in your interpreter.
If you wish to use the ModelPipeline
code to model on
DataFrame
objects in your local environment, the
feather-format package (requires pandas >= 0.20)
will improve data transfer speeds and guarantee that your data types are correctly
detected by CivisML. You must install feather-format if you wish to use
pd.Categorical columns in your DataFrame objects, since that type information
is lost when writing data as a CSV.
If you wish to use custom models or download trained models, you’ll need scikit-learn installed.
Several pre-defined models rely on public Civis Analytics
libraries. The “sparse_logistic”, “sparse_linear_regressor”,
“sparse_ridge_regressor”, “stacking_classifier”, and “stacking_regressor” models
all use the glmnet
library. Pre-defined MLP models
(“multilayer_perceptron_classifier” and
“multilayer_perceptron_regressor”) depend on the muffnn
library. Finally, models which use the default CivisML ETL,
along with models which use stacking or hyperband, depend on
civisml-extensions
. Install these packages if you wish to download
the pre-defined models that depend on them.
Define Your Model¶
Start the modeling process by defining your model. Do this by creating an instance
of the ModelPipeline
class. Each ModelPipeline
corresponds to a
scikit-learn Pipeline
which will run in Civis Platform.
A Pipeline
allows you to combine multiple
modeling steps (such as missing value imputation and feature selection) into a
single model. The Pipeline
is treated as a unit – for example,
cross-validation happens over all steps together.
You can define your model in two ways, either by selecting a pre-defined algorithm
or by providing your own scikit-learn
Pipeline
or BaseEstimator
object.
Note that whichever option you chose, CivisML will pre-process your
data using either its default ETL, or ETL that you provide (see Custom ETL).
If you have already trained a scikit-learn model outside of Civis Platform, you can register it with Civis Platform as a CivisML model so that you can score it using CivisML. Read Registering Models Trained Outside of Civis for how to do this.
Pre-Defined Models¶
You can use the following pre-defined models with CivisML.
All models start by imputing missing values with the mean of non-null
values in a column. The “sparse_*” models include a LASSO regression step
(using the glmnet package)
to do feature selection before passing data to the final model.
In some models, CivisML uses default parameters different from those in scikit-learn,
as indicated in the “Altered Defaults” column. All models also have
random_state=42
.
Name | Model Type | Algorithm | Altered Defaults |
---|---|---|---|
sparse_logistic | classification | LogisticRegression | C=499999950, tol=1e-08 |
gradient_boosting_classifier | classification | GradientBoostingClassifier | n_estimators=500, max_depth=2 |
random_forest_classifier | classification | RandomForestClassifier | n_estimators=500, max_depth=7 |
extra_trees_classifier | classification | ExtraTreesClassifier | n_estimators=500, max_depth=7 |
multilayer_perceptron_classifier | classification | muffnn.MLPClassifier | |
stacking_classifier | classification | civismlext.StackedClassifier | |
sparse_linear_regressor | regression | LinearRegression | |
sparse_ridge_regressor | regression | Ridge | |
gradient_boosting_regressor | regression | GradientBoostingRegressor | n_estimators=500, max_depth=2 |
random_forest_regressor | regression | RandomForestRegressor | n_estimators=500, max_depth=7 |
extra_trees_regressor | regression | ExtraTreesRegressor | n_estimators=500, max_depth=7 |
multilayer_perceptron_regressor | regression | muffnn.MLPRegressor | |
stacking_regressor | regression | civismlext.StackedRegressor |
The “stacking_classifier” model stacks
the “gradient_boosting_classifier”,
and “random_forest_classifier” predefined models together with a
glmnet.LogitNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5,
scoring='log_loss')
. The models are combined using a
Pipeline
containing a Normalizer
step, followed by LogisticRegressionCV
with penalty='l2'
and tol=1e-08
. The
“stacking_regressor” works similarly, stacking together the
“gradient_boosting_regressor” and “random_forest_regressor” models
and a glmnet.ElasticNet(alpha=0, n_splits=4, max_iter=10000,
tol=1e-5, scoring='r2')
, combining them using
NonNegativeLinearRegression. The
estimators that are being stacked have the same names as the
associated pre-defined models, and the meta-estimator steps are named
“meta-estimator”. Note that although default parameters are provided
for multilayer perceptron models, it is highly recommended that
multilayer perceptrons be run using hyperband.
Custom Models¶
You can create your own Pipeline
instead of using one of
the pre-defined ones. Create the object and pass it as the model
parameter
of the ModelPipeline
. Your model must follow the
scikit-learn API, and you will need to include any dependencies as
Custom Dependencies if they are not already installed in
CivisML. Please check here
for the available pre-installed libraries and their versions.
When you’re assembling your own model, remember that you’ll have to make certain that
either you add a missing value imputation step or that your data doesn’t have any
missing values. If you’re making a classification model, the model must have a predict_proba
method. If the class you’re using doesn’t have a predict_proba
method,
you can add one by wrapping it in a CalibratedClassifierCV
.
Custom ETL¶
By default, CivisML pre-processes data using the
DataFrameETL
class, with cols_to_drop
equal to the excluded_columns
parameter. You can replace this
with your own ETL by creating an object of class
BaseEstimator
and passing it as the etl
parameter during training.
By default, DataFrameETL
automatically one-hot encodes all categorical columns in the
dataset. If you are passing a custom ETL estimator, you will have to
ensure that no categorical columns remain after the transform
method is called on the dataset.
Hyperparameter Tuning¶
You can tune hyperparamters using one of two methods: grid search or
hyperband. CivisML will perform grid search if you pass a dictionary
of hyperparameters to the cross_validation_parameters
parameter, where the keys are
hyperparameter names, and the values are lists of hyperparameter
values to grid search over. You can run hyperparameter tuning in parallel by
setting the n_jobs
parameter to however many jobs you would like to run in
parallel. By default, n_jobs
is dynamically calculated based on
the resources available on your cluster, such that a modeling job will
never take up more than 90% of the cluster resources at once.
Hyperband
is an efficient approach to hyperparameter optimization, and
recommended over grid search where possible. CivisML will perform
hyperband optimization for a pre-defined model if you pass the string
'hyperband'
to cross_validation_parameters
. Hyperband is
currently only supported for the following models:
gradient_boosting_classifier
, random_forest_classifier
,
extra_trees_classifier
, multilayer_perceptron_classifier
,
stacking_classifier
, gradient_boosting_regressor
,
random_forest_regressor
, extra_trees_regressor
,
multilayer_perceptron_regressor
, and
stacking_regressor
. Although hyperband is supported for stacking
models, stacking itself is a kind of model tuning, and the combination
of stacking and hyperband is likely too computationally intensive to
be useful in many cases.
Hyperband cannot be used to tune GLMs. For this reason, preset GLMs do
not have a hyperband option. Similarly, when
cross_validation_parameters='hyperband'
and the model is
stacking_classifier
or stacking_regressor
, only the GBT and
random forest steps of the stacker are tuned using hyperband.
Note that if you want to use hyperband with a custom model, you will need to
wrap your estimator in a
civismlext.hyperband.HyperbandSearchCV
estimator yourself.
CivisML runs pre-defined models with hyperband using the following distributions:
Models | Cost Parameter | Hyperband Distributions |
---|---|---|
gradient_boosting_classifier
gradient_boosting_regressor
GBT step in stacking_classifier
GBT step in stacking_regressor
|
n_estimators min = 100, max = 1000 |
max_depth: randint(low=1, high=5) max_features: [None, 'sqrt', 'log2', 0.5, 0.3, 0.1, 0.05, 0.01] learning_rate: truncexpon(b=5, loc=.0003, scale=1./167.) |
random_forest_classifier
random_forest_regressor
extra_trees_classifier
extra_trees_regressor
RF step in stacking_classifier
RF step in stacking_regressor
|
n_estimators min = 100, max = 1000 |
criterion: ['gini', 'entropy'] max_features: truncexpon(b=10., loc=.01, scale=1./10.11) max_depth: [1, 2, 3, 4, 6, 10] |
multilayer_perceptron_classifier
multilayer_perceptron_regressor
|
n_epochs min = 5, max = 50 |
keep_prob: uniform() hidden_units: [(), (16,), (32,), (64,), (64, 64), (64, 64, 64), (128,), (128, 128), (128, 128, 128), (256,), (256, 256), (256, 256, 256), (512, 256, 128, 64), (1024, 512, 256, 128)] learning_rate: [1e-2, 2e-2, 5e-2, 8e-2, 1e-3, 2e-3, 5e-3, 8e-3, 1e-4] |
The truncated exponential distribution for the gradient boosting classifier and regressor was chosen to skew the distribution toward small values, ranging between .0003 and .03, with a mean close to .006. Similarly, the truncated exponential distribution for the random forest and extra trees models skews toward small values, ranging between .01 and 1, and with a mean close to .1.
Custom Dependencies¶
Installing packages from PyPI is straightforward. You can specify a dependencies
argument to ModelPipeline
which will install the
dependencies in your runtime environment. VCS support is also enabled
(see docs.)
Installing a remote git repository from, say, Github only requires passing the HTTPS
URL in the form of, for example, git+https://github.com/scikit-learn/scikit-learn
.
CivisML will run pip install [your package here]
. We strongly encourage you to pin
package versions for consistency. Example code looks like:
from civis.ml import ModelPipeline
from pyearth import Earth
deps = ['git+https://github.com/scikit-learn-contrib/py-earth.git@da856e11b2a5d16aba07f51c3c15cef5e40550c7']
est = Earth()
model = ModelPipeline(est, dependent_variable='age', dependencies=deps)
train = model.train(table_name='donors.from_march', database_name='client')
Additionally, you can store a remote git host’s API token in the Civis Platform as a
credential to use for installing private git repositores. For example, you can go to
Github at the https://github.com/settings/tokens
URL, copy your token into the
password field of a credential, and pass the credential name to the git_token_name
argument in ModelPipeline
. This also works with other hosting services.
A simple example of how to do this with API looks as follows
import civis
password = 'abc123' # token copied from https://github.com/settings/tokens
username = 'user123' # Github username
git_token_name = 'Github credential'
client = civis.APIClient()
credential = client.credentials.post(password=password,
username=username,
name=git_token_name,
type="Custom")
pipeline = civis.ml.ModelPipeline(..., git_token_name=git_token_name)
Note, installing private dependencies with submodules is not supported.
CivisML Versions¶
By default, CivisML uses its latest version in production.
If you would like a specific version
(e.g., for a production pipeline where pinning the CivisML version is desirable),
ModelPipeline
(both its constructor and the class method
civis.ml.ModelPipeline.register_pretrained_model()
) has the optional
parameter civisml_version
that accepts a string, e.g., 'v2.3'
for CivisML v2.3. Please see here
for the list of CivisML versions.
Asynchronous Execution¶
All calls to a ModelPipeline
object are non-blocking, i.e. they immediately
provide a result without waiting for the job in the Civis Platform to complete.
Calls to civis.ml.ModelPipeline.train()
and civis.ml.ModelPipeline.predict()
return
a ModelFuture
object, which is a subclass of
Future
from the Python standard library.
This behavior lets you train multiple models at once, or generate predictions
from models, while still doing other work while waiting for your jobs to complete.
The ModelFuture
can find and retrieve outputs from your CivisML jobs,
such as trained Pipeline
objects or out-of-sample predictions.
The ModelFuture
only downloads outputs when you request them.
Model Persistence¶
Civis Platform permanently stores all models, indexed by the job ID and the run ID
(also called a “build”) of the training job. If you wish to use an existing
model, call civis.ml.ModelPipeline.from_existing()
with the job ID of the training job.
You can find the job ID with the train_job_id
attribute of a ModelFuture
,
or by looking at the URL of your model on the
Civis Platform models page.
If the training job has multiple runs, you may also provide a run ID to select
a run other than the most recent.
You can list all model runs of a training job by calling
civis.APIClient().jobs.get(train_job_id)['runs']
.
You may also store the ModelPipeline
itself with the pickle
module.
Examples¶
Future
objects have the method
add_done_callback()
.
This is called as soon as the run completes. It takes a single argument, the
Future
for the completed job.
You can use this method to chain jobs together:
from concurrent import futures
from civis.ml import ModelPipeline
import pandas as pd
df = pd.read_csv('data.csv')
training, predictions = [], []
model = ModelPipeline('sparse_logistic', dependent_variable='type')
training.append(model.train(df))
training[-1].add_done_callback(lambda fut: predictions.append(model.predict(df)))
futures.wait(training) # Blocks until all training jobs complete
futures.wait(predictions) # Blocks until all prediction jobs complete
You can create and train multiple models at once to find the best approach for solving a problem. For example:
from civis.ml import ModelPipeline
algorithms = ['gradient_boosting_classifier', 'sparse_logistic', 'random_forest_classifier']
pkey = 'person_id'
depvar = 'likes_cats'
models = [ModelPipeline(alg, primary_key=pkey, dependent_variable=depvar) for alg in algorithms]
train = [model.train(table_name='schema.name', database_name='My DB') for model in models]
aucs = [tr.metrics['roc_auc'] for tr in train] # Code blocks here
Registering Models Trained Outside of Civis¶
Instead of using CivisML to train your model, you may train any
scikit-learn-compatible model outside of Civis Platform and use
civis.ml.ModelPipeline.register_pretrained_model()
to register it
as a CivisML model in Civis Platform. This will let you use Civis Platform
to make predictions using your model, either to take advantage of distributed
predictions on large datasets, or to create predictions as part of
a workflow or service in Civis Platform.
When registering a model trained outside of Civis Platform, you are strongly advised to provide an ordered list of feature names used for training. This will allow CivisML to ensure that tables of data input for predictions have the correct features in the correct order. If your model has more than one output, you should also provide a list of output names so that CivisML knows how many outputs to expect and how to name them in the resulting table of model predictions.
If your model uses dependencies which aren’t part of the default CivisML
execution environment, you must provide them to the dependencies
parameter of the register_pretrained_model()
function, just as with the ModelPipeline
constructor.
Sharing Models¶
Models produced by CivisML can’t be shared directly through the Civis Platform
UI or API. The ml
namespace provides functions which will
let you share your CivisML models with other Civis Platform users.
To share your models, use the functions
put_models_shares_users()
put_models_shares_groups()
delete_models_shares_users()
delete_models_shares_groups()
To find out what models a user has, use list_models()
.
Object and Function Reference¶
-
class
civis.ml.
ModelPipeline
(model, dependent_variable, primary_key=None, parameters=None, cross_validation_parameters=None, model_name=None, calibration=None, excluded_columns=None, client=None, cpu_requested=None, memory_requested=None, disk_requested=None, notifications=None, dependencies=None, git_token_name=None, verbose=False, etl=None, civisml_version=None)[source]¶ Interface for scikit-learn modeling in the Civis Platform
Each ModelPipeline corresponds to a scikit-learn
Pipeline
which will run in Civis Platform.Note that this object can be safely pickled and unpickled, but it does not store the state of any attached
APIClient
object. An unpickled ModelPipeline will use the API key from the user’s environment.Parameters: - model : string or Estimator
Either the name of a pre-defined model (e.g. “sparse_logistic” or “gradient_boosting_classifier”) or else a pre-existing Estimator object.
- dependent_variable : string or List[str]
The dependent variable of the training dataset. For a multi-target problem, this should be a list of column names of dependent variables. Nulls in a single dependent variable will automatically be dropped.
- primary_key : string, optional
The unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores.
- parameters : dict, optional
Specify parameters for the final stage estimator in a predefined model, e.g.
{'C': 2}
for a “sparse_logistic” model.- cross_validation_parameters : dict or string, optional
Options for cross validation. For grid search, supply a parameter grid as a dictionary, e.g.,
{{'n_estimators': [100, 200, 500], 'learning_rate': [0.01, 0.1], 'max_depth': [2, 3]}}
. For hyperband, pass the string “hyperband”.- model_name : string, optional
The prefix of the Platform modeling jobs. It will have ” Train” or ” Predict” added to become the Script title.
- calibration : {None, “sigmoid”, “isotonic”}
If not None, calibrate output probabilities with the selected method. Valid only with classification models.
- excluded_columns : array, optional
A list of columns which will be considered ineligible to be independent variables.
- client :
APIClient
, optional If not provided, an
APIClient
object will be created from theCIVIS_API_KEY
.- cpu_requested : int, optional
Number of CPU shares requested in the Civis Platform for training jobs. 1024 shares = 1 CPU.
- memory_requested : int, optional
Memory requested from Civis Platform for training jobs, in MiB
- disk_requested : float, optional
Disk space requested on Civis Platform for training jobs, in GB
- notifications : dict
See
post_custom()
for further documentation about email and URL notification.- dependencies : array, optional
List of packages to install from PyPI or git repository (e.g., Github or Bitbucket). If a private repo is specified, please include a
git_token_name
argument as well (see below). Make sure to pin dependencies to a specific version, since dependencies will be reinstalled during every training and predict job.- git_token_name : str, optional
Name of remote git API token stored in Civis Platform as the password field in a custom platform credential. Used only when installing private git repositories.
- verbose : bool, optional
If True, supply debug outputs in Platform logs and make prediction child jobs visible.
- etl : Estimator, optional
Custom ETL estimator which overrides the default ETL, and is run before training and validation.
- civisml_version : str, optional
CivisML version to use for training and prediction. If not provided, the latest version in production is used.
See also
Examples
>>> from civis.ml import ModelPipeline >>> model = ModelPipeline('gradient_boosting_classifier', 'depvar', ... primary_key='voterbase_id') >>> train = model.train(table_name='schema.survey_data', ... fit_params={'sample_weight': 'survey_weight'}, ... database_name='My Redshift Cluster', ... oos_scores='scratch.survey_depvar_oos_scores') >>> train <ModelFuture at 0x11be7ae10 state=queued> >>> train.running() True >>> train.done() False >>> df = train.table # Read OOS scores from its Civis File. Blocking. >>> meta = train.metadata # Metadata from training run >>> train.metrics['roc_auc'] 0.88425 >>> pred = model.predict(table_name='schema.demographics_table ', ... database_name='My Redshift Cluster', ... output_table='schema.predicted_survey_response', ... if_exists='drop') >>> df_pred = pred.table # Blocks until finished # Modify the parameters of the base estimator in a default model: >>> model = ModelPipeline('sparse_logistic', 'depvar', ... primary_key='voterbase_id', ... parameters={'C': 2}) # Grid search over hyperparameters in the base estimator: >>> model = ModelPipeline('sparse_logistic', 'depvar', ... primary_key='voterbase_id', ... cross_validation_parameters={'C': [0.1, 1, 10]})
Attributes: - estimator :
Pipeline
The trained scikit-learn Pipeline
- train_result_ :
ModelFuture
ModelFuture
encapsulating this model’s training run- state : str
Status of the training job (non-blocking)
Methods
train() Train the model on data in Civis Platform; outputs ModelFuture
predict() Make predictions on new data; outputs ModelFuture
from_existing() Class method; use to create a ModelPipeline
from an existing model training run-
classmethod
from_existing
(train_job_id, train_run_id='latest', client=None)[source]¶ Create a
ModelPipeline
object from existing model IDsParameters: - train_job_id : int
The ID of the CivisML job in the Civis Platform
- train_run_id : int or string, optional
Location of the model run, either
- an explicit run ID,
- “latest” : The most recent run
- “active” : The run designated by the training job’s “active build” parameter
- client :
APIClient
, optional If not provided, an
APIClient
object will be created from theCIVIS_API_KEY
.
Returns: ModelPipeline
A
ModelPipeline
which refers to a previously-trained model
Examples
>>> from civis.ml import ModelPipeline >>> model = ModelPipeline.from_existing(job_id) >>> model.train_result_.metrics['roc_auc'] 0.843
-
predict
(self, df=None, csv_path=None, table_name=None, database_name=None, manifest=None, file_id=None, sql_where=None, sql_limit=None, primary_key=Sentinel(), output_table=None, output_db=None, if_exists='fail', n_jobs=None, polling_interval=None, cpu=None, memory=None, disk_space=None, dvs_to_predict=None)[source]¶ Make predictions on a trained model
Provide input through one of a
DataFrame
(df
), a local CSV (csv_path
), a Civis Table (table_name
anddatabase_name
), a Civis File containing a CSV (file_id
), or a Civis File containing a manifest file (manifest
).A “manifest file” is JSON which specifies the location of many shards of the data to be used for prediction. A manifest file is the output of a Civis export job with
force_multifile=True
set, e.g. fromcivis.io.civis_to_multifile_csv()
. Large Civis Tables (provided usingtable_name
) will automatically be exported to manifest files.Prediction outputs will always be stored as gzipped CSVs in one or more Civis Files. You can find a list of File ID numbers for output files at the “output_file_ids” key in the metadata returned by the prediction job. Provide an
output_table
(and optionally anoutput_db
, if it’s different fromdatabase_name
) to copy these predictions into a Civis Table.Parameters: - df : pd.DataFrame, optional
A
DataFrame
of data for prediction. TheDataFrame
will be uploaded to a Civis file so that CivisML can access it. Note that the index of theDataFrame
will be ignored – usedf.reset_index()
if you want your index column to be included with the data passed to CivisML. NB: You must installfeather-format
if yourDataFrame
containsCategorical
columns, to ensure that CivisML preserves data types.- csv_path : str, optional
The location of a CSV of data on the local disk. It will be uploaded to a Civis file.
- table_name : str, optional
The qualified name of the table containing your data
- database_name : str, optional
Name of the database holding the data, e.g., ‘My Redshift Cluster’.
- manifest : int, optional
ID for a manifest file stored as a Civis file. (Note: if the manifest is not a Civis Platform-specific manifest, like the one returned from
civis.io.civis_to_multfile_csv()
, this must be used in conjunction with table_name and database_name due to the need for column discovery via Redshift.)- file_id : int, optional
If the data are a CSV stored in a Civis file, provide the integer file ID.
- sql_where : str, optional
A SQL WHERE clause used to scope the rows to be predicted
- sql_limit : int, optional
SQL LIMIT clause to restrict the size of the prediction set
- primary_key : str, optional
Primary key of the prediction table. Defaults to the primary key of the training data. Use
None
to indicate that the prediction data don’t have a primary key column.- output_table: str, optional
The table in which to put the predictions.
- output_db : str, optional
Database of the output table. Defaults to the database of the input table.
- if_exists : {‘fail’, ‘append’, ‘drop’, ‘truncate’}
Action to take if the prediction table already exists.
- n_jobs : int, optional
Number of concurrent Platform jobs to use for multi-file / large table prediction. Defaults to None, which allows CivisML to dynamically calculate an appropriate number of workers to use (in general, as many as possible without using all resources in the cluster).
- polling_interval : float, optional
Check for job completion every this number of seconds. Do not set if using the notifications endpoint.
- cpu : int, optional
CPU shares requested by the user for a single job.
- memory : int, optional
RAM requested by the user for a single job.
- disk_space : float, optional
disk space requested by the user for a single job.
- dvs_to_predict : list of str, optional
If this is a multi-output model, you may list a subset of dependent variables for which you wish to generate predictions. This list must be a subset of the original dependent_variable input. The scores for the returned subset will be identical to the scores which those outputs would have had if all outputs were written, but ignoring some of the model’s outputs will let predictions complete faster and use less disk space. The default is to produce scores for all DVs.
Returns:
-
classmethod
register_pretrained_model
(model, dependent_variable=None, features=None, primary_key=None, model_name=None, dependencies=None, git_token_name=None, skip_model_check=False, verbose=False, client=None, civisml_version=None)[source]¶ Use a fitted scikit-learn model with CivisML scoring
Use this function to set up your own fitted scikit-learn-compatible Estimator object for scoring with CivisML. This function will upload your model to Civis Platform and store enough metadata about it that you can subsequently use it with a CivisML scoring job.
The only required input is the model itself, but you are strongly recommended to also provide a list of feature names. Without a list of feature names, CivisML will have to assume that your scoring table contains only the features needed for scoring (perhaps also with a primary key column), in all in the correct order.
Parameters: - model : sklearn.base.BaseEstimator or int
The model object. This must be a fitted scikit-learn compatible Estimator object, or else the integer Civis File ID of a pickle or joblib-serialized file which stores such an object. If an Estimator object is provided, it will be uploaded to the Civis Files endpoint and set to be available indefinitely.
- dependent_variable : string or List[str], optional
The dependent variable of the training dataset. For a multi-target problem, this should be a list of column names of dependent variables.
- features : string or List[str], optional
A list of column names of features which were used for training. These will be used to ensure that tables input for prediction have the correct features in the correct order.
- primary_key : string, optional
The unique ID (primary key) of the scoring dataset
- model_name : string, optional
The name of the Platform registration job. It will have ” Predict” added to become the Script title for predictions.
- dependencies : array, optional
List of packages to install from PyPI or git repository (e.g., GitHub or Bitbucket). If a private repo is specified, please include a
git_token_name
argument as well (see below). Make sure to pin dependencies to a specific version, since dependencies will be reinstalled during every predict job.- git_token_name : str, optional
Name of remote git API token stored in Civis Platform as the password field in a custom platform credential. Used only when installing private git repositories.
- skip_model_check : bool, optional
If you’re sure that your model will work with CivisML, but it will fail the comprehensive verification, set this to True.
- verbose : bool, optional
If True, supply debug outputs in Platform logs and make prediction child jobs visible.
- client :
APIClient
, optional If not provided, an
APIClient
object will be created from theCIVIS_API_KEY
.- civisml_version : str, optional
CivisML version to use. If not provided, the latest version in production is used.
Returns: Examples
This example assumes that you already have training data
X
andy
, whereX
is aDataFrame
.>>> from civis.ml import ModelPipeline >>> from sklearn.linear_model import Lasso >>> est = Lasso().fit(X, y) >>> model = ModelPipeline.register_pretrained_model( ... est, 'concrete', features=X.columns) >>> model.predict(table_name='my.table', database_name='my-db')
-
train
(self, df=None, csv_path=None, table_name=None, database_name=None, file_id=None, sql_where=None, sql_limit=None, oos_scores=None, oos_scores_db=None, if_exists='fail', fit_params=None, polling_interval=None, validation_data='train', n_jobs=None)[source]¶ Start a Civis Platform job to train your model
Provide input through one of a
DataFrame
(df
), a local CSV (csv_path
), a Civis Table (table_name
anddatabase_name
), or a Civis File containing a CSV (file_id
).Model outputs will always contain out-of-sample scores (accessible through
ModelFuture.table
on this function’s output), and you may chose to store these out-of-sample scores in a Civis Table with theoos_scores
,oos_scores_db
, andif_exists
parameters.Parameters: - df : pd.DataFrame, optional
A
DataFrame
of training data. TheDataFrame
will be uploaded to a Civis file so that CivisML can access it. Note that the index of theDataFrame
will be ignored – usedf.reset_index()
if you want your index column to be included with the data passed to CivisML. NB: You must installfeather-format
if yourDataFrame
containsCategorical
columns, to ensure that CivisML preserves data types.- csv_path : str, optional
The location of a CSV of data on the local disk. It will be uploaded to a Civis file.
- table_name : str, optional
The qualified name of the table containing the training set from which to build the model.
- database_name : str, optional
Name of the database holding the training set table used to build the model. E.g., ‘My Cluster Name’.
- file_id : int, optional
If the training data are stored in a Civis file, provide the integer file ID.
- sql_where : str, optional
A SQL WHERE clause used to scope the rows of the training set (used for table input only)
- sql_limit : int, optional
SQL LIMIT clause for querying the training set (used for table input only)
- oos_scores : str, optional
If provided, store out-of-sample predictions on training set data to this Redshift “schema.tablename”.
- oos_scores_db : str, optional
If not provided, store OOS predictions in the same database which holds the training data.
- if_exists : {‘fail’, ‘append’, ‘drop’, ‘truncate’}
Action to take if the out-of-sample prediction table already exists.
- fit_params: Dict[str, str]
Mapping from parameter names in the model’s
fit
method to the column names which hold the data, e.g.{'sample_weight': 'survey_weight_column'}
.- polling_interval : float, optional
Check for job completion every this number of seconds. Do not set if using the notifications endpoint.
- validation_data : str, optional
Source for validation data. There are currently two options: ‘train’ (the default), which cross-validates over training data for validation; and ‘skip’, which skips the validation step.
- n_jobs : int, optional
Number of jobs to use for training and validation. Defaults to None, which allows CivisML to dynamically calculate an appropriate number of workers to use (in general, as many as possible without using all resources in the cluster). Increase n_jobs to parallelize over many hyperparameter combinations in grid search/hyperband, or decrease to use fewer computational resources at once.
Returns:
-
class
civis.ml.
ModelFuture
(job_id, run_id, train_job_id=None, train_run_id=None, polling_interval=None, client=None, poll_on_creation=True)[source]¶ Encapsulates asynchronous execution of a CivisML job
This object knows where to find modeling outputs from CivisML jobs. All data attributes are lazily retrieved and block on job completion.
This object can be pickled, but it does not store the state of the attached
APIClient
object. An unpickled ModelFuture will use the API key from the user’s environment.Parameters: - job_id : int
ID of the modeling job
- run_id : int
ID of the modeling run
- train_job_id : int, optional
If not provided, this object is assumed to encapsulate a training job, and
train_job_id
will equaljob_id
.- train_run_id : int, optional
If not provided, this object is assumed to encapsulate a training run, and
train_run_id
will equalrun_id
.- polling_interval : int or float, optional
The number of seconds between API requests to check whether a result is ready. The default intelligently switches between a short interval if
pubnub
is not available and a long interval forpubnub
backup if that library is installed.- client :
civis.APIClient
, optional If not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.- poll_on_creation : bool, optional
If
True
(the default), it will poll upon callingresult()
the first time. IfFalse
, it will wait the number of seconds specified in polling_interval from object creation before polling.
See also
civis.futures.CivisFuture
civis.futures.ContainerFuture
concurrent.futures.Future
Attributes: - metadata : dict, blocking
The metadata associated with this modeling job
- metrics : dict, blocking
Validation metrics from this job’s training
- validation_metadata : dict, blocking
Metadata from this modeling job’s validation run
- train_metadata : dict, blocking
Metadata from this modeling job’s training run (will be identical to metadata if this is a training run)
- estimator :
sklearn.pipeline.Pipeline
, blocking The fitted scikit-learn Pipeline resulting from this model run
- table :
pandas.DataFrame
, blocking The table output from this modeling job: out-of-sample predictions on the training set for a training job, or a table of predictions for a prediction job. If the prediction job was split into multiple files (this happens automatically for large tables), this attribute will provide only predictions for the first file.
- state : str
The current state of the Civis Platform run
- job_id : int
- run_id : int
- train_job_id : int
Container ID for the training job – identical to
job_id
if this is a training job.- train_run_id : int
As
train_job_id
but for runs- is_training : bool
True if this
ModelFuture
corresponds to a train-validate job.
Methods
cancel() Cancels the corresponding Platform job before completion succeeded() (Non-blocking) Is the job a success? failed() (Non-blocking) Did the job fail? cancelled() (Non-blocking) Was the job cancelled? running() (Non-blocking) Is the job still running? done() (Non-blocking) Is the job finished? result() (Blocking) Return the final status of the Civis Platform job. -
add_done_callback
(self, fn)¶ Attaches a callable that will be called when the future finishes.
- Args:
- fn: A callable that will be called with this future as its only
- argument when the future completes or is cancelled. The callable will always be called by a thread in the same process in which it was added. If the future has already completed or been cancelled then the callable will be called immediately. These callables are called in the order that they were added.
-
cancel
(self)¶ Submit a request to cancel the container/script/run.
Returns: - bool
Whether or not the job is in a cancelled state.
-
cancelled
(self)¶ Return True if the future was cancelled.
-
done
(self)¶ Return True of the future was cancelled or finished executing.
-
exception
(self, timeout=None)¶ Return the exception raised by the call that the future represents.
- Args:
- timeout: The number of seconds to wait for the exception if the
- future isn’t done. If None, then there is no limit on the wait time.
- Returns:
- The exception raised by the call that the future represents or None if the call completed without raising.
- Raises:
CancelledError: If the future was cancelled. TimeoutError: If the future didn’t finish executing before the given
timeout.
-
failed
(self)¶ Return
True
if the Civis job failed.
-
outputs
(self)¶ Block on job completion and return a list of run outputs.
The method will only return run outputs for successful jobs. Failed jobs will raise an exception.
Returns: - list[dict]
List of run outputs from a successfully completed job.
Raises: - civis.base.CivisJobFailure
If the job fails.
-
result
(self, timeout=None)¶ Return the result of the call that the future represents.
- Args:
- timeout: The number of seconds to wait for the result if the future
- isn’t done. If None, then there is no limit on the wait time.
- Returns:
- The result of the call that the future represents.
- Raises:
CancelledError: If the future was cancelled. TimeoutError: If the future didn’t finish executing before the given
timeout.Exception: If the call raised then that exception will be raised.
-
running
(self)¶ Return True if the future is currently executing.
-
set_exception
(self, exception)¶ Sets the result of the future as being the given exception.
This is adapted from https://github.com/python/cpython/blob/3.8/Lib/concurrent/futures/_base.py#L532-L545 This version does not try to change the _state or check that the initial _state is running since the Civis implementation has _state depend on the Platform job state.
-
set_result
(self, result)¶ Sets the return value of work associated with the future.
This is adapted from https://github.com/python/cpython/blob/3.8/Lib/concurrent/futures/_base.py#L517-L530 This version does not try to change the _state or check that the initial _state is running since the Civis implementation has _state depend on the Platform job state.
-
set_running_or_notify_cancel
(self)¶ Mark the future as running or process any cancel notifications.
Should only be used by Executor implementations and unit tests.
If the future has been cancelled (cancel() was called and returned True) then any threads waiting on the future completing (though calls to as_completed() or wait()) are notified and False is returned.
If the future was not cancelled then it is put in the running state (future calls to running() will return True) and True is returned.
This method should be called by Executor implementations before executing the work associated with this future. If this method returns False then the work should not be executed.
- Returns:
- False if the Future was cancelled, True otherwise.
- Raises:
- RuntimeError: if this method was already called or if set_result()
- or set_exception() was called.
-
succeeded
(self)¶ Return
True
if the job completed in Civis with no error.
Set the permissions users have on this object
Use this on both training and scoring jobs. If used on a training job, note that “read” permission is sufficient to score the model.
Parameters: - id : integer
The ID of the resource that is shared.
- user_ids : list
An array of one or more user IDs.
- permission_level : string
Options are: “read”, “write”, or “manage”.
- client :
civis.APIClient
, optional If not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.- share_email_body : string, optional
Custom body text for e-mail sent on a share.
- send_shared_email : boolean, optional
Send email to the recipients of a share.
Returns: - readers : dict::
- users : list::
- id : integer
- name : string
- groups : list::
- id : integer
- name : string
- writers : dict::
- users : list::
- id : integer
- name : string
- groups : list::
- id : integer
- name : string
- owners : dict::
- users : list::
- id : integer
- name : string
- groups : list::
- id : integer
- name : string
- total_user_shares : integer
For owners, the number of total users shared. For writers and readers, the number of visible users shared.
- total_group_shares : integer
For owners, the number of total groups shared. For writers and readers, the number of visible groups shared.
Set the permissions groups have on this model.
Use this on both training and scoring jobs. If used on a training job, note that “read” permission is sufficient to score the model.
Parameters: - id : integer
The ID of the resource that is shared.
- group_ids : list
An array of one or more group IDs.
- permission_level : string
Options are: “read”, “write”, or “manage”.
- client :
civis.APIClient
, optional If not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.- share_email_body : string, optional
Custom body text for e-mail sent on a share.
- send_shared_email : boolean, optional
Send email to the recipients of a share.
Returns: - readers : dict::
- users : list::
- id : integer
- name : string
- groups : list::
- id : integer
- name : string
- writers : dict::
- users : list::
- id : integer
- name : string
- groups : list::
- id : integer
- name : string
- owners : dict::
- users : list::
- id : integer
- name : string
- groups : list::
- id : integer
- name : string
- total_user_shares : integer
For owners, the number of total users shared. For writers and readers, the number of visible users shared.
- total_group_shares : integer
For owners, the number of total groups shared. For writers and readers, the number of visible groups shared.
Revoke the permissions a user has on this object
Use this function on both training and scoring jobs.
Parameters: - id : integer
The ID of the resource that is shared.
- user_id : integer
The ID of the user.
- client :
civis.APIClient
, optional If not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.
Returns: - None
Response code 204: success
Revoke the permissions a group has on this object
Use this function on both training and scoring jobs.
Parameters: - id : integer
The ID of the resource that is shared.
- group_id : integer
The ID of the group.
- client :
civis.APIClient
, optional If not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.
Returns: - None
Response code 204: success
-
civis.ml.
list_models
(job_type='train', author=Sentinel(), client=None, **kwargs)[source]¶ List a user’s CivisML models.
Parameters: - job_type : {“train”, “predict”, None}
The type of model job to list. If “train”, list training jobs only (including registered models trained outside of CivisML). If “predict”, list prediction jobs only. If None, list both.
- author : int, optional
User id of the user whose models you want to list. Defaults to the current user. Use
None
to list models from all users.- client :
civis.APIClient
, optional If not provided, an
civis.APIClient
object will be created from theCIVIS_API_KEY
.- **kwargs : kwargs
Extra keyword arguments passed to client.scripts.list_custom()
See also
APIClient.scripts.list_custom