Machine Learning¶

CivisML uses the Civis Platform to train machine learning models and parallelize their predictions over large datasets. It contains best-practice models for general-purpose classification and regression modeling as well as model quality evaluations and visualizations. All CivisML models use scikit-learn for interoperability with other platforms and to allow you to leverage resources in the open-source software community when creating machine learning models.

Define Your Model¶

Start the modeling process by defining your model. Do this by creating an instance of the ModelPipeline class. Each ModelPipeline corresponds to a scikit-learn Pipeline which will run in Civis Platform. A Pipeline allows you to combine multiple modeling steps (such as missing value imputation and feature selection) into a single model. The Pipeline is treated as a unit – for example, cross-validation happens over all steps together.

You can define your model in two ways, either by selecting a pre-defined algorithm or by providing your own scikit-learn Pipeline or BaseEstimator object. Note that whichever option you chose, CivisML will pre-process your data to one-hot-encode categorical features (the non-numerical columns) to binary indicator columns before sending the features to the Pipeline.

Pre-Defined Models¶

You can use the following pre-defined models with CivisML. All models start by imputing missing values with the mean of non-null values in a column. The “sparse_*” models include a LASSO regression step (using the glmnet package) to do feature selection before passing data to the final model. In some models, CivisML uses default parameters different from those in scikit-learn, as indicated in the “Altered Defaults” column. All models also have random_state=42.

Name	Model Type	Algorithm	Altered Defaults
sparse_logistic	classification	LogisticRegression	`C=499999950, tol=1e-08`
gradient_boosting_classifier	classification	GradientBoostingClassifier	`n_estimators=500, max_depth=2`
random_forest_classifier	classification	RandomForestClassifier	`n_estimators=500`
extra_trees_classifier	classification	ExtraTreesClassifier	`n_estimators=500`
sparse_linear_regressor	regression	LinearRegression
sparse_ridge_regressor	regression	Ridge
gradient_boosting_regressor	regression	GradientBoostingRegressor	`n_estimators=500, max_depth=2`
random_forest_regressor	regression	RandomForestRegressor	`n_estimators=500`
extra_trees_regressor	regression	ExtraTreesRegressor	`n_estimators=500`

Custom Models¶

You can create your own Pipeline instead of using one of the pre-defined ones. Create the object and pass it as the model parameter of the ModelPipeline. Your model must be built from libraries which CivisML recognizes. You can use code from

scikit-learn v0.18.1
glmnet v2.0.0
xgboost v0.6a2
muffnn v1.1.1

When you’re assembling your own model, remember that you’ll have to make certain that either you add a missing value imputation step or that your data doesn’t have any missing values. If you’re making a classification model, the model must have a predict_proba method. If the class you’re using doesn’t have a predict_proba method, you can add one by wrapping it in a CalibratedClassifierCV.

Custom Dependencies¶

Installing packages from PyPI is straightforward. You can specify a dependencies argument to ModelPipeline which will install the dependencies in your runtime environment. VCS support is also enabled (see docs.) Installing a remote git repository from, say, Github only requires passing the HTTPS URL in the form of, for example, git+https://github.com/scikit-learn/scikit-learn.

CivisML will run pip install [your package here]. We strongly encourage you to pin package versions for consistency. Example code looks like:

from civis.ml import ModelPipeline
from pyearth import Earth
deps = ['git+https://github.com/scikit-learn-contrib/py-earth.git@da856e11b2a5d16aba07f51c3c15cef5e40550c7']
est = Earth()
model = ModelPipeline(est, dependent_variable='age', dependencies=deps)
train = model.train(table_name='donors.from_march', database_name='client')

Additionally, you can store a remote git host’s API token in the Civis Platform as a credential to use for installing private git repositores. For example, you can go to Github at the https://github.com/settings/tokens URL, copy your token into the password field of a credential, and pass the credential name to the git_token_name argument in ModelPipeline. This also works with other hosting services. A simple example of how to do this with API looks as follows

import civis
password = 'abc123'  # token copied from https://github.com/settings/tokens
username = 'user123'  # Github username
git_token_name = 'Github credential'

client = civis.APIClient()
credential = client.credentials.post(password=password,
                                     username=username,
                                     name=git_token_name,
                                     type="Custom")

pipeline = civis.ml.ModelPipeline(..., git_token_name=git_token_name)

Note, installing private dependencies with submodules is not supported.

Asynchronous Execution¶

All calls to a ModelPipeline object are non-blocking, i.e. they immediately provide a result without waiting for the job in the Civis Platform to complete. Calls to civis.ml.ModelPipeline.train() and civis.ml.ModelPipeline.predict() return a ModelFuture object, which is a subclass of Future from the Python standard library. This behavior lets you train multiple models at once, or generate predictions from models, while still doing other work while waiting for your jobs to complete.

The ModelFuture can find and retrieve outputs from your CivisML jobs, such as trained Pipeline objects or out-of-sample predictions. The ModelFuture only downloads outputs when you request them.

Model Persistence¶

Civis Platform permanently stores all models, indexed by the job ID and the run ID (also called a “build”) of the training job. If you wish to use an existing model, call civis.ml.ModelPipeline.from_existing() with the job ID of the training job. You can find the job ID with the train_job_id attribute of a ModelFuture, or by looking at the URL of your model on the Civis Platform models page. If the training job has multiple runs, you may also provide a run ID to select a run other than the most recent. You can list all model runs of a training job by calling civis.APIClient().jobs.get(train_job_id)['runs']. You may also store the ModelPipeline itself with the pickle module.

Examples¶

Future objects have the method add_done_callback(). This is called as soon as the run completes. It takes a single argument, the Future for the completed job. You can use this method to chain jobs together

from concurrent import futures
from civis.ml import ModelPipeline
import pandas as pd
df = pd.read_csv('data.csv')
training, predictions = [], []
model = ModelPipeline('sparse_logistic', dependent_variable='type')
training.append(model.train(df))
training[-1].add_done_callback(lambda fut: predictions.append(model.predict(df)))
futures.wait(training)  # Blocks until all training jobs complete
futures.wait(predictions)  # Blocks until all prediction jobs complete

You can create and train multiple models at once to find the best approach for solving a problem. For example

from civis.ml import ModelPipeline
algorithms = ['gradient_boosting_classifier', 'sparse_logistic', 'random_forest_classifier']
pkey = 'person_id'
depvar = 'likes_cats'
models = [ModelPipeline(alg, primary_key=pkey, dependent_variable=depvar) for alg in algorithms]
train = [model.train(table_name='schema.name', database_name='My DB') for model in models]
aucs = [tr.metrics['roc_auc'] for tr in train]  # Code blocks here

Optional dependencies¶

You do not need any external libraries installed to use CivisML, but the following pip-installable dependencies enhance the capabilities of the ModelPipeline:

pandas
scikit-learn
glmnet

Install pandas if you wish to download tables of predictions. You can also model on DataFrame objects in your interpreter.

If you wish to use custom models or download trained models, you’ll need scikit-learn installed.

The “sparse_logistic”, “sparse_linear_regressor”, and “sparse_ridge_regressor” models all use the public Civis Analytics glmnet library. Install it if you wish to download a model created from one of these pre-defined models.

Object reference¶

class civis.ml.ModelPipeline(model, dependent_variable, primary_key=None, parameters=None, cross_validation_parameters=None, model_name=None, calibration=None, excluded_columns=None, client=None, cpu_requested=None, memory_requested=None, disk_requested=None, notifications=None, dependencies=None, git_token_name=None, verbose=False, etl=None)[source]¶

Interface for scikit-learn modeling in the Civis Platform

Each ModelPipeline corresponds to a scikit-learn Pipeline which will run in Civis Platform.

Note that this object can be safely pickled and unpickled, but it does not store the state of any attached APIClient object. An unpickled ModelPipeline will use the API key from the user’s environment.

Parameters:

model : string or Estimator

Either the name of a pre-defined model (e.g. “sparse_logistic” or “gradient_boosting_classifier”) or else a pre-existing Estimator object.

dependent_variable : string or List[str]

The dependent variable of the training dataset. For a multi-target problem, this should be a list of column names of dependent variables.

primary_key : string, optional

The unique ID (primary key) of the training dataset. This will be used to index the out-of-sample scores.

parameters : dict, optional

Specify parameters for the final stage estimator in a predefined model, e.g. {'C': 2} for a “sparse_logistic” model.

cross_validation_parameters : dict, optional

Cross validation parameter grid for learner parameters, e.g. {{'n_estimators': [100, 200, 500], 'learning_rate': [0.01, 0.1], 'max_depth': [2, 3]}}.

model_name : string, optional

The prefix of the Platform modeling jobs. It will have ” Train” or ” Predict” added to become the Script title.

calibration : {None, “sigmoid”, “isotonic”}

If not None, calibrate output probabilities with the selected method. Valid only with classification models.

excluded_columns : array, optional

A list of columns which will be considered ineligible to be independent variables.

client : APIClient, optional

If not provided, an APIClient object will be created from the CIVIS_API_KEY.

cpu_requested : int, optional

Number of CPU shares requested in the Civis Platform for training jobs. 1024 shares = 1 CPU.

memory_requested : int, optional

Memory requested from Civis Platform for training jobs, in MiB

disk_requested : float, optional

Disk space requested on Civis Platform for training jobs, in GB

notifications : dict

See post_custom() for further documentation about email and URL notification.

dependencies : array, optional

List of packages to install from PyPI or git repository (i.e., Github or Bitbucket). If a private repo is specified, please include a git_token_name argument as well (see below). Make sure to pin dependencies to a specific version, since dependecies will be reinstalled during every training and predict job.

git_token_name : str, optional

Name of remote git API token stored in Civis Platform as the password field in a custom platform credential. Used only when installing private git repositories.

verbose : bool, optional

If True, supply debug outputs in Platform logs and make prediction child jobs visible.

etl : Estimator, optional

Custom ETL estimator which overrides the default ETL, and is run before training and validation.

See also

civis.ml.ModelFuture

Examples

>>> from civis.ml import ModelPipeline
>>> model = ModelPipeline('gradient_boosting_classifier', 'depvar',
...                       primary_key='voterbase_id')
>>> train = model.train(table_name='schema.survey_data',
...                     fit_params={'sample_weight': 'survey_weight'},
...                     database_name='My Redshift Cluster',
...                     oos_scores='scratch.survey_depvar_oos_scores')
>>> train
<ModelFuture at 0x11be7ae10 state=queued>
>>> train.running()
True
>>> train.done()
False
>>> df = train.table  # Read OOS scores from its Civis File. Blocking.
>>> meta = train.metadata  # Metadata from training run
>>> train.metrics['roc_auc']
0.88425
>>> pred = model.predict(table_name='schema.demographics_table ',
...                      database_name='My Redshift Cluster',
...                      output_table='schema.predicted_survey_response',
...                      if_exists='drop',
...                      n_jobs=50)
>>> df_pred = pred.table  # Blocks until finished
# Modify the parameters of the base estimator in a default model:
>>> model = ModelPipeline('sparse_logistic', 'depvar',
...                       primary_key='voterbase_id',
...                       parameters={'C': 2})
# Grid search over hyperparameters in the base estimator:
>>> model = ModelPipeline('sparse_logistic', 'depvar',
...                       primary_key='voterbase_id',
...                       cross_validation_parameters={'C': [0.1, 1, 10]})

Attributes

estimator	(`Pipeline`) The trained scikit-learn Pipeline
train_result_	(`ModelFuture`) `ModelFuture` encapsulating this model’s training run
state	(str) Status of the training job (non-blocking)

Methods

train()	Train the model on data in Civis Platform; outputs `ModelFuture`
predict()	Make predictions on new data; outputs `ModelFuture`
from_existing()	Class method; use to create a `ModelPipeline` from an existing model training run

classmethod from_existing(train_job_id, train_run_id='latest', client=None)[source]¶

Create a ModelPipeline object from existing model IDs

Parameters:

train_job_id : int

The ID of the CivisML job in the Civis Platform

train_run_id : int or string, optional

Location of the model run, either

an explicit run ID,

“latest” : The most recent run

“active” : The run designated by the training job’s “active build” parameter

client : APIClient, optional

If not provided, an APIClient object will be created from the CIVIS_API_KEY.

Returns:

ModelPipeline

A ModelPipeline which refers to a previously-trained model

Examples

>>> from civis.ml import ModelPipeline
>>> model = ModelPipeline.from_existing(job_id)
>>> model.train_result_.metrics['roc_auc']
0.843

predict(df=None, csv_path=None, table_name=None, database_name=None, manifest=None, file_id=None, sql_where=None, sql_limit=None, primary_key=Sentinel(), output_table=None, output_db=None, if_exists='fail', n_jobs=None, polling_interval=None, cpu=None, memory=None, disk_space=None)[source]¶

Make predictions on a trained model

Provide input through one of a DataFrame (df), a local CSV (csv_path), a Civis Table (table_name and database_name), a Civis File containing a CSV (file_id), or a Civis File containing a manifest file (manifest).

A “manifest file” is JSON which specifies the location of many shards of the data to be used for prediction. A manifest file is the output of a Civis export job with force_multifile=True set, e.g. from civis.io.civis_to_multifile_csv(). Large Civis Tables (provided using table_name) will automatically be exported to manifest files.

Prediction outputs will always be stored as gzipped CSVs in one or more Civis Files. You can find a list of File ID numbers for output files at the “output_file_ids” key in the metadata returned by the prediction job. Provide an output_table (and optionally an output_db, if it’s different from database_name) to copy these predictions into a Civis Table.

Parameters:

df : pd.DataFrame, optional

A DataFrame of data for prediction. The DataFrame will be uploaded to a Civis file so that CivisML can access it. Note that the index of the DataFrame will be ignored – use df.reset_index() if you want your index column to be included with the data passed to CivisML.

csv_path : str, optional

The location of a CSV of data on the local disk. It will be uploaded to a Civis file.

table_name : str, optional

The qualified name of the table containing your data

database_name : str, optional

Name of the database holding the data, e.g., ‘My Redshift Cluster’.

manifest : int, optional

ID for a manifest file stored as a Civis file. (Note: if the manifest is not a Civis Platform-specific manifest, like the one returned from civis.io.civis_to_multfile_csv(), this must be used in conjunction with table_name and database_name due to the need for column discovery via Redshift.)

file_id : int, optional

If the data are a CSV stored in a Civis file, provide the integer file ID.

sql_where : str, optional

A SQL WHERE clause used to scope the rows to be predicted

sql_limit : int, optional

SQL LIMIT clause to restrict the size of the prediction set

primary_key : str, optional

Primary key of the prediction table. Defaults to the primary key of the training data. Use None to indicate that the prediction data don’t have a primary key column.

output_table: str, optional

The table in which to put the predictions.

output_db : str, optional

Database of the output table. Defaults to the database of the input table.

if_exists : {‘fail’, ‘append’, ‘drop’, ‘truncate’}

Action to take if the prediction table already exists.

n_jobs : int, optional

Number of concurrent Platform jobs to use for multi-file / large table prediction.

polling_interval : float, optional

Check for job completion every this number of seconds. Do not set if using the notifications endpoint.

cpu : int, optional

CPU shares requested by the user for a single job.

memory : int, optional

RAM requested by the user for a single job.

disk_space : float, optional

disk space requested by the user for a single job.

Returns:

ModelFuture

train(df=None, csv_path=None, table_name=None, database_name=None, file_id=None, sql_where=None, sql_limit=None, oos_scores=None, oos_scores_db=None, if_exists='fail', fit_params=None, polling_interval=None, validation_data='train', n_jobs=4)[source]¶

Start a Civis Platform job to train your model

Provide input through one of a DataFrame (df), a local CSV (csv_path), a Civis Table (table_name and database_name), or a Civis File containing a CSV (file_id).

Model outputs will always contain out-of-sample scores (accessible through ModelFuture.table on this function’s output), and you may chose to store these out-of-sample scores in a Civis Table with the oos_scores, oos_scores_db, and if_exists parameters.

Parameters:

df : pd.DataFrame, optional

A DataFrame of training data. The DataFrame will be uploaded to a Civis file so that CivisML can access it. Note that the index of the DataFrame will be ignored – use df.reset_index() if you want your index column to be included with the data passed to CivisML.

csv_path : str, optional

The location of a CSV of data on the local disk. It will be uploaded to a Civis file.

table_name : str, optional

The qualified name of the table containing the training set from which to build the model.

database_name : str, optional

Name of the database holding the training set table used to build the model. E.g., ‘My Cluster Name’.

file_id : int, optional

If the training data are stored in a Civis file, provide the integer file ID.

sql_where : str, optional

A SQL WHERE clause used to scope the rows of the training set (used for table input only)

sql_limit : int, optional

SQL LIMIT clause for querying the training set (used for table input only)

oos_scores : str, optional

If provided, store out-of-sample predictions on training set data to this Redshift “schema.tablename”.

oos_scores_db : str, optional

If not provided, store OOS predictions in the same database which holds the training data.

if_exists : {‘fail’, ‘append’, ‘drop’, ‘truncate’}

Action to take if the out-of-sample prediction table already exists.

fit_params: Dict[str, str]

Mapping from parameter names in the model’s fit method to the column names which hold the data, e.g. {'sample_weight': 'survey_weight_column'}.

polling_interval : float, optional

Check for job completion every this number of seconds. Do not set if using the notifications endpoint.

validation_data : str, optional

Source for validation data. There are currently two options: ‘train’ (the default), which cross-validates over training data for validation; and ‘skip’, which skips the validation step.

n_jobs : int, optional

Number of jobs to use for training and validation. Defaults to 4, which allows parallelization over the 4 cross validation folds. Increase n_jobs to parallelize over many hyperparameter combinations in grid search/hyperband, or decrease to use fewer computational resources at once.

Returns:

ModelFuture

class civis.ml.ModelFuture(job_id, run_id, train_job_id=None, train_run_id=None, polling_interval=None, client=None, poll_on_creation=True)[source]¶

Encapsulates asynchronous execution of a CivisML job

This object knows where to find modeling outputs from CivisML jobs. All data attributes are lazily retrieved and block on job completion.

This object can be pickled, but it does not store the state of the attached APIClient object. An unpickled ModelFuture will use the API key from the user’s environment.

Parameters:

job_id : int

ID of the modeling job

run_id : int

ID of the modeling run

train_job_id : int, optional

If not provided, this object is assumed to encapsulate a training job, and train_job_id will equal job_id.

train_run_id : int, optional

If not provided, this object is assumed to encapsulate a training run, and train_run_id will equal run_id.

polling_interval : int or float, optional

The number of seconds between API requests to check whether a result is ready. The default intelligently switches between a short interval if pubnub is not available and a long interval for pubnub backup if that library is installed.

client : civis.APIClient, optional

If not provided, an civis.APIClient object will be created from the CIVIS_API_KEY.

poll_on_creation : bool, optional

If True (the default), it will poll upon calling result() the first time. If False, it will wait the number of seconds specified in polling_interval from object creation before polling.

metadata	(dict, blocking) The metadata associated with this modeling job
metrics	(dict, blocking) Validation metrics from this job’s training
validation_metadata	(dict, blocking) Metadata from this modeling job’s validation run
train_metadata	(dict, blocking) Metadata from this modeling job’s training run (will be identical to metadata if this is a training run)
estimator	(`sklearn.pipeline.Pipeline`, blocking) The fitted scikit-learn Pipeline resulting from this model run
table	(`pandas.DataFrame`, blocking) The table output from this modeling job: out-of-sample predictions on the training set for a training job, or a table of predictions for a prediction job. If the prediction job was split into multiple files (this happens automatically for large tables), this attribute will provide only predictions for the first file.
state	(str) The current state of the Civis Platform run
job_id	(int)
run_id	(int)
train_job_id	(int) Container ID for the training job – identical to `job_id` if this is a training job.
train_run_id	(int) As `train_job_id` but for runs
is_training	(bool) True if this `ModelFuture` corresponds to a train-validate job.

cancel()	Cancels the corresponding Platform job before completion
succeeded()	(Non-blocking) Is the job a success?
failed()	(Non-blocking) Did the job fail?
cancelled()	(Non-blocking) Was the job cancelled?
running()	(Non-blocking) Is the job still running?
done()	(Non-blocking) Is the job finished?
result()	(Blocking) Return the final status of the Civis Platform job.