Study

MCI 600 project lifecycle

  •   0%
  •  0     0     0

  • What does high bias in a model usually cause?
    Underfitting – the model is too simple and misses patterns.
  • Why is data understanding important before modeling?
    To explore distributions, detect missing values, and understand context.
  • Which metric balances precision and recall into one score?
    F1-score.
  • What technique would you use to fairly evaluate a model on imbalanced data, ensuring each fold has similar class proportions?
    Stratified K-Fold cross-validation.
  • In a confusion matrix, what is a false positive?
    When the model predicts “positive” incorrectly (Type I error).
  • In supervised learning, what does the training dataset consist of?
    Input-output pairs (features and labels).
  • Why is cross-validation important in ML?
    It provides a more reliable estimate of model performance and helps prevent overfitting.
  • What is the purpose of the Turing Test?
    To check if a machine can mimic human intelligence so well that a human judge cannot distinguish it from a real person.
  • A hospital is testing a new ML model to detect cancer from screening data. The confusion matrix results are: True Positives = 100 True Negatives = 50 False Positives = 10 False Negatives = 5 👉 How would you evaluate this model’s perfo
    Calculate accuracy, precision, recall, F1-score. Recall is critical: missing a cancer patient (false negative) is worse than a false alarm. Precision matters
  • Why do we calculate the standard deviation instead of just variance when describing data spread?
    Because standard deviation is in the same units as the data, making it easier to interpret (variance is squared units).
  • What is the first stage of the ML project cycle?
    Business understanding (define goals, scope, problem).
  • In unsupervised learning, why are labels not needed?
    The algorithm discovers hidden patterns or clusters without predefined outputs.
  • Why is data privacy critical in ML projects?
    To ensure anonymization, secure storage, and compliance with regulations like GDPR.
  • You are given a dataset of university students with features such as: Attendance rate Assignment submission timeliness LMS login frequency Socioeconomic background Previous academic performance 👉 Design a supervised ML pipeline to pr
    Use supervised learning (e.g., logistic regression, decision trees, or random forests). Address imbalance with techniques like SMOTE, class-weight adjustments,
  • A model achieves 95% accuracy on an imbalanced dataset where 90% are “negative.” Why is accuracy misleading here?
    Because predicting everything as negative still gives 90% accuracy — metrics like precision, recall, or F1 are better.
  • An online retail company wants to segment customers based on purchasing behavior. Data includes: Number of purchases per month Average order value Types of products bought Time since last purchase 👉 Propose an unsupervised learning ap
    Use clustering (e.g., K-means, DBSCAN, or hierarchical clustering). Validate clusters using metrics like Silhouette score, Davies-Bouldin index. Interpret clu
  • In a dataset of student exam scores, the mean score is 65 while the median score is 80. What does this suggest about the distribution of the scores?
    The distribution is left-skewed (negatively skewed), meaning more students scored higher, but a few very low scores pulled the mean down.
  • How can outliers affect machine learning models if not handled properly during data preparation?
    They can distort measures like mean, inflate variance, and mislead algorithms (especially sensitive models like linear regression).
  • If a model has low bias but high variance, how does it perform on training vs test data?
    Very good on training data but poor on test data.
  • Give one real-world example of supervised learning.
    Predicting student exam performance from study hours, attendance, etc.
  • What is the key difference between Artificial Intelligence (AI) and Machine Learning (ML)?
    AI is the broader concept of machines acting intelligently, while ML is a subset that enables machines to learn from data.
  • What does high variance in a model usually cause?
    Overfitting – the model learns noise and fails to generalize.
  • Who first asked the question “Can machines think?”
    Alan Turing (1950)
  • Why is the normal distribution important in machine learning and statistics?
    Many algorithms (e.g., linear regression, logistic regression, statistical tests) assume normally distributed data; it also underpins concepts like the Central