Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition by Justin Grimmer, Margaret Roberts, Brandon Stewart – Ebook PDF Instant Download/Delivery: 0691207992, 9780691207995
Full download Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition after payment

Product details:
ISBN 10: 0691207992
ISBN 13: 9780691207995
Author: Justin Grimmer, Margaret Roberts, Brandon Stewart
A guide for using computational text analysis to learn about the social world From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights. Text as Data is organized around the core tasks in research projects using text—representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research. Bridging many divides—computer science and social science, the qualitative and the quantitative, and industry and academia—Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain. Overview of how to use text as data Research design for a world of data deluge Examples from across the social sciences and industry
Text as Data A New Framework for Machine Learning and the Social Sciences 1st Table of contents:
Part I: Preliminaries
Chapter 1: Introduction
- 1.1 How This Book Informs the Social Sciences
- 1.2 How This Book Informs the Digital Humanities
- 1.3 How This Book Informs Data Science in Industry and Government
- 1.4 A Guide to This Book
- 1.5 Conclusion
Chapter 2: Social Science Research and Text Analysis
- 2.1 Discovery
- 2.2 Measurement
- 2.3 Inference
- 2.4 Social Science as an Iterative and Cumulative Process
- 2.5 An Agnostic Approach to Text Analysis
- 2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media
- 2.7 Six Principles of Text Analysis
- 2.7.1 Social Science Theories and Substantive Knowledge Are Essential for Research Design
- 2.7.2 Text Analysis Does Not Replace Humans—It Augments Them
- 2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation
- 2.7.4 Text Analysis Methods Distill Generalizations from Language
- 2.7.5 The Best Method Depends on the Task
- 2.7.6 Validations Are Essential and Depend on the Theory and the Task
- 2.8 Conclusion: Text Data and Social Science
Part II: Selection and Representation
Chapter 3: Principles of Selection and Representation
- 3.1 Principle 1: Question-Specific Corpus Construction
- 3.2 Principle 2: No Values-Free Corpus Construction
- 3.3 Principle 3: No Right Way to Represent Text
- 3.4 Principle 4: Validation
- 3.5 State of the Union Addresses
- 3.6 The Authorship of the Federalist Papers
- 3.7 Conclusion
Chapter 4: Selecting Documents
- 4.1 Populations and Quantities of Interest
- 4.2 Four Types of Bias
- 4.2.1 Resource Bias
- 4.2.2 Incentive Bias
- 4.2.3 Medium Bias
- 4.2.4 Retrieval Bias
- 4.3 Considerations of “Found Data”
- 4.4 Conclusion
Chapter 5: Bag of Words
- 5.1 The Bag of Words Model
- 5.2 Choose the Unit of Analysis
- 5.3 Tokenize
- 5.4 Reduce Complexity
- 5.4.1 Lowercase
- 5.4.2 Remove Punctuation
- 5.4.3 Remove Stop Words
- 5.4.4 Create Equivalence Classes (Lemmatize/Stem)
- 5.4.5 Filter by Frequency
- 5.5 Construct Document-Feature Matrix
- 5.6 Rethinking the Defaults
- 5.6.1 Authorship of the Federalist Papers
- 5.6.2 The Scale Argument against Preprocessing
- 5.7 Conclusion
Chapter 6: The Multinomial Language Model
- 6.1 Multinomial Distribution
- 6.2 Basic Language Modeling
- 6.3 Regularization and Smoothing
- 6.4 The Dirichlet Distribution
- 6.5 Conclusion
Chapter 7: The Vector Space Model and Similarity Metrics
- 7.1 Similarity Metrics
- 7.2 Distance Metrics
- 7.3 tf-idf Weighting
- 7.4 Conclusion
Chapter 8: Distributed Representations of Words
- 8.1 Why Word Embeddings
- 8.2 Estimating Word Embeddings
- 8.2.1 The Self-Supervision Insight
- 8.2.2 Design Choices in Word Embeddings
- 8.2.3 Latent Semantic Analysis
- 8.2.4 Neural Word Embeddings
- 8.2.5 Pretrained Embeddings
- 8.2.6 Rare Words
- 8.2.7 An Illustration
- 8.3 Aggregating Word Embeddings to the Document Level
- 8.4 Validation
- 8.5 Contextualized Word Embeddings
- 8.6 Conclusion
Chapter 9: Representations from Language Sequences
- 9.1 Text Reuse
- 9.2 Parts of Speech Tagging
- 9.2.1 Using Phrases to Improve Visualization
- 9.3 Named-Entity Recognition
- 9.4 Dependency Parsing
- 9.5 Broader Information Extraction Tasks
- 9.6 Conclusion
Part III: Discovery
Chapter 10: Principles of Discovery
- 10.1 Principle 1: Context Relevance
- 10.2 Principle 2: No Ground Truth
- 10.3 Principle 3: Judge the Concept, Not the Method
- 10.4 Principle 4: Separate Data Is Best
- 10.5 Conceptualizing the US Congress
- 10.6 Conclusion
Chapter 11: Discriminating Words
- 11.1 Mutual Information
- 11.2 Fightin’ Words
- 11.3 Fictitious Prediction Problems
- 11.3.1 Standardized Test Statistics as Measures of Separation
- 11.3.2 χ2 Test Statistics
- 11.3.3 Multinomial Inverse Regression
- 11.4 Conclusion
Chapter 12: Clustering
- 12.1 An Initial Example Using k-Means Clustering
- 12.2 Representations for Clustering
- 12.3 Approaches to Clustering
- 12.3.1 Components of a Clustering Method
- 12.3.2 Styles of Clustering Methods
- 12.3.3 Probabilistic Clustering Models
- 12.3.4 Algorithmic Clustering Models
- 12.3.5 Connections between Probabilistic and Algorithmic Clustering
- 12.4 Making Choices
- 12.4.1 Model Selection
- 12.4.2 Careful Reading
- 12.4.3 Choosing the Number of Clusters
- 12.5 The Human Side of Clustering
- 12.5.1 Interpretation
- 12.5.2 Interactive Clustering
- 12.6 Conclusion
Chapter 13: Topic Models
- 13.1 Latent Dirichlet Allocation
- 13.1.1 Inference
- 13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases
- 13.2 Interpreting the Output of Topic Models
- 13.3 Incorporating Structure into LDA
- 13.3.1 Structure with Upstream, Known Prevalence Covariates
- 13.3.2 Structure with Upstream, Known Content Covariates
- 13.3.3 Structure with Downstream, Known Covariates
- 13.3.4 Additional Sources of Structure
- 13.4 Structural Topic Models
- 13.4.1 Example: Discovering the Components of Radical Discourse
- 13.5 Labeling Topic Models
- 13.6 Conclusion
Chapter 14: Low-Dimensional Document Embeddings
- 14.1 Principal Component Analysis
- 14.1.1 Automated Methods for Labeling Principal Components
- 14.1.2 Manual Methods for Labeling Principal Components
- 14.1.3 Principal Component Analysis of Senate Press Releases
- 14.1.4 Choosing the Number of Principal Components
- 14.2 Classical Multidimensional Scaling
- 14.2.1 Extensions of Classical MDS
- 14.2.2 Applying Classical MDS to Senate Press Releases
- 14.3 Conclusion
Part IV: Measurement
Chapter 15: Principles of Measurement
- 15.1 From Concept to Measurement
- 15.2 What Makes a Good Measurement
- 15.2.1 Principle 1: Measures should have clear goals
- 15.2.2 Principle 2: Source material should always be identified and ideally made public
- 15.2.3 Principle 3: The coding process should be explainable and reproducible
- 15.2.4 Principle 4: The measure should be validated
- 15.2.5 Principle 5: Limitations should be explored, documented, and communicated to the audience
- 15.3 Balancing Discovery and Measurement with Sample Splits
Chapter 16: Word Counting
- 16.1 Keyword Counting
- 16.2 Dictionary Methods
- 16.3 Limitations and Validations of Dictionary Methods
- 16.3.1 Moving beyond Dictionaries: Wordscores
- 16.4 Conclusion
Chapter 17: An Overview of Supervised Classification
- 17.1 Example: Discursive Governance
- 17.2 Create a Training Set
- 17.3 Classify Documents with Supervised Learning
- 17.4 Check Performance
- 17.5 Using the Measure
- 17.6 Conclusion
Chapter 18: Coding a Training Set
- 18.1 Characteristics of a Good Training Set
- 18.2 Hand Coding
- 18.2.1 1: Decide on a codebook
- 18.2.2 2: Select coders
- 18.2.3 3: Select documents to code
- 18.2.4 4: Manage coders
- 18.2.5 5: Check reliability
- 18.2.6 Managing Drift
- 18.2.7 Example: Making the News
- 18.3 Crowdsourcing
- 18.4 Supervision with Found Data
- 18.5 Conclusion
Chapter 19: Classifying Documents with Supervised Learning
- 19.1 Naive Bayes
- 19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong
- 19.1.2 Naive Bayes is a Generative Model
- 19.1.3 Naive Bayes is a Linear Classifier
- 19.2 Machine Learning
- 19.2.1 Fixed Basis Functions
- 19.2.2 Adaptive Basis Functions
- 19.2.3 Quantification
- 19.2.4 Concluding Thoughts on Supervised Learning with Random Samples
- 19.3 Example: Estimating Jihad Scores
- 19.4 Conclusion
Chapter 20: Checking Performance
- 20.1 Validation with Gold-Standard Data
- 20.1.1 Validation Set
- 20.1.2 Cross-Validation
- 20.1.3 The Importance of Gold-Standard Data
- 20.1.4 Ongoing Evaluations
- 20.2 Validation without Gold-Standard Data
- 20.2.1 Surrogate Labels
- 20.2.2 Partial Category Replication
- 20.2.3 Nonexpert Human Evaluation
- 20.2.4 Correspondence to External Information
- 20.3 Example: Validating Jihad Scores
- 20.4 Conclusion
Chapter 21: Collecting a Gold-Standard Dataset
- 21.1 The Challenges of Gold-Standard Data
- 21.2 Creating the Dataset
- 21.3 Validating the Dataset
- 21.4 The Gold-Standard in the Field
- 21.5 Conclusion
Part V: Inference
Chapter 22: Introduction to Inference
- 22.1 The Importance of Causal Inference
- 22.2 Measurement vs. Inference
- 22.3 Prediction vs. Inference
Part VI: Conclusion
Chapter 23: Closing Thoughts
- 23.1 Text Analysis in the Social Sciences
- 23.2 Final Considerations on Data and Methods
People also search for Text as Data A New Framework for Machine Learning and the Social Sciences 1st:
text vs data
is text structured data
what type of data is text
text as data a new framework for machine learning
text as data pdf
Tags:
Justin Grimmer,Margaret Roberts,Brandon Stewart,Text as Data,Machine Learning


