Sale!

Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition by Justin Grimmer, Margaret Roberts, Brandon Stewart ISBN 0691207992 9780691207995

Name: Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition by Justin Grimmer, Margaret Roberts, Brandon Stewart ISBN 0691207992 9780691207995
SKU: EB_121602
Availability: InStock

Original price was: $70.00.Current price is: $35.00.

Instant download Text as Data A New Framework for Machine Learning and the Sociamer Justin Grimmer & Margaret E. Roberts & Brandon M. Stewart after payment

SKU: EB_121602 Category: Ebooks

Description

Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition by Justin Grimmer, Margaret Roberts, Brandon Stewart – Ebook PDF Instant Download/Delivery: 0691207992, 9780691207995
Full download Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition after payment

Product details:

ISBN 10: 0691207992
ISBN 13: 9780691207995
Author: Justin Grimmer, Margaret Roberts, Brandon Stewart

A guide for using computational text analysis to learn about the social world From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights. Text as Data is organized around the core tasks in research projects using text—representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research. Bridging many divides—computer science and social science, the qualitative and the quantitative, and industry and academia—Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain. Overview of how to use text as data Research design for a world of data deluge Examples from across the social sciences and industry

Text as Data A New Framework for Machine Learning and the Social Sciences 1st Table of contents:

Part I: Preliminaries

Chapter 1: Introduction

1.1 How This Book Informs the Social Sciences
1.2 How This Book Informs the Digital Humanities
1.3 How This Book Informs Data Science in Industry and Government
1.4 A Guide to This Book
1.5 Conclusion

Chapter 2: Social Science Research and Text Analysis

2.1 Discovery
2.2 Measurement
2.3 Inference
2.4 Social Science as an Iterative and Cumulative Process
2.5 An Agnostic Approach to Text Analysis
2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media
2.7 Six Principles of Text Analysis
- 2.7.1 Social Science Theories and Substantive Knowledge Are Essential for Research Design
- 2.7.2 Text Analysis Does Not Replace Humans—It Augments Them
- 2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation
- 2.7.4 Text Analysis Methods Distill Generalizations from Language
- 2.7.5 The Best Method Depends on the Task
- 2.7.6 Validations Are Essential and Depend on the Theory and the Task
2.8 Conclusion: Text Data and Social Science

Part II: Selection and Representation

Chapter 3: Principles of Selection and Representation

3.1 Principle 1: Question-Specific Corpus Construction
3.2 Principle 2: No Values-Free Corpus Construction
3.3 Principle 3: No Right Way to Represent Text
3.4 Principle 4: Validation
3.5 State of the Union Addresses
3.6 The Authorship of the Federalist Papers
3.7 Conclusion

Chapter 4: Selecting Documents

4.1 Populations and Quantities of Interest
4.2 Four Types of Bias
- 4.2.1 Resource Bias
- 4.2.2 Incentive Bias
- 4.2.3 Medium Bias
- 4.2.4 Retrieval Bias
4.3 Considerations of “Found Data”
4.4 Conclusion

Chapter 5: Bag of Words

5.1 The Bag of Words Model
5.2 Choose the Unit of Analysis
5.3 Tokenize
5.4 Reduce Complexity
- 5.4.1 Lowercase
- 5.4.2 Remove Punctuation
- 5.4.3 Remove Stop Words
- 5.4.4 Create Equivalence Classes (Lemmatize/Stem)
- 5.4.5 Filter by Frequency
5.5 Construct Document-Feature Matrix
5.6 Rethinking the Defaults
- 5.6.1 Authorship of the Federalist Papers
- 5.6.2 The Scale Argument against Preprocessing
5.7 Conclusion

Chapter 6: The Multinomial Language Model

6.1 Multinomial Distribution
6.2 Basic Language Modeling
6.3 Regularization and Smoothing
6.4 The Dirichlet Distribution
6.5 Conclusion

Chapter 7: The Vector Space Model and Similarity Metrics

7.1 Similarity Metrics
7.2 Distance Metrics
7.3 tf-idf Weighting
7.4 Conclusion

Chapter 8: Distributed Representations of Words

8.1 Why Word Embeddings
8.2 Estimating Word Embeddings
- 8.2.1 The Self-Supervision Insight
- 8.2.2 Design Choices in Word Embeddings
- 8.2.3 Latent Semantic Analysis
- 8.2.4 Neural Word Embeddings
- 8.2.5 Pretrained Embeddings
- 8.2.6 Rare Words
- 8.2.7 An Illustration
8.3 Aggregating Word Embeddings to the Document Level
8.4 Validation
8.5 Contextualized Word Embeddings
8.6 Conclusion

Chapter 9: Representations from Language Sequences

9.1 Text Reuse
9.2 Parts of Speech Tagging
- 9.2.1 Using Phrases to Improve Visualization
9.3 Named-Entity Recognition
9.4 Dependency Parsing
9.5 Broader Information Extraction Tasks
9.6 Conclusion

Part III: Discovery

Chapter 10: Principles of Discovery

10.1 Principle 1: Context Relevance
10.2 Principle 2: No Ground Truth
10.3 Principle 3: Judge the Concept, Not the Method
10.4 Principle 4: Separate Data Is Best
10.5 Conceptualizing the US Congress
10.6 Conclusion

Chapter 11: Discriminating Words

11.1 Mutual Information
11.2 Fightin’ Words
11.3 Fictitious Prediction Problems
- 11.3.1 Standardized Test Statistics as Measures of Separation
- 11.3.2 χ2 Test Statistics
- 11.3.3 Multinomial Inverse Regression
11.4 Conclusion

Chapter 12: Clustering

12.1 An Initial Example Using k-Means Clustering
12.2 Representations for Clustering
12.3 Approaches to Clustering
- 12.3.1 Components of a Clustering Method
- 12.3.2 Styles of Clustering Methods
- 12.3.3 Probabilistic Clustering Models
- 12.3.4 Algorithmic Clustering Models
- 12.3.5 Connections between Probabilistic and Algorithmic Clustering
12.4 Making Choices
- 12.4.1 Model Selection
- 12.4.2 Careful Reading
- 12.4.3 Choosing the Number of Clusters
12.5 The Human Side of Clustering
- 12.5.1 Interpretation
- 12.5.2 Interactive Clustering
12.6 Conclusion

Chapter 13: Topic Models

13.1 Latent Dirichlet Allocation
- 13.1.1 Inference
- 13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases
13.2 Interpreting the Output of Topic Models
13.3 Incorporating Structure into LDA
- 13.3.1 Structure with Upstream, Known Prevalence Covariates
- 13.3.2 Structure with Upstream, Known Content Covariates
- 13.3.3 Structure with Downstream, Known Covariates
- 13.3.4 Additional Sources of Structure
13.4 Structural Topic Models
- 13.4.1 Example: Discovering the Components of Radical Discourse
13.5 Labeling Topic Models
13.6 Conclusion

Chapter 14: Low-Dimensional Document Embeddings

14.1 Principal Component Analysis
- 14.1.1 Automated Methods for Labeling Principal Components
- 14.1.2 Manual Methods for Labeling Principal Components
- 14.1.3 Principal Component Analysis of Senate Press Releases
- 14.1.4 Choosing the Number of Principal Components
14.2 Classical Multidimensional Scaling
- 14.2.1 Extensions of Classical MDS
- 14.2.2 Applying Classical MDS to Senate Press Releases
14.3 Conclusion

Part IV: Measurement

Chapter 15: Principles of Measurement

15.1 From Concept to Measurement
15.2 What Makes a Good Measurement
- 15.2.1 Principle 1: Measures should have clear goals
- 15.2.2 Principle 2: Source material should always be identified and ideally made public
- 15.2.3 Principle 3: The coding process should be explainable and reproducible
- 15.2.4 Principle 4: The measure should be validated
- 15.2.5 Principle 5: Limitations should be explored, documented, and communicated to the audience
15.3 Balancing Discovery and Measurement with Sample Splits

Chapter 16: Word Counting

16.1 Keyword Counting
16.2 Dictionary Methods
16.3 Limitations and Validations of Dictionary Methods
- 16.3.1 Moving beyond Dictionaries: Wordscores
16.4 Conclusion

Chapter 17: An Overview of Supervised Classification

17.1 Example: Discursive Governance
17.2 Create a Training Set
17.3 Classify Documents with Supervised Learning
17.4 Check Performance
17.5 Using the Measure
17.6 Conclusion

Chapter 18: Coding a Training Set

18.1 Characteristics of a Good Training Set
18.2 Hand Coding
- 18.2.1 1: Decide on a codebook
- 18.2.2 2: Select coders
- 18.2.3 3: Select documents to code
- 18.2.4 4: Manage coders
- 18.2.5 5: Check reliability
- 18.2.6 Managing Drift
- 18.2.7 Example: Making the News
18.3 Crowdsourcing
18.4 Supervision with Found Data
18.5 Conclusion

Chapter 19: Classifying Documents with Supervised Learning

19.1 Naive Bayes
- 19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong
- 19.1.2 Naive Bayes is a Generative Model
- 19.1.3 Naive Bayes is a Linear Classifier
19.2 Machine Learning
- 19.2.1 Fixed Basis Functions
- 19.2.2 Adaptive Basis Functions
- 19.2.3 Quantification
- 19.2.4 Concluding Thoughts on Supervised Learning with Random Samples
19.3 Example: Estimating Jihad Scores
19.4 Conclusion

Chapter 20: Checking Performance

20.1 Validation with Gold-Standard Data
- 20.1.1 Validation Set
- 20.1.2 Cross-Validation
- 20.1.3 The Importance of Gold-Standard Data
- 20.1.4 Ongoing Evaluations
20.2 Validation without Gold-Standard Data
- 20.2.1 Surrogate Labels
- 20.2.2 Partial Category Replication
- 20.2.3 Nonexpert Human Evaluation
- 20.2.4 Correspondence to External Information
20.3 Example: Validating Jihad Scores
20.4 Conclusion

Chapter 21: Collecting a Gold-Standard Dataset

21.1 The Challenges of Gold-Standard Data
21.2 Creating the Dataset
21.3 Validating the Dataset
21.4 The Gold-Standard in the Field
21.5 Conclusion

Part V: Inference

Chapter 22: Introduction to Inference

22.1 The Importance of Causal Inference
22.2 Measurement vs. Inference
22.3 Prediction vs. Inference

Part VI: Conclusion

Chapter 23: Closing Thoughts

23.1 Text Analysis in the Social Sciences
23.2 Final Considerations on Data and Methods

People also search for Text as Data A New Framework for Machine Learning and the Social Sciences 1st:

text vs data

is text structured data

what type of data is text

text as data a new framework for machine learning

text as data pdf

Tags:

Justin Grimmer,Margaret Roberts,Brandon Stewart,Text as Data,Machine Learning

Text as Data A New Framework for Machine Learning and the Social Sciences 1st edition by Justin Grimmer, Margaret Roberts, Brandon Stewart ISBN 0691207992 9780691207995

Product details:

Text as Data A New Framework for Machine Learning and the Social Sciences 1st Table of contents:

People also search for Text as Data A New Framework for Machine Learning and the Social Sciences 1st:

Tags:

Login